In [1]:
import numpy as np

In [2]:
np.random.seed(32)

In [3]:
predictions = np.load('predictions.npy')
targets = np.load('targets.npy')

In [4]:
predictions.shape

(32, 64, 256)

In [5]:
targets.shape

(32, 64)

In [7]:
reshaped_targets = np.eye(predictions.shape[-1])[targets]

In [8]:
reshaped_targets.shape

(32, 64, 256)

In [10]:
log_p = np.sum(reshaped_targets * predictions, axis=-1)

In [12]:
non_pad = 1.0 - np.equal(targets, 0)

In [13]:
non_pad

array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       ...,
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.]])

In [14]:
non_pad.shape

(32, 64)

In [15]:
real_log_p = log_p * non_pad

In [16]:
real_log_p

array([[ -5.39654493,  -1.03111839,  -0.66916656, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -4.58577061,  -1.13412857,  -8.53803253, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.22238874,  -1.28241444,  -0.17312431, ...,  -0.        ,
         -0.        ,  -0.        ],
       ...,
       [ -5.39654493, -17.29168129,  -4.36076593, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.93131638, -14.24741745,  -0.26373291, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.67053604,  -0.10595131,   0.        , ...,  -0.        ,
         -0.        ,  -0.        ]])

In [17]:
real_log_p.shape

(32, 64)

## Get the probability per token

In [18]:
log_prob_sum = np.sum(real_log_p, axis=1) / np.sum(non_pad, axis=1)

In [19]:
log_prob_sum

array([ -1.46443952,  -3.46996219,  -1.72545644,  -3.05195556,
        -2.43820807,  -1.50907578, -12.93874272,  -1.70937594,
        -2.53130744,  -1.69242826,  -2.15547694,  -2.7938809 ,
        -2.39598125,  -1.64991479,  -1.38344754,  -2.39099213,
        -1.84206926,  -1.47235944,  -3.56172533,  -2.57546407,
        -1.45711951,  -1.95714687,  -1.71366604,  -1.75605716,
        -3.13621035,  -1.71856025,  -1.65658158,  -2.58908648,
        -2.6263243 ,  -4.99332861,  -4.18875562,  -1.33283563])

In [24]:
log_perplexity = -np.mean(log_prob_sum)

In [25]:
log_perplexity

2.6211854987065033

In [26]:
perplexity = np.exp(log_perplexity)

In [27]:
perplexity

13.752016923578548

### Language Model Probability and Perplexity Calculation

The probability of a sequence of words (or tokens) is based on the chain rule of probability, which decomposes the joint probability into a product of conditional probabilities. This reflects the dependencies between successive words in a sequence:


$$ P(w_1, w_2, ..., w_N) = P(w_1) \cdot P(w_2|w_1) \cdot ... \cdot P(w_N|w_1, w_2, ..., w_{N-1}) $$
Each word w_i is predicted based on the context provided by the preceding words $ w_1, ..., w_{i-1} $, capturing the sequence's inherent conditional dependencies.

Taking the logarithm of this product, which is a common step in language model evaluation, gives us a sum of log probabilities:


$$ log P(w_1, w_2, ..., w_N) = log P(w_1) + log P(w_2|w_1) + ... + log P(w_N|w_1, w_2, ..., w_{N-1}) $$
This sum does not assume independence; rather, it includes the log of the conditional probabilities, maintaining the sequence's contextual information.

Language models are trained to optimize these parameters so that the sum of the log conditional probabilities is maximized for actual sequences, reflecting the true nature of language, where each token is contextually dependent on previous tokens.

#### Independence vs. Conditional Probability in Language Modeling
Contrary to assuming independence, the sum of log probabilities in language modeling explicitly accounts for the sequence's conditional nature. The model's architecture (e.g., RNN, LSTM, Transformer) is designed to capture and utilize the contextual dependencies between tokens.

In perplexity calculation:

The sum of log probabilities computes the total log likelihood of the sequence under the model, considering only actual data tokens and ignoring padding.
The average log likelihood per token is derived by normalizing this sum over the sequence's length (excluding padding).
Perplexity is the exponential of the negative average log likelihood per token, representing the model's predictive performance, considering contextual dependencies.
markdown
Copy code
This method of perplexity calculation ensures that the metric fairly reflects model performance across sequences of varying lengths and content. It allows for a robust comparison of model predictions and actual token sequences in language processing tasks.
css
Copy code

Feel free to include this summary in your documentation or use it as a reference for explaining the concepts of probability and perplexity in the context of language models.







### Independent Token Assumption in Probability

When assuming that tokens in a sequence are independent, the joint probability of the sequence is the product of the probabilities of each individual token:


$$ P(w_1, w_2, ..., w_N) = P(w_1) \cdot P(w_2) \cdot ... \cdot P(w_N) $$

Under this assumption, there is no dependency between tokens; the probability of a token occurring is not influenced by the preceding tokens in the sequence. Each token w_i is predicted in isolation, only based on its own probability $ P(w_i) $.

This is in contrast to language models that consider conditional probabilities, where the occurrence of each word is dependent on the previous words in the sequence.


In this representation, you've made it clear that each $w_i$ is being considered without regard to the other words, which is a simplification often not suitable for modeling language due to its inherent structure and dependencies.





