# Introduction to Perplexity in Language Modeling

The perplexity is a metric that measures how well a probability model predicts a sample and it is commonly used to evaluate language models. It is defined as: 

$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{i-1})}}$$

Where  $P()$ denotes probability and $w_i$ denotes the i-th word, so $P(w_i| w_1,...,w_{i-1})$ is the probability of word $i$, given all previous words ($1$ to $i-1$).
As an implementation hack, you would usually take the log of that formula (so the computation is less prone to underflow problems). You would also need to take care of the padding, since you do not want to include the padding when calculating the perplexity (to avoid an artificially good metric).

After taking the logarithm of $P(W)$ you have:

$$log P(W) = {\log\left(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{i-1})}}\right)}$$


$$ = \log\left(\left(\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{i-1})}\right)^{\frac{1}{N}}\right)$$

$$ = \log\left(\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{i-1})}}\right)^{-\frac{1}{N}}\right)$$

$$ = -\frac{1}{N}{\log\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{i-1})}}\right)} $$

$$ = -\frac{1}{N}{{\sum_{i=1}^{N}{\log P(w_i| w_1,...,w_{i-1})}}} $$

## Understanding Perplexity
- **Lower Perplexity:** A lower perplexity indicates that the model predicts the words with higher likelihood, which means the model is more confident in its predictions.
- **Higher Perplexity:** Conversely, a higher perplexity suggests that the model is less sure about its predictions.
Perplexity can be thought of as the weighted average branching factor of a language; it is the number of choices the model feels it has, on average, when predicting the next symbol.

## Why Use Perplexity?
**Model Comparisons:** Perplexity provides a standard way to compare different language models. A model with a lower perplexity on a test set is generally considered to be better at predicting the sample.
Model Improvement: It can also serve as an objective function during the training of a language model, with the goal of minimizing perplexity, thus improving the model's performance.

## Limitations of Perplexity
While perplexity is a widely used metric, it does have limitations. It assumes that the model and test data are from the same distribution, and it may not always correlate perfectly with human judgment of model quality, especially in tasks like machine translation or text generation.

## Applications and Limitations of Perplexity

Perplexity is particularly useful in the context of language modeling, where the goal is often to predict the next word in a sequence. It provides a way to evaluate how well a model has learned to perform this task:

- **Applications**: In areas such as speech recognition, machine translation, and text generation, perplexity serves as a key metric to gauge how well the model understands the language patterns.

- **Limitations**: However, perplexity alone may not capture all aspects of model performance, particularly in tasks that require semantic understanding or generation of coherent text. It is a measure of the model's uncertainty, not necessarily the quality or coherence of the text it might generate.

## Correlation with Human Judgment

The correlation between perplexity and the human judgment of language model quality can be imperfect. For example:

- A model might achieve low perplexity by being overly conservative, thus not generating diverse or interesting text.
- Conversely, a model could generate novel and contextually appropriate text but might have higher perplexity due to the unpredictability of creative language use.

## Cross-Entropy and Perplexity

Perplexity is directly related to the cross-entropy between the true distribution and the model's distribution. The cross-entropy measures the average number of bits needed to encode the data coming from the true distribution using the model's distribution.

## Normalization and Perplexity Calculation
The normalization by the length of the sequence $N$ when computing perplexity ensures that sequences of different lengths can be compared fairly. This is crucial in datasets with variable-length sequences and allows for a consistent evaluation metric across different models and corpora.

## Good Perplexity

**Lower Than Baseline:** A "good" perplexity is typically lower than some established baseline. For instance, it might be compared against the perplexity of a unigram model (a model that only considers the probability of individual words, ignoring context).

**Improvement Over Training:** A perplexity that decreases over time during training indicates that the model is learning and improving its predictions.

**Context-Specific Benchmarks:** What is considered "good" can also be highly specific to the domain or application. For language models trained on children's books, a lower perplexity is expected because of the simpler language structure, compared to models trained on more complex corpora, like legal documents.

## Bad Perplexity
**Higher Than Baseline:** A "bad" perplexity would be higher than a simple or baseline model, suggesting that the model is not learning the data's structure.

**No Improvement Over Training:** If perplexity does not decrease, or worse, increases as training progresses, it suggests issues with the model architecture, learning process, or data representation.

**High Perplexity on Simple Data:** If a model has high perplexity on data that is relatively predictable or repetitive, it indicates that the model is failing to capture even the basic patterns in the data.

## Factors Influencing Good vs. Bad Perplexity

**Quality of Training Data:** The diversity and quality of the training data can greatly influence perplexity. If the training data is not representative of the language patterns in the test set, even a low perplexity might not indicate a good model.

**Model Complexity:** More complex models may achieve lower perplexity by capturing subtler patterns, but they may also be more prone to overfitting, which could result in poor performance on unseen data despite good perplexity scores on the training or validation sets.

**Corpus Size and Diversity:** For a very diverse corpus or a corpus with a large vocabulary, a higher perplexity might be acceptable compared to a more homogeneous or smaller corpus.

## Evaluating Perplexity in Practice

**Relative Performance:** A good practice is to compare perplexity across several models or against known benchmarks in the field. Improvements relative to these benchmarks can indicate good perplexity.

**Task-Specific Metrics:** Perplexity should be used alongside other performance metrics that are relevant to the specific task (e.g., BLEU scores for translation, ROUGE for summarization).

**Human Evaluation:** Ultimately, especially for generative models, human evaluation can provide the most relevant assessment of text quality, and perplexity should not be the sole measure of a language model's capability.

## Conclusion
While perplexity is a valuable metric for comparing probabilistic models, it should be used alongside other evaluations, such as qualitative assessments of generated text and task-specific performance measures, to gain a comprehensive understanding of a model's capabilities.

# Example

## Load libraries

In [2]:
import numpy as np

## Set the seed

In [3]:
np.random.seed(32)

## Load data

In [4]:
predictions = np.load('predictions.npy')
targets = np.load('targets.npy')

## Analyze data

### Targets

In [10]:
targets.shape

(32, 64)

The targets comprise the genuine tokens within a sequence. For instance, the target data contain 32 distinct sequences, each potentially including up to 64 tokens. Sequences shorter than 64 tokens are padded with the index '0' to reach this fixed length. Conversely, sequences exceeding 64 tokens are truncated to fit this limit. Each token in a sequence is denoted by a numerical index ranging from 0 to 255, which corresponds to its position in the vocabulary. Notably, the index '0' is reserved for the 'pad' token.

First sequence looks like following:

In [11]:
targets[0, :]

array([105, 110,  32, 115, 117,  99, 104,  32, 100, 105, 115, 100,  97,
       105, 110, 102, 117, 108,  32, 109,  97, 110, 110, 101, 114,  32,
       109, 101,  32, 116, 111,  32, 119, 111, 111,  46,   1,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
      dtype=int32)

Second sequence looks like following:

In [12]:
targets[1, :]

array([ 97, 110, 110, 101,  32, 112,  97, 103, 101,   9, 110, 111, 116,
        32, 105,  44,  32, 115, 105, 114,  59,  32, 112, 114,  97, 121,
        32, 121, 111, 117,  44,  32, 107, 101, 101, 112,  32, 111, 110,
        46,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
      dtype=int32)

### Predictions

In [13]:
predictions.shape

(32, 64, 256)

The concept applies similarly to predictions. There are 32 predicted sequences, each consisting of up to 64 tokens. However, instead of each token being represented by a single figure between 0 and 255, every token is associated with a prediction vector of length 256. Each position in the vector corresponds to a token index in the vocabulary. It's important to note that for padded tokens, which are represented by the 'pad' token '0' in the targets, the prediction for this padding is specifically indicated at the zero index of the corresponding vector.

Lets see some examples:

Lets investigate the target of the first sequence and token 3:

In [15]:
targets[0, 2]

32

In the first sequence, the third token correctly corresponds to the token with index 32. Let's examine the predictions for this token:

In [16]:
predictions[0, 2, :]

array([-15.783699  , -14.416848  , -15.512791  , -15.747061  ,
       -15.875423  , -15.857048  , -15.714527  , -15.726427  ,
       -15.636484  , -13.204674  , -15.7897625 , -15.5456505 ,
       -15.627533  , -15.669703  , -15.476461  , -15.597759  ,
       -15.612074  , -15.629346  , -15.774958  , -15.947931  ,
       -15.6925955 , -15.850674  , -15.700855  , -15.57556   ,
       -15.5442295 , -15.589245  , -15.796459  , -15.781776  ,
       -15.77724   , -15.740474  , -15.735656  , -15.596254  ,
        -0.66916656,  -9.375701  , -15.529961  , -15.690064  ,
       -18.688759  , -15.854721  , -21.715818  ,  -4.7703843 ,
       -18.846731  , -20.794693  , -15.685709  , -15.793209  ,
        -6.7585335 ,  -9.853044  ,  -4.3036513 , -15.662617  ,
       -13.796404  , -15.667776  , -19.005121  , -15.684215  ,
       -15.78345   , -18.78386   , -16.584015  , -15.431082  ,
       -17.811726  , -15.437679  , -11.17949   , -11.846891  ,
       -15.628307  , -15.664791  , -15.902561  ,  -8.73

Or, more precisely, let's look at the probability that the actual token is the one with index 32 in the vocabulary.

In [22]:
np.exp(predictions[0, 2, 32])

0.5121352

Let’s determine if the prediction was accurate by identifying the index with the lowest log probability, which corresponds to the highest probability.

In [23]:
np.argmax(predictions[0, 2, :])

32

This is a correct prediction, as the target index for the third token in the first sequence (sequence 0, token 2) was 32, and the token with the highest probability at this position also has an index of 32, with a probability of 0.5121325.

## Reshaping `targets` to same shape as `predictions`


We need to adjust the targets to match the dimensions of the predictions, transforming them from a single number between $0$ and $255$, which represents a specific token in the vocabulary, to a one-hot encoded vector. In this format, the one-hot vector has a value of $1$ at the index corresponding to the target token's index (previously a number between $0$ - $255$). For example, if $targets[1, 40]$ initially had a value of $1$, after reshaping, it would be represented by a one-hot vector like this: $ [0, 1, 0, 0, 0, ..., 0] $, with a length of $64$.

In [26]:
targets[1, 40]

1

In [28]:
reshaped_targets = np.eye(predictions.shape[-1])[targets]

The code `reshaped_targets = np.eye(predictions.shape[-1])[targets]` reshapes the targets to have the same dimensions as the predictions by converting them into one-hot encoded vectors. This is achieved through the following steps:

1. **Creating an Identity Matrix**: `np.eye(predictions.shape[-1])` generates an identity matrix of size equal to the number of classes or possible tokens (the last dimension of `predictions`). Each row in this matrix represents a one-hot encoded vector for a corresponding class.

2. **Indexing with Targets**: The matrix `[targets]` uses the target indices to select the appropriate one-hot encoded vectors from the identity matrix. This effectively transforms each scalar target into a one-hot encoded vector.

3. **Matching Dimensions**: The resulting `reshaped_targets` array matches the shape of `predictions`, enabling direct comparison and element-wise operations between predicted probabilities and actual target vectors. This transformation is essential for computing loss functions that require input dimensions of targets and predictions to be identical, such as categorical cross-entropy.

By using this approach, we ensure that each target token is accurately represented as a one-hot vector, facilitating effective training and evaluation of classification models where predictions are also probability distributions over the same set of classes.


In [29]:
reshaped_targets.shape

(32, 64, 256)

In [30]:
reshaped_targets[1, 40, 0]

0.0

In [31]:
reshaped_targets[1, 40, 1]

1.0

In [32]:
reshaped_targets[1, 40, 2]

0.0

## Summing the log probabilities

As we aim to compute the following negative average log-likelihood,

$$ = -\frac{1}{N}{{\sum_{i=1}^{N}{\log P(w_i| w_1,...,w_{i-1})}}} $$

the initial step involves calculating the sum of the log probabilities of the predicted values, each multiplied by their corresponding one-hot vectors. Since these vectors are one-hot encoded, each entry in the resultant matrix directly represents the log probability of the actual token. This ensures that only the log probabilities associated with the true token indices contribute to the sum.

As an example, see following calculation:

In [35]:
reshaped_targets[1, 40, :]

array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [36]:
predictions[1, 40, :]

array([-36.391533  ,  -2.946434  , -36.57123   , -36.770393  ,
       -36.335464  , -36.6263    , -36.485146  , -36.377808  ,
       -36.43523   , -21.897081  , -36.628742  , -36.537407  ,
       -36.42761   , -36.476738  , -36.472946  , -36.34853   ,
       -36.506317  , -36.689102  , -36.408024  , -36.267384  ,
       -36.523792  , -36.531082  , -36.442635  , -36.562263  ,
       -36.627666  , -36.634056  , -36.533333  , -36.47747   ,
       -36.67165   , -36.475838  , -36.439873  , -36.559284  ,
        -0.05396652, -34.01696   , -36.34834   , -36.47068   ,
       -37.786743  , -36.589767  , -43.71125   , -11.775192  ,
       -32.293053  , -27.512966  , -36.45308   , -36.458286  ,
       -22.550772  , -16.47274   , -28.42991   , -36.377464  ,
       -30.13439   , -37.79309   , -37.859344  , -37.63462   ,
       -36.533337  , -38.48908   , -35.33613   , -37.42507   ,
       -33.745102  , -37.340454  , -21.17191   , -22.333817  ,
       -36.7372    , -36.648296  , -36.459934  , -29.19

In [38]:
predictions[1, 40, 1]

-2.946434

In [39]:
np.sum(reshaped_targets[1, 40, :] * predictions[1, 40, :])

-2.9464340209960938

Now, lets do this for all sequences and tokens:

In [33]:
log_p = np.sum(predictions * reshaped_targets, axis = -1)

In [34]:
log_p.shape

(32, 64)

In [40]:
log_p

array([[ -5.39654493,  -1.03111839,  -0.66916656, ..., -22.37672997,
        -23.18770981, -21.84348297],
       [ -4.58577061,  -1.13412857,  -8.53803253, ..., -20.15686035,
        -26.83709717, -23.57501984],
       [ -5.22238874,  -1.28241444,  -0.17312431, ..., -21.328228  ,
        -19.85441208, -33.88444138],
       ...,
       [ -5.39654493, -17.29168129,  -4.36076593, ..., -20.82580185,
        -21.06583786, -22.44311523],
       [ -5.93131638, -14.24741745,  -0.26373291, ..., -26.74324799,
        -18.38433075, -22.35527802],
       [ -5.67053604,  -0.10595131,   0.        , ..., -23.33252335,
        -28.08737564, -23.87880707]])

In [41]:
log_p[1, 40]

-2.9464340209960938

##  Adjusting for padded values

One of the issues that will impact the perplexity measure and somethign that we need to adjust for is that sentences shorter than 64 tokens were padded with zeros up to the length of 64. Since the padding is the actual token with index 0 in the vocab. This will actually affect the probabilties and the perplexity measure. we need to adjust the log_p matrix for this issue.

As an example see this:

In [43]:
targets[0, 63]

0

In [44]:
reshaped_targets[0, 63, :]

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [47]:
reshaped_targets[0, 63, 0]

1.0

This implies that in the one-hot encoded vector, the first element is set to 1.0 at the position corresponding to the pad token. Consequently, this configuration influences the sum of probabilities since the calculation will include the model's estimated probability that this particular token is a padding token.

For instance:

In [48]:
predictions[0, 63, :] * reshaped_targets[0, 63, :]

array([-21.84348297,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.  

In [50]:
(predictions[0, 63, :] * reshaped_targets[0, 63, :])[0]

-21.843482971191406

To adjust for this, we create a `non_pad` matrix

In [51]:
no_pad = 1.0 - np.equal(targets, 0)
no_pad

array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       ...,
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.]])

In [52]:
no_pad.shape

(32, 64)

That we then multiply with the `log_p` matrix.

In [53]:
real_log_p = log_p * no_pad

In [54]:
real_log_p

array([[ -5.39654493,  -1.03111839,  -0.66916656, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -4.58577061,  -1.13412857,  -8.53803253, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.22238874,  -1.28241444,  -0.17312431, ...,  -0.        ,
         -0.        ,  -0.        ],
       ...,
       [ -5.39654493, -17.29168129,  -4.36076593, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.93131638, -14.24741745,  -0.26373291, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.67053604,  -0.10595131,   0.        , ...,  -0.        ,
         -0.        ,  -0.        ]])

## Calculating Perplexity

Perplexity is a crucial metric in evaluating language models, indicating how well a model predicts a sample. The process involves three methodical steps:

### Calculate Average Log Probability Per Sequence:
Begin by summing the log probabilities for each sentence. Importantly, this summation must only include the log probabilities associated with actual tokens, excluding any padding tokens. Divide this sum by the number of actual (non-padded) tokens in the sentence to obtain the average log probability per sequence. This step ensures that each sequence is evaluated based on its content rather than its length.

In [73]:
log_ppx = np.sum(real_log_p, axis = 1) / np.sum(no_pad, axis= 1)

### Compute Mean Log Probability Across All Sequences:
Once you have the average log probability for each individual sequence, the next step is to calculate the mean of these averages across all sequences. This overall mean log probability provides a single measure reflecting the model’s performance across the entire dataset, balancing out individual sequence variances.

In [84]:
log_ppx = np.mean(-log_ppx)

In [85]:
log_ppx

2.6211854987065033

### Exponentiate to Obtain Perplexity:
Since the values computed are logarithmic probabilities, to convert these back into a more interpretable measure, raise them to the power of $e$ (the base of natural logarithms). The result is the perplexity of the model, a non-logarithmic measure. A lower perplexity score indicates a model with better predictive accuracy, as it suggests less uncertainty in predicting the next word in a sequence.

In [86]:
ppx = np.exp(log_ppx)

In [87]:
ppx

13.752016923578548