# Perplexity: How Surprised Is Your Model?

This notebook demonstrates how to compute perplexity, a metric that measures how well a language model predicts a sample of text. Lower perplexity indicates the model is less surprised by the text, meaning it predicts the words more accurately.


Perplexity is defined as the exponential of the average negative log-likelihood of a sequence.
For a sequence of words \(w_1, w_2, \dots, w_N\), the perplexity of a model with probability distribution \(P\) is:

\[ 	ext{Perplexity} = 2^{ -rac{1}{N} \sum_{i=1}^N \log_2 P(w_i | w_{i-1}, \dots, w_{i-n+1}) } \]

In other words, the lower the perplexity, the better the model predicts the sequence. In the next cell, we build a simple n-gram model to estimate the perplexity on a small corpus.


In [1]:

import math
from collections import defaultdict

def build_ngram_model(text, n=2):
    # Build an n-gram model from the provided text.
    model = defaultdict(lambda: defaultdict(int))
    words = text.split()
    for i in range(n, len(words)):
        context = tuple(words[i-n:i])
        word = words[i]
        model[context][word] += 1
    # convert counts to probabilities
    for context, counts in model.items():
        total = float(sum(counts.values()))
        for word in list(counts.keys()):
            counts[word] /= total
    return model


def perplexity(model, text, n=2):
    # Compute perplexity of the model on the provided text.
    words = text.split()
    log_prob = 0.0
    count = 0
    for i in range(n, len(words)):
        context = tuple(words[i-n:i])
        word = words[i]
        prob = model.get(context, {}).get(word, 1e-6)  # smoothing for unseen words
        log_prob += -math.log(prob, 2)
        count += 1
    return math.pow(2, log_prob / max(count, 1))

# Example corpus for training and evaluation
train_text = "hello world it is a nice day hello world it is another beautiful day"
test_text = "hello world it is a sunny day hello world it is a nice day"

# Build bigram model (n=2)
model = build_ngram_model(train_text, n=2)

# Compute perplexity on train and test
train_perplexity = perplexity(model, train_text, n=2)
test_perplexity = perplexity(model, test_text, n=2)

print("Bigram Model Perplexity:\nTrain:", train_perplexity, " Test:", test_perplexity)


Bigram Model Perplexity:
Train: 1.122462048309373  Test: 35.49536659755571


In [2]:

# Optional: Compute perplexity using a pretrained language model from Hugging Face
# This cell will attempt to load a small model and compute perplexity on a simple sentence.
# If the environment does not have internet or the model cannot be loaded, it will fall back gracefully.
try:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model_name = "distilgpt2"  # small model for demo
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    def calculate_perplexity(text):
        encodings = tokenizer(text, return_tensors='pt')
        max_length = model.config.n_positions
        stride = 512
        nlls = []
        for i in range(0, encodings.input_ids.size(1), stride):
            begin_loc = max(i + stride - max_length, 0)
            end_loc = i + stride
            trg_len = end_loc - i
            input_ids = encodings.input_ids[:, begin_loc:end_loc]
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100

            with torch.no_grad():
                outputs = model(input_ids, labels=target_ids)
                neg_log_likelihood = outputs.loss * trg_len

            nlls.append(neg_log_likelihood)
        ppl = torch.exp(torch.stack(nlls).sum() / end_loc).item()
        return ppl

    sample_text = "Perplexity is a measure of how well a language model predicts text."
    print("Perplexity (distilgpt2):", calculate_perplexity(sample_text))
except Exception as e:
    print("Could not load pretrained model or compute perplexity due to:", e)


Could not load pretrained model or compute perplexity due to: No module named 'transformers'



### Conclusion

Perplexity gives us a quantitative measure of how well a language model predicts a sequence of text. In this notebook, we implemented a simple n-gram model and calculated perplexity on both training and test texts. We also attempted to use a pretrained Hugging Face model (DistilGPT-2) to compute perplexity on a sample sentence.

In practice, lower perplexity values indicate better predictive performance. However, perplexity alone doesn't tell the whole story — models can have low perplexity yet produce incoherent or irrelevant outputs. Therefore, perplexity should be used alongside other evaluation metrics and human judgment.
