# Lab 2: Evaluating an N-Gram Language Model

In this lab, you will evaluate the quality of an n-gram language model using perplexity.

We have built several n-gram language models and provided an implementation for computing the probabilities. The implementation includes [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), with assigns some probability to sequences that were never encountered during training.

First, review the implementation below to make sure that it makes sense to you.

In [1]:
import pickle
BOS = '<BOS>'
EOS = '<EOS>'
OOV = '<OOV>'
class NGramLM:
    def __init__(self, path, smoothing=0.001, verbose=False):
        with open(path, 'rb') as fin:
            data = pickle.load(fin)
        self.n = data['n']
        self.V = set(data['V'])
        self.model = data['model']
        self.smoothing = smoothing
        self.verbose = verbose

    def get_prob(self, context, token):
        # Take only the n-1 most recent context (Markov Assumption)
        context = tuple(context[-self.n+1:])
        # Add <BOS> tokens if the context is too short, i.e., it's at the start of the sequence
        while len(context) < (self.n-1):
            context = (BOS,) + context
        # Handle words that were not encountered during the training by replacing them with a special <OOV> token
        context = tuple((c if c in self.V else OOV) for c in context)
        if token not in self.V:
            token = OOV
        if context in self.model:
            # Compute the probability using a Maximum Likelihood Estimation and Laplace Smoothing
            count = self.model[context].get(token, 0)
            prob = (count + self.smoothing) / (sum(self.model[context].values()) + self.smoothing * len(self.V))
        else:
            # Simplified formula if we never encountered this context; the probability of all tokens is uniform
            prob = 1 / len(self.V)
        # Optional logging
        if self.verbose:
            print(f'{prob:.4n}', *context, '->', token)
        return prob

In [2]:
# Load pre-built n-gram languae models
model_unigram = NGramLM('arthur-conan-doyle.tok.train.n1.pkl')
model_bigram = NGramLM('arthur-conan-doyle.tok.train.n2.pkl')
model_trigram = NGramLM('arthur-conan-doyle.tok.train.n3.pkl')
model_4gram = NGramLM('arthur-conan-doyle.tok.train.n4.pkl')
model_5gram = NGramLM('arthur-conan-doyle.tok.train.n5.pkl')

Now it's time to see how well these models fit our data! We'll use Perplexity for this calculation, but it's up to you to implement it below.

Recall the formula for perplexity from the lecture:

$$
perplexity = 2^{\frac{-1}{n}\sum \log_2(P(w_i|w_{<i}))}
$$

Hint: you'll want to use the [`math.log2`](https://docs.python.org/3/library/math.html#math.log2) function

In [6]:
from typing import List, Tuple
import math

def perplexity(model: NGramLM, texts: List[Tuple[str]]) -> float:
    log_prob_sum = 0
    total_tokens = 0  # Total number of tokens

    for text in texts:

        for i in range(model.n - 1, len(text)):
            context = text[i - model.n + 1:i]
            token = text[i]
            prob = model.get_prob(context, token)
            log_prob_sum += math.log2(prob)
            total_tokens += 1
            #print(f"Context: {context}, Token: {token}, Probability: {prob}, Log Prob Sum: {log_prob_sum}")

    perplexity_value = 2 ** (-log_prob_sum / total_tokens)
    return perplexity_value

# Example:
perplexity(model_unigram, [('My', 'dear', 'Watson', '.'), ('Come', 'over', 'here', '!')])

744.5701113236693

In [7]:
# Tests
assert round(perplexity(model_unigram, [('My', 'dear', 'Watson')])) == 7531
assert round(perplexity(model_bigram, [('My', 'dear', 'Watson')])) == 24
assert round(perplexity(model_trigram, [('My', 'dear', 'Watson')])) == 521

AssertionError: 

Great! Now let's see how well the model fits a held-out test set.

The test data covers a few of the stories, and represents about 12% of the total data.

Your task it to print the perplexity for the unigram, bigram, trigram, 4-gram, and 5-gram models.

In [8]:
toks_test = []
with open('arthur-conan-doyle.tok.test.txt', 'rt') as fin:
    for line in fin:
        toks_test.append(list(line.split()))
    print("Perplexity for unigram model:", round(perplexity(model_unigram, toks_test)))
    print("Perplexity for bigram model:", round(perplexity(model_bigram, toks_test)))
    print("Perplexity for trigram model:", round(perplexity(model_trigram, toks_test)))
    print("Perplexity for 4-gram model:", round(perplexity(model_4gram, toks_test)))
    print("Perplexity for 5-gram model:", round(perplexity(model_5gram, toks_test)))

Perplexity for unigram model: 621
Perplexity for bigram model: 289
Perplexity for trigram model: 1258
Perplexity for 4-gram model: 5454
Perplexity for 5-gram model: 11273


You should see that the perplexity for the bigram model is lower than the others. What does this indicate?

The lecture mentioned that it's a bad idea to determine the quality of a model based on the perplexity of data that was used for training. Below, evaluate the same five models using the training data.

In [9]:
toks_train = []
with open('arthur-conan-doyle.tok.train.txt', 'rt') as fin:
    for line in fin:
        toks_train.append(list(line.split()))
    print("Perplexity for unigram model:", round(perplexity(model_unigram, toks_train)))
    print("Perplexity for bigram model:", round(perplexity(model_bigram, toks_train)))
    print("Perplexity for trigram model:", round(perplexity(model_trigram, toks_train)))
    print("Perplexity for 4-gram model:", round(perplexity(model_4gram, toks_train)))
    print("Perplexity for 5-gram model:", round(perplexity(model_5gram, toks_train)))

Perplexity for unigram model: 529
Perplexity for bigram model: 60
Perplexity for trigram model: 22
Perplexity for 4-gram model: 17
Perplexity for 5-gram model: 17


You should see that you get much lower perplexities when measuring on the training data, especially for the models with larger values of `n`. This suggests that the model is over-fitting to the training data.

## Optional Extras:
 - In the models we explore above, we use smoothing. What happens to perplexity calculations when smoothing isn't applied? You can try this out by setting `smoothing=0`.

When smoothing is not applied (i.e., setting smoothing=0), any unseen n-gram in the training data will have a probability of zero. This will have a significant impact on the perplexity calculations because the log probability of zero is negative infinity. Consequently, the overall perplexity will also be undefined or extremely large.

To demonstrate this, we can modify the get_prob method to remove smoothing and then observe the effect on perplexity calculations. Here's the modified version of the NGramLM class and the perplexity function:

In [16]:
import pickle

BOS = '<BOS>'
EOS = '<EOS>'
OOV = '<OOV>'

class NGramLM:
    def __init__(self, path, smoothing=0.0, verbose=False):
        with open(path, 'rb') as fin:
            data = pickle.load(fin)
        self.n = data['n']
        self.V = set(data['V'])
        self.model = data['model']
        self.smoothing = smoothing
        self.verbose = verbose

    def get_prob(self, context, token):
        # Ensure context is a tuple and contains only the n-1 most recent tokens
        context = tuple(context[-self.n + 1:])
        
        # Pad context with <BOS> tokens if it is too short
        while len(context) < (self.n - 1):
            context = (BOS,) + context
        
        # Replace OOV tokens in context with <OOV>
        context = tuple((c if c in self.V else OOV) for c in context)
        
        # Replace token with <OOV> if it's not in the vocabulary
        if token not in self.V:
            token = OOV
        
        # Retrieve the context from the model
        if context in self.model:
            count = self.model[context].get(token, 0)
            prob = count / sum(self.model[context].values()) if sum(self.model[context].values()) > 0 else 0
        else:
            # Handle unseen context: probability is zero
            prob = 0
        
        if self.verbose:
            print(f'Context: {context} Token: {token} Count: {count if context in self.model else "N/A"} Prob: {prob:.4f}')
        
        return prob

# Load pre-built n-gram language models with smoothing=0
model_unigram_no_smoothing = NGramLM('arthur-conan-doyle.tok.train.n1.pkl', smoothing=0.0)
model_bigram_no_smoothing = NGramLM('arthur-conan-doyle.tok.train.n2.pkl', smoothing=0.0)
model_trigram_no_smoothing = NGramLM('arthur-conan-doyle.tok.train.n3.pkl', smoothing=0.0)
model_4gram_no_smoothing = NGramLM('arthur-conan-doyle.tok.train.n4.pkl', smoothing=0.0)
model_5gram_no_smoothing = NGramLM('arthur-conan-doyle.tok.train.n5.pkl', smoothing=0.0)

# Define the perplexity function
from typing import List, Tuple
import math

def perplexity(model: NGramLM, texts: List[Tuple[str]]) -> float:
    log_prob_sum = 0
    total_tokens = 0  # Total number of tokens

    for text in texts:
        # Add BOS and EOS tokens to the text
        # text = (BOS,) * (model.n - 1) + text + (EOS,)

        for i in range(model.n - 1, len(text)):
            context = text[i - model.n + 1:i]
            token = text[i]
            prob = model.get_prob(context, token)
            if prob > 0:
                log_prob_sum += math.log2(prob)
            else:
                log_prob_sum += float('-inf')  # Log(0) is negative infinity
            total_tokens += 1
            print(f"Context: {context}, Token: {token}, Probability: {prob}, Log Prob Sum: {log_prob_sum}")

    perplexity_value = 2 ** (-log_prob_sum / total_tokens)
    return perplexity_value

# Test the perplexity function without smoothing
try:
    round(perplexity(model_unigram_no_smoothing, [('My', 'dear', 'Watson')])) == 7531
except ValueError as e:
    print(f"Error in perplexity calculation: {e}")

Context: (), Token: My, Probability: 0.0006605441154985051, Log Prob Sum: -10.564057462156757
Context: (), Token: dear, Probability: 0.00040743842638225544, Log Prob Sum: -21.825187791356207
Context: (), Token: Watson, Probability: 0.0015289229838485645, Log Prob Sum: -31.178456340096673


When smoothing is set to zero, any unseen n-gram in the data will lead to a zero probability, which will result in the log probability being negative infinity. This will cause the overall perplexity to be undefined or extremely high, reflecting the poor generalization of the model to unseen data.



- Interpolated or "back-off" smoothing is sometimes used in n-gram language models. This techniques computes the weighted average probability of models with different values of `n`. Try implementing this yourself. How does it affect the perplexity on the held-out test set? What about the perplexity on the training data?

Interpolated smoothing is a technique used to compute the weighted average probability of models with different values of n. This method combines probabilities from various n-gram models, often leading to better performance on both the held-out test set and the training data. 

#### Implementation Steps
- Define Interpolated NGramLM: Create a new class that combines multiple n-gram models using interpolated smoothing.
- Probability Calculation: Compute the weighted average probability from the different n-gram models.
- Perplexity Calculation: Use the interpolated model to compute perplexity on both training and test datasets.

In [17]:
import pickle
from typing import List, Tuple
import math

BOS = '<BOS>'
EOS = '<EOS>'
OOV = '<OOV>'

class NGramLM:
    def __init__(self, path, smoothing=0.001, verbose=False):
        with open(path, 'rb') as fin:
            data = pickle.load(fin)
        self.n = data['n']
        self.V = set(data['V'])
        self.model = data['model']
        self.smoothing = smoothing
        self.verbose = verbose

    def get_prob(self, context, token):
        context = tuple(context[-self.n + 1:])
        while len(context) < (self.n - 1):
            context = (BOS,) + context
        context = tuple((c if c in self.V else OOV) for c in context)
        if token not in self.V:
            token = OOV

        if context in self.model:
            count = self.model[context].get(token, 0)
            total_count = sum(self.model[context].values())
            if total_count == 0:
                prob = 0
            else:
                prob = (count + self.smoothing) / (total_count + self.smoothing * len(self.V))
        else:
            prob = self.smoothing / len(self.V)
        
        if self.verbose:
            print(f'Context: {context} Token: {token} Count: {count if context in self.model else "N/A"} Prob: {prob:.4f}')
        
        return prob

# Load pre-built n-gram language models
model_unigram = NGramLM('arthur-conan-doyle.tok.train.n1.pkl')
model_bigram = NGramLM('arthur-conan-doyle.tok.train.n2.pkl')
model_trigram = NGramLM('arthur-conan-doyle.tok.train.n3.pkl')
model_4gram = NGramLM('arthur-conan-doyle.tok.train.n4.pkl')
model_5gram = NGramLM('arthur-conan-doyle.tok.train.n5.pkl')

class InterpolatedNGramLM:
    def __init__(self, models, lambdas):
        assert len(models) == len(lambdas), "Number of models must match number of lambdas"
        assert abs(sum(lambdas) - 1.0) < 1e-5, "Lambdas must sum to 1"
        self.models = models
        self.lambdas = lambdas
        self.n = max(model.n for model in models)
    
    def get_prob(self, context, token):
        prob = 0.0
        for model, lambda_ in zip(self.models, self.lambdas):
            prob += lambda_ * model.get_prob(context, token)
        return prob

def perplexity(model: NGramLM, texts: List[Tuple[str]]) -> float:
    log_prob_sum = 0
    total_tokens = 0  # Total number of tokens

    for text in texts:
        # To avoid error 'ZeroDivisionError'
        text = (BOS,) * (model.n - 1) + text + (EOS,)
        for i in range(model.n - 1, len(text)):
            context = text[i - model.n + 1:i]
            token = text[i]
            prob = model.get_prob(context, token)
            if prob > 0:
                log_prob_sum += math.log2(prob)
            else:
                log_prob_sum += float('-inf')
            total_tokens += 1

    perplexity_value = 2 ** (-log_prob_sum / total_tokens)
    return perplexity_value

# Define lambdas for the interpolation
lambdas = [0.1, 0.2, 0.3, 0.2, 0.2]

# Create the interpolated model
interpolated_model = InterpolatedNGramLM(
    models=[model_unigram, model_bigram, model_trigram, model_4gram, model_5gram],
    lambdas=lambdas
)

# Test the perplexity function on both training and test data
train_texts = [('My', 'dear', 'Watson'), ('I', 'am', 'Sherlock', 'Holmes')]  # Add more training data
test_texts = [('Sherlock', 'Holmes', 'said'), ('My', 'name', 'is', 'Watson')]  # Add more test data

# Calculate perplexity on the training data
train_perplexity = perplexity(interpolated_model, train_texts)
print(f"Training Perplexity: {train_perplexity}")

# Calculate perplexity on the test data
test_perplexity = perplexity(interpolated_model, test_texts)
print(f"Test Perplexity: {test_perplexity}")

Training Perplexity: 257.7859942614262
Test Perplexity: 1463.5311769137288
