## Pipeline 3: N-Gram Likehood Calculation

This notebook constitutes the fourth and last pipeline of the N-Gram Language Modeling Analysis Project. It includes the functions to calculate the likehoods, probabilities and perplexities of the different N-Gram Models. It also contains an analysis of the analysis and results for three random sentences picked from the training set.

## Import Libraries

In [4]:
import pandas as pd
import numpy as np
import os
import re
import math

import nltk
from nltk.util import pad_sequence
from nltk.util import ngrams

import itertools
import pickle

## Load pkl files

In [1]:
#NOTE: Execute this cell ONLY if you want to work with the testing set
#Data for test
data_folder = "output_data/UNK 5-55/"
unigram_pkl = "unigram_dictionary_test.pkl"
bigram_pkl = "bigram_dictionary_test.pkl"
trigram_pkl = "trigram_dictionary_test.pkl"
fourgram_pkl = "fourgram_dictionary_test.pkl"
test_sentences_unk="test_sentences_unk.pkl"

In [3]:
#NOTE: Execute this cell ONLY if you want to work with the training set
#Data for training
unigram_pkl = "unigram_dictionary_training.pkl"
bigram_pkl = "bigram_dictionary_training.pkl"
trigram_pkl = "trigram_dictionary_training.pkl"
fourgram_pkl = "fourgram_dictionary_training.pkl"
training_sentences_unk="training_sentences_unk.pkl"

Load n-gram dictionaries into the variable *ngram_dicts*

In [2]:
def load_pkl(folder, file):
    """Helper function to load a pkl file"""
    filename = os.path.join(folder, file)
    with open(filename, "rb") as file: 
        file_contents = pickle.load(file)
    return(file_contents)

In [5]:
unigram_dict = load_pkl(data_folder, unigram_pkl)
bigram_dict = load_pkl(data_folder, bigram_pkl)
trigram_dict = load_pkl(data_folder, trigram_pkl)
fourgram_dict = load_pkl(data_folder, fourgram_pkl)

Combine the unigram, bigram, trigram, and fourgram dicts together

In [6]:
#For MAC
#ngram_dict = unigram_dict | bigram_dict | trigram_dict | fourgram_dict

In [6]:
#For Python 3.7. Same as line above.

# Python code to merge dict using update() method
def Merge(dict1, dict2):
    return(dict1.update(dict2))
 
# This return None
Merge(unigram_dict, bigram_dict)
Merge(unigram_dict, trigram_dict)
Merge(unigram_dict, fourgram_dict)
 

In [7]:
ngram_dict = unigram_dict
len(unigram_dict)

434612

Read the sentences from the training set

In [9]:
#NOTE: Execute this cell ONLY if you are working with the training set
filename = os.path.join(data_folder, training_sentences_unk)
with open(filename, "rb") as file: 
    training_sentences_unk = pickle.load(file)

Read the sentences from the testing set

In [8]:
#NOTE: Execute this cell ONLY if you are working with the test set
filename = os.path.join(data_folder, test_sentences_unk)
with open(filename, "rb") as file: 
    test_sentences_unk = pickle.load(file)

## Helper Functions: Probability and Perplexity

In [9]:
def ngram_probability(ngram, k = None): 
    """ Computes the probability of the given ngram. 
    
    Parameters
    -------------
    ngram: tple. representing an n-gram of length n>=2
    k: float or None. If None is provided, no smoothing is applied. 
        If a float is provided, add-k smoothing is applied.
    
    Return
    -------------
    The probability of the ngram (with or without add-k smoothing)

    """
    
    # print(ngram)
    n = len(ngram)
    
    # Obtain prefix
    if n > 2: 
        prefix = ngram[:(n-1)]
    elif n == 2:
        prefix = ngram[0]
    else: 
        print("ngram must be of length 2 or greater")
        
    # print(prefix)
    
    # No smoothing applied
    if k is None: 
        probability = ngram_dict[ngram]/float(ngram_dict[prefix])
        
    # Apply add-k smoothing
    else:
        V = float(len(unigram_dict)) # Vocabulary size
        probability = (ngram_dict[ngram] + k)/(ngram_dict[prefix] + (k*V))
        
    # print(ngram_dict[ngram])
    # print(ngram_dict[prefix])
    # print(probability)
    
    return(probability)

In [10]:
def sentence_probabilities(sentence, n, k=None): 
    """ Computes the probability of the given sentence. 
    
    Parameters
    -------------
    sentence: list. The sentence as a tokenized list
    n: int. The degree for the language model. (e.g. n=2 for bigram, n=3 for trigram, etc.)
    k: float or None. If None is provided, no smoothing is applied. 
        If a float is provided, add-k smoothing is applied.
    
    Return
    -------------
    A list of probabilities, where index i corresponds to the probability of the ith ngram in the sentence.
    
    """

    # Obtain ngrams from the sentence
    padded_sentence = list(pad_sequence(sentence,
                         pad_left=True, left_pad_symbol="<s>",
                         pad_right=True, right_pad_symbol="</s>",
                         n=n))
    ngram_sentence = list(nltk.ngrams(padded_sentence, n))
    
    probabilities = [ngram_probability(ngram, k) for ngram in ngram_sentence] 
    
    # print("\nsentence:", sentence)
    # print("\npadded_sentence:", padded_sentence)
    # print("\nngram_sentence:", ngram_sentence)
    
    return(probabilities)

In [11]:
def sentence_likelihood(sentence_probabilities): 
    """ Computes the likelihood for the sentence. 
     sentence_probabilites: list. A list of the probabilities for each ngram in a sentence
     return: float. The probability of the sentence.
    """
    return np.prod(sentence_probabilities)

In [12]:
def sentence_log_likelihood(sentence_probabilities):
    """ Computes the log-likelihood for the sentence.
     sentence_probabilites: list. A list of the probabilities for each ngram in a sentence
     return: float. The log-likelihood of the sentence.
    """
    log_probabilities = [math.log(val) for val in sentence_probabilities]
    return(np.sum(log_probabilities))

In [13]:
def sentence_perplexity(sentence_probabilities):
    """ Computes the perplexity for the sentence.
     sentence_probabilites: list. A list of the probabilities for each ngram in a sentence
     return: float. The log-likelihood of the sentence.
    """
    probability = np.prod(sentence_probabilities)
    N = len(sentence_probabilities) # todo check that 
    return probability ** (-1.0/N)

In [14]:
def corpus_statistics(corpus, n, k):
    """ Compute the average log likelihood and perplexity for a corpus. Uses a n-gram model with or without add-k smoothing"""
    
    log_likelihoods = []
    perplexities = []

    for sentence in corpus: 
        
        # Get the list of probabilities for each ngram in the sentence
        ngram_probabilities = sentence_probabilities(sentence, n, k)

        # Get the log likelihood of the sentence
        log_likelihood = sentence_log_likelihood(ngram_probabilities)
        
        # Get the perplexity of the sentence
        perplexity = sentence_perplexity(ngram_probabilities)

        # Store the statistics for the given sentence
        log_likelihoods.append(log_likelihood)
        perplexities.append(perplexity)

    avg_log_likelihood = np.mean(log_likelihoods)
    avg_perplexity = np.mean(perplexities)
    
    return({"avg_log_likelihood":avg_log_likelihood, "avg_perplexity":avg_perplexity})

## Analysis

In [15]:
#NOTE: Assign to corpus the appropriate value and comment the other line
corpus = test_sentences_unk #NOTE: Use this line ONLY if you are working with the test set. Comment otherwise.
#corpus = training_sentences_unk #NOTE: Use this line ONLY if you are working with the training set. Comment otherwise.

n_values = [2, 3, 4]
k_values = [None, 1, 0.25]

# No smoothing
print("\nNo smoothing")
print(f"bigram: {corpus_statistics(corpus, 2, None)}")
print(f"trigram: {corpus_statistics(corpus, 3, None)}")
print(f"fourgram: {corpus_statistics(corpus, 4, None)}")

# add-1 smoothing
print("\nadd-1 smoothing")
print(f"bigram: {corpus_statistics(corpus, 2, 1)}")
print(f"trigram: {corpus_statistics(corpus, 3, 1)}")
print(f"fourgram: {corpus_statistics(corpus, 4, 1)}")

# add-0.25 smoothing
print("\nadd-0.25 smoothing")
print(f"bigram: {corpus_statistics(corpus, 2, 0.25)}")
print(f"trigram: {corpus_statistics(corpus, 3, 0.25)}")
print(f"fourgram: {corpus_statistics(corpus, 4, 0.25)}")


No smoothing
bigram: {'avg_log_likelihood': -81.50553892566927, 'avg_perplexity': 48.07546904387368}
trigram: {'avg_log_likelihood': -32.10712816176196, 'avg_perplexity': 4.648789629178389}
fourgram: {'avg_log_likelihood': -13.264335791603559, 'avg_perplexity': 2.022652103987471}

add-1 smoothing
bigram: {'avg_log_likelihood': -234.21628779103438, 'avg_perplexity': 59551.838696568375}
trigram: {'avg_log_likelihood': -265.2576563900911, 'avg_perplexity': 131353.90587832712}
fourgram: {'avg_log_likelihood': -280.5707084126525, 'avg_perplexity': 150665.81848706104}

add-0.25 smoothing
bigram: {'avg_log_likelihood': -209.95632327272435, 'avg_perplexity': 19639.11448145478}
trigram: {'avg_log_likelihood': -242.58761757715416, 'avg_perplexity': 48261.50347826865}
fourgram: {'avg_log_likelihood': -257.7304272790726, 'avg_perplexity': 57109.21649882589}


### Result Analysis for Specific Sentences of the Training Set

Take some sample sentences from the training set (three is enough) and show the likelihood calculation for each n-gram (n=2,3,4) model.

In [16]:
#Get sentences in training set
training_sentences_unk = "training_sentences_unk.pkl"

filename = os.path.join(data_folder, training_sentences_unk)
with open(filename, "rb") as file: 
    training_sentences_unk = pickle.load(file)

len(training_sentences_unk)

38008

In [21]:
import random
random.seed(42)

#Get three random sentences from the trainig set
index1, index2, index3 = random.sample(range(0, len(training_sentences_unk)), 3)
sentence1 = training_sentences_unk[index1]
print("Sentence 1: ", sentence1)
sentence2 = training_sentences_unk[index2]
print("Sentence 2: ", sentence2)
sentence3 = training_sentences_unk[index3]
print("Sentence 3: ", sentence3)

Sentence 1:  ['these', 'people', 'are', 'very', 'vulnerable', 'and', 'often', 'easily', '<UNK>']
Sentence 2:  ['however', 'the', 'ifc', 'has', 'made', 'no', 'commitment', 'on', 'the', 'next', 'five', 'dams']
Sentence 3:  ['so', 'you', 'get', 'the', 'argument', 'why', 'not', 'do', 'biology', '?']


### Sentence 1

['these', 'people', 'are', 'very', 'vulnerable', 'and', 'often', 'easily', '\<UNK\>']

#### Bigram LM

In [22]:
# Get the list of probabilities for each bigram in the sentence
sentence1_bigrams_probab = sentence_probabilities(sentence1, 2)
print(sentence1_bigrams_probab)    

[0.004630604083350874, 0.02932551319648094, 0.04632867132867133, 0.005097312326227989, 0.0045662100456621, 0.10714285714285714, 0.0008663634394628547, 0.004484304932735426, 0.140625, 0.0774194765741457]


In [23]:
# Get the log likelihood of the sentence
print(sentence_log_likelihood(sentence1_bigrams_probab)) 

-41.85661864843611


#### Trigram LM

In [24]:
# Get the list of probabilities for each trigram in the sentence
sentence1_trigrams_probab = sentence_probabilities(sentence1, 3)
print(sentence1_trigrams_probab) 

[0.004630604083350874, 0.03409090909090909, 0.25, 0.03773584905660377, 0.045454545454545456, 0.5, 0.3333333333333333, 0.0625, 1.0, 0.2222222222222222, 1.0]


In [25]:
# Get the log likelihood of the sentence
print(sentence_log_likelihood(sentence1_trigrams_probab))

-22.576699609353195


#### Four-Gram LM

In [26]:
# Get the list of probabilities for each fourgram in the sentence
sentence1_fourgrams_probab = sentence_probabilities(sentence1, 4)
print(sentence1_fourgrams_probab) 

[0.004630604083350874, 0.03409090909090909, 0.5, 0.2, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]


In [27]:
# Get the log likelihood of the sentence
print(sentence_log_likelihood(sentence1_fourgrams_probab))

-11.749524747192693


### Sentence 1, Sentence 2 and Sentence 3

Perform the same process (likehood calculation) for the three sentences:

In [28]:
sentences = [sentence1, sentence2, sentence3]

ngram_id_dict = {2:"Bigram", 3:"Trigram", 4:"Four-gram"}

for sentence in sentences:
    print("Sentence: ", sentence)
    for k in range(2,5):
        print("----", ngram_id_dict[k], "----")
        probs = sentence_probabilities(sentence, k) #Add 3rd param for smoothing
        print(">>>", ngram_id_dict[k] , "probabilities: ", probs)
        print(">>> Sentence Log Likehood: ", sentence_log_likelihood(probs))
    print("-----\n")

Sentence:  ['these', 'people', 'are', 'very', 'vulnerable', 'and', 'often', 'easily', '<UNK>']
---- Bigram ----
>>> Bigram probabilities:  [0.004630604083350874, 0.02932551319648094, 0.04632867132867133, 0.005097312326227989, 0.0045662100456621, 0.10714285714285714, 0.0008663634394628547, 0.004484304932735426, 0.140625, 0.0774194765741457]
>>> Sentence Log Likehood:  -41.85661864843611
---- Trigram ----
>>> Trigram probabilities:  [0.004630604083350874, 0.03409090909090909, 0.25, 0.03773584905660377, 0.045454545454545456, 0.5, 0.3333333333333333, 0.0625, 1.0, 0.2222222222222222, 1.0]
>>> Sentence Log Likehood:  -22.576699609353195
---- Four-gram ----
>>> Four-gram probabilities:  [0.004630604083350874, 0.03409090909090909, 0.5, 0.2, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
>>> Sentence Log Likehood:  -11.749524747192693
-----

Sentence:  ['however', 'the', 'ifc', 'has', 'made', 'no', 'commitment', 'on', 'the', 'next', 'five', 'dams']
---- Bigram ----
>>> Bigram probabilities:  [0.008550