# Model Assessment and Out of Vocabulary
In this milestone we will learn how to assess the quality of our n-gram langauge model and handle words not present in the original corpus. You can access the code for Milestone 2 here:[]

In milestone 3, we will work on the following items:

* Assessing the relevance of a sentence by calcultaing its perplexity  
* Handling n-grams that are not present in the corpus with Laplace smoothing

The goal is to assess the quality of our n-gram langauge model and be able to generate new words/text outside of the original corpus.

In [29]:
import pandas as pd
import numpy as np
import pickle
import random
from tqdm import tqdm
from collections import defaultdict, Counter
from nltk.util import ngrams
from nltk.tokenize import WordPunctTokenizer

In [13]:
# if there's a problem with the versions of the librairies, you can . . uncomment this line and install the proper versions

# !pip install -r requirements.txt

In [14]:
# Set some global parameters

# Displaying all columns when displaying dataframes
pd.options.display.max_columns = None

# We will work with trigrams 
ngrams_degree = 3


In [15]:
#Load test df to assess quality of our n-gram language model


In [16]:
# Check the 1st 5 lines


In [17]:
#Load counts and freq object from Milestone 2





In [18]:
#Print 5 random samples from counts object to check




In [19]:
#Print 5 random samples from freq object to check




# Perplexity

Let's now implement a way to measure the quality of our model.

The idea is to estimate the probability of a test sentence given our model. 
An uncommon sentence should be less probable than a common one.


Notes : 
  1. At this point the sentence should exist in the corpus. Our model does not know yet how to handle out-of-vocabulary (OOV) bigrams, trigrams or tokens.
  2. To avoid the problem of underflow caused by multiplying multiple very small floats, we work in the log space:

So instead of calculating perplexity with (case ngrams_degree = 3):
 
$$PP(w_{1},\cdots, w_N) = ( \prod_{i = 3}^{N} \frac{1}{ p(w_i/ w_{i-2}w_{i-1} )} )^{\frac{1}{N}}$$

We compute

$$PP(w_{1},\cdots, w_N) = \exp [ - \frac{1}{N} {\sum_{i = 3}^{N} \log {p(w_i/ w_{i-2}w_{i-1}} } ) ]$$



In [20]:
#Define a tokenizer object using WordPunctTokenizer from NLTK


#Define a generate function that takes in an input sentence and returns the perplexity score of that sentence


    #Convert input sentence to lowercase and tokenize

    #Get number of tokens

    
    #Initialize logprob to be 0 - we will add the log probabilities of each ngram to this variable
    #and then take the exponent at the end to calculate the perplexity

    
    #For each ngram in the sentence

    
    
    
        
        #Try except block in case the prefix/token doesn't exist in our model
            #Get the prefix bigram (beginning to 2nd last index)

            #Get the following token (last index)

            #Get the corresponding probability of that prefix/token combination from the freq object
            #and calulate the log of this probability. Add this value to the logprob variable.

            
            #Pass in case prefix/token doesn't exist in freq object

            
    #Return the perpexity - calculate using previous definition
    #Take the exponent of -(sum of logprobabilities)/number of tokens 



Let's calculate the perplexity on some sentences.

Take the time to see how the perplexity score varies when you . . modify the sentence. For instance compare the perplexity for

* *the difference between the two approaches is discussed here.*
* *the difference between the two approaches is discussed here*
* *the difference between the two approaches*


In [21]:
#Calculate the perplexity of some test sentences








# Out of Vocabulary (OOV) 

The main weakness of our model so far is that it does not know how to handle elements that are not already in the original corpus.

Since both when generating text and when calculating perplexity we use the count of the prefix in the corpus, when that prefix is missing, the counts = 0  which causes problems with logs and divisions.

To remediate to that problem we can artificially assign a probability (although a very low one) to missing ngrams and tokens.

This method is called Laplace smoothing. It relies on calculating the frequency of a token / prefix with:

$$ p(token / prefix) = \frac{ count( prefix + token) + \delta}{count(prefix) + \delta \times |N| }$$


Where 

* N is the total number of prefixes in the model
* delta is an arbitrary number 

When the prefix is missing from the original corpus, the probability of a token / prefix will now be:

$$p(token / prefix) = \frac{1} { | N |}$$

Let's implement that perplexity with Laplace Smoothing








In [22]:
#We can modify our original perplexity function to deal with words not in our model.
#In the original function we simply skipped calculating the probabilities for any prefix/tokens
#that didn't have probabilities. In this function, rather than skipping these cases we will 
#implement Laplace smoothing - artificially adding a small probability to this missing tokens and prefixes.

#Define a generate function that takes in an input sentence and returns the perplexity score of that sentence
#With addititve laplace smoothing for any words that are not in the original corpus.



    #Convert input sentence to lowercase and tokenize

    #Get number of tokens

    
    #Initialize logprob to be 0 - we will add the log probabilities of each ngram to this variable
    #and then take the exponent at the end to calculate the perplexity

    
    #For each ngram in the sentence

    
    
    
    
        #Get the prefix bigram (beginning to 2nd last index)

        #Get the following token (last index)

        
        #If prefix is present in model

            #Get the combined count of the potential following tokens

            #If following token is present in model

                #Get the corresponding probability of that prefix/token combination from the counts object
                #and implement Laplace smoothing using the formula defined above.
                #We need to modify our prefix/token probability calculation by adding delta to the
                #numerator and delta*number of tokens to the denominator. This adds an artifical probability 
                #so we can still return a probability in the case where the counts are 0.
                #As before take the log of this value and add it to the logprob variable.


                #If following token is missing then calculate using Laplace smoothing with delta

                
            #If prefix is missing then simply calculate as log of number of tokens

            
    #Return the perpexity - calculate using previous definition
    #Take the exponent of -(sum of logprobabilities)/number of tokens 

    

We can now calculate the perplexity of sentences that were not present in the original corpus. 

For instance: 

In [23]:
#Calculate the perplexity of some new sentences - you can make them up!
#Try with different values for delta and see how the perpelexity changes, especially for made-up sentences

#Try the following 2 sentences- "this model belongs on a different planet", "this question really belongs on a different site."
#Try out a delta value of 1






In [24]:
#Try the following 2 sentences- "this model belongs on a different planet", "this question really belongs on a different site."
#Try out a delta value of 10







# Perplexity on the test corpus and sentence probability

How do we calculate the perplexity of a model on a test corpus.

Let's say we have *m* sentences in the corpus, the perplexity of the corpus is given by 

$$ PP(Corpus) = P(S_1, \cdots, S_m)^{-\frac{1}{N}} $$

We can assume that the sentences are independent

$$ PP(Corpus) = (\prod_{k = 1}^{m}  P(S_k))^{-\frac{1}{N}} $$

Which we calculate in the log space to avoid underflow

$$ PP(Corpus) = \exp ( -\frac{1}{N} \sum_{k = 1}^{m}  log(P(S_k)) $$

So to calculate the perplexity on a test corpus we need to calculate the probability of each single sentence.

The following function calculates the probability of a sentence. 

Instead of using laplace smoothing to deal with the missing bigrams and tokens, we will simply skip missing elements to make the function faster.
Implementing laplace smoothing requires several extra conditions that are taking too much time to run.



In [25]:
#We can modify our original perplexity function to simply return the sum of the logprobabilities.

#Define a generate function that takes in an input sentence and returns the sum of the 
#logprobabilities for the tokens in the sentence. 

    #Convert input sentence to lowercase and tokenize

    #Get number of tokens

    
    #Initialize logprob to be 0 - we will add the log probabilities of each ngram to this variable
    #and then take the exponent at the end to calculate the perplexity

    
    #For each ngram in the sentence
 




        #Try except block in case the prefix/token doesn't exist in our model

            #Get the prefix bigram (beginning to 2nd last index)

            #Get the following token (last index)

            #Get the corresponding probability of that prefix/token combination from the freq object
            #and calulate the log of this probability. Add this value to the logprob variable.


            #Pass in case prefix/token doesn't exist in freq object

            
    #Return the sum of logprobabilities


We can now implement the perplexity for a whole set of sentences





In [26]:
#Define a function calculate the perplexity of an input corpus (list of sentences)

    #Start by calculating the total number of tokens in the corpus
    #Combine all the sentences together to form a single string

    #Convert combined sentence string to lowercase and tokenize

    #Get number of tokens in combined sentence/corpus

    
    #Initialize logprob to be 0 - we will add the log probabilities of each sentence to this variable
    #and then take the exponent at the end to calculate the perplexity

    
    #For each sentence in the input corpus

        #Calculate the logprobability of that sentence using our previously defined function.
        #Add this value to the logprob variable.

        
    #Return the corpus perpexity - calculate using definition as before
    #Take the exponent of -(sum of logprobabilities of sentences)/number of tokens in corpus

    

# Calculate corpus perplexity
Let's now calculate the perplexity of our test corpus composed of just the titles. First we'll calculate the perplexity of a random sample of 1000 sentences from the test set. Then we we'll try and calculate the perplexity of the entire test corpus as well - however there may be an overflow warning if the number is too big!

In [144]:
# Calculate the perplexity of a sample of 1000 titles and save this value to a variable cp_1000


#Print value of cp_1000


100%|██████████| 1000/1000 [00:06<00:00, 152.88it/s]

21.584069021216166





In [143]:
# Calculate the perplexity of the whole test corpus


100%|██████████| 83685/83685 [00:05<00:00, 14843.50it/s]


21.197697454704233

In [146]:
#Create new column 'sample_perplexity' and set value to the sample perpelxity value we calculated

#Check first 5 lines


Unnamed: 0,post_id,parent_id,comment_id,text,category,tokens,n_tokens,sample_perplexity
0,154700,,,Are aov with Error same as lmer of lme package...,title,"['are', 'aov', 'with', 'error', 'same', 'as', ...",13,21.584069
1,160640,,,How to compare contingency tables for a specif...,title,"['how', 'to', 'compare', 'contingency', 'table...",10,21.584069
2,148203,,,One-sided significance test for correlation,title,"['one', '-', 'sided', 'significance', 'test', ...",7,21.584069
3,327174,,,Visualization activization maximization for re...,title,"['visualization', 'activization', 'maximizatio...",8,21.584069
4,169986,,,Meaning of Intercept and what the intercept sh...,title,"['meaning', 'of', 'intercept', 'and', 'what', ...",14,21.584069


# Export data and model
As in Milestone 1 and 2 we will export our test dataframe as csv after transforming the list of tokens into a space separated string.

In [None]:
#Change tokens column to a space separated string of tokens rather a list


#Write test dataframe to output csv
