# n-gram Language Models 
Traditional ngram language models based purely on straightforward statisics 

We recommend runing this notebook on Google Colab instead of your local computer to avoid the hassle of installing necessary Python packages on local machine. This notebook does not need "GPU" to run, so setting Runtime to "None" (CPU) is good enough. If you want to change the runtime, go to <TT>Runtime > Change Runtime Type</TT> and make change from the dropdown menu..

We will train the n-gram language models on WikiText-2. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext-2.


## Preprocessing the Data

To make the models more robust, it is necessary to perform some basic preprocessing on the corpora. 

* <b>Sentence splitting:</b>&nbsp;&nbsp;&nbsp;&nbsp; identify individual sentences by splitting paragraphs WikiTest dataset at punctuation tokens (".",  "!",  "?").

* <b>Sentence markers:</b>&nbsp;&nbsp;&nbsp;&nbsp;For both training and testing corpora, each sentence must be surrounded by a start-of-sentence (`<s>`) and end-of-sentence marker (`/s`). 

* <b>Unknown words:</b>&nbsp;&nbsp;&nbsp;&nbsp;In order to deal with unknown words in the test corpora, all words that do not appear in the vocabulary must be replaced with a special token for unknown words (`<UNK>`) The WikiText dataset has already done this.


In [None]:
# Constants 
START = "<s>"   # Start-of-sentence token
END = "</s>"    # End-of-sentence-token
UNK = "<UNK>"   # Unknown word token

In [None]:

import torchtext
import random
def preprocess(data, vocab=None):
    final_data = []
    lowercase = "abcdefghijklmnopqrstuvwxyz"
    for paragraph in data:
        # Each of the paragraph variable in the for loop is a long string of the text of one paragraph from the dataset
        # the following statement will convert the paragraph into a list of individual tokens, filling in UNK when necessary
        # the original text in the dataset does contain '<unk>', by eye-inspecting you can see that.
        # For example a paragrah "I eat <unk> apple" will be converted to ['I', 'eat', UNK, 'apple']
        paragraph = [x if x != '<unk>' else UNK for x in paragraph.split()]

        if vocab is not None:
            # filling in UNK if a token is not in the vocabulary
            paragraph = [x if x in vocab else UNK for x in paragraph]
        if paragraph == [] or paragraph.count('=') >= 2: continue
        sen = []
        prev_punct, prev_quot = False, False
        for word in paragraph:
            if prev_quot:
                if word[0] not in lowercase:
                    final_data.append(sen)
                    sen = []
                    prev_punct, prev_quot = False, False
            if prev_punct:
                if word == '"':
                    prev_punct, prev_quot = False, True
                else:
                    if word[0] not in lowercase:
                        final_data.append(sen)
                        sen = []
                        prev_punct, prev_quot = False, False
            if word in {'.', '?', '!'}: prev_punct = True
            sen += [word]
        if sen[-1] not in {'.', '?', '!', '"'}: continue # Prevent a lot of short sentences
        final_data.append(sen)
    vocab_was_none = vocab is None
    if vocab is None:
        vocab = set()
    for i in range(len(final_data)):
        final_data[i] = [START] + final_data[i] + [END]
        if vocab_was_none:
            for word in final_data[i]:
                vocab.add(word)
    return final_data, vocab

def getDataset():
    dataset = torchtext.datasets.WikiText2(root='.data', split=('train', 'valid'))
    train_dataset, vocab = preprocess(dataset[0])
    test_dataset, _ = preprocess(dataset[1], vocab)

    return train_dataset, test_dataset

train_dataset, test_dataset = getDataset()

100%|██████████| 4.48M/4.48M [00:00<00:00, 28.3MB/s]


Run the next cell to see 10 random sentences of the training data.

In [None]:
if __name__ == '__main__':
    for x in random.sample(train_dataset, 10):
        print (x)

['<s>', 'The', 'plan', 'also', 'included', 'the', 'replacement', 'of', 'the', 'two', 'other', 'signaled', 'intersections', 'at', 'West', 'Allenton', 'Road', 'and', 'Oak', 'Hill', 'Road', 'with', 'overpasses', ';', 'the', 'overpass', 'for', 'West', 'Allenton', 'Road', 'is', 'planned', 'to', 'be', 'constructed', 'as', 'a', 'new', 'exit', '4', '.', '</s>']
['<s>', 'Venus', 'was', 'important', 'to', 'ancient', 'American', 'civilizations', ',', 'in', 'particular', 'for', 'the', 'Maya', ',', 'who', 'called', 'it', '<UNK>', '<UNK>', ',', '"', 'the', 'Great', 'Star', '"', 'or', '<UNK>', '<UNK>', ',', '"', 'the', 'Wasp', 'Star', '"', ';', 'they', 'embodied', 'Venus', 'in', 'the', 'form', 'of', 'the', 'god', '<UNK>', '(', 'also', 'known', 'as', 'or', 'related', 'to', '<UNK>', 'and', '<UNK>', 'in', 'other', 'parts', 'of', 'Mexico', ')', '.', '</s>']
['<s>', 'His', 'CPR', 'history', 'ends', ',', 'for', 'example', ',', 'with', 'a', 'recounting', 'of', 'Western', 'grievances', 'against', 'economic',

## The LanguageModel Class

4 types of language models will be implemented: a <b>unigram</b> model, a <b>smoothed unigram</b> model, a <b>bigram</b> model, and a <b>smoothed bigram</b> model. <b> This class is just a skeleton from which subclasses will be extended </b>, the following methods will be implemented in each subclass:

* <b>`__init__(self, trainCorpus)`</b>: Train the language model on `trainCorpus`. This will involve calculating relative frequency estimates according to the type of model you're implementing.

* <b>`generateSentence(self)`</b>: <b></b> Return a sentence that is generated by the language model. It should be a list of the form <TT>[&lt;s&gt;, w<sup>(1)</sup>, ..., w<sup>(n)</sup>, &lt;&sol;s&gt;]</TT>, where each <TT>w<sup>(i)</sup></TT> is a word in the vocabulary (including <TT>&lt;UNK&gt;</TT> but exlcuding <TT>&lt;s&gt;</TT> and <TT>&lt;&sol;s&gt;</TT>). Assume that <TT>&lt;s&gt;</TT> starts each sentence (with probability $1$). The following words <TT>w<sup>(1)</sup></TT>, ... , <TT>w<sup>(n)</sup></TT>, <TT>&lt;&sol;s&gt;</TT> are generated according to the language model's distribution. Note that the number of words <TT>n</TT> is not fixed; instead, the sentence terminates as soon as an end token <TT>&lt;&sol;s&gt;</TT> is generated.

* <b>`getSentenceLogProbability(self, sentence)`</b>: <b> </b> Return the <em> naturl logarithm (base-e logarithm) of the probability</em> of <TT>sentence</TT>, which is again a list of the form <TT>[&lt;s&gt; w<sup>(1)</sup>, ..., w<sup>(n)</sup>, &lt;&sol;s&gt;]</TT>. See the note below about performing the calculations in logarithm.

* <b>`getCorpusPerplexity(self, testCorpus)`</b>: <b> </b> Compute the perplexity (normalized inverse log probability) of `testCorpus` . For a corpus $W$ with $N$ words and a bigram model, Jurafsky and Martin's book presents a formula to compute perplexity as follows: 

$$Perplexity(W) = \Big [ \prod_{i=1}^N \frac{1}{P(w^{(i)}|w^{(i-1)})} \Big ]^{1/N}$$

In order to avoid underflow,  do all of calculations in logarithms. That is, instead of multiplying probabilities, we add the logarithms of the probabilities and exponentiate the result:

$$\prod_{i=1}^N P(w^{(i)}|w^{(i-1)}) = \exp\Big (\sum_{i=1}^N \log P(w^{(i)}|w^{(i-1)}) \Big ) $$




In [None]:
import math
from collections import defaultdict

class LanguageModel(object):
    def __init__(self, trainCorpus):
        '''
        Initialize and train the model (i.e. estimate the model's underlying probability
        distribution from the training corpus.)
        '''

        return

    def generateSentence(self):
        '''
        Generate a sentence by drawing words according to the model's probability distribution.
        '''

       
        raise NotImplementedError("Implement generateSentence in each subclass.")

    def getSentenceLogProbability(self, sentence):
        '''
        Calculate the log probability of the sentence provided. 
        '''

        raise NotImplementedError("Implement getSentenceProbability in each subclass.")
        
    def getCorpusPerplexity(self, testCorpus):
        '''
        Calculate the perplexity of the corpus provided.
        '''

        raise NotImplementedError("Implement getCorpusPerplexity in each subclass.")

    def printSentences(self, n):
        '''
        Prints n sentences generated by the model.
        '''

        for i in range(n):
            sent = self.generateSentence()
            prob = self.getSentenceLogProbability(sent)
            print('Log Probability:', prob , '\tSentence:',sent)

## </font> Unigram Model

Implement each of the 4 functions described above for an <b>unsmoothed unigram</b> model. The probability distribution of a word is given by $\hat P(w)$.

In [None]:
class UnigramModel(LanguageModel):
    def __init__(self, trainCorpus):

        print("Initiliaze un-smoothed UnigramModel")
        # A dictionary with each key as each token in the vocabulary, and the value as that token's frequency in the coupus
        self.counts = defaultdict(int)
        # total tokens in the corpus
        self.total = 0
        
        for sent in trainCorpus:
            for word in sent:
              # exclude counting the start of sentence token <s>
                if word != START:
                    self.counts[word] += 1
                    self.total +=1
 
    def generateSentence(self):
        # the first token of a generated sentence is always the start <s> token
        output_sent = [START]
        curr_word = START
      
        # as long as it is not the end of sentence, keep generating the next word
        while curr_word != END:
            rand = random.random()
            cumulative_prob = 0.0
            for word in self.counts.keys():
               cumulative_prob += float(self.counts[word])/self.total
               if rand < cumulative_prob:
                  break
            curr_word = word;
            output_sent.append(word)
        return output_sent

    def getSentenceLogProbability(self, sentence):
        
        log_prob = 0.0
        for word in sentence:
            if word != START:
                # add a check to make sure word is in self.counts before accessing it , otherwise nonexisting word will be added to 
                # self.counts with value 0, that changes the structure of self.counts[word]. This check also serves the purpose of 
                # sanity check for the case of the word is not in self.counts, i.e it does not appear in the traning set, 
                # and sentence probability shouls be 0, log probability is then negative infinity. Without the check there 
                # could be math excpetion for log(0)?               
                if word in self.counts.keys():
                    temp_count = self.counts[word]
                    log_prob += math.log(float(temp_count)/self.total)
                else:
                    temp_count = 0
                    # the probability of sentence is 0 because this word is not in training set, log_prob should be -inf
                    log_prob = float("-inf")

        return log_prob
 

      
    def getCorpusPerplexity(self, testCorpus):

        log_prob = 0.0
        count = 0
        for sent in testCorpus:
            for word in sent:
                if word != START:
                    count += 1
                    log_prob += math.log(float(self.counts[word])/self.total)
        return math.exp(-1.0/count * log_prob)
        

Train the model on the full WikiText corpus, and evaluate it on the held-out test set.

In [None]:
def runModel(model_type):
    assert model_type in {'unigram', 'bigram', 'smoothed-unigram', 'smoothed-bigram'}
    # Read the corpora
    if model_type == 'unigram':
        model = UnigramModel(train_dataset)
    elif model_type == 'bigram':
        model = BigramModel(train_dataset)
    elif model_type == 'smoothed-unigram':
        model = SmoothedUnigramModel(train_dataset)
    else:
        model = SmoothedBigramModelAD(train_dataset)

    print("--------- 5 sentences from the model ---------")
    model.printSentences(5)

    print ("\n--------- Corpus Perplexities ---------")
    print ("Training Set:", model.getCorpusPerplexity(train_dataset))
    print ("Testing Set:", model.getCorpusPerplexity(test_dataset))

if __name__=='__main__':
    runModel('unigram')

Initiliaze un-smoothed UnigramModel
--------- 5 sentences from the model ---------
Log Probability: -286.2687966073898 	Sentence: ['<s>', 'of', 'was', 'in', 'this', 'United', 'the', '.', 'throughout', 'in', 'remains', 'In', 'where', 'the', 'to', '3', "'s", 'nitrous', 'were', 'entrenched', 'may', 'three', ';', 'a', 'availability', '12', 'two', 'Series', 'not', ',', 'Finkelstein', ',', 'depression', 'and', 'Crystal', 'the', '<UNK>', 'value', 'presented', 'Forced', 'during', 'time', 'over', '</s>']
Log Probability: -297.928905998917 	Sentence: ['<s>', 'is', 'the', 'was', 'as', '<UNK>', 'his', 'distribution', 'the', 'businesses', '.', 'its', 'with', 'another', 'of', 'had', 'the', 'ones', 'populations', ',', 'farm', 'appeared', '<UNK>', 'in', 'gained', 'worth', 'of', '<UNK>', 'lines', 'Nicaragua', 'SAS', 'advantage', 'technology', ',', 'student', 'five', 'The', 'Moriarty', 'artistic', '@-@', 'United', 'service', 'provided', '</s>']
Log Probability: -3.3214585046256566 	Sentence: ['<s>', '</

##</font> Smoothed Unigram Model 

Implement each of the 4 functions described above for a <b>unigram</b> model with <b>Laplace (add-one) smoothing</b>. The probability distribution of a word is given by $P_L(w)$. This type of smoothing takes away some of the probability mass for observed events and assigns it to unseen events.

In order to smooth the model, we need the number of words in the corpus, $N$, and the number of word types, $S$. The distinction between these is meaningful: $N$ indicates the number of word instances, where $S$ refers to the size of our vocabulary. For example, the sentence <em>the cat saw the dog</em> has four word types (<em>the</em>, <em>cat</em>, <em>saw</em>, <em>dog</em>), but five word tokens (<em>the</em>, <em>cat</em>, <em>saw</em>, <em>the</em>, <em>dog</em>). The token <em>the</em> appears twice in the sentence, but they share the same type <em>the</em>.

If $c(w)$ is the frequency of $w$ in the training data, then compute $P_L(w)$ as follows:

$$P_L(w)=\frac{c(w)+1}{N+S}$$

In [None]:
class SmoothedUnigramModel(LanguageModel):
    def __init__(self, trainCorpus):

        print("Initiliaze smoothed UnigramModel")
        # A dictionary with each key as each token in the vocabulary, and the value as that token's frequency in the coupus
        self.counts = defaultdict(int)
        # total tokens in the corpus
        self.total = 0
        
        for sent in trainCorpus:
            for word in sent:
              # exclude counting the start of sentence token <s>
                if word != START:
                    self.counts[word] += 1
                    self.total +=1
        # The following covers all of the content of smoothing, therefore the rest of the functions 
        # do not nneed to change and remain the same as the un-smoothed unigram
        self.total += len(self.counts)
        for word in self.counts.keys():
            self.counts[word] += 1

    # remains same as unsmoothed unigram
    def generateSentence(self):
        # the first token of a generated sentence is always the start = <s> token
        output_sent = [START]
        curr_word = START
      
        # as long as it is not the end of sentence, keep generating the next word
        while curr_word != END:
            rand = random.random()
            cumulative_prob = 0.0
            for word in self.counts.keys():
               cumulative_prob += float(self.counts[word])/self.total
               if rand < cumulative_prob:
                  break
            curr_word = word;
            output_sent.append(word)
        return output_sent

    # remains the same as unsmoothed unigram
    def getSentenceLogProbability(self, sentence):
        
        log_prob = 0.0
        for word in sentence:
            if word != START:
                # add a check to make sure word is in self.counts before accessing it , otherwise nonexisting word will be added to 
                # self.counts with value 0, that changes the structure of self.counts[word]. This check also serves the purpose of 
                # sanity check for the case of the word is not in self.counts, i.e it does not appear in the traning set, 
                # and sentence probability shouls be 0, log probability is then negative infinity. Without the check there 
                # could be math excpetion for log(0)?               
                if word in self.counts.keys():
                    temp_count = self.counts[word]
                    log_prob += math.log(float(temp_count)/self.total)
                else:
                    temp_count = 0
                    # the probability of sentence is 0 because this word is not in training set, log_prob should be -inf
                    log_prob = float("-inf")

        return log_prob
        
    # remains the same as unsmoothed unigram
    def getCorpusPerplexity(self, testCorpus):
        
        log_prob = 0.0
        count = 0
        for sent in testCorpus:
            for word in sent:
                if word != START:
                    count += 1
                    log_prob += math.log(float(self.counts[word])/self.total)
        return math.exp(-1.0/count * log_prob)

In [None]:
if __name__=='__main__':
    runModel('smoothed-unigram')

Initiliaze smoothed UnigramModel
--------- 5 sentences from the model ---------
Log Probability: -68.71720936354858 	Sentence: ['<s>', ',', 'probably', ',', '.', 'the', '1794', 'have', 'H.', '(', 'predecessor', '</s>']
Log Probability: -41.084626898741575 	Sentence: ['<s>', 'World', 'modal', 'into', 'Barrow', '</s>']
Log Probability: -3.3374709753302536 	Sentence: ['<s>', '</s>']
Log Probability: -49.18116681096004 	Sentence: ['<s>', 'the', "'s", 'with', 'that', '"', 'stalled', 'features', 'the', '</s>']
Log Probability: -304.36139267616363 	Sentence: ['<s>', 'October', 'Bullet', 'casualties', 'linebacker', 'major', 'the', 'de', '6', 'to', '1919', 'transformed', 'Ruler', 'bones', 'where', 'colony', 'already', 'split', ',', 'and', 'elementary', 'century', ',', 'of', 'and', 'as', 'five', 'large', 'with', 'normal', 'dropped', 'archipelago', '"', '"', ',', 'Best', 'teammate', 'preceded', 'rock', '<UNK>', '</s>']

--------- Corpus Perplexities ---------
Training Set: 1103.0243317913958
Test

## </font> Bigram Model 

Implement each of the 4 functions described above for a <b>unsmoothed bigram</b> model. The probability distribution of a word is given by $\hat P(w'|w)$. Thus, the probability of $w_i$ is conditioned on $w_{i-1}$.

In [None]:
class BigramModel(LanguageModel):
    def __init__(self, trainCorpus):

        print("Initiliaze un-smoothed BigramModel")
        # A dictionary with each key as a bigram, and the value as that bigram's frequency in the coupus
        self.bi_counts = defaultdict(int)
        # A dictionary with each key as a unigram in the corpus (each unigram is a the leading word of a bigram),
        # and the value as that unigram's frequency in the coupus
        self.uni_counts = defaultdict(int)

        # prev_word should be the leading word of a bigram
        prev_word = START

        for sent in trainCorpus:
            for word in sent:
                if word != START:
                    # Denote a bigram as "current word|prev word", and use that as the key for the dictionary of bigram frequencies
                    # for example "jumps|cat" indicates a bigram of two words, with the previous word as "cat" and the current word as "jumps"
                    bigram = word + "|" + prev_word
                 
                    # update the frequency count for the bigram
                    self.bi_counts[bigram] += 1
                  
                    # update the unigram frequency count of the prev_word, note that START is also treated as a unigram 
                    # and counted. This is different from UnigramModel where we did not count START into counts. 
                    # So when generating sentence, need to make sure the START token cannnot be selected excpet the beginning of sentence
                    self.uni_counts[prev_word] +=1
                    
                    # we intentioanlly add END = "</s>" to the uni_counts, otherwise it will not show up as a key of uni_counts
                    # we found this when debugging, and then figured out the reason is that we have been adding prev_word as keys
                    # but END = "</s>" is the last token in a sentence so it can never be a prev_word and therefore it will not 
                    # show up. We need it to be in the keys of uni_counts since that is considered to be in the vocabulary, 
                    # and END = "</s>" cannot be missing from the vocabulary
                    if (word == END):
                        self.uni_counts[word] +=1

                prev_word = word


    def generateSentence(self):
        # the first token of a generated sentence is always the START = <s> token
        output_sent = [START]
        prev_word = START

        # as long as it is not the end of sentence, keep generating the next word
        while prev_word != END:
            rand = random.random()
            cumulative_prob = 0.0
            for curr_word in self.uni_counts.keys():
                # make sure we do not generate a START token in the middle of a sentence
                if curr_word != START: 
                    bigram = curr_word + "|" + prev_word
                    # check whether the bigranm is in the bi_counts, and then access self.bi_counts[bigram], 
                    # otherwise nonexisting keys will be addded with value 0, that will change the structure of bi_counts
                    if bigram in self.bi_counts.keys():
                        temp_count = self.bi_counts[bigram]
                        # could temp_count be 0?, it should not be
                        if temp_count == 0:
                            raise Exception("Test generateSentence, temp_count == 0, how could this happen?")
                        cumulative_prob += float(temp_count)/self.uni_counts[prev_word]
                        if rand < cumulative_prob:
                            break
            prev_word = curr_word;
            output_sent.append(curr_word)

        return output_sent

    def getSentenceLogProbability(self, sentence):
        log_prob = 0.0
        prev_word = START
        for curr_word in sentence:
            if curr_word != START:
                bigram = curr_word + "|" + prev_word
                # check whether the bigranm is in the bi_counts, and then access self.bi_counts[bigram], 
                # otherwise nonexisting keys will be addded with value 0, that will change the structure of bi_counts
                if bigram in self.bi_counts.keys():
                    temp_count = self.bi_counts[bigram] 
                    # could temp_count be 0?, it should not be
                    if temp_count == 0:
                        raise Exception("Test getSentenceLogProbability, temp_count == 0, how could this happen?")
                    log_prob += math.log(float(temp_count)/self.uni_counts[prev_word])
                else:
                    # The bigram is not in bi_counts, The probability for this sentence is 0, and the log probabiity is negative infinity
                    log_prob = float("-inf")
                prev_word = curr_word
        return log_prob 
        
    def getCorpusPerplexity(self, testCorpus):
        log_prob = 0.0
        word_count = 0

        for sent in testCorpus:
            # when counting number of words for each sentence, need to deduct by 1 
            # since each sentence has a START = "<s>" at the beginning, and it should not be counted as a word
            word_count += (len(sent)-1)
            temp = self.getSentenceLogProbability(sent)
            if temp == float("-inf"):
            # the log probability for one sentence in the corpus is -inf, i.e. the probability of the sentence is 0,
            # so the perplexity of the corpus is +inf
                return float("inf")
            else:
                log_prob += temp

        return math.exp(-1.0/word_count * log_prob)

In [None]:
if __name__=='__main__':
    runModel('bigram')

Initiliaze un-smoothed BigramModel
--------- 5 sentences from the model ---------
Log Probability: -68.18187569931654 	Sentence: ['<s>', 'In', 'July', '2007', 'at', 'well', '@-@', '<UNK>', 'had', 'run', 'had', 'just', 'below', 'the', 'world', '.', '</s>']
Log Probability: -5.948757402608453 	Sentence: ['<s>', '<UNK>', '.', '</s>']
Log Probability: -50.69394925875432 	Sentence: ['<s>', 'He', 'also', 'occur', 'to', 'place', 'alongside', 'Julia', 'Cho', 'and', 'dreams', '.', '</s>']
Log Probability: -139.96871040735041 	Sentence: ['<s>', 'Norwegian', 'trains', 'a', 'series', 'on', '3', '–', 'because', 'Tomita', 'demonstrated', 'firing', 'of', 'his', 'tomb', 'in', 'an', '1811', 'he', 'was', 'portrayed', '<UNK>', '23', 'miles', '(', 'I', ',', 'completed', '.', '</s>']
Log Probability: -71.4764232973057 	Sentence: ['<s>', 'But', 'what', 'is', 'an', 'individual', 'components', 'shape', 'of', 'design', 'for', 'the', 'Royal', 'Horse', 'Brigades', 'and', 'training', '.', '</s>']

--------- Corpu

##</font> Smoothed Bigram Model 

Implement each of the 4 functions described above for a <b>bigram</b> model with <b>absolute discounting</b>. The probability distribution of a word is given by $P_{AD}(w’|w)$.

In order to smooth the model, a discounting factor $D$ needs to be computed. If $n_k$ is the number of bigrams $w_1w_2$ that appear exactly $k$ times, $D$ can be computed as: 

$$D=\frac{n_1}{n_1+2n_2}$$ 

For each word $w$, the number of bigram types $ww’$ is computed as follows: 

$$S(w)=|\{w’|c(ww’)>0\}|$$ 

where $c(ww’)$ is the frequency of $ww’$ in the training data.

Finally, $P_{AD}(w’|w)$ is computed as follows: 

$$P_{AD}(w’|w)=\frac{\max \big (c(ww’)-D,0\big )}{c(w)}+\bigg (\frac{D}{c(w)}\cdot S(w) \cdot P_L(w’)\bigg )$$ 

where $c(w)$ is the frequency of $w$ in the training data and $P_L(w’)$ is the Laplace-smoothed unigram probability of $w’$.

In [1]:
class SmoothedBigramModelAD(LanguageModel):
    def __init__(self, trainCorpus):

        print("Initiliaze smoothed BigramModel")
        # The following block of code is almost exact copy from the unsmoothed model, go there to see comments if needed
        self.bi_counts = defaultdict(int)
        self.uni_counts = defaultdict(int)
        prev_word = START
        # total tokens in the corpus
        self.total = 0

        # The S(w) as defined above, the bigram type
        self.S = defaultdict(int)

        for sent in trainCorpus:
            for word in sent:
                if word != START:
                    #bigram_count += 1
                    bigram = word + "|" + prev_word
                    #print(prev_word + "|" + word)
                    # This is the first time this type of bigram ww' appears, 
                    # so add 1 to the bigram type for the the work "w", i.e. the prev_word
                    if self.bi_counts[bigram] == 0:
                        self.S[prev_word] += 1;
                    self.bi_counts[bigram] += 1
                    self.uni_counts[prev_word] +=1
                    # make sure END = "</s>" is also inlcuded into the uni_counts
                    # when we generate sentence or calculate P_L(w'), "</s>" needs to be included
                    if (word == END):
                        self.uni_counts[word] +=1
                    
                    self.total += 1
                prev_word = word

        n1 = 0
        n2 = 0

        # in the following bigram with the first word as START = "<s>" is also taken into account
        for bigram in self.bi_counts.keys():
            if self.bi_counts[bigram] == 1:
                n1 += 1
            elif self.bi_counts[bigram] == 2:
                n2 += 1
        self.D = float(n1)/(n1 + 2 * n2)
                   
        
    def Prob_Bigram_AD(self, word, prev_word):
        # P_L(w'), Laplace smoothed unigram probability. In the following denominator, we take the keys of uni_counts
        # as the vocabulary, but it includes START = "<s>", and the original Laplace smoothed unigram model does not 
        # include START = "<s>" in the vocabulary, so we need to reduce the vocabulary size by 1, i.e: len(self.uni_counts)-1

        P_L = float(self.uni_counts[word]+1)/(self.total + len(self.uni_counts)-1)

        bigram = word+"|"+prev_word
        # check whether the bigranm is in the bi_counts before accessing self.bi_counts[bigram], 
        # otherwise nonexisting keys will be addded with value 0, that will change the structure of bi_counts
        if bigram in self.bi_counts.keys():
            temp_count = self.bi_counts[bigram]
            # could temp_count be 0?, it should not be
            if temp_count == 0:
                raise Exception("Inside Prob_Bigram_AD, temp_count == 0, how could this happen?")
        else:
            temp_count = 0
        
        return max(temp_count - self.D, 0)/self.uni_counts[prev_word] + self.D * self.S[prev_word]*P_L/self.uni_counts[prev_word]


    def generateSentence(self):
        # This is almost straight copy from the unsmoothed model, woth only change on probability calculation
        # the first token of a generated sentence is always the START = <s> token
        
        output_sent = [START]
        prev_word = START
        # as long as it is not the end of sentence, keep generating the next word
         
        while prev_word != END:
            rand = random.random()
            cumulative_prob = 0.0
            for curr_word in self.uni_counts.keys():
                # make sure we do not generate a START token in the middle of a sentence
                if curr_word != START: 
                    cumulative_prob += self.Prob_Bigram_AD(curr_word, prev_word)
                    if rand < cumulative_prob:
                        break
            prev_word = curr_word;
            output_sent.append(curr_word)
        return output_sent

    def getSentenceLogProbability(self, sentence):
        log_prob = 0.0
        prev_word = START
        for curr_word in sentence:
            if curr_word != START:
                bigram_prob = self.Prob_Bigram_AD(curr_word, prev_word)
                if bigram_prob == 0.0:
                # The probability for this sentence is 0, and the log probabiity is negative infinity
                    return float("-inf")
                else:
                    log_prob += math.log(bigram_prob)
                prev_word = curr_word
        
        return log_prob
        
    def getCorpusPerplexity(self, testCorpus):
        log_prob = 0.0
        word_count = 0

        prev_word = START
        for sent in testCorpus:
            for word in sent:
                 if word != START:
                    word_count += 1
                    bigram_prob = self.Prob_Bigram_AD(word, prev_word)
                    if bigram_prob == 0.0:
                        # The probability for this sentence is 0, the perplexity of this corpus is infinity
                        return float("inf")
                    else:
                        log_prob += math.log(bigram_prob)
                 prev_word = word
        return math.exp(-1.0/word_count * log_prob)

        



NameError: ignored

In [None]:
if __name__=='__main__':
    runModel('smoothed-bigram')

Initiliaze smoothed BigramModel
--------- 5 sentences from the model ---------
Log Probability: -26.780832825267428 	Sentence: ['<s>', 'The', 'series', 'finale', 'of', 'books', '.', '</s>']
Log Probability: -166.29245216126085 	Sentence: ['<s>', 'The', '<UNK>', 'in', 'a', 'due', 'to', 'provide', 'more', ',', 'she', 'felt', 'that', 'had', 'approached', 'that', 'he', 'was', 'born', 'was', 'involved', 'with', 'these', 'is', 'just', 'a', 'composite', '2008', ',', 'the', 'town', '<UNK>', ',', 'inside', ',', '2007', '.', '</s>']
Log Probability: -66.55685142242773 	Sentence: ['<s>', 'The', 'album', 'cover', 'Minor', '<UNK>', ',', 'under', 'match', ',', 'but', 'era', '.', '</s>']
Log Probability: -86.21039496077162 	Sentence: ['<s>', 'Measuring', 'deciduous', 'and', 'publisher', 'was', 'also', 'a', 'popular', 'singer', 'and', 'in', 'her', 'visit', 'the', '<UNK>', '.', '</s>']
Log Probability: -8.659715625266967 	Sentence: ['<s>', 'He', '.', '</s>']

--------- Corpus Perplexities ---------
Tra