# 2. Language Models - NLTK Language Model

Now, let's see how we cna implement what we have learned in the last section. The core logic required for training a statistical language model is pretty simple because it's all about counting number of occurences of different n-grams. The pseudo code looks something like this:
```
1. language_model = defaultdict(Counter)
2. for i in range(len(tokens)-n):
    1. context, word = tokens[i : i + n], tokens[i + n]
    2. language_model[context][word] +=1
```
But, we'll be using the Language Model in NLTK as a starting point for two main reasons 
1. I don't believe in re-inventing the wheel. If there is a production grade solution, use it. Adapt it to your needs.
2. The implementation is general enough so that we can make our modifications easily.

And for the second reason, it's important to understand the implementation. 

In [1]:
import nltk

## NLTK LM - an exploration

Let's take a toy corpus and explore the implementation and do it the NLTK way.

The corpus is one of my favorite movie monologues of all time. It is from the movie Rocky Balboa where Rocky gives this rousing speech to his son the night before his fight.

In [2]:
text = """
I'd hold you up to say to your mother, "this kid's gonna be the best kid in the world. This kid's gonna be somebody better than anybody I ever knew." And you grew up good and wonderful. It was great just watching you, every day was like a privilege. Then the time come for you to be your own man and take on the world, and you did.

But somewhere along the line, you changed. You stopped being you. You let people stick a finger in your face and tell you you're no good. And when things got hard, you started looking for something to blame, like a big shadow. Let me tell you something you already know.

The world ain't all sunshine and rainbows. It's a very mean and nasty place and I don't care how tough you are it will beat you to your knees and keep you there permanently if you let it. You, me, or nobody is gonna hit as hard as life. But it ain't about how hard you hit. It's about how hard you can get hit and keep moving forward. How much you can take and keep moving forward. That's how winning is done! Cause if you're willing to go through all the battling you got to go through to get where you want to get, who's got the right to stop you?

I mean maybe some of you guys got something you never finished, something you really want to do, something you never said to someone, something... and you're told no, even after you paid your dues? Who's got the right to tell you that, who? Nobody! It's your right to listen to your gut, it ain't nobody's right to say no after you earned the right to be where you want to be and do what you want to do!

Now if you know what you're worth then go out and get what you're worth. But ya gotta be willing to take the hits, and not pointing fingers saying you ain't where you wanna be because of him, or her, or anybody! Cowards do that and that ain't you! You're better than that! I'm always gonna love you no matter what. No matter what happens. You're my son and you're my blood. You're the best thing in my life. But until you start believing in yourself, ya ain't gonna have a life."""

## Text Preprocessing

Let's get comfortable with a few nifty tools in the NTLK arsenal which will make out preprocessing simpler.

### Sentence Tokenizer

We have to split the text into sentences first. We can either use a naive `split(".")` or use slightly sophisticated tokenizers like the Punkt Sentence Tokenizer

> Punkt Sentence Tokenizer
> 
> This tokenizer divides a text into a list of sentences by using an
> unsupervised algorithm to build a model for abbreviation words,
> collocations, and words that start sentences. It must be trained on a
> large collection of plaintext in the target language before it can be
> used.
> 
> The NLTK data package includes a pre-trained Punkt tokenizer for
> English.


In [3]:
order = 3
#using the punkt sentence tokenizer to split our text into sentences.
sentences = nltk.sent_tokenize(text)
print(sentences[:5])

['\nI\'d hold you up to say to your mother, "this kid\'s gonna be the best kid in the world.', 'This kid\'s gonna be somebody better than anybody I ever knew."', 'And you grew up good and wonderful.', 'It was great just watching you, every day was like a privilege.', 'Then the time come for you to be your own man and take on the world, and you did.']


### Word Tokenizer

Now we need to tokenize the sentences to each word tokens. We can do it the naive way and do a `sentence.split(" )`. But to get more accurate results, we can use word_tokenize from NLTK. It uses the TreebankWordDetokenizer by default. The key actions it does are:
1.  Standardize starting quotes
2. Deals with punctuation
3. Converts parentheses to tokens
4. Uses contractions like 'gonna' from the list Robert MacIntyre compiled to split them as well

And after tokenizing, we also need to pad both ends with special tokens and make words all lower case. We already talked abut why we put start and end tokens and we make everything lower to make it easier for the model to learn and consider "What" and "what" the same. But if we have a huge corpus to train with, then probably it makes sense to leave the capitalization in.

In [4]:
#removing the '.' and putting in start and end tokens for each sentence
tokens = []
for sentence in sentences:
    sentence = sentence.replace(".","").replace('\n', ' ').replace('\r', '')
    sentence_tokens = []
    for word in nltk.word_tokenize(sentence):
        sentence_tokens.append(word.lower())
    tokens.append(nltk.lm.preprocessing.pad_both_ends(sentence_tokens,n=order))
print (list(tokens[0]))

['<s>', '<s>', 'i', "'d", 'hold', 'you', 'up', 'to', 'say', 'to', 'your', 'mother', ',', '``', 'this', 'kid', "'s", 'gon', 'na', 'be', 'the', 'best', 'kid', 'in', 'the', 'world', '</s>', '</s>']


### N-Gram Generator
Now that you have word tokens, we can make any n-grams out of this using ready functions in NLTK

In [5]:
n_grams_l = []
for token in tokens:
    n_grams_l.append(nltk.ngrams(token, n=order+1))

list(n_grams_l[0])[:5]

[]

### Putting it all together - Preprocessing Pipeline
Now that we have gone through the basic blcks, let's put all this into a pipeline. which we can reuse for another corpus as well.

In [6]:
def pad_tokens(sentence_tokens, order):
    return nltk.lm.preprocessing.pad_both_ends(sentence_tokens,n=order)

def clean_sentence(sentence):
    return sentence.replace(".","").replace('\n', ' ').replace('\r', '')

def tokenize_to_lower_sentence(sentence):
    sentence_tokens = []
    for word in nltk.word_tokenize(sentence):
        sentence_tokens.append(word.lower())
    return sentence_tokens

def split_to_sentences(text):
    return nltk.sent_tokenize(text)

def create_n_grams(tokens, order):
    return nltk.ngrams(tokens, n=order)

def lm_preprocessing_pipeline(text, order):
    sentences = split_to_sentences(text)
    padded_sentence = []
    n_grams = []
    for sentence in sentences:
        sentence = clean_sentence(sentence)
        sentence_tokens = tokenize_to_lower_sentence(sentence)
        sentence_tokens = list(pad_tokens(sentence_tokens, order))
        n_grams.append(create_n_grams(sentence_tokens, order))
        padded_sentence += sentence_tokens
    return n_grams, padded_sentence

In [7]:
n_grams, padded_sentence = lm_preprocessing_pipeline(text, order=3)

In [8]:
padded_sentence[:10]

['<s>', '<s>', 'i', "'d", 'hold', 'you', 'up', 'to', 'say', 'to']

In [9]:
list(n_grams[0])[:5]

[('<s>', '<s>', 'i'),
 ('<s>', 'i', "'d"),
 ('i', "'d", 'hold'),
 ("'d", 'hold', 'you'),
 ('hold', 'you', 'up')]

## Basic Language Model

Let's understand the Language Model implementation in NLTK so that we can confidently use it.

**Initialization**

The general LanguageModel class is an abstract class with just three parameters:
* `order` - The order of the Language model, or the length of the context This parameter is only used while generating new text from the model. We will be overwriting that function and we will talk about this parameter in depth then.
* `vocabulary` - (Optional) Vocabulary is a way to maintain the vocabulary of the model. If not given, it will be built up during training.
* `counter` - (Optional) Counter is the core engine of the model which counts the context - word pairs. If not given, this too will be build up during training.

How the NGramCounter works is very important to understand.

![Ngram Counter](images/ngram_counter.png)

So, as you can see, Ngram Counter takes the window, splits it into context and word, and then counts the co-occurences. It is important to know this because this tells us that the order of the Language Model is actually one more than the context window you choose. For eg. A trigram model has a context window of 2.

**Fit**
The other key method in the class is the `fit` method. It does two primary actions:
1. Updates the Vocabulary.
2. Updates the Context-Word co-occurences

**Other helper methods**
* `context_counts` - A helper method which retrieves all the counts for a given context.
* `entropy` and `perplexity` - Helper methods to quickly calculate Entropy and Perplexity, given a set of text ngrams
* `generate` - A helper method to generate text from the Language Model. This is the method we override to modify our text generation process

**How do we inherit?**
There is a abstract method called `unmasked_score` in the class, which is what we should be defining for any new class that inherits the class. for eg. The MLE model just returns the count of the word, given context, as is. 

**Modified LanguageModel**

The LanguageModel implementation in NLTK has a few problems/shortcomings. So, I have made some changes to the original LanguageModel to make it easy for our use case.
The main things I've changed are:
- By default, text generation from the LM model was not flexible. Since we want to look at different Sampling Strategies, I abstracted that part out
- I have also included another method in the base class which calculates the probability score for the entire vocabulary, given a context. This will be useful when we generate text from the model.
- In one of the models(InterpolatedModel, which we will cover in the future), I implemented a method to track recursion. (If you haven't followed this, ignore it. I swear it'll become clearer when we reach the part where we talk about Interpolated Smoothing.)

The code for the new LanguageModel is in the `api.py` file. Now, let's import it and create MLE Language Model

In [10]:
from lm.api import LanguageModel

class MLE(LanguageModel):
    """Class for providing MLE ngram model scores.

    Inherits initialization from BaseNgramModel.
    """

    def unmasked_score(self, word, context=None):
        """Returns the MLE score for a word given a context.

        Args:
        - word is expcected to be a string
        - context is expected to be something reasonably convertible to a tuple
        """
        return self.context_counts(context).freq(word)

Let's initialize our trigram model

In [11]:
order = 3
n_grams, padded_sentence = lm_preprocessing_pipeline(text, order=order)
model = MLE(order)

### Training the Model

Now, this is just the shell. It has not vocab or counts because we haven't fitted it yet.

In [12]:
len(model.vocab)

0

Now, let's fit this with the n-grams we have prepared and the padded sentence.

In [13]:
model.fit(n_grams, padded_sentence)
print(model.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 177 items>


Working with the model takes some getting used to, especially how you query out data. The easiest way to query data is using the `context_counts` method.

In [14]:
model.context_counts(('you', "'re"))

FreqDist({'worth': 2, 'my': 2, 'no': 1, 'willing': 1, 'told': 1, 'better': 1, 'the': 1})

In [15]:
model.context_counts(("you","can")).keys()

dict_keys(['get', 'take'])

The NGramCounter in the MLE model is saved under the attribute, `counts`, which has another attribute `_counts` which is where the magic happens. Let's see what that holds.

This is very similar to what we were doing earlier, a defaultdict with ConditionalFreqDist objects. This also has separate key value pairs for each n-gram. In our case, we passed only a fourgram, and therefore we have just one key for '4'. The NgramCounter also has easy getters, albeit an non-intuitive interface.

Let's see how we can query out the count of "you" after the context "about how hard".

In [16]:
model.counts[['you','can']]['take']

1

We can also look at all the contexts the model has seen by indexing the `model.counts` with the order. In this case `3`:

In [17]:
list(model.counts[3].keys())[:10]

[('<s>', '<s>'),
 ('<s>', 'i'),
 ('i', "'d"),
 ("'d", 'hold'),
 ('hold', 'you'),
 ('you', 'up'),
 ('up', 'to'),
 ('to', 'say'),
 ('say', 'to'),
 ('to', 'your')]

There are a couple of  other convenient functions like `score`, '`logscore`, and `perplexity` which we can use

In [18]:
model.score(word='take',context="you can".split())

0.5

In [19]:
#Score for an unseen context-token pair
model.score(word='hard',context="you can".split())

0.0

In [20]:
#Score for an OOV
model.score(word='dragon',context="you can".split())

0.0

In [21]:
model.logscore(word='take',context="you can".split())

-1.0

In [22]:
#Perplexity if an unseen context-token pair is present
model.perplexity([['you','can', "hard"]])



inf

In [23]:
#Perplexity if an unseen context-token pair is present
model.perplexity([['you','can', "dragon"]])



inf

### Generating Text

Now how do we generate text from the model. There are many ways of generating text from a Language Model, but we have chosen the most simple and straightforward. We take each context, get the distribution of words after that context, and then choose the most probable one or the one with maximum likelihood.
Now, let's try and generate some text from our model.

In [24]:
seed = ('but',
 'it',
 'ai',
 "n't")

In [25]:
model.generate(num_words=5, text_seed=seed)



['all', 'sunshine', 'and', 'rainbows', '</s>']

Let's make the generated output more human-like. For that we need to use the same tokenizer we used to tokenize the sentences and convert them back

In [26]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

In [41]:
detokenize(model.generate(num_words=5, text_seed=seed))



'all sunshine and rainbows </s>'

All is well in language model paradise, isn't it? Let's try and generate a longer sentence.

In [42]:
detokenize(model.generate(num_words=50, text_seed=seed))



"nobody's right to say to your knees and keep moving forward </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>"

## OOV problem

Looks like it is stuck at the end token, isn't it? If you look at the code for GreedySampler, you can see that it returns the stop token when it hasn't seen the context before. This is partly because of OOV/unseen data problem.

We talked abut two kinds of problems and let's see how we can solve one of them (and save the rest for the next part in the series)

#### \<UNK> token and Closed Vocabulary

The problem that this technique solves is the first one - a brand new word that the model hasn't seen. By default, the LanguageModel is an open vocabulary problem. But by using this technique, we can convert it into a closed vocabulary problem. and the texhnique is simple - we can handle the OOV by using the special `<UNK>` token. What we do is replace the low frequency words with a special token `<UNK>` so that the rare occurences are bunched together under this special token and help us out of the OOV situation. The LanguageModel in NLTK already does this for you, using the vocabulary object. By default, the vocabulary in the model replaces all the words which has only occured once in the corpus with `<UNK>` and whenever we query for text, we pass the text through the vocabulary object so that it can replace the unknown tokens with `<UNK>`. We can also change the vocabulary object and instruct it to increase the UNK token cutoff so that more words are replace with <UNK>

In [43]:
order = 3
n_grams, padded_sentence = lm_preprocessing_pipeline(text, order=order)
vocab = nltk.lm.Vocabulary(unk_cutoff=3)
model = MLE(order, vocabulary=vocab)
model.fit(n_grams, padded_sentence)



In [44]:
detokenize(model.generate(num_words=50, text_seed=seed))



'place how hard you after know watching privilege shadow through line time all me really are worth up moving never this winning big all blame was is her not winning care blame beat stick mother happens knew cowards say always hits mother saying then great until guys paid listen earned'

The ability to generate longer text sequences(although gibberish) is because the probability mass for the rare tokens are re-distributed among the <UNK> tokens and hence not getting stuck in a narrow context.