# NLP Assignment CheatSheet - N-grams Language Model Basics

N-gram language models (LM) are a classical approach to sentence modelling, and text autocompletion.

We will use the `nltk` (natural language toolkit) python package. 
If you want to learn more about this popular module, refer to the [official website](https://www.nltk.org/) ([API reference](https://www.nltk.org/api/nltk.html), [installation guide](https://www.nltk.org/install.html)).

In particular, the `nltk.lm` submodule provides optimized implementations of classical n-grams language models such as the maximum likelihood estimator (MLE) and its smoothing variants (Laplace, Lidstone, ...).

To illustrate the ngram approach, in the NLP assignment, we will apply it on the Trump Tweets dataset, and try to generate new tweets!

Before that, this notebook presents the basics on how to preprocess the text data into tokens and ngrams, and to fit LMs with `nltk`.

In [None]:
import numpy as np
import nltk

In [None]:
#!pip install nltk

In [None]:
# First download some nltk resources
# (By default '!pip install nltk' does not actually download every resource in the module,
# as for example some language models are heavy.)
# The following command should download every resource needed for this practical:
nltk.download('popular', quiet=True)

## 1. Introduction: preprocessing and n-grams with dummy data

For simplicity, we consider the dummy corpus `corp` with two tokenized documents (sequences of tokens). The tokens are here simple letters, but we can think of them as representing words in our vocabulary. (We have seen how to tokenize raw text in the previous seminar.)

In [None]:
corp = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

If we want to train a bigram model, we need to turn this tokenized text into n-grams. We can use the `bigrams` and `ngrams` functions from NLTK as helpers, to turn the each token list document into an ngram list (for ex. with $n=2$ and $n=3$).

In [None]:
from nltk.util import bigrams, ngrams

In [None]:
list(bigrams(corp[0]))

In [None]:
list(ngrams(corp[1], n=3))

*Remark:* The `list()` is here just used to display the results, as `bigrams`, `ngrams` and other `nltk` functions return python lazy generators, for efficiency.

Notice how "b" occurs both as the first and second member of different bigrams but "a" and "c" don't? 

It would be nice to indicate to the model how often sentences start with "a" and end with "c" for example, when we will count those ngrams later-on.


A standard way to deal with this is to add special "padding" symbols to the document/sequence before splitting it into ngrams. Fortunately, NLTK also has a `pad_sequence` function for that. We use `"<s>"` and `"</s>"` by convention in `nltk` to pad before and after the sequence, respectively.

Lets add the relevent paddings and construct the bigrams and 3-grams for the first text sequence. Note the `n` argument, that tells the function we need padding for `n`-grams.

In [None]:
from nltk.util import pad_sequence

In [None]:
#n=2
padded_seq2 = list(pad_sequence(corp[0],
                                pad_left=True, left_pad_symbol="<s>",
                                pad_right=True, right_pad_symbol="</s>",
                                n=2)) # The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 
padded_seq2

In [None]:
list(ngrams(padded_seq2, n=2))

In [None]:
#n=3
padded_seq3 = list(pad_sequence(corp[0],
                                pad_left=True, left_pad_symbol="<s>",
                                pad_right=True, right_pad_symbol="</s>",
                                n=3)) # The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 
padded_seq3

In [None]:
list(ngrams(padded_seq3, n=3))

Passing all these parameters every time can be tedious and in most cases one uses the same defaults anyway.

Thus the `nltk.lm` module provides a convenience function that has all these arguments already set while the other arguments remain the same as for `pad_sequence`.

In [None]:
from nltk.lm.preprocessing import pad_both_ends

In [None]:
list(pad_both_ends(corp[0], n=2))

Combining the two parts discussed so far we get the following preparation steps for one sentence.

In [None]:
list(bigrams(pad_both_ends(corp[0], n=2)))

For versatility and conditional probability computations, the `nltk.lm` n-gram models that we will use typically rely on counting everygrams of order n. 
For example, LMs of order 2 are trained by counting unigrams (single words) as well as bigrams (word pairs). For LMs of order 3, they usually rely on counting unigrams, bigrams and 3-grams. And so on... 
That way, an `nltk` LM model of order $n$ can output word probabilities for contexts (i.e. previous words/tokens in the conditioning) of size $0, 1, 2, ..., n-1$ tokens.

To construct those everygrams, that will serve as training data for the LM model to count, NLTK once again helpfully provides a function called `everygrams`.

In [None]:
from nltk.util import everygrams

In [None]:
padded_seq2 = list(pad_both_ends(corp[0], n=2))

list(everygrams(padded_seq2, max_len=2))

We are almost ready to start counting ngrams, just one more step left.

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model, to efficiently perform the counting.

One can create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.


In [None]:
from nltk.lm.preprocessing import flatten

In [None]:
list(flatten(pad_both_ends(sent, n=2) for sent in corp)) #vocab

Now that we discussed the necessary preprocessing steps, in most cases, one typically wants to use the same text as the source for both vocabulary and ngram counts.

To this aim, the `padded_everygram_pipeline` function does exactly everything above (padding, everygrams, vocabulary stream) for us for the whole tokenized corpus, in a single function call.

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline

In [None]:
training_neverygrams, padded_vocab_stream = padded_everygram_pipeline(order=2, text=corp)

To avoid re-creating the text in memory, both `training_neverygrams` and `padded_vocab_stream` are lazy iterators. They are evaluated on demand at training time.

For the sake of understanding the outputs of `padded_everygram_pipeline`, we "materialize" the lazy iterators by casting them into a list.

In [None]:
training_neverygrams, padded_vocab_stream = padded_everygram_pipeline(2, corp)

print('==== n-everygram data (n=2) for each sequence in "corp": ====')
for ngramlize_sent in training_neverygrams:
    print(list(ngramlize_sent))
    print()
print('==== Vocabulary data: ====')
print(list(padded_vocab_stream))

## 2. Language models in NLTK: Basic usage tips

The `nltk.lm` submodule has implementations of the language models (LM) you have seen in class, and several others. In particular, you will find implementations of: The simple Maximum Likelihood Estimator (MLE) (`nltk.lm.MLE`), Laplace smoothing (`nltk.lm.Laplace`), and Lidstone smoothing (`nltk.lm.Lidstone`). 
Lidstone is simply a generalization of Laplace, where a user-selected value $\gamma$ is added to the counts, instead of the value $1$ added with Laplace.

In this section, you will find the very basics on how to use these language model implementations. For more details, you are encouraged to look into the nltk doccumentation.

### 2.1 Fitting an NLTK LM

The LM model usage, is quite similar to scikit-learn. With the simple MLE as an example, it is first instantiated (there might be some other hyperparameters for other models):
```python
    from nltk.lm import MLE
    model = MLE(order = n) # n is the desired order of the model
```
Then it is fit to a training corpus that has been properly preprocessed, into everygrams and a vocabulary stream, for the correct order $n$:
```python
    model.fit(text = training_neverygrams, vocabulary_text = padded_vocab_stream)
```

### 2.2 Accessing the fitted model

Initializing the MLE model, creates an empty vocabulary, which gets filled as we fit the model. The vocabulary object is accessible as an argument. Try for example what the following do:
```python
    model.vocab
    model.vocab.lookup(token_list)
```

For a more advanced usage, the vocabulary can be constructed separately and given to the model, instead of letting it infer it from the vocabulary stream. This allows for example cutting-off infrequent words from the vocabulary. If you are interested in the implementation and going a bit further, you can check out the documentation for the `nltk.lm.vocabulary.Vocabulary` class [here](https://www.nltk.org/api/nltk.lm.vocabulary.html) or the source code: [`nltk.lm.vocabulary.Vocabulary`](https://github.com/nltk/nltk/blob/develop/nltk/lm/vocabulary.py).

Then, fitting n-gram LMs basically boils down to counting the number of word/token and n-gram occurrences in the training data. To access token counts, and conditional token counts (in a context of one or several preceding tokens), try:
```python
    model.counts
    model.counts['word']
    model.counts[('context_word1', "context_word2", ...)]["word"]
```
However, the real purpose of training a language model is to have it score how probable words are in certain contexts. 
For the MLE, the model returns the item's relative frequency as its score, i.e. (conditional) occurrence probability.
```python
    model.score('word')                                             # P('word')
    model.score('word', ('context_word1', "context_word2", ...))    # P('word'|'context_word1 context_word2 ...')
```
To avoid underflow when working with many small score values it makes sense to take their logarithm. 
For convenience this can be done by using the `logscore` method instead of the `score`.
```python
    model.logscore('word')
    model.logscore('word', ('context_word1', "context_word2", ...))
```

### 2.3 Generation with NLTK LMs

One cool feature of ngram models is that they can be used to generate text. The `nltk.lm.model` classes have a `.generate()` method to sample sequentially from the estimated (conditional) probabilities. This can be achieved using:
```python
    model.generate(num_words = num_words, text_seed = initial_context_tokens, random_seed = None)
```
Keep in mind that this will generate `num_words` new words according to the model's fitted scores, as a list of vocabulary tokens. For a realistic output text, it might thus need some post-processing. `nltk.tokenize.treebank.TreebankWordDetokenizer()` provides a general-purpose **sentence** detokenizer, but might need some additional post-processing for specific tasks.

### 2.4 Perplexity

The model perplexity is a normalized form of the sequence probability, as seen in the lecture. It can be used on a kept-aside test dataset to evaluate the performance of a ngram probability model. The `nltk.lm.model` classes have a `.perplexity()` method to compute the perplexity on a given list or corpus of n-grams.
```python
    model.perplexity(test_ngrams)
```