# Language Models

In this session we will learn the following:
- How to use NLTK to build n-gram Language Models
- Build a Recurrent Neural Language Model
- Apply Language Models to generation.

## Preliminary Steps

Make sure you have the latest NLTK version (3.4 or higher).
You can install NLTK using pip: pip install NLTK

The following examples will be using Python 3 syntax and conventions.
Once you have installed NLTK, you need to download the required corpora. Launch the Python interpreter and type:

In [None]:
import nltk
nltk.download()

: 

A new window should open, showing the NLTK Downloader. Next, select all-corpora to download.

NLTK provides many corpora covering different types of texts. We’ll work with the C-span corpus of state of the union and inaugural speeches by US presidents.

To access the corpora:

In [None]:
from nltk.corpus import state_union
from nltk.corpus import inaugural



#To list all documents in a corpus, you can use the fileids() method:
inaugural.fileids()


: 

In [None]:
#To access the content of a given document, you can use the raw(), words() and sents() methods as follows:
inaugural.raw('2009-Obama.txt')

: 

In [None]:
inaugural.words('2009-Obama.txt')


: 

In [None]:
inaugural.sents('2009-Obama.txt')

: 

## The LM Module: Basics

Since NLTK 3.4, the **lm** module allows to build language models.
First thing we need is to count types (words, bigrams, trigrams, etc.): we can do this in different ways, using Counter from the collections package or, more properly, by using NgramCounter from the lm module:


In [None]:
from nltk.util import ngrams
from nltk.lm import NgramCounter

: 

let's count all unigrams (single words) in the State of the Union corpus (state_union):

In [None]:
text_unigrams = [ngrams(sent, 1) for sent in state_union.sents()]
#for tu in text_unigrams:
#    print(list(tu))
ngram_counts=NgramCounter(text_unigrams)
ngram_counts.N()


: 

be careful since the ngrams function produces a generator: once the ngrams are used, they are 'lost'.

You can look at the frequencies of a type (word) in a very simple way:

In [None]:
ngram_counts['the']

: 

In [None]:
ngram_counts.unigrams.most_common(20)

: 

If you have matplotlib installed, it is possible to display a rank/frequency diagram by typing:

In [None]:
ngram_counts.unigrams.plot(50)

: 

Example for bigrams:

In [None]:
text_bigrams = [ngrams(sent, 2) for sent in state_union.sents()]
ngram_counts=NgramCounter(text_bigrams)
ngram_counts[['the']]

: 

In [None]:
ngram_counts[['the']]['people']

: 

Vocabulary objects allow to create a vocabulary from a set of types and a frequency threshold:

In [None]:
from nltk.lm import Vocabulary

vocab = Vocabulary(state_union.words(), unk_cutoff=2)
#The vocabulary will include all words that appear at least 2 times in the corpus.

vocab["America"]

: 

**Exercise 1**: Count the number of unigrams for each of the presidents in the inaugural dataset. Which one held the longest discourse? Which one the shortest one?

In [None]:
#YOUR CODE HERE
#Let's see all the presidential inaugural addresses that are available in the corpus:

president_fileid= inaugural.fileids()
unigram_counter={}
for fileid in president_fileid:
    current_president_date=fileid.split('.')[0]
    text_unigrams = [ngrams(sent, 1) for sent in inaugural.sents(fileid)]
    ngram_counts=NgramCounter(text_unigrams)
    unigram_counter[current_president_date]=ngram_counts.N()
    
unigram_counter=sorted(unigram_counter.items(), key=lambda x: x[1], reverse=True)
a,b=unigram_counter[0][0].split('-')[0],unigram_counter[0][0].split('-')[1]
c,d=unigram_counter[-1][0].split('-')[0],unigram_counter[-1][0].split('-')[1]
print(f"The president with the most words in his inaugural address is {b} in year {a} with {unigram_counter[0][1]} words")
print(f"The president with the least words in his inaugural address is {d} in year {c} with {unigram_counter[-1][1]} words")


: 

**1.b)**: Count the number of different *types*. Which president used the "richest vocabulary" for his speech?

In [None]:
#YOUR CODE HERE
type_counter={}
for fileid in president_fileid:
    current_president_date=fileid.split('.')[0]
    text_unigrams = [ngrams(sent, 1) for sent in inaugural.sents(fileid)]
    ngram_counts=NgramCounter(text_unigrams)
    type_counter[current_president_date]=len(ngram_counts.unigrams)
#normalized_type_counter={k: v/unigram_counter[k] for k, v in type_counter.items()}
type_counter=sorted(type_counter.items(), key=lambda x: x[1], reverse=True)
print(f"The president with the most unique words in his inaugural address is {type_counter[0][0]} with {type_counter[0][1]} unique words")
print(f"The president with the least unique words in his inaugural address is {type_counter[-1][0]} with {type_counter[-1][1]} unique words")   



: 

In [None]:
a=inaugural.raw('1793-Washington.txt').split()
print(len(set(a)))

: 

## Building a language model

Once we have the data we need to introduce padding to tell the model what the boundaries of the sentence are, and to calculate probabilities for the starting and the end of a sentence. Luckily, NLTK has a function that allows us to do it easily:

In [None]:
from nltk.lm.preprocessing import pad_both_ends
from nltk.util import bigrams

sentence=inaugural.sents('2009-Obama.txt')[1]

list(pad_both_ends(sentence, n=2))
list(bigrams(pad_both_ends(sentence, n=2))) #extract bigrams


: 

The module lm will need also a flattened list of symbols to build the vocabulary. We can do it with the following function:

In [None]:
from nltk.lm.preprocessing import flatten
list(flatten(pad_both_ends(sent, n=2) for sent in inaugural.sents('2009-Obama.txt')))


: 

There is also a convenience method called everygram_pipeline that produces the two at the same time – from the manual:
    
    padded_everygram_pipeline(order, text):
    Default preprocessing for a sequence of sentences.

    Creates two iterators:
    - sentences padded and turned into sequences of `nltk.util.everygrams`
    - sentences padded as above and chained together for a flat stream of words

    :param order: Largest ngram length produced by `everygrams`.
    :param text: Text to iterate over. Expected to be an iterable of sentences:
    Iterable[Iterable[str]]
    :return: iterator over text as ngrams, iterator over text as vocabulary data

For example:

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline

train, vocab = padded_everygram_pipeline(3, inaugural.sents())


: 

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE).

In [None]:
from nltk.lm import MLE
lm = MLE(3) #the parameter is the highest n-gram order for our model. We consider up to trigrams in this example 

lm.fit(train, vocab) #fit with the data. May take a while

: 

We can verify frequencies in the same way as we did with the NgramCounter:

In [None]:
lm.counts['America']

: 

In [None]:
lm.counts[['bless']]['America']

: 

The score function returns the probability of observing the given word:

In [None]:
#lm.score('America')
#lm.score('America', ['bless']) #or the probability of observing a word given the previous word
lm.score('America', ['God', 'bless'])

: 

Note that these probabilities are not smoothed since we used a MLE model. For better results, models with smoothing are available:
-	nltk.lm.Lidstone (requires the gamma parameter to increase scores)
-	nltk.lm.Laplace (add 1)
-	nltk.lm.KneserNeyInterpolated

**Exercise 2**: Build a language model from the **state_union** dataset. Verify the probabilities for the words "America", "the" and "jobs", first without smoothing and then using Laplace smoothing (warning: it may take a certain time).

In [None]:
#YOUR CODE HERE
#Let's generate some text using the trained model:
from nltk.lm import Laplace


state_union_sentences=state_union.sents()
train, vocab = padded_everygram_pipeline(3, state_union_sentences)
lm=MLE(3)
lm_laplace=Laplace(3)
lm.fit(train, vocab)

print("The probabilities without smoothing are:")
print(f"America: {lm.score('America')}")
print(f"the: {lm.score('the')}")
print(f"jobs: {lm.score('jobs')}")


: 

In [None]:

state_union_sentences=state_union.sents()
train, vocab = padded_everygram_pipeline(3, state_union_sentences)
lm_laplace=Laplace(3)
lm_laplace.fit(train, vocab)
print("The probabilities with smoothing are:")
print(f"America: {lm.score('America')}")
print(f"the: {lm.score('the')}")
print(f"jobs: {lm.score('jobs')}")


: 

## Evaluating language models: perplexity

Perplexity is a measure of how well does your model approximate true probability distribution behind data. __Smaller perplexity = better model__.

To compute perplexity on one sentence, use:
$$
    {\mathbb{P}}(w_1 \dots w_N) = 2^{-\frac{1}{N} \left( \sum_{t=1}^N \log P(w_t \mid w_{t - n}, \dots, w_{t - 1})\right)},
$$


**Exercise 3**: We would like to create a function that calculates the perplexity on a given test set, made of multiple sentences, returning their average. Complete the following code to calculate the perplexity as defined above

Hint: you can obtain the log-probabilities from the model with the function lm.logscore(...). To help, we include the conversion of the input sentences into sequences of n-grams, including the start and the end of the sentences (special symbols \<s> and \</s> )

In [None]:
from nltk.util import ngrams
import numpy as np

def perplexity(lm, sents, n, min_logprob=np.log(10 ** -50.)):
    """
    :param sents: a list of sentences (each sentence a list of words)
    :param n: the size of n-grams for which to compute the perplexity. This cannot exceed the size used for the construction of the LM
    :param min_logprob: if log(P(w | ...)) is smaller than min_logprob, set it equal to min_logprob
    :returns: corpora-level perplexity - a single scalar number from the formula above
    """
    
    test_data = [nltk.ngrams(t, n, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in sents]
    prp=[]

    for test in test_data:
        logprob=0
        for ngram in test:
            context, word = tuple(ngram[:-1]), ngram[-1]
            logprob+=lm.score(word, context)
        prp.append(logprob)
        
    return np.mean(prp)


: 

**Exercise 3.a**: Evaluate your perplexity function on the *inaugural* dataset and test for $n \in \{1,2,3,4\}$. What do you obtain? Can you explain the result?

In [None]:
#example
sents=inaugural.sents()

perplexity(lm, sents, 3)
#YOUR 
for n in range(1, 5):
    print(f"Perplexity for n={n}:{perplexity(lm, sents, n)}")



: 

## Generation

Finally, we are using the model to generate language. The key function is called **generate**, always in the lm interface

    def generate(self, num_words=1, text_seed=None, random_seed=None):
        Generate words from the model.

        :param int num_words: How many words to generate. By default 1.
        :param text_seed: Generation can be conditioned on preceding context.
        :param random_seed: If provided, makes the random sampling part of
        generation reproducible.
        :return: One (str) word or a list of words generated from model.
 
**Exercise 4**: Generate 10 sentences, each composed by 10 words, using the prompt "I shall", and calculate the average perplexity on the generated set. Compare this value to the value obtained on the *inaugural* dataset. Try for 2- and 3- grams.

What can you conclude about the quality of the generated text (both on the basis of the values you obtained and your personal judgment)?

In [None]:
lm.generate(10, ["I", "shall"])
#YOUR CODE HERE
def generate(self, num_words=1, text_seed=None,random_seed=None):
    
    if text_seed is None:
        text_seed = ["<s>"] * (self.order - 1)
    return self._generate(num_words, text_seed)


prompt=["I shall"]
#Lets generate 10 sentences each with 10 words using the prompt "I shall"
sentences=[]
for i in range(10):
    sentences.append(' '.join(lm.generate(10, prompt)))

for s in sentences:
    print(s)    

: 

# Neural Language Models

The following script contains a demonstration of how to create a neural language model using Recurrent NN (in this case, LSTM) with Keras. Word vectors are one-hot representations. This script has an embedded training text, which is too short to produce reliable results (as you will probably notice).

In [None]:
from keras.utils import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import numpy as np

tokenizer = Tokenizer()

def dataset_preparation(data):
    #the purpose of this function is to transform the text in a format that can be handled by the model
    corpus = data.lower().split("\n")
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    input_sequences = []
    for line in corpus: #process corpus one line at a time
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)

    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len, total_words

def create_model(predictors, label, max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    model.add(Embedding(total_words, 10, input_length=input_len)) #Input Layer : Takes the sequence of words as input
    model.add(LSTM(150)) #LSTM Layer : Computes the output using LSTM units.
    model.add(Dropout(0.5)) #Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting.
    model.add(Dense(total_words, activation='softmax')) #Output Layer : Computes the probability of the best possible next word as output
    
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    model.fit(predictors, label, epochs=500, verbose=1)
    return model


: 

The following cell has some text that is used to train the model; use the short text to take a look at how the testing works, and the 10 discourses (initially commented) from the inaugural dataset for the final experiments. If this takes too long, you can reduce the number of epochs above.

In [None]:
data = """The cat and her kittens
They put on their mittens,
To eat a Christmas pie.
The poor little kittens
They lost their mittens,
And then they began to cry.
O mother dear, we sadly fear
We cannot go to-day,
For we have lost our mittens.
If it be so, ye shall not go,
For ye are naughty kittens.
The three little kittens, they found their mittens,
And they began to cry,
Oh, mother dear, see here, see here,
For we have found our mittens.
Put on your mittens, you silly kittens,
And you shall have some pie.
Purr, purr, purr,
Oh, let us have some pie.
The three little kittens,
they washed their mittens,
And hung them out to dry,
Oh, mother dear, do you not hear,
That we have washed our mittens?
What, washed your mittens,
then you're good kittens,
But I smell a rat close by.
Meow, meow, meow,
We smell a rat close by."""

#data = '\n'.join([' '.join(s) for s in inaugural.sents()[:10]])

X, Y, msl, total_words = dataset_preparation(data)
model = create_model(X, Y, msl, total_words)

: 

### Generating with Temperature

As seen in the course, temperature can be used to tune the creativity of the model.

In the example below, we are not using temperature for sampling as the generate_text function always returns the most probable item.

As the model output are probabilities and not the logits (the softmax has been already applied), we use a trick to calculate the temperature on the final result: we use log to reverse the softmax operation and get logit-like values:

$$e^{(log(a)/T)} = a^{(1/T)}$$

**Exercise 5**: Modify the generate_text function to use temperatures to sample and test the results with temperature temp=2 and temp 0.2. Hint: you can use the function np.random.choice(...) to sample, use as parameters the list of indices of words and as p the probabilities with temperature

In [None]:
def generate_text(seed_text, next_words, max_sequence_len, model, temp=0):
    for j in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=
                             max_sequence_len-1, padding='pre')
        #here we obtain the index of the predicted word
        #note that model.predict(...) returns the probabilities associated to the words
        if temp==0: #if Temperature == 0, then return the most probable token
            predicted = np.argmax(model.predict(token_list), axis=-1)
        else:
            probs=model.predict(token_list)
            #YOUR CODE HERE
            predicted = np.random.choice(range(len(probs[0])), p=probs[0])
            
            #Hint: predicted = np.random.choice(...)

        output_word = ""
        for word, index in tokenizer.word_index.items():
            #we look for the index in the dictionary created by the tokenizer, then we get the word
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text



: 

In [None]:
text = generate_text("I stand here today humbled by the task before us", 30, msl, model, 0.2)
print(text)

: 

### Alternative sampling strategies

__Top-k sampling:__ on each step, sample the next token from __k most likely__ candidates from the language model.

Suppose $k=3$ and the token probabilities are $p=[0.1, 0.35, 0.05, 0.2, 0.3]$. You first need to select $k$ most likely words and set the probability of the rest to zero: $\hat p=[0.0, 0.35, 0.0, 0.2, 0.3]$ and re-normalize: 
$p^*\approx[0.0, 0.412, 0.0, 0.235, 0.353]$.

__Nucleus sampling:__ similar to top-k sampling, but this time we select $k$ dynamically. In nucleus sampling, we sample from top-__N%__ fraction of the probability mass.

Using the same  $p=[0.1, 0.35, 0.05, 0.2, 0.3]$ and nucleus N=0.9, the nucleus words consist of:
1. most likely token $w_2$, because $p(w_2) < N$
2. second most likely token $w_5$, $p(w_2) + p(w_5) = 0.65 < N$
3. third most likely token $w_4$ because $p(w_2) + p(w_5) + p(w_4) = 0.85 < N$

And thats it, because the next most likely word would overflow: $p(w_2) + p(w_5) + p(w_4) + p(w_1) = 0.95 > N$.

After you've selected the nucleous words, you need to re-normalize them as in top-k sampling and generate the next token.

**Exercise 6**: Implement a generate_with_nucleus_sampling function to use the nucleus sampling strategy. Compare (qualitatively) the results obtained with nucleus=0.9 and nucleus=0.3.

In [None]:
def generate_with_nucleus_sampling(seed_text, next_words, max_sequence_len, model, nucleus=0):
    for j in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=
                             max_sequence_len-1, padding='pre')
        #here we obtain the index of the predicted word
        #note that model.predict(...) returns the probabilities associated to the words
        if nucleus==0: #if nucleus == 0, then return the most probable token
            predicted = np.argmax(model.predict(token_list), axis=-1)
        else:
            probs=model.predict(token_list)
            #YOUR CODE
            #...
            probs=probs[0]
            sorted_indexes=np.argsort(probs)[::-1]
            cumulative_probs=np.cumsum(probs[sorted_indexes])
            nucleus_indexes=np.where(cumulative_probs>nucleus)[0]
            if len(nucleus_indexes)==0:
                taken_indexes=[sorted_indexes[0]]
            else:
                taken_indexes=sorted_indexes[:nucleus_indexes[-1] + 1]
            selected_probabilities=probs[taken_indexes]/np.sum(probs[taken_indexes])  
            predicted = np.random.choice(taken_indexes, p=selected_probabilities,size=1)    




        output_word = ""
        for word, index in tokenizer.word_index.items():
            #we look for the index in the dictionary created by the tokenizer, then we get the word
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

: 

In [None]:
text = generate_with_nucleus_sampling("I stand here today humbled by the task before us", 30, msl, model, 0.9)
print(text)

: 

## Generation using GPT-2

The following script uses a **pre-trained** GPT-2 model to generate texts. This model has been trained on a vast set of documents scraped from the web.

You can use this script to produce a text based on an excerpt from a discourse in a database.

**Exercise 7**: Run this script and observe the result. Does the result look like an US inaugural or state of the union address? How would you adapt the model to produce a realistic US inaugural or state of the union address?


In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "I stand here today humbled by the task before us"

indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

encoded_prompt = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.to(torch.device("cpu"))

output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=200, #number of tokens that will be produced (includes seed)
    temperature=0.9, #regulates "creativity of the model" - 1.0 default
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0, #default values
    do_sample=True,
)

# Batch size == 1. to add more examples please use num_return_sequences > 1
generated_sequence = output_sequences[0].tolist()
text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

print(text)


: 