## Part 4. N-gram language models

In Part 3 you computed frequencies of n-grams in a corpus. You also computed conditional frequencies for bigrams, that is, the number of times a specific word follows some other word.

We now move on to statistical n-gram language models. In a statistical language model, plain frequencies are converted into probabilities. Since the implementation of such language models is a bit complicated, most of the program code that you need has been prepared for you in a module called `ngrams`. To begin with, you need to import both `nltk` and `ngrams`:

In [None]:
import sys
!{sys.executable} -m pip install nltk
import nltk
nltk.download(['punkt', 'gutenberg'])

sys.path.append("../../../morf-synt-2025/src")
import ngrams

### 4.1 Plotting conditional n-gram probabilities

Earlier on this course we have been talking about *smoothing*, a technique for transferring some probability mass from seen n-grams to unseen n-grams. This makes it possible to estimate a probability for new word sequences that do not occur in the training corpus but that might still occur in the language.

In this lab, we will produce a smoothed model by mixing (interpolating) zero-, uni-, bi-, tri-, and fourgram probabilities. We have not mentioned **zerograms** before: a zerogram assigns the same (flat) probability to all words regardless of how frequent they are in the language; we also reserve some probability mass for unseen, so-called out-of-vocabulary (OOV), words.

To make this a bit more concrete, let us estimate a language model (LM) from Jane Austen's Emma and plot the probabilities of the next word in a given word sequence. First we initialize the language model on text that has been converted to lower case:

In [None]:
tokenized = [w.lower() for w in nltk.corpus.gutenberg.words('austen-emma.txt')]
lm = ngrams.Ngrams(tokenized)
lm.set_weights(0.2, 0.2, 0.2, 0.2, 0.2) # more about weights further below

Then we ask the model for the probabilities of all possible words that could continue a given start of a sentence: _"Emma forgot what to ..."_. (Note that we lower-case this start of a sentence, too, so it will actually be _"emma forgot what to"_.) We then print the 10 most probable words and their probabilities:

In [None]:
start = "Emma forgot what to"
pdist = lm.sorted_pdist_ngram(nltk.word_tokenize(start.lower()))
print(pdist[0:10])

Let us plot this probability distribution as a bar chart. Look at the plot. Do the probabilities make sense? What seems to be happening?

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

n = 40  # number of most probable words to plot

words = [w for w, _ in pdist[0:n]]
freqs = [p for _, p in pdist[0:n]]

plt.figure()
plt.bar(range(n), freqs)
plt.xticks(range(n), words, rotation=90)
plt.title(start + "...")
plt.xlabel("Word")
plt.ylabel("Prob")
plt.show()

### 4.2 Adjusting the weights

The language model that we are studying is a _mixture_ of five separate models: a zerogram, unigram, bigram, trigram, and fourgram model. Each model contributes to the final result by the _weight_ that is assigned to it. In the example above we assigned the same weight (= 0.2) to every model. So, each of the five models contributed by 20% to the end result.

Your next task is to modify the weights:
* What happens if you put all weight (1.0) on the zero-, uni-, bi-, tri-, or fourgram? Why?
* What happens if you put most weight on some of the models? Why?
* Is there an optimal configuration for the weights?

Below you can find code for repeating the experiment with your own weights. Note that the sum of the weights must always be 1.0 (= 100%).

In [None]:
# Modify the weights; their sum must be 1.0
zerogram_weight = 0.2
unigram_weight = 0.2
bigram_weight = 0.2
trigram_weight = 0.2
fourgram_weight = 0.2

# You can change the start of sentence, too:
start = "Emma forgot what to"

# This part you don't need to change
lm.set_weights(zerogram_weight, unigram_weight, bigram_weight, trigram_weight, fourgram_weight)
pdist = lm.sorted_pdist_ngram(nltk.word_tokenize(start.lower()))
n = 40  # number of most probable words to plot
words = [w for w, _ in pdist[0:n]]
freqs = [p for _, p in pdist[0:n]]
plt.figure()
plt.bar(range(n), freqs)
plt.xticks(range(n), words, rotation=90)
plt.title(start + "...")
plt.xlabel("Word")
plt.ylabel("Prob")
plt.show()

### 4.3 Perplexity

One way to evaluate a language model is to calculate the overall probability it assigns to some test corpus that the model has _not_ been trained on. If the model gives a high probability to the test data, then the model has been able to predict the test data rather well, which usually means it is a good model. If the model gives a low probability to the test data, then the model has not been able to predict the data too well, and the model is bad, at least for this type of text.

Rather than using probability as a measure, **perplexity** is typically used. Perplexity can be derived from the probability, and a good thing about perplexity is that it is not dependent on the _length_ of the test corpus. (If we calculate the probability of the test corpus, the longer the corpus, the lower the probability, usually.)

Perplexity measures how "perplex", or "confused", or "surprised", our language model is by the test corpus. Perplexity is a positive number, such as 120. This number means that on average, the model has to guess between 120 equally good possible continuations. We want this value to be low, because that means that the model is fairly confident about what the next word should be.

Let us compute the perplexity of our Emma LM on some test sentences:

In [None]:
lm.set_weights(0.2, 0.2, 0.2, 0.2, 0.2) # reset the weights

sent1 = "She was the youngest of the two daughters of a most affectionate, indulgent father; " + \
        "and had, in consequence of her sister's marriage, been mistress of his house from a " + \
        "very early period."

sent2 = "The jury further said in term-end presentments that the City Executive Committee, which had " + \
        "over-all charge of the election, ''deserves the praise and thanks of the City of Atlanta'' for " + \
        "the manner in which the election was conducted."

sent3 = "Emma could not admit to herself that all that she dreamed of was that one day a handsome, rich, " + \
        "and particularly intelligent nobleman would ask her father for her hand in marriage."        
        
print("Perplexity of sentence 1:", lm.perplexity(nltk.word_tokenize(sent1.lower())))
print("Perplexity of sentence 2:", lm.perplexity(nltk.word_tokenize(sent2.lower())))
print("Perplexity of sentence 3:", lm.perplexity(nltk.word_tokenize(sent3.lower())))

What do you say about the perplexities, given the following additional information?
* Sentence 1 is is straight from the novel, so it is part of the training corpus.
* Sentence 2 is from the Brown corpus.
* Sentence 3 is an invented sentence, inspired by Jane Austen.

### 4.4 Generating text using an n-gram model

In this last section we will use statistical language models to generate random text. The text is only partly random in the sense that words will be selected according to their n-gram probabilities. The more specific the model is the more "real" the language will appear.

To start with, let us refresh our memory on what texts are available in the Gutenberg corpus. You can pick some other text than Jane Austen's Emma:

In [None]:
print("Available in the Gutenberg corpus:", nltk.corpus.gutenberg.fileids())

text = 'austen-emma.txt'  # you can change this
tokenized = [w.lower() for w in nltk.corpus.gutenberg.words(text)]
lm = ngrams.Ngrams(tokenized)

Next set the weights (or you can come back to this later):

In [None]:
lm.set_weights(0.01, 0.1, 0.2, 0.3, 0.39) # you can change this

Then generate a sentence (or something more or less like a sentence) at random. Every time you rerun this command, the sentence will change:

In [None]:
words = lm.generate_sentence()
print(" ".join(words))

You can also write your own beginning of the sentence and let the language model continue from that...

In [None]:
start = "Oh, I wish I could"
words = lm.generate_sentence(start=nltk.word_tokenize(start.lower()))
print(" ".join(words))

Test out different corpora to train your model on, and also modify the weights.

After this, you can continue to the home assignment.