[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 7.1 N-gram language modelling using NLTK

In this exercise we experiment with n-gram language models using [`NLTK`'s functionality](https://www.nltk.org/api/nltk.lm.html).

You might also find the following article insightful: [Language Modeling with NLTK](https://medium.com/swlh/language-modelling-with-nltk-20eac7e70853)



In [1]:
!pip install -U nltk>=3.7.0

import nltk
nltk.download("reuters")
nltk.download("punkt_tab")
from nltk.corpus import reuters

from nltk.lm import Vocabulary
from nltk.util import pad_sequence, ngrams
from nltk.lm.preprocessing import flatten
from nltk.lm.models import Laplace

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


(a) ***Collect*** all the sentences from the `reuters` corpus, ***lowercase*** them, and ***pad*** the start and end with special symbols (this means we will have n-grams that distinguish the start and end of sentences). For the left pad symbol use `<s>` and the right use `</s>`.

Hint: Use the [`reuters.sents()`](https://www.nltk.org/api/nltk.corpus.html#corpus-reader-functions) method and the `pad_sequence` function that have already been imported

In [2]:
#(a)
sentences = []
# Corpus reader function that takes in some kind of input and read it
# item argument shows which doc from corpus should be read
# .sents() = means that we are looking at list of list of str
for s in reuters.sents():
  lower_s = [word.lower() for word in s]
  # We are iterating through the corpus, splitting each sentence into words, lowering those words, and padding each sentence
  padded_lower_s = list(pad_sequence(lower_s,
                                     n=2,
                                     pad_left=True,
                                     left_pad_symbol="<s>",
                                     pad_right=True,
                                     right_pad_symbol="</s>"))
  sentences.append(padded_lower_s)

#Inspect the first 3 sentences
for s in sentences[:3]:
  print(s)

['<s>', 'asian', 'exporters', 'fear', 'damage', 'from', 'u', '.', 's', '.-', 'japan', 'rift', 'mounting', 'trade', 'friction', 'between', 'the', 'u', '.', 's', '.', 'and', 'japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.', '</s>']
['<s>', 'they', 'told', 'reuter', 'correspondents', 'in', 'asian', 'capitals', 'a', 'u', '.', 's', '.', 'move', 'against', 'japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'u', '.', 's', '.', 'and', 'lead', 'to', 'curbs', 'on', 'american', 'imports', 'of', 'their', 'products', '.', '</s>']
['<s>', 'but', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', '-', 'run', ',', 'in', 'the', 'short', '-', 'term', 'tokyo', "'", 's', 'loss', 'might', 'be', 'their', 'gain', '.', '</s>']


(b) If needed, ***flatten*** the sentences into a single list of words representing the entire corpus and create a finite vocabulary using NLTK's [`Vocabulary` constructor](https://www.nltk.org/api/nltk.lm.html#nltk.lm.Vocabulary) by specifying a frequency cut-off at 10.

(c) Subsequently, inspect the lengths of the corpus and the vocabulary. Compare the length of the vocabulary with the number of unique words in the corpus.

In [10]:
#(b)
import itertools
# Create vocabulary of corpus
# unk_cutoff (int) – Words that occur less frequently than this value are not considered part of the vocabulary.
flat_sentences = list(flatten(sentences))
vocab = Vocabulary(flat_sentences, unk_cutoff=10)

#(c)
print(f"Number of tokens in corpus: {len(flat_sentences)}")
print(f"Number of unique tokens: {len(vocab)}")

Number of tokens in corpus: 1830349
Number of unique tokens: 8070


(d) Split the text into `train` and `test` sets as follows: reserve the first 10,000 words for the `test` set and the rest for `train`.

In [4]:
#(d)
train_words = flat_sentences[10000:]
test = flat_sentences[:10000]

Train three n-gram language models with n=1,2,3 respectively. Use add-one smoothing (using NLTK's [`Laplace` constructor]())

In [5]:
lms = {}

for n in [1, 2, 3]:
  train_ngrams = list(ngrams(train_words, n))
  print(train_ngrams[:10])
  # Laplace constructor = Implements Laplace (add one) smoothing.
  lm = Laplace(n)
  lm.fit([train_ngrams], vocab)
  lms[n] = lm

[('reasons',), ('are',), ('low',), ('domestic',), ('inflation',), (',',), ('a',), ('bottoming',), ('out',), ('of',)]
[('reasons', 'are'), ('are', 'low'), ('low', 'domestic'), ('domestic', 'inflation'), ('inflation', ','), (',', 'a'), ('a', 'bottoming'), ('bottoming', 'out'), ('out', 'of'), ('of', 'the')]
[('reasons', 'are', 'low'), ('are', 'low', 'domestic'), ('low', 'domestic', 'inflation'), ('domestic', 'inflation', ','), ('inflation', ',', 'a'), (',', 'a', 'bottoming'), ('a', 'bottoming', 'out'), ('bottoming', 'out', 'of'), ('out', 'of', 'the'), ('of', 'the', 'fall')]


(e) Evaluate each of the 3 language models by determining the perplexity on the `train` and `test` sets.

Hint: Use the [`perplexity` method](https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.perplexity)

In [9]:
#(e)
from nltk.lm import Laplace
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline

# Initialize an empty dictionary to store the models
lms = {}

# Loop over n-grams (1, 2, 3)
for n in [1, 2, 3]:
    # Prepare the n-grams and vocabulary
    train_ngrams, vocab = padded_everygram_pipeline(n, train_words)

    # Initialize Laplace model with smoothing
    lm = Laplace(n)  # Laplace smoothing

    # Fit the language model on the training n-grams
    lm.fit(train_ngrams, vocab)

    # Store the trained model in the dictionary
    lms[n] = lm

    # Calculate and print perplexity for the train and test sets
    train_perplexity = lm.perplexity(train_words)
    test_perplexity = lm.perplexity(test)

    print(f"Perplexity of train words for {n}-gram model: {train_perplexity}")
    print(f"Perplexity of test words for {n}-gram model: {test_perplexity}")

Perplexity of train words for 1-gram model: 68.48512959427785
Perplexity of test words for 1-gram model: 66.23824742683821
Perplexity of train words for 2-gram model: 75.73408595163743
Perplexity of test words for 2-gram model: 72.99160518233431
Perplexity of train words for 3-gram model: 79.84740979500924
Perplexity of test words for 3-gram model: 76.74789630733954


(f) Comment on the performance of the three models, which one is best. Why?

In [None]:
#(f)
# Perplexity shows how confident the model is in assigning probabilities to its predictions
# A lower perplexity means that the model assigned higher probabilities to chosen sequence of generated words, which means better performance
# Could argue that perplexity cannot be a representative value based on different architecture of these models

(g) What would the perplexity be for a predictor which randomly guesses from any one of the words occurring in the test set?

In [None]:
#(g)
# Size of vocabulary of test set = 8070

(h) **Optional:** Experiment with models using different smoothing approaches. What is the best perplexity you can achieve?