## Introduction to Natural Language Processing
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich / Fabio Mariani<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>

#  Language Modeling

**Learning Goals**

* In this notebook, we will explore how to train and evaluate trigram (3-gram) language models using two sets of texts with different genres of the Brown Corpus.
* We will assess how well each model performs on a test set using *perplexity* as a metric.
* We will analyze the results to understand how genre, model parameters, and smoothing techniques influence language models performance.

## Resources

Before we build our language model, we need to import several modules and download necessary datasets:

- `nltk`: The Natural Language Toolkit, a popular library for working with NLP.
- `brown`, `gutenberg`: Two corpora provided by NLTK — the Brown Corpus contains categorized texts (e.g., fiction, news), while the Gutenberg Corpus includes classic literature.
- `math`: Used for mathematical operations, such as calculating logarithms and exponentials during perplexity evaluation.

We also use `nltk.download()` to ensure that the required corpora and tokenizers are available locally. This step downloads:
- `'gutenberg'` – Jane Austen and other literary texts.
- `'punkt_tab'` – Tokenizers.
- `'brown'` – A categorized corpus of American English.

In [1]:
!pip install --user -U nltk

import nltk
from nltk.lm import Laplace, StupidBackoff
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.corpus import brown, gutenberg
from nltk.util import ngrams
from nltk import pad_sequence
import math
import pprint as pp

nltk.download('gutenberg')
nltk.download('punkt_tab')
nltk.download('brown')



[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

## Preparing the Training and Test Data


We prepare the data used to train and test our trigram language models.

We extract sentences from the Brown Corpus, grouped by two distinct categories (genres):
- ``fiction``: used to train our first language model.
- ``news``: used to train our second language model.

&#9997; To what extent do we expect the two language models to differ?

_Your Answer:_
> Informative texts could be argued to be more predictable, since they are bound to more constraints(more realistic, no dragons or elves). In contrast, narrative texts may be more "creatively free", which may make completely non-sensical or highly unlikely sentences still possible in this context. The higher predictability of the former could result in a language model performing better when trained on this type of text.

> First "suspicion": news might be the better source material for training these models.

In [2]:
# obtain sentences from Brown corpus object
model_fiction_sentences = brown.sents(categories='fiction')
model_news_sentences = brown.sents(categories='news')

# print the number of sentences that we use to train each language model
print(len(model_fiction_sentences))
print(len(model_news_sentences))

4249
4623


We use 200 sentences from "Emma" by Jane Austen (Gutenberg Corpus) as our test data, i.e., we will check how well the language models estimated on the training corpora match the distribution of n-grams in this new (unseen) text. The better a language model captures the token distributions and sequences in the target domain (or genre), the better the language model (for this domain).

&#9997; Do you expect the `fiction` or the `news` model to match this test text better?

_Your answer:_
> Considering Jane Austen to be a writer of fiction, and Emma to be a work of fiction, I expect the `fiction` model to match this test better.

> Going back to my first "suspicion": even though I suspect a model trained on news might perform better in general, I would expect the model trained on fiction sentences to perform better if the unseen data we use for testing is a narrative text.

In [3]:
# We test on these sentences.
austen_sentences = gutenberg.sents('austen-emma.txt')[700:900]

## Training the N-gram Language Model

The following function trains an N-gram language model using the **Laplace smoothing** technique.

### Why Do We Need Smoothing in Language Models?

When training N-gram language models, we estimate the probability of a word given its preceding context (e.g., for trigrams, `P(w3 | w1, w2)`). However, we face a major challenge:

#### The Zero-Probability Problem

If an N-gram from the test set **never appeared** in the training data, the model assigns it a probability of **zero**.

- Even large training corpora can miss many valid n-gram combinations.
- This leads to entire sentences getting **zero probability**, which is not desired in practice - even if we have not seen a particular combination of words, the entire sentencen may still contain some likely parts and we would like to reflect this in our score.
- Computing perplexity would also turn complicated if our test set contains unseen sequences (which almost always happens in practice).

&#9997; Check the formular in the slides - why do zero probabilities provide a problem for computing perplexity?

_Your answer:_

> We define Perplexity as "the inverse probability", i.e. (1)/(probability). Should any given N-Gram part of a sequence get a probability of 0, that would make the probability of the entire sequence 0 (as mentioned above), and its perplexity impossible to calculate (since you cannot divide by 0).

**Smoothing** adjusts the probability distribution so that **no possible sequence gets a probability of zero**, even if it was unseen during training.

#### Laplace Smoothing

In this exercise, we use **Laplace smoothing**:

- It adds `1` to every possible N-gram count.
- This ensures **every possible word combination has a non-zero count**.
- It’s simple and effective for our purposes.

Smoothing is crucial for building robust language models that can handle new or rare word sequences.


In [4]:
def train_ngram_model(corpus_sentences, n=3):
    train_data, vocab = padded_everygram_pipeline(n, corpus_sentences)
    model = Laplace(n)
    model.fit(train_data, vocab)
    return model

We can then train the 2 models:

In [5]:
n = 3 # trigram

model_fiction = train_ngram_model(model_fiction_sentences, n)
model_news = train_ngram_model(model_news_sentences, n)

## Applying the Language Model on New Data

Once a language model is trained, we want to see how well it performs on new, unseen data. The function below takes a trained model and a list of tokenized sentences, then calculates the probability of each word based on its context using the model:

1) We add ``<s>`` padding to the beginning of the sentence so that the first n-grams can be formed correctly. For trigrams ``(n=3)``, this adds two ``<s>`` tokens: ``['<s>', '<s>', 'the', ...]``.

2) We split the sentence into overlapping sequences of tokens (n-grams). For trigrams: ``('<s>', '<s>', 'the'), ('<s>', 'the', 'dog'), ('the', 'dog', 'barked'), ... ``

3) Separate each n-gram into context and target word.
    - n-gram: ``('the', 'dog', 'barked')``:
        - context (what the model see): ``('the', 'dog')``
        - target word (what we expect the model to predict) : ``barked``

4) We calculate the model probability to predict the expected word given the context.

5) The function returns a list of ``(word, probability)`` for all the words in new data.


In [6]:
def apply_model(model, tokenized_sentences, n):
    prob_t = []
    for sentence in tokenized_sentences:
        padded_tokens = pad_sequence(sentence, n=n, pad_left=True, left_pad_symbol='<s>')
        for ngram in ngrams(padded_tokens, n):
            context = ngram[:-1]
            word = ngram[-1]
            prob = model.score(word, context)
            prob_t.append((word, prob, context))
    return prob_t

In [7]:
model_fiction_probs = apply_model(model_fiction, austen_sentences, n)
model_news_probs = apply_model(model_news, austen_sentences, n)

# Take a look at the data structure
pp.pprint(model_fiction_probs[:10])

[('"', 7.377895824110964e-05, ('<s>', '<s>')),
 ('You', 0.00010746910263299302, ('<s>', '"')),
 ('have', 0.00010746910263299302, ('"', 'You')),
 ('made', 0.0002149151085321298, ('You', 'have')),
 ('her', 0.00010743446497636442, ('have', 'made')),
 ('too', 0.0001074229240519927, ('made', 'her')),
 ('tall', 0.0001074575542660649, ('her', 'too')),
 (',', 0.00010746910263299302, ('too', 'tall')),
 ('Emma', 0.00010746910263299302, ('tall', ',')),
 (',"', 0.00010746910263299302, (',', 'Emma'))]


## Exercise: Calculate Perplexity

Now that we’ve evaluated word probabilities using our language models, it’s time to **quantify** how surprised the models are with new data (test text) using **perplexity**.

---

### What is Perplexity?

Perplexity measures how well a language model predicts a sequence of words. It can be interpreted as:

> "**How surprised** is the model when it sees the actual test data?"

- A **lower** perplexity score means the model is more confident and better at predicting the text.
- A **higher** perplexity indicates more uncertainty or poor prediction.

Mathematically, for a sequence of N words with probabilities \( p_1, p_2, ..., p_N \), perplexity is defined as:

$$
\text{Perplexity} = \exp \left( -\frac{1}{N} \sum_{i=1}^{N} \log p_i \right)
$$

Where:
- \( pi \) is the predicted probability of the i-th word (following a history whose length depends on the language model),
- \( N \) is the total number of predicted words.


### Your Task

Complete the following function to compute perplexity based on the list of predicted probabilities.

**Input**

``words_probs``: A list of tuples like ``(word, probability, context)``

Example: ``[('the', 0.003, ('<s>', '<s>')), ('dog', 0.02, ('<s>', 'the'))]``


**Output**

Returns a single float value: the computed perplexity of the model over the test data.



The function may use Python's built-in ``math`` module:
- `math.log(prob)` calculates the natural logarithm (ln) of a probability.
- `math.exp(x)` computes the exponential \( e^x \).
You can read more here: [math documentation](https://docs.python.org/3/library/math.html)

In [29]:
def calculate_perplexity(words_probs):
    sum_of_lns = 0

    # get the log(pi) of each item of the list and accumulate the sum in a var
    for wp in words_probs:
      sum_of_lns += math.log(wp[1])

    # multiply the var w (-1/N)
    exponent = sum_of_lns / len(words_probs) * -1

    # calc perplexity = e ^ exponent
    perp = math.exp(exponent)
    print(f"Perplexity = {perp}")

    return perp

In [30]:
ppl_fiction = calculate_perplexity(model_fiction_probs)
ppl_news = calculate_perplexity(model_news_probs)

print(f'Perplexity for model_fiction: {ppl_fiction}')
print(f'Perplexity for model_news: {ppl_news}')

Perplexity = 7670.091861316205
Perplexity = 12425.717876764873
Perplexity for model_fiction: 7670.091861316205
Perplexity for model_news: 12425.717876764873


 &#9997; What do you observe? What does this mean? Do the observation match your expectations?

 _Your answer:_

 > The perplexity reported for model_fiction = 7670.0918... is lower than that reported for model_news = 12425.7178.... The observation contradicts my expectations. Again, being a model trained on Fiction sentences, I expected Fiction to be less surprised at 'Emma'.

  &#9997; Experiment with at least two parameters, e.g., change the length of the history ($n$) or experiment with a different smoothing method, adding training data (check out how to use text from several categories of the Brown corpus, for example), etc. and briefly describe your findings. Submit the version of the notebook containing the answers and code for the original question, but keep your variants either below or comment them out.