<a href="https://colab.research.google.com/github/probabll/ntmi-tutorials/blob/main/T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guide

* Check the entire notebook before you get started, this gives you an idea of what lies ahead.
* Note that, as always, the notebook contains a condensed version of the theory We recommend you read the theory part before the LC session.


## ILOs

After completing this lab you should be able to

* develop NGram LMs (classic and neural) in Python (and PyTorch)
* estimate parameters of LMs via MLE
* evaluate LMs intrinsically in terms of perplexity
* evaluate LMs statistically in terms of properties of generated text

## General Notes

* In this notebook you are expected to use $\LaTeX$. 
* Use python3.
* Use Torch
* To have GPU support run this notebook on Google Colab (you will find more instructions later).

We will use a set of standard libraries that are often used in machine learning projects. If you are running this notebook on Google Colab, all libraries should be pre-installed. If you are running this notebook locally you will need to install some additional packages, ask your TA for help if you have problems setting up.

If you need a short introduction to PyTorch [check this tutorial](https://github.com/probabll/ntmi-tutorials/blob/main/PyTorch.ipynb).


## Table of Contents

* Data
* Word segmentation and tokenisation
* Language models
* NGram language models
    * Unigram LM
    * Higher order LMs
    * Implementation
    * Experiment
* Evaluation
* Neural NGram LM

## Table of Graded Exercises

**Important.** The grader may re-run your notebook to investigate its correctness, but you do upload your notebook with the cells already run and make sure that all your answers are visible without the need to re-run the notebook. 

The weight of the exercise is indicated below.

* [Statistics from classic LMs](#classic) (30%)
* [Perplexity](#perplexity) (20%)
* [Neural NGram LMs](#neural) (50%)


## Setting up

In [None]:
import random
import numpy as np
np.random.seed(42)
random.seed(42)

In [None]:
!pip install nltk
!pip install sentencepiece
!pip install tabulate

# Data

In this tutorial we will develop models of text generation. So our data for this tutorial will be collections of sentences, or *corpora*. We will use corpora available in NLTK.

In [None]:
import nltk
nltk.download('treebank')
nltk.download('brown')
nltk.download('punkt')

In [None]:
def split_nltk_corpus(corpus, max_length=30):
    """
    Shuffle and split a corpus.
    corpus: list of sentences, each sentence is a list of tokens, each token is a string.
    max_length: discard sentences longer than this

    Return: training sentences, dev sentences, test sentences
        in each corpus a sentence is now a string where tokens are space separated
    """
    sentences = corpus.sents()
    # do not change the seed in here    
    order = np.random.RandomState(42).permutation(np.arange(len(sentences)))    
    shuffled = [' '.join(sentences[i]) for i in order if len(sentences[i]) <= max_length]    
    return shuffled[2000:], shuffled[1000:2000], shuffled[:1000]

In [None]:
from nltk.corpus import brown

training, dev, test = split_nltk_corpus(brown)

In [None]:
print(f"Number of sentences: training={len(training)} dev={len(dev)} test={len(test)}")

In [None]:
print("# A few training sentences\n\n")
for x in training[:10]:
    print(x)

# Word segmentation and tokenisation

Our models of text generation are probability distributions over finite-length sequences of discrete symbols. These discrete symbols are what we call *tokens*. The **vocabulary** of the model is therefore the finite set of known tokens which it can use to make sequences. 

We are interested in a special type of sequence, namely, sentences. Sentences are typically made of linguistic units that we call words. The linguistic notion of *word* is much too complex for our models. In practice, we use a data-driven and computationally convenient notion instead.

In this tutorial we will work with tokens that are subword units obtained via a compression algorithm known as *byte pair encoding* (BPE). You can optionally check the [original paper](http://www.aclweb.org/anthology/P16-1162) for more detail. 

We will use a package called `sentencepiece` that implements an efficient BPE tokeniser (and word segmenter) for us. This tokeniser is language independent, it learns a vocabulary of subword units from a corpus of sentences (without the need for any tokenisation). Because this is based on a compression algorithm, we can choose the level of compression, that is, we can choose the number of tokens that we want to have in the vocabulary, the BPE algorithm will find what collection of tokens best describes the corpus at the given budget. 

After trained, we can use the BPE model to tokenise and detokenise sentences for us deterministically. 

In [None]:
import sentencepiece as spm
import io

This helper function trains a BPE model for us:

In [None]:
def fit_vocabulary(corpus, vocab_size):
    """
    Return a BPE model as implemented by the sentencepiece package.

    corpus: an iterable of sentences, each sentence is a python string
    """
    proto = io.BytesIO()

    spm.SentencePieceTrainer.train(
        sentence_iterator=iter(corpus), 
        model_writer=proto, 
        vocab_size=vocab_size,
        pad_id=0,
        bos_id=1,
        eos_id=2,
        unk_id=3,
    )

    return spm.SentencePieceProcessor(model_proto=proto.getvalue())

In [None]:
tokenizer = fit_vocabulary(training, vocab_size=1000)

Note how we control the vocabulary size:

In [None]:
tokenizer.vocab_size()

And see how we can use this object to tokenize and detokenize text:

In [None]:
example_str = training[0]
print(example_str)

The function `encode` can be used to tokenize a string into a list of tokens (each a string). To be able to read the output, you need to use the argument `out_type=str`, otherwise the tokenizer will convert the tokens to numerical codes.

In [None]:
tokenizer.encode(example_str, out_type=str)

The `decode` method can map the tokens back to original form:

In [None]:
tokenizer.decode(tokenizer.encode(example_str, out_type=str))

Without `out_type=str` we get a sequence of codes:

In [None]:
tokenizer.encode(example_str)

And, of course, `decode` can map it back to text:

In [None]:
tokenizer.decode(tokenizer.encode(example_str))

There are two main advantages of this strategy for tokenization:

1. we control the vocabulary size (which helps us control memory usage)
2. oftentimes unseen words are made of a combination of existing subword units, so we can deal with more words than before

In [None]:
tokenizer.encode("This is a tutorial within Natuurlijke Talmodellen en Interfaces at the UvA.", out_type=str)

You can also preprocess whole batches of sentences:

In [None]:
tokenizer.encode(["This is a sentence.",  "And this is another"], out_type=str)

In [None]:
tokenizer.decode(tokenizer.encode(["This is a sentence.",  "And this is another"], out_type=str))

**Ungraded exercise.** Play a bit with the tokenizer object.

# Language Models
    
A language model (LM) is a **probability distribution over text**, where text is a finite sequences of words. 
    
LMs can be used to generate text as well as to quantify a degree of naturalness (or rather, a degree of resemblance to training data) which is useful for example to compare and rank alternative pieces of text in terms of fluency.
    
To design an LM, we need to talk about units of text (e.g., documents, paragraphs, sentences) as outcomes of a random experiment, and for that we need random variables.
    
We will generally refer to the unit of text as a *sentence*, but that's just for convenience, you could design an LM over documents and very little, if anything, would change.  

A **random sentence** is a finite sequence of symbols from the vocabulary of a given language. As a running example, we will refer to this language as English. The vocabulary of English will be denoted by $\mathcal W$, a finite collection of unique symbols, each of which we refer to as a *word* (but in practice, these symbols are any unit we care to model). We will denote a random sentence by $X$, or more explicitly, by the random sequence $X = \langle W_1, \ldots, W_L \rangle$. Here $L$ indicates the sequence length. Each word in the sequence is a random variable that takes on values in $\mathcal W$. We will adopt an important convention, every sentence is a finite sequence that ends with a special symbol, the end-of-sequence (EOS) symbol. 

Formally, random sentences take on values in the set $\mathcal W^*$ of all strings made of symbols in $\mathcal W$, which is a set that does include an infinte number of valid English sentences (possibly not all English sentences, as our vocabulary may not be complete enough, but hopefully this space is still large enough for the LM to be useful) as well as an infinite number of sequences that are not valid English sentences.       
    
Part of the deal with a language model is to define and estimate a probability distribution that expresses a preference for sentences that are more likely to be accepted as English sentences. In practice an LM will prefer sentences that reproduce statistics of the observations used to estimate its parameters, whether these sentences will resemble English sentences or not will depend on how expressive the LM is, that is, whether or not the LM can capture patterns as complex as those arising from well-formed English (or whatever variant/register of English was observed during training).
    
**Notation guide** Some textbooks or papers use $W_1^L$ instead of $W_{1:L}$ for ordered sequences, both are clear enough, but we will use the notation adopted by the textbook, that is, $W_{1:L}$. The textbook uses $W_1\cdots W_L$ (without commas) as another notation for ordered sequences, but we prefer to explicitly mark the sequence with angle brackets to avoid ambiguities, i.e., we prefer $\langle W_1, \ldots, W_L \rangle$. For assignments, we will use the lowercase version of the letter that names the random variable: $w_{1:l} = \langle w_1, \ldots, w_l \rangle$. 


The following notational shortcuts are rather convenient:

* we will often use $W_{1:L}$ for a random sentence instead of the longer form $\langle W_1, \ldots, W_L \rangle$, and similarly for outcomes (i.e., $w_{1:l}$ instead of $\langle w_1, \ldots, w_l\rangle$), but in the long form we shall never drop the angle brackets, as otherwise it's hard to tell that we mean an ordered sequence
* we will use $W_{<i}$ (or $w_{<i}$ for an outcome) to denote the sequence of tokens that precedes the $i$th token, this sequence is empty $W_{<i} \triangleq \langle \rangle$ for $i \le 1$, for $i>1$ the sequence is defined as $W_{<i} \triangleq \langle W_1, \ldots, W_{i-1}\rangle$
* sometimes it will be useful to find a more compact notation for $W_{<i}$, in those cases we refer to it as a *random history* and denote it by $H$.



LMs can be described by a **generative story**, that is, a stochastic procedure that explains how an outcome $w_{1:l}$ is drawn from the model distribution. Though we may find inspiration in how we believe the data were generated, the generative story is not a faithful representation of any linguistic process, it is all but an abstraction that codes our own assumptions about the problem. 

The most general form this generative story can take, that is, the form with the least amount of assumptions, looks as follows:

1. For each position $i$ of a sequence, condition on the history $h_i$ and draw the $i$th word $w_i$ with probability $P(W=w_i|H=h_i)$. 
2. Append $w_i$ to the end of the history: $h_i \circ \langle w_i \rangle$
2. Stop generating if $w_i$ is the EOS token, else repeat from (1).

We say this procedure is very general because it is essentially just chain rule spelled out in English words, though here the order of enumeration is determined by the left-to-right order of tokens in an English sentence.


Here is an example for a sequence of length $l=3$:

$P_X(\langle w_1, w_2, w_3 \rangle) = P_{W|H}(w_1|\langle \rangle) P_{W|H}(w_2|\langle w_1 \rangle) P_{W|H}(w_3 |\langle w_1, w_2 \rangle)$

For our example sentence *He went to the store* this means:

\begin{align}
P_X(\langle \text{He, went, to, the, store, EOS} \rangle) &= P_{X|H}(\text{He}|\langle \rangle) \\
    &\times P_{W|H}(\text{went}|\langle \text{He} \rangle) \\
    &\times P_{W|H}(\text{to}|\langle \text{He}, \text{went} \rangle) \\
    &\times P_{W|H}(\text{the}|\langle \text{He},  \text{went}, \text{to} \rangle) \\
    &\times P_{W|H}(\text{store}|\langle \text{He},  \text{went}, \text{to}, \text{the} \rangle) \\
    &\times P_{W|H}(\text{EOS}|\langle \text{He},  \text{went}, \text{to}, \text{the}, \text{store} \rangle) 
\end{align}

* where with some abuse of notation we use the words themselves as outcomes instead of their corresponding indices. 




**Exercise with solution**  Write down the general rule for the probability $P_X$ of a sentence $w_{1:l}$. Don't forget to indicate the precise random variable associated with every distribution (that is, for example, $P_X(w_{1:l})$ and $P(X=w_{1:l})$ are correct while $P(w_{1:l})$ is incomplete). 

<details>
    <summary><b>SOLUTION</b></summary>

$P_X(w_{1:l}) = \prod_{i=1}^{l}P_{W|H}(w_i|w_{<i})$

where $l$ is the length of the sentence.
    
</details>
    
---

The LM above is just an abstraction, not a concrete implementation, think of it as a general template for building models. 

A concrete model design needs to specify the conditional probability distributions in the model, this is known as the **parameterisation** of the model, and an algorithm for parameter estimation. 

# NGram Language Models

One way to achieve a tractable parameterisation of a language model is to make a conditional independence assumption, which simplifies the factors in the chain rule.

## Unigram LM

From week 1 of the course, we actually already know the *unigram LM*, which is the simplest language model: the idea is to forget the history completely, therefore making the strong assumption that words are drawn from the same distribution independently of thei preceding context (i.e., $W_i \perp W_{<i}$):

\begin{equation}
P_X(w_{1:l}) \overset{\text{ind.}}{\triangleq}  \prod_{i=1}^l P_W(w_i)
\end{equation}

For the parameterisation, we let $W$ follow a Categorical distribution with parameter $\pi_{1:V} \in \Delta_{V-1}$. 

Then the probability mass function of the unigram LM assigns probability 

\begin{align}
f_X(w_{1:l}; \theta) &\triangleq \prod_{i=1}^l \text{Cat}(w_i|\pi_{1:V}) \\
&=\prod_{i=1}^l \pi_{w_i}
\end{align}

where $\theta = \{\pi_{1}, \ldots, \pi_V\}$ are the trainable parameters of the language model, and the Categorical pmf is $\text{Cat}(c|\pi_{1:V}) = \pi_c$.

From now on we will often have multiple pmfs in the same expression, that's because our models will have multiple components. To remind you of which pmf we are talking about, we will subscript the letter $f$ with the random variable with which the pmf is associated. The stuff that appears after the semi-colon is the parameter upon which the pmf depends parametrically.

***Example***

Consider the sentence *He went to the store*, its probability under the unigram LM is

$P_X(\langle \text{He, went, to, the, store, EOS} \rangle) = P_W(\text{He}) \times P_W(\text{went}) \times P_W(\text{to}) \times P_W(\text{the}) \times P_W(\text{store}) \times P_W(\text{EOS})$

which, as a function of the parameters of the model, evaluates to

$P_S(\langle \text{He, went, to, the, store, EOS} \rangle) = \pi_{\text{He}} \times \pi_{\text{went}} \times \pi_{\text{to}} \times \pi_{\text{the}} \times \pi_{\text{store}} \times \pi_{\text{EOS}}$

where again we use the words instead of their indices.

---

We now have a concrete instance of a language model:
* a set of independence assumptions (the unigram assumption)
* a parameterisation (we will use a single Categorical distribution)

If we are given the parameters $\pi_{1:V}$, we will be able to sample $X$ from this model, and we will be able to assess the probability the model assigns to any given sentence in its sample space. 

So, we just need to find a way to choose $\pi_{1:V}$. We could pick any vector in the probability simplex. Clearly, some probability vectors are better than others though, so we better find a procedure that enjoys some theoretical support.

We can turn to basic statistics for help, in particular, we can turn to maximum likelihood estimation.

### Parameter estimation

Suppose we are given a corpus $\mathcal D$ containing $N$ sentences

* each sentence is of the form $w_{1:l_n}^{(n)}$ for $n=1, \ldots, N$
* where $l_n$ is the length of the $n$th sentence

The MLE solution for the unigram LM is based on gathering counts and computing the relative frequency of word types:

\begin{equation}
\pi_w = \frac{\mathrm{count}_W(w)}{\sum_{o \in \mathcal W}\mathrm{count}_W(o)} 
\end{equation}

And we may, for example, employ a smoothing technique such as Laplace smoothing.



## Higher order LMs

    
A unigram LM makes some rather unreasonable assumptions. Clearly, words in a sentence do depend on one another.

**Exercise with solution** Can the unigram LM assign different probabilities to `what a nice day` and `day what nice a`?

<details>
    <summary><b>SOLUTION</b></summary>
    
    No really. The sentence contains the exact same tokens and they occur the exact same number of times.
    
    
</details>

---

If we want to capture dependencies in English sentences, we better go back to the general chain rule
    
\begin{equation}
P_X(w_{1:l}) = \prod_{i=1}^{l}P_{W|H}(w_i|w_{<i})
\end{equation}

and ask ourselves, why did we make such a strong independence assumption in the unigram LM and can we avoid it?

The deal is that with our current way of parameterising a Categorical distribution, we need to store  a $V$-dimensional parameter vector for every unique history in the training data. This type of parameterisation, where we have one parameter (a probability value) per outcome per context, is known as **tabular** (you can imagine storing the parameters in a table, each row is a context, each column is an outcome).
The more words we allow in the context, the more parameters we will have to estimate, and this tabular representation grows very quickly. Not only this costs a lot of parameters, most histories will appear very few times, a very long history will probably only appear once. Thus we will not be able to gather enough data to estimate the parameters of the LM reliably. 

We have at least two ways around this problem: we can revisit the model from the point of view of the (conditional) independencies we make, or we can change the way we parameterise Categorical distributions to make them more parameter efficient. The combination of the two strategies lies at the core of much of the progress in modern NLP, so we will look into both.

The core of the problem is the data sparsity due to long histories, then let's shorten the histry. But unlike the unigra LM, we won't discard the history entirely, we will just forget some of it instead. 

The $n$-gram LM assumes that $W_i$ is independent of all but the $n-1$ immediately preceding words. It uses Categorical distributions to model the distribution of $W | H=x_{i-n+1:i-1}$ in context. 


In its standard formulation, the $n$-gram LM uses tabular cpds, that is, it stores one Categorical distribution per unique history: $X | H=h \sim \text{Cat}(\pi_{1:V}^{(h)})$ with $\pi_{1:V}^{(h)} \in \Delta_{V-1}$ for each history $h \in \mathcal W^{n-1}$ (a sequence of $n-1$ tokens).

The joint probability the $n$-gram LM assigns to a sentence $x_{1:m}$ is

\begin{align}
P_X(w_{1:l}) &\overset{cond.ind.}{\triangleq} \prod_{i=1}^l P_{W|H}(w_i|\underbrace{w_{i-n+1:i-1}}_{h_i})
\end{align}

and therefore the pmf of the $n$-gram LM is:
\begin{align}
    f_X(w_{1:l}; \theta)&= \prod_{i=1}^l \text{Cat}(w_i|\pi_{1:V}^{(h_i)}) \\
    &=\prod_{i=1}^l \pi^{(h_i)}_{w_i}
\end{align}
where $\theta = \{\pi_{1:V}^{(h)} \text{ for every }h \in \mathcal W^{n-1} \}$ are the parameters of the model.


The $n$-gram LM is what we call a **Markov model** of order $o=n-1$, the order indicates the length of the shortened history we condition on when drawing $W_i$.






**Exercise with solution**  Write down the probability of the sentence 

    He went to the store
    
under a bigram language model. 

<details>
    <summary><b>SOLUTION</b></summary>

$P_X(\langle \text{He, went, to, the, store, EOS} \rangle) = P_{W|H}(\text{He}|\langle \rangle) \times P_{W|H}(\text{went}|\langle \text{He} \rangle) \times P_{W|H}(\text{to}| \langle \text{went} \rangle) \times P_{W|H}(\text{the}| \langle \text{to} \rangle) \times P_{W|H}(\text{store}|\langle \text{the} \rangle) \times P_{W|H}(\text{EOS}|\langle \text{store} \rangle)$

Tip: recall that *the* is a word while $\langle \text{the} \rangle$ is a sequence, and recall that though sentences in corpora don't usually come decorated with an EOS, our LMs treat them as if they did.
    
Trick: it's quite handy to imagine that we had $n-1$ occurences of a begin-of-sentence (BOS) symbol prior to the first token in the sentence, this gives us a fixed-size history across time steps: the first step would look like $P_{W|H}(\text{He}|\langle \text{BOS} \rangle)$.

</details>

---

### Parameter estimation

The Laplace-smoothed MLE solution for tabular cpds is

\begin{equation}
\pi_w^{(h)} = \frac{\mathrm{count}_{HW}(h \circ \langle w \rangle) + \alpha}{\mathrm{count}_H(h) + \alpha V}
\end{equation}

where  $h \circ \langle w \rangle$ is the concatenation of history and word. 

Often this is not enough for smoothing higher order models, the reason being that not only certain words will be unseen, but very often histories will be unseen. For example, we have seen "a nice dog" as a history, but have not seen "a strange dog". There are many techniques to overcome missing histories, the simplest of which is the *backoff* technique, whereby upon failing to encounter "a strange dog", we try "UNK strange dog"; failing to encounter it again, we try "UNK UNK dog", and if we also fail to encounter that, we use "UNK UNK UNK", at which point at the very list, the corresponding cpd will be uniform over the vocabulary. You will see this technique being used in the implementation below.
    
**Notation guideline** When we use a superscript like $\theta_{1:V}^{(h)}$ we mean the probability vector selected by the history, or the probability vector that is specific to the cpd that conditions on the given history $h$. For example, in a bigram LM, we have one $V$-dimensional parameter vector for each one of the $V$ cpds in the model. That is, there is one cpd per word in our vocabulary, because each time we condition on a history $\langle w \rangle$, we get a different cpd, and each of these cpds is a discrete distribution over the entire vocabulary. 
    

## Implementation

Here we provide a complete implementation of an NGram LM, our implementation is kept simple for didactic purposes and is, therefore, not super efficient. You will only be able to play with models of small order (we recommend not going beyond a trigram LM).

**Ungraded exercise.** Study the implementation of the NGramLM. 

In [None]:
from collections import defaultdict, Counter
from itertools import tee
import numpy as np


class NGramLM:
    """
    This is an n-gram LM with generalised Laplace smoothing.    
    """
    
    def __init__(self, ngram_size: int, alpha=0.0, backoff=False, BOS="<s>", EOS="</s>", UNK="<unk>", seed=None):  
        """
        ngram_size: 
        alpha: Laplace smoothing coefficient
        backoff: if True, for every known h, we also store counts for shortened versions of h
         where, from left-to-right, we 'forget' words in h (by seeting them to UNK)
         
         Example: in a trigram LM, if ['a', 'cute', 'dog'] is the history and 'barks' is the next word
            we gather counts for:
            * ['cute', 'dog'] , 'barks'
            * ['UNK', 'dog'] , 'barks'
            * ['UNK', 'UNK'] , 'barks'

        BOS: symbol to be used internally as BOS
        EOS: symbol to be used as EOS (the EOS symbol is within the vocab, and it is used to stop generation)
        UNK: symbol to be used as UNK (used for Laplace smoothing and backoff and whenever new words come in a test time)
        seed: seed for numpy random state (used in the sampling algorithm)
        """
        if ngram_size < 1:
            raise ValueError(f"ngram_size must be at least 1, got {ngram_size}")
        if ngram_size > 3:
           raise ValueError("This implementation is not efficient enough, you will probably run out of memory if you try going beyond trigrams")
        self._order = ngram_size - 1
        self._BOS = BOS
        self._EOS = EOS
        self._UNK = UNK
        self._alpha = alpha   
        self._backoff = backoff
        
        # Used to store cpds
        self._cpds = None
        # Used to store the vocabulary
        self._word2int = {
            EOS: 0,
            UNK: 1,
        }
        self._EOS_ID = 0
        self._UNK_ID = 1
        # Used to store the vocabulary
        self._words = [EOS, UNK]
        self._rng = np.random if seed is None else np.random.RandomState(seed)
        
    @property
    def order(self):
        """1 minus the ngram_size"""
        return self._order
    
    @property
    def vocab_size(self):
        """Number of known words"""
        return len(self._words)

    def num_parameters(self):
        """Count the number of parameters in the model"""
        if self._cpds is None:
            return 0
        return len(self._cpds) * self.vocab_size

    def _get_cpd(self, history):
        """
        This function returns the categorical parameter associated with a certain history.

        :param history: a sequence of words (a tuple)
        """
        if len(history) != self._order:
            raise ValueError(f"A history should have length {self._order}")
        # lookup the cpd for this history
        cpd = self._cpds.get(history, None)        
        if cpd is None:  # if the history is unknown 
            if self._backoff: # we either try to backoff to a shorter history
                # we backoff one word at a time from left-to-right
                unk_history = list(history)
                for i in range(len(history)):
                    unk_history[i] = self._UNK # forget another word 
                    cpd = self._cpds.get(history, None) # try finding a cpd for the remaining history
                    if cpd is not None: # if we find, return it
                        return cpd
            # or return a uniform distribution
            return np.ones(self.vocab_size) / self.vocab_size
        else:
            # if we have a cpd, we return it
            return cpd
                
    def _get_parameter(self, history, word):
        """
        This function returns the categorical parameter associated with a certain word given a certain history.

        :param history: a sequence of words (a tuple)
        :param word: a word (a str)
        :return: a float representing P(word|history)
            this function takes care of backoff if needed
            and it takes care of unknown words
        """
        cpd = self._get_cpd(history)        
        return cpd[self._word2int.get(word, self._UNK_ID)]            
        
    def fit(self, data_stream):
        """
        Fit the parameters of the NGramLM via MLE, possibly using Laplace smoothing and backoff.
        data_stram: an iterable of sentences, each sentence is a list of tokens, each token is a string
        """
        # we will iterate through data_stream twice, once for making a vocabulary
        # once for gathering counts        
        stream1, stream2 = tee(data_stream, 2)

        # Make the vocabulary of known words
        for sentence in stream1:
            if type(sentence) not in [list, tuple]:
                raise ValueError("Did you forget to tokenize your sentences")
            for word in sentence:
                wid = self._word2int.get(word, None)
                if wid is None: # if we haven't seen this word
                    # it must be new, so we get a new id for it
                    wid = len(self._words)
                    # and store it
                    self._word2int[word] = wid
                    self._words.append(word)

        V = len(self._words)
        
        # Make the dictionary of counts        
        #  for each known history h we will store a V-dimensional count vector
        #  (we could use a sparse data structure, but this implementation is meant
        #   to be didactic, rather than efficient)
        joint_counts = defaultdict(lambda: np.zeros(V))
        for sentence in stream2:            
            sentence = [self._BOS] * self._order  + sentence # pad with BOS tokens
            if len(sentence) == 0 or sentence[-1] != self._EOS: # pad with EOS token
                sentence.append(self._EOS)
            l = len(sentence)
            for i in range(self.order, l):  # for each content word (all after BOS)
                # we have to make it a tuple otherwise we cannot use it as a key in a python dict
                history = tuple(sentence[i - self._order: i])
                # get the id of the next word
                word = sentence[i]
                word_id = self._word2int[word]
                # count the joint occurrence
                joint_counts[history][word_id] += 1
                # reserve some events for the future                
                if self._backoff:
                    unk_history = list(history)
                    # erase the history from left to right creating unk histories
                    # this is a way to further smooth the model
                    for j in range(self._order):
                        unk_history[j] = self._UNK
                        joint_counts[tuple(unk_history)][word_id] += 1
        
        # Make probabilities
        cpds = dict()
        for h, counts in joint_counts.items():
            counts += self._alpha  # Laplace smoothing
            cpds[h] = counts / np.sum(counts)
        
        self._cpds = cpds
        
    def log_prob(self, sentence):
        """
        Compute the log probability of a sentence under this model. 
                
        input: 
            sentence: a list of tokens
        output:
            log probability
        """
        if type(sentence) not in [list, tuple]:
                raise ValueError("Did you forget to tokenize your sentences")
        if self._cpds is None:
            raise ValueError("Did you fit the model?")
        log_prob = 0.        
        sentence = [self._BOS] * self._order + sentence  # pad sentence with BOS
        if len(sentence) == 0 or sentence[-1] != self._EOS:
            sentence.append(self._EOS) # pad sentence with EOS
        for i in range(self._order, len(sentence)):  # there are two more words <start> and <end> of sentence
            history = tuple(sentence[i - self._order: i])
            word = sentence[i]        
            ngram_probability = self._get_parameter(history, word)
            # accumulate the log prob
            log_prob += np.log(ngram_probability)    
        return log_prob 

    def _sample(self, rng, trim_EOS=False, max_length=1000):
        """
        Draw and return a single sample.

        rng: a np random state
        trim_EOS: whether we should trim the EOS symbol from the output
        max_length: used to interrupt the sampler in case EOS is not drawn
        """
        sentence = [self._BOS] * self._order
        V = self.vocab_size
        #uniform = np.ones(V) / V
        eos_id = self._word2int[self._EOS]
        for i in range(max_length):
            h = tuple(sentence[len(sentence)-self._order:])
            #cpd = self._cpds.get(h, uniform)
            cpd = self._get_cpd(h)
            k = rng.choice(V, p=cpd)
            sentence.append(self._words[k])
            if k == eos_id:
                break
        if trim_EOS:
            return sentence[self._order:-1] 
        else:               
            return sentence[self._order:]

    def sample(self, num_samples=None, trim_EOS=False, max_length=1000, seed=None):
        """
        num_samples: if None, returns a single sample
            if more than 0, returns a list containing num_samples samples within
            each sample is a list of tokens
        trim_EOS: whether we should trim the EOS symbol from the output
        max_length: used to interrupt the sampler in case EOS is not drawn
        seed: use it to control the np random state
        """
        if self._cpds is None:
            raise ValueError("Did you fit the model?")
        rng = self._rng if seed is None else np.random.RandomState(seed)
        if num_samples is None:
            return self._sample(rng, trim_EOS=trim_EOS, max_length=max_length)
        elif num_samples > 0:
            return [self._sample(rng, trim_EOS=trim_EOS, max_length=max_length) for _ in range(num_samples)]
        else:
            raise ValueError(f"num_samples should be a positive integer, got {num_samples}")

Here's a toy demonstration:

In [None]:
unigram_lm = NGramLM(1, seed=42)  

We fit the model by giving it a collection of tokenized sentences for MLE:

In [None]:
from collections import Counter
from itertools import chain


toy_tokenized_corpus = [
    "a a a".split(),
    "b b b b b b".split(),
    "c".split()
]
for w, n in Counter(chain.from_iterable(toy_tokenized_corpus)).most_common():
    print(f"word={w} occurrences={n}")
print(f"Number of EOS: {len(toy_tokenized_corpus)}")
print(f"Average sentence length: {np.mean([len(x) for x in toy_tokenized_corpus])}") 

In [None]:
unigram_lm.fit(toy_tokenized_corpus)

We can use the model to draw samples as well as to assess the probability of an outcome:

In [None]:
xs = unigram_lm.sample(100, trim_EOS=True, seed=42)
for x in xs[:10]:
    print(f"{unigram_lm.log_prob(x):.2f} {' '.join(x)}")

Note that language models aim at reproducing statistics of the observed data, or at least, the statistics that are within their capacity (given the conditional independence assumptions in place).

A unigram LM can match observed word frequencies:

In [None]:
counter = Counter(chain.from_iterable(xs))
total = sum(counter.values())
for w, n in counter.most_common():
    print(w, n/total)

It should also be able to learn the average sequence length (though the length distribution as a whole is limited to a Geometric law, which oftentimes is not the observed pattern).

In [None]:
import matplotlib.pyplot as plt
print(f"Average sentence length from samples: {np.mean([len(x) for x in xs])}")
_ = plt.hist([len(x) for x in xs])
_ = plt.xlabel("Length of sample")
_ = plt.ylabel("Frequency")

You can increase the NGram size and use some smoothing tricks:

In [None]:
trigram_lm = NGramLM(3, alpha=1e-3, backoff=True, seed=42)  
trigram_lm.fit(toy_tokenized_corpus)
xs3 = trigram_lm.sample(100, trim_EOS=True, seed=42)
for x in xs3[:10]:
    print(f"{trigram_lm.log_prob(x):.2f} {' '.join(x)}")

In [None]:
counter3 = Counter(chain.from_iterable(xs3))
total3 = sum(counter3.values())
for w, n in counter3.most_common():
    print(w, n/total3)

In [None]:
print(f"Average sentence length from samples: {np.mean([len(x) for x in xs3])}")
_ = plt.hist([len(x) for x in xs3])
_ = plt.xlabel("Length of sample")
_ = plt.ylabel("Frequency")

## Experiment 

In [None]:
brown_unigram_lm = NGramLM(1, alpha=1e-3, seed=42)  
brown_unigram_lm.fit(
    tokenizer.encode(training, out_type=str) # here we tokenize with sentencepiece in order to keep the vocabulary size manageable
)
brown_unigram_lm.num_parameters()

You will find made-up words, because the unigram LM might sample subword units that should not be next to one another, and then the tokenizer might end up merging them.

In [None]:
for x in brown_unigram_lm.sample(10, seed=42):
    print(f"{brown_unigram_lm.log_prob(x):.2f} {tokenizer.decode(x)}")

A bigram LM should do better, but at the cost of a big increase in model size:

In [None]:
brown_bigram_lm = NGramLM(2, alpha=1e-3, backoff=True, seed=42)  
brown_bigram_lm.fit(
    tokenizer.encode(training, out_type=str)
)
brown_bigram_lm.num_parameters()

In [None]:
for x in brown_bigram_lm.sample(10, seed=42):
    print(f"{brown_bigram_lm.log_prob(x):.2f} {tokenizer.decode(x)}")

A trigam LM should do even better, but the model is getting much too large. For a small vocabulary size (eg, 1000) you will be able to use this class, for larger vocabulary sizes our implementation is not sufficiently efficient and you will run out of memory.

In [None]:
brown_trigram_lm = NGramLM(3, alpha=1e-3, backoff=True, seed=42)  
brown_trigram_lm.fit(
    tokenizer.encode(training, out_type=str)
)
brown_trigram_lm.num_parameters()

In [None]:
for x in brown_trigram_lm.sample(10, seed=42):
    print(f"{brown_trigram_lm.log_prob(x):.2f} {tokenizer.decode(x)}")

<a name="classic"> **Graded exercises - Statistics from classic NGram LMs**

For each model (unigram, bigram, and trigram LM):

* Sample from the LM and compare the length distribution (for BPE-tokenized sentences) of the LM and the dev set. Use as many samples as you have sentences in the dev set. While sampling use the seed 42.
* Using the sample samples: first BPE-detokenize the samples, then tokenize them at the white space (with python `.split()`), check how often the LM generates tokens that did not exist in the original NLTK training corpus, report the ratio of out-of-vocabulary (OOV) tokens. Compare both LMs in terms of this ratio.
* Manually inspect a random subset of 30 of these new tokens and report how many are words that do not exist in English.
* Comment on what you observe and relate your observations to technical aspects of the model.

For hyperparameters use `alpha=1e-3, backoff=True` (these are not necessarily optimum, but we will not invest time searching for better ones).

In [None]:
# CONTRIBUTE YOUR SOLUTION

# Evaluation

There are a few ways to evaluate the performance of an LM.

Sometimes you can plug it into an application (e.g., an auto-complete system or a system that ranks sentence for fluency), in those cases we can test whether that downstream application improves as we modify the language model. This is called *extrinsic* evaluation.

To evaluate an LM independently from an application we need to evaluate its statistical properties in an attempt to determine how well the model fits the data, namely, how well it reproduces statistics of observed data. This is called *intrinsinc* evaluation. In this course, we are going to focus on intrinsic evaluation of the LM.

We generally have access to 3 datasets: 

* Training is used for estimating $\theta$.
* Develpment is used to make choices during the design phase (choose hyperparameters such as smoothing technique, order of model, etc).
* Test is used for measuring the accuracy of the final model.
   
One indication of the model's fitness to the data is the value of the model likelihood given novel sentences (e.g., sentence held-out from training). 
We assume this dataset $\mathcal T$ of novel sentences consits of $K$ independent sentences each denoted $w_{1:l_k}^{(k)}$, then the model likelihood given $\mathcal T$ is the probability mass that the model assigns to $\mathcal T$: 

$\prod_{k=1}^K f_X(w_{1:l_k}^{(k)}; \theta)$

or in form of the log-probability:

$\sum_{k=1}^K \log f_{X}(w_{1:l_k}^{(k)}; \theta)$

Then define the log-likelihood as follows:

$\mathcal L_{\mathcal T}(\theta) = \sum_{k=1}^K \log f_X(w_{1:l_k}^{(k)}; \theta)$


Then the model that assings the higher $\mathcal L$ given the test set is the one that's better predictive of future data, presumably that's the case because it found a better fit of the training data (a better compromise between memorisation and generalisation). 

In other words, given two probabilistic models, the one that assigns a higher probability to the test data is taken as intrinsically better. One detail we need to abstract away from is differences in factorisation of the models which may cause their likelihoods not to be comparable, but for that we will define *perplexity* below. 

The log likelihood is used because the probability of a particular sentence according to the LM can be a very small number, and the product of these small numbers can become even smaller, and it will cause numerical
precision problems. 


**Perplexity** of a language model on a test set is the inverse probability of the test set, normalized
by the number of tokens. Perplexity is a notion of average branching factor, thus an LM with low perplexity can be thought of as a *less confused* LM. That is, each time it introduces a word given some history it picks from a reduced subset of the entire vocabulary (in other words, it is more certain of how to continue from the history). 

If a dataset contains $t$ tokens where $t = \sum_{k=1}^K l_k$, then the perplexity of the model given the test set is

\begin{equation}
\text{PP}_{\mathcal T}(\theta) = \exp\left( -\frac{1}{t} \mathcal L_{\mathcal T}(\theta) \right)
\end{equation}

The lower the perplexity, the better the model is. Comparisons in terms of perplexity are only fair if the models have the same vocabulary.

<a name="perplexity"> **Graded exercise - Perplexity**

* Implement the perplexity per token function, make sure it passess the two test cases
* Use it to compare the unigram LM, the bigram LM and the trigram LM for the Brown corpus. Report a table with the number of parameters and the perplexity in the dev and in the test set. For hyperparameters use `alpha=1e-3, backoff=True` (these are not necessarily optimum, but we will not invest time searching for better ones).

In [None]:
def perplexity_per_token(model: NGramLM, dataset):
    """
    model: a trained NGramLM
    dataset: a tokenized corpus of sentences

    Return the perplexity of the model given the dataset. This should be a single real number.
    """
    raise ValueError("Implement me")    

We implemented a few test cases for you, so you can make sure your implementation of perplexity is correct.

In [None]:
def toy_test_perplexity_per_token():
    print("Testing with toy corpus")
    toy_tokenized_corpus = [
        "a a a".split(),
        "b b b b b b".split(),
        "c".split()
    ]
    unigram_lm = NGramLM(1, seed=42)
    unigram_lm.fit(toy_tokenized_corpus)
    assert np.isclose(perplexity_per_token(unigram_lm, toy_tokenized_corpus), 4.954, 1e-3)

    trigram_lm = NGramLM(3, alpha=1e-3, backoff=True, seed=42)  
    trigram_lm.fit(toy_tokenized_corpus)
    assert np.isclose(perplexity_per_token(trigram_lm, toy_tokenized_corpus), 2.055, 1e-3)

toy_test_perplexity_per_token()

In [None]:
from nltk.corpus import treebank 
from tabulate import tabulate


def ptb_test_perplexity_per_token():
    print("Testing with PTB without BPE tokenisation")
    ptb_training, ptb_dev, ptb_test = split_nltk_corpus(treebank)
    rows = []
    for ngram_size in [1, 2, 3]:
        lm = NGramLM(ngram_size, alpha=1e-3, backoff=True, seed=42)
        lm.fit((x.split() for x in ptb_training))
        rows.append(['ptb', ngram_size, lm.vocab_size, lm.num_parameters(), perplexity_per_token(lm, [x.split() for x in ptb_dev]), perplexity_per_token(lm, [x.split() for x in ptb_test])])

    print("Testing with PTB with BPE tokenisation")
    ptb_tokenizer = fit_vocabulary(ptb_training, vocab_size=1000)    
    for ngram_size in [1, 2, 3]:        
        lm = NGramLM(ngram_size, alpha=1e-3, backoff=True, seed=42)                                 
        lm.fit(ptb_tokenizer.encode(ptb_training, out_type=str))
        rows.append(['ptb', ngram_size, lm.vocab_size, lm.num_parameters(), perplexity_per_token(lm, ptb_tokenizer.encode(ptb_dev, out_type=str)), perplexity_per_token(lm, ptb_tokenizer.encode(ptb_test, out_type=str))])

    print(tabulate(rows, headers=['corpus', 'ngram size', 'vocab size', 'num parameters', 'perplexity dev', 'perplexity test']))

    print("\n\nRemarks:")    
    print("* with BPE tokenisation the model is not as often surprised with unknown symbols, this should improve perplexity a lot")
    print("* but note that the comparison of models with different vocabularies is not strictly fair")
    print("* note how the number of parameters grows fast as a function of vocab size and ngram size")
    print("* alpha and backoff can affect the perplexity, one would have to adjust those numbers, but we won't do that in this tutorial")

    assert np.isclose(rows[0][-2], 3953, 1), "Did you change anything? Did you implement perplexity correctly?"
    assert np.isclose(rows[1][-2], 1889, 1), "Did you change anything? Did you implement perplexity correctly?"
    assert np.isclose(rows[2][-2], 3215, 1), "Did you change anything? Did you implement perplexity correctly?"
    assert np.isclose(rows[3][-2], 328, 1), "Did you change anything? Did you implement perplexity correctly?"
    assert np.isclose(rows[4][-2], 143, 1), "Did you change anything? Did you implement perplexity correctly?"
    assert np.isclose(rows[5][-2], 265, 1), "Did you change anything? Did you implement perplexity correctly?"

ptb_test_perplexity_per_token()    

In [None]:
# REPORT YOUR OWN EXPERIMENT USING THE BROWN CORPUS

# Neural NGram LM

Just like the classic $n$-gram LM, a neural $n$-gram LM makes a conditional independence assumption to simplify the factors in the chain rule. Rather than storying the probabilities values of observed $(h, w)$ pairs, the neural model stores the parameters necessary to predict those probabilities by transformation of the concatenation of the one-hot encodings of the words in the history. The pmf of the neural $n$-gram LM is defined as follows:

\begin{align}
f_X(w_{1:l};\theta) &= \prod_{i=1}^l \mathrm{Cat}(x_i|\mathbf g(w_{i-n+1:i-1}; \theta)) 
\end{align}

where $\mathbf g$ is a neural network with parameters $\theta$, it maps a tuple of words $x_{i-n+1:i-1}$, from a vocabulary of $V$ known words, to a $V$-dimensional probability vector.

This neural network is typically implemented as follows:

* embed each word in the history into a $D$-dimensional space: $\mathbf e_j = \mathrm{embed}_D(w_j; \theta_{\text{in}})$
* concatenate the word embeddings for the words in the history: $\mathbf u_i = \mathrm{concat}(\mathbf e_{i-n+1}, \ldots, \mathbf e_{i-1})$
* use a single-layer feed-forward NN to transform the history encoding $\mathbf u_i$ into a vector of $V$ logits: $\mathbf s = \mathrm{ffnn}_V(\mathbf u_i; \theta_{\text{out}})$
* use softmax to obtain probabilities for the possible words at the $i$th position: $\mathbf g(w_{i-n+1:i-1}; \theta)$
* the parameters are $\theta = \theta_{\text{in}} \cup \theta_{\text{out}}$ where 
    * $\theta_{\text{in}}$ is an embedding matrix $\mathbf E \in \mathbb R^{V\times D}$
    * $\theta_{\text{out}}$ are the parameters of the hidden layer $\mathbf W^{[1]} \in \mathbb R^{H\times (n-1)D}$ and $\mathbf b^{[1]} \in \mathbb R^H$, and the parameters of the output layer $\mathbf W^{[2]} \in \mathbb R^{V\times H}$ and $\mathbf b^{[2]} \in \mathbb R^V$


Next we implement this in PyTorch using `torch.nn.Embedding` for the embedding layer and `torch.nn.Linear` for the affine transformations inside the FFNN. As non-linearity for hidden layers we will use ReLU.

In [None]:
import random
import numpy as np
import torch

def seed_all(seed=42):    
    np.random.seed(seed)
    random.seed(seed)    
    torch.manual_seed(seed)


seed_all()

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as td
import torch.optim as opt
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

**Ungraded exercise** Study the NeuralNGram LM class below and complete the code for the forward pass and for the loss function. We have a test case that you can use to check your implementation. Once you are satisfied, compare your solution to ours, if they differ understand what you did wrong and use ours for the rest of the notebook.

In [None]:
class NeuralNGramLM(nn.Module):

    def __init__(self, ngram_size: int, vocab_size, embedding_dim: int, hidden_size: int, pad_id=0, bos_id=1, eos_id=2):
        """
        ngram_size: longest ngram
        vocab_size: number of known words
        embedding_dim: dimensionality of word embeddings
        hidden_size: dimensionality of hidden layers
        """
        super().__init__()
        assert ngram_size > 1, "This class expects at least ngram_size 2"
        self.tokenizer = tokenizer
        self.vocab_size = vocab_size
        self.ngram_size = ngram_size
        self.embedding_dim = embedding_dim
        self.pad = pad_id
        self.bos = bos_id
        self.eos = eos_id

        self.embed = nn.Embedding(vocab_size, embedding_dim=embedding_dim)
        self.logits_predictor = nn.Sequential(
            nn.Linear((ngram_size - 1) * embedding_dim, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, vocab_size),
        )

    def num_parameters(self):
        return sum(np.prod(theta.shape) for theta in self.parameters())
        
    def forward(self, x):
        """
        Parameterise the conditional distributions over X[i] given history x[:i] for i=1...I.

        This procedure takes care that the ith output distribution conditions only on the n-1 observations before x[i].
        It also takes care of padding to the left with BOS symbols.

        x: [batch_size, max_length]

        Return: a batch of V-dimensional Categorical distributions, one per step of the sequence.
        """
        # The inputs to the FFNN are the ngram_size-1 previous words:

        # Create a sequence of BOS symbols to be prepended to x.
        # [batch_size, ngram_size - 1]
        bos = torch.full((x.shape[0], self.ngram_size - 1), self.bos, device=x.device)
        # [batch_size, max_length + ngram_size - 1]
        _x = torch.cat([bos, x], 1)

        # For each output step, we will have ngram_size - 1 inputs, so we collect those from x
        # [batch_size, max_length, ngram_size - 1]
        inputs = torch.cat([_x.unsqueeze(-1)[:,i:i+self.ngram_size-1].reshape(x.shape[0], 1, -1) for i in range(x.shape[1])], 1)
        
        raise NotImplementedError("Use the layers in this model to predict a batch `s` of logits with shape [batch_size, max_length, vocab_size] for a batch of `inputs` with shape [batch_size, max_length, ngram_size - 1]")
        s = None  # use your own implementation (or copy ours from the answer model below)

        # For numerical stability, we prefer to parameterise the Categorical using logits, rather than probs.
        # It would be equivalent (up to numerical precision) to use: Categorical(probs=F.softmax(s, -1))
        return td.Categorical(logits=s)

    def log_prob(self, x):
        """
        Computes the log probability of each sentence in a batch.

        x: [batch_size, max_length]
        """
        # [batch_size, max_length]
        logp = self(x).log_prob(x)
        # [batch_size]
        logp = torch.where(x != self.pad, logp, torch.zeros_like(logp)).sum(-1)
        return logp    

    def sample(self, batch_size=1, max_length=50):
        """
        Draws a number of samples from the model, each sample is a complete sequence.
        We impose a maximum number of steps, to avoid infinite loops.
        This procedure takes care of mapping sampled symbols to pad after the EOS symbol is generated.
        """
        with torch.no_grad():  # sampling discrete outcomes is not differentiable
            # Reserve memory for the samples (it's not important what symbol to use, I use BOS for clarity)
            x = torch.full((batch_size, max_length), self.bos, device=self.embed.weight.device) 
            # Keeps track of which samples are complete (i.e., already include EOS)
            complete = torch.zeros(batch_size, device=self.embed.weight.device)

            for i in range(max_length):
                # We condition on x[:,:i] (the prefixes) and parameterise Categoricals per step
                #  then sample tokens. This will sample all tokens (including tokens in the prefix), 
                #  but we are only interested in the 'current' one, which we use to udpate our
                #  actual sample x
                # [batch_size, length]
                x_i = self(x[:,:i+1]).sample()
                # Here we update the current token to something freshly sampled
                #  and also replace the token by 0 (pad) in case the sentence is already complete
                x[:, i] = x_i[:, i] * (1 - complete)
                # Here we update the state of the sentence (i.e., complete or not).
                complete = (complete.bool() + (x_i[:, i] == self.eos)).float()
            
            return x

    def loss(self, x):   
        """
        Compute a scalar loss from a batch of sentences.
        The loss is the negative log likelihood of the model estimated on a single batch:
            - 1/batch_size * \sum_{s} log P(x[s]|theta)

        x: [batch_size, max_length] 
        """
        raise NotImplementedError("Implement me!")

In [None]:
seed_all()
toy_lm = NeuralNGramLM(ngram_size=3, vocab_size=tokenizer.vocab_size(), embedding_dim=12, hidden_size=7)
toy_lm.num_parameters()

In [None]:
# I use 1 as EOS and 0 as BOS/PAD
obs = torch.tensor(
    [[5, 7, 6, 2, toy_lm.eos, toy_lm.pad],
    [4, 5, 7, 4, 6, toy_lm.eos]]
)

Check that the forward pass is correct

In [None]:
assert type(toy_lm(obs)) is td.Categorical, "Did you change the return type?"
assert torch.allclose(torch.sum(toy_lm(obs).probs, -1), torch.ones_like(obs).float(), 1e-3), "Your probabilities do not sum to 1"
assert toy_lm(obs).probs.shape == obs.shape + (tokenizer.vocab_size(),), "The shape should be [2, 6, vocab_size]"

We can estimate the loss using the 2 observations above:

In [None]:
toy_lm.loss(obs)

In [None]:
assert type(toy_lm.loss(obs)) is torch.Tensor, "Your loss should be a torch tensor"
assert toy_lm.loss(obs).requires_grad, "Your loss should be differentiable"
assert toy_lm.loss(obs).shape == tuple(), "Your loss should be a scalar tensor"
assert np.isclose(toy_lm.loss(obs).item(), 37, 1), "Without training, with seed 42, and the obs we gave you, your loss should be close to 37. If this is not correct, check with your TA."

<details>
    <summary><b>Solution for forward method </b> </summary>

```python
def forward(self, x):
    """
    Parameterise the conditional distributions over X[i] given history x[:i] for i=1...I.

    This procedure takes care that the ith output distribution conditions only on the n-1 observations before x[i].
    It also takes care of padding to the left with BOS symbols.

    x: [batch_size, max_length]

    Return: a batch of V-dimensional Categorical distributions, one per step of the sequence.
    """
    # The inputs to the FFNN are the ngram_size-1 previous words:

    # Create a sequence of BOS symbols to be prepended to x.
    # [batch_size, ngram_size - 1]
    bos = torch.full((x.shape[0], self.ngram_size - 1), self.bos, device=x.device)
    # [batch_size, max_length + ngram_size - 1]
    _x = torch.cat([bos, x], 1)

    # For each output step, we will have ngram_size - 1 inputs, so we collect those from x
    # [batch_size, max_length, ngram_size - 1]
    inputs = torch.cat([_x.unsqueeze(-1)[:,i:i+self.ngram_size-1].reshape(x.shape[0], 1, -1) for i in range(x.shape[1])], 1)

    # Embed the input histories
    # [batch_size, max_length, ngram_size - 1, D]
    e = self.embed(inputs)
    # [batch_size, max_length, (ngram_size - 1) * D]
    e = e.reshape(x.shape + (-1,))

    # Compute the V-dimensional scores (logits) 
    # [batch_size, max_length, V]
    s = self.logits_predictor(e)

    # For numerical stability, we prefer to parameterise the Categorical using logits, rather than probs.
    # It would be equivalent (up to numerical precision) to use: Categorical(probs=F.softmax(s, -1))
    return td.Categorical(logits=s)
```    
</details>

<details>
    <summary> <b> Solution for loss method </b> </summary>

```python
def loss(self, x):   
    """
    Compute a scalar loss from a batch of sentences.
    The loss is the negative log likelihood of the model estimated on a single batch:
        - 1/batch_size * \sum_{s} log P(x[s]|theta)

    x: [batch_size, max_length]    
    """
    # [batch_size]
    loss = - self.log_prob(x)    
    # []
    return loss.mean(0)        
```        
</details>

We can also obtain samples from the model:

In [None]:
toy_lm.sample(5, max_length=20)

We can plot statistics of sampled strings, for example, length and word frequency:

In [None]:
sample = toy_lm.sample(100, max_length=20)
_ = plt.hist(( sample > 0).sum(-1), bins='auto')
_ = plt.xlabel("Length distribution before training")

In [None]:
from collections import Counter
import numpy as np

flat_samples = sample.flatten()
word_counts = Counter(flat_samples[flat_samples > 0].numpy())
wcs = np.array([(w, n) for w, n in word_counts.items()])
_ = plt.plot(wcs[:,0], wcs[:,1]/wcs[:,1].sum(), 'x')
_ = plt.xlabel("Word id")
_ = plt.ylabel("Frequency before training")

We can train the model using gradient-based optimisation:

In [None]:
optimiser = opt.Adam(toy_lm.parameters(), lr=0.01)

In [None]:
with tqdm(range(1000)) as bar:
  for _ in bar:
    toy_lm.train()
    optimiser.zero_grad()

    loss = toy_lm.loss(obs)
    bar.set_postfix({'loss': f"{loss:.2f}" } )
    
    loss.backward()
    optimiser.step()
    

And then our samples should look less arbitrary

In [None]:
toy_lm.sample(10, 8)

And statistics such as length and word frequency should be closer to training data:

In [None]:
sample = toy_lm.sample(100, max_length=20)
_ = plt.hist((sample > 0).sum(-1), bins='auto')
_ = plt.xlabel("Length distribution after training")

In [None]:
flat_samples = sample.flatten()
word_counts = Counter(flat_samples[flat_samples > 0].numpy())
wcs = np.array([(w, n) for w, n in word_counts.items()])
_ = plt.plot(wcs[:,0], wcs[:,1]/wcs[:,1].sum(), 'x')
_ = plt.xlabel("Word id")
_ = plt.ylabel("Frequency after training")

Now we will conduct an experiment with an actual corpus, we better use GPU support for that (on Google Colab you change the runtime to GPU).

In [None]:
my_device = torch.device('cuda:0')
my_device

As we did in the PyTorch tutorial, we will create a `Dataset` object and a `DataLoader` for batching:

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
class Corpus(Dataset):

    def __init__(self, corpus, tokenizer):
        """
        In PyTorch we better always manipulate numerical codes, rather than text.
        So, our Corpus object will contain a tokenizer that converts words to codes.

        corpus: a list of sentences, each a string
        tokenizer: a BPE tokenizer from sentencepiece
        """
        self.corpus = list(corpus)
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.corpus)
    
    def __getitem__(self, idx):
        """Return corpus[idx] but BPE tokenized, converted to codes, and with the EOS symbol"""
        return self.tokenizer.encode(self.corpus[idx], add_eos=True)

Our neural models are **a lot** more parameter efficient than our classic models, so we could use a larger vocabulary, but to keep the tutorial lightweight, we will still use a vocabulary of size 1000.

In [None]:
tokenizer = fit_vocabulary(training, vocab_size=1000)
training_tok = Corpus(training, tokenizer)
dev_tok = Corpus(dev, tokenizer)
test_tok = Corpus(test, tokenizer)
len(training_tok), len(dev_tok), len(test_tok)

When we manipulate sequences of variable length, we need to "pad" them all to the same length. That's because to batch them using tensors they need to look like as if they did have the same length. 

We do that with a special symbol that will get ignored later on.

In [None]:
def pad_to_longest(sequences, pad_id=0):
    """
    Take a list of coded sequences and returns a torch tensor where 
    every sentence has the same length (by means of using PAD tokens)
    """
    longest = max(len(seq) for seq in sequences)
    return torch.tensor([seq + [pad_id] * (longest - len(seq)) for seq in sequences])

See what this does to the first few sentences in the batch (they should end with a few 0s, indicating they are shorter than the longest sentence in the batch).

In [None]:
pad_to_longest([training_tok[0], training_tok[1], training_tok[2]])

Now that we can convert batches of sentences to codes and guarantee they have the same length, we can construct a data loader to create mini batches at random:

In [None]:
batcher = DataLoader(training, batch_size=10, shuffle=True, collate_fn=pad_to_longest)
len(batcher)

We will need a batched PyTorch version of perplexity, to make sure you can run experiments with the correct code, we implement it here for you:

In [None]:
def perplexity(model: NeuralNGramLM, dl, device):
    """
    model: an instance of NeuralNGramLM
    dl: a data loader for the heldout data
    device: the PyTorch device where the model is stored
    """
    model.eval()
    total_tokens = 0
    total_log_prob = 0.
    with torch.no_grad():
        for batch in dl:
            total_tokens += (batch > 0).float().sum()
            total_log_prob = total_log_prob + model.log_prob(batch.to(device)).sum()
    return torch.exp(-total_log_prob / total_tokens)

Here we have the training loop (already fully implemented for you). Do study it.

In [None]:
def train_neural_model(model, optimiser, training_corpus, dev_corpus, batch_size=200, num_epochs=10, device=torch.device('cuda:0')):
    # we use the training data in random order for parameter estimation
    batcher = DataLoader(training_corpus, batch_size=batch_size, shuffle=True, collate_fn=pad_to_longest)
    # we use the dev data for evaluation during training (no need for randomisation here)
    dev_batcher = DataLoader(dev_corpus, batch_size=batch_size, shuffle=False, collate_fn=pad_to_longest)

    total_steps = num_epochs * len(batcher)
    log = defaultdict(list)
    ppl = perplexity(model, dev_batcher, device=device).item()
    log['ppl'].append(ppl)
    
    step = 0

    with tqdm(range(total_steps)) as bar:
        for epoch in range(num_epochs):
            
            for batch in batcher:
                lm.train()
                optimiser.zero_grad()
                
                loss = lm.loss(batch.to(device))
                        
                loss.backward()
                optimiser.step()

                bar.set_postfix({'loss': f"{loss.item():.2f}", 'ppl': f"{ppl:.2f}"} )
                bar.update()  
                log['loss'].append(loss.item())

                if step % 100 == 0:
                    ppl = perplexity(lm, dev_batcher, device=device).item()
                    log['ppl'].append(ppl)
                
                step += 1
                
    ppl = perplexity(lm, dev_batcher, device=device).item()
    log['ppl'].append(ppl)
    return log            

## Experiment

Here we demonstrate how to train and evaluate a model. 

After that you will conduct an experiment.

On GPU, this should take just about 2 minutes:

In [None]:
seed_all() # reset random number generators before creating your model and training it

# Create our LM
lm = NeuralNGramLM(
    ngram_size=3, 
    vocab_size=tokenizer.vocab_size(), 
    embedding_dim=32, 
    hidden_size=64, 
    pad_id=tokenizer.pad_id(),
    bos_id=tokenizer.bos_id(),
    eos_id=tokenizer.eos_id(),
).to(my_device)

# construct an Adam optimiser
optimiser = opt.Adam(lm.parameters(), lr=5e-3)

print("Model")
print(lm)
# report number of parameters
print("Model size:", lm.num_parameters())

log = train_neural_model(
    lm, optimiser, training_tok, dev_tok, 
    batch_size=200, num_epochs=10, 
    device=my_device
)
fig, axs = plt.subplots(1, 2, figsize=(10, 4))
_ = axs[0].plot(np.arange(len(log['loss'])), log['loss'])
_ = axs[0].set_xlabel('steps')
_ = axs[0].set_ylabel('training loss')
_ = axs[1].plot(np.arange(len(log['ppl'])), log['ppl'])
_ = axs[1].set_xlabel('steps (in 100s)')
_ = axs[1].set_ylabel('model ppl given dev')
_ = fig.tight_layout(h_pad=2, w_pad=2)
plt.show()

print("\n# Samples\n\n")
for x in lm.sample(10, 20):
    print(tokenizer.decode([int(a) for a in x]))

test_batcher = DataLoader(test_tok, batch_size=100, shuffle=False, collate_fn=pad_to_longest)
print("\n\n Model perplexity given test set", perplexity(lm, test_batcher, device=my_device).item())

<a name="neural">  **Graded exercise - Neural NGram LM**

Train NeuralNGram LMs using the Brown corpus and a BPE vocabulary of 1000 tokens.

* Train 2-gram LMs, 3-gram LMs, 4-gram LMs and 5-gram LMs
* For each model, train it with 
    * 32 dimensions for embeddings and 64 dimensions for hidden size
    * 64 dimensions for embeddings and 128 dimensions for hidden size    
* For each model, plot the training loss and perplexity using the dev set throughout training. 
* For trained models, report a table with embedding size, hidden size, ngram-size, model size (in number of parameters), model perplexity using the dev set, and model perplexity using the test set
* As always, write a few discussion points. Here are some ideas: comment on how increasing ngram-size is relatively cheap here (compared to the classic model), comment on the effect of increasing ngram-size in model perplexity, was the effect similar or different in the classic case? 


Using the best model you have (measured in perplexity given test set):

* Draw samples from the model (use the same number of sentences as in the dev set)
* Compare the length distribution of generated data with the length distribution of the dev set (use BPE-tokenization for that)
* BPE-detokenize your samples and re-tokenize it using python `.split()`, investigate the rate at which the model creates words that did not exist in the training data.

