# Back to Basics: Let's Build a Language Model – Andrej Jovanović

![title](images/llm.jpg)

Large language models have taken the world by storm. Anyone heard of ChatGPT or something like that? No? Me neither…

The powerful thing about these technologies is that they are using neural networks to create successful models of language. However, language model machinery was not always so grand and complex: we once relied on basic probability theory. I find it helpful to view these language models through this probabilistic lens: it helps to understand and normalise them, in a sense. Join me in this tutorial where we create a relatively impoverished n-gram language model from scratch!

*Note: this Jupyter Notebook ties in with a set of lecture slides that explain the theoretical backbone of the code. The link for which can be found here : This tutorial is based of a few resources which can be found here: \
https://www.geeksforgeeks.org/n-gram-language-modelling-with-nltk/ \
https://web.stanford.edu/~jurafsky/slp3/3.pdf \
https://github.com/nltk/nltk/blob/develop/nltk/lm/__init__.py*

## Step 1: What is our data?
For this tutorial, we are going to use the famous Gutenberg corpus (a dataset that just contains a lot of text). We will use the NLTK package for this which is very useful for a wide variety of NLP tasks.

In [None]:
# Collect all of our imports at the top of the notebook - just for convenience ;)
import os, nltk, string, random
from collections import Counter, defaultdict
import nltk
nltk.download()
from nltk import FreqDist, word_tokenize, sent_tokenize 
from nltk.corpus import gutenberg, stopwords, webtext
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.util import ngrams, pad_sequence, bigrams, everygrams

In [None]:
## TODO - view data

As you can see, we are returned a list of lists. Where each list within the list is a sentence. These sentences actually come from novels - this sentence in particular comes from Emma by Jane Austen. Our goal now is to create a model of the English language, based on novel contained in the Gutenberg corpus. However, we first need to clean the data before we are ready to create our n-gram language model.

## Step 2: Cleaning the data

First, we need to remove some stop words. These are words that are very frequently occuring in English and will distract from our language modelling task - bloating some of the probabilities.

In [None]:
### TODO

Now, we want to build up our vocabulary. To achieve this, we need to first do a number of things \
- To start with, we need to tokenise our text. In this context, this means cleaning our input data so that all the words are lowercase.
- Then, we want to create our  unigram, bigram and trigram vocabularies. 


In [None]:
### TODO

Now let's see how frequent our tokens are:

In [None]:
### TODO

Now, let's try and generate some sentences from our tri-gram language model

In [None]:
### TODO

As we can see, we have some output. In some cases it is fluent, but in most cases it isn't. Why do we think that is? Do you notice anything interesting about the type of sentences that the model is generating?

## Let's step it up a bit: Let's use NLTK

NLTK fortunately has a lot of this functionality baked into it. Let's trying using their classes and methods on different data.

In [None]:
### TODO

Let's generate our sentences and our data for our tri-gram model

In [None]:
### TODO

Now let's fit our model

In [None]:
### TODO

In [None]:
### TODO

In [None]:
### TODO

In [None]:
### TODO

In [None]:
### TODO

In [None]:
### TODO

In [None]:
### TODO

## But wait there is more!

Did you notice something strange when we were creating our language model? We imported something called MLE – that stands for maximum likelihood estimation. This is a method of creating our language model where we estimate our probabilities directly from the data. Furthermore, in our case, we set our vocab cuttoff to 1: we wanted to ignore words that didn't occur frequently enough to provide us useful information. But what happens if we want to interpret the probability of a sentence that has such a word that we do not know in our vocabulary? Should the probability immediately be 0?

I think not. This problem is known as smoothing n-gram models! 

To find examples of smoothing, look at [these objects from `nltk.lm.models`](https://github.com/nltk/nltk/blob/develop/nltk/lm/models.py):

 - `Lidstone`: Provides Lidstone-smoothed scores.
 - `Laplace`: Implements Laplace (add one) smoothing.
 - `InterpolatedLanguageModel`: Logic common to all interpolated language models (Chen & Goodman 1995).
 - `WittenBellInterpolated`: Interpolated version of Witten-Bell smoothing.

In [None]:
### TODO - IMPLEMENT YOUR OWN SMOOTHED N-GRAM MODELS!