# Creating a language model from ngrams

We know that a language model is just the encoding of knowledge about likely sequences and word associations. You can create a language model by obtaining statistics about a large corpus of ngrams. You can also create a large language model (LLM) like the ones behind ChatGPT or Llama using deep learning. 

Here, we'll create simple language models from ngrams using NLTK. The intuition about n-grams is that you can predict the next _n_ in a sequence if you know the frequencies of pairs of _n_ items from corpora. To make things simple, let's think of _n_ as 2. And we will assume _n_ is a word. But you can also calculate ngram frequency for characters, sounds, or sentences. 

So, let's say that we are calculating 2-gram sequences of words. Then, we are calculating bigrams. There are also unigrams (one word at a time), trigrams (sequences of three words), 4-grams, etc. You get the idea. 

Thus, we can figure out what the next word is if we know the previous words are. Let's say that we want to find out the likelihood that the next word in the sequence _I really like_ is _you_. This is, by the way, what Google suggested when I typed _I really like..._ The first link was to a [Carly Rae Jepsen song](https://youtu.be/qV5lzRHrGeg). We can calculate that as:


$$ P(you | I, really, like ) $$

The way that formula is written is a 4-gram (a sequence of 4 words). This can be difficult to calculate, especially for less frequent combinations of sentences. So, to make this into a bigram probability, we calculate the following, which reads as "the probability of _you_ given _like_": 

$$ P(you | like ) $$

The general formula is below. The probability of $w_i$ given the sequence $w_1$ to $w{i-1}$ is approximately the probability of  $w_i$ given $w_{i-1}$. So, instead of calculating probabilities for a long sequence of words, we do it for a sequence of 2 words at a time.

$$ P(w_i | w_1, w_2, w_3, ..., w_{i-1} ) \approx P(w_i | w_{i-1}) $$

Note that above we say "the probability of _x_ given _y_". To calculate that, we just count how often any 2 words appear in a large enough corpus. This is what we'll do in this notebook!

Credits: [NLTK LM documentation](https://github.com/nltk/nltk/blob/develop/nltk/lm/__init__.py), [N-gram language models](https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk), [N-gram language modelling with NLTK](https://www.geeksforgeeks.org/n-gram-language-modelling-with-nltk/), [Predicting next word using n-gram model NLTK](https://stackoverflow.com/questions/75565130/predicting-next-word-using-n-gram-model-nltk).

## Import statements

We import everything we need, including bits of NLTK. To train only on "important" or content words, we will remove punctuation and stopwords. We'll first use the [NLTK Reuters corpus](https://www.nltk.org/book/ch02.html#reuters-corpus) to train. 

In [None]:
import string 
import nltk 
from nltk.corpus import stopwords
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.corpus import reuters 
from nltk import FreqDist 
from nltk import word_tokenize, sent_tokenize 
from itertools import chain
nltk.download('punkt') 
nltk.download('stopwords') 
nltk.download('reuters') 

## NLTK ngram functions

There are several functions in NLTK that you can use. We start with `nltk.bigrams()`. It takes a list of tokens as input and gives you all the possible bigrams of words. Then we can compute the frequency distribution of those bigrams. 

But the best function is `everygrams`, which builds as many ngrams as you like from an input. You give it word tokens (but it can also be used with character tokens) and tell it how many types of ngrams to build. In my example, I say `1, 3`, which means: give me: unigrams, bigrams, trigrams.

In [None]:
sent1 = "I really like you."

sent1_tokens = nltk.word_tokenize(sent1)

sent1_bi = nltk.bigrams(sent1_tokens)

#compute frequency distribution for all the bigrams in the text
sent1_fdist = nltk.FreqDist(sent1_bi)
for key, value in sent1_fdist.items():
    print(key, value)

In [None]:
# Try the same with your own sentence. 
# Copy the code from above to generate bigrams for sent2

sent2 = "I really really very much like you."

sent2_tokens = nltk.word_tokenize(sent2)

sent2_bi = nltk.bigrams(sent2_tokens)

sent2_fdist = nltk.FreqDist(sent2_bi)
for k, v in sent2_fdist.items():
    print(k, v)

In [None]:
# now, everygrams for sent1

sent1_every = everygrams(sent1_tokens, 1, 3)

In [None]:
list(sent1_every)

In [None]:
# build everygrams for your sent2
# you can also build them of different length (1-2, 1-3, 1-4, etc)



## Calculating ngram frequencies from Reuters

Ngrams are really useful when we have large numbers of them and their frequencies. In this part, we take all the sentences in the Reuters corpus and count their frequencies. Then, we create a `removal_list` with all the things that we want to strip (punctuation and stopwords). Then, we create the lists of unigrams,  of bigrams and trigrams, padding to the left and to the right. Padding just means adding a special "word" that indicates the beginning and end of a sentence, so that the first and last words also participate in all possible bigrams. 

For instance, in _I really like you_, we could have the following bigrams:

```
I, really
really, like
like, you
```

But notice how, unlike the other words, _I_ and _you_ only participate in one bigram. We want to know that that's because they are the beginning and end of the sentence. Padding adds that information, which here I am representing with the html code `<s>` and `</s>`. So then we create the following bigrams:

```
<s>, I
I, really
really, like
like, you
you, </s>
```

So, we will first create a set of removal words, punctuation and stopwords that we don't want to include the in the lists of ngrams. You can see what it contains below. 

Next, we import the Reuters sentences and use `everygrams` to create unigrams, bigrams, and trigrams. We remove those that have words in the removal list. 

After that, `word_salad`is a dictionary with the frequency distribution of those ngrams. 

Finally, we use `word_salad` to create a sequence of segments that start with a certain prompt. The segments are made up of the prompt, plus the most likely next word. Thus, if the prompt is "it will", then we'll get the following:

```
('it', 'will'), 
('it', 'will', 'be'), 
('It', 'will'), 
('it', 'will', 'pay'), 
('it', 'will', 'not'), 
('IT', 'WILL'), 
('it', 'will', 'continue'), 
('it', 'will', 'have'), 
('it', 'will', 'make'), 
('it', 'will', 'take'), 
('It', 'will', 'be'), 
('it', 'will', 'raise'), 
('it', 'will', 'acquire'), 
('it', 'will', 'also'), 
('it', 'will', 'report'), 
('it', 'will', 'offer'), 
('it', 'will', 'issue'), 
('it', 'will', 'receive'), 
('it', 'will', 'increase'), 
('it', 'will', 'sell'),
 etc.
```

In [None]:
# create the list of things we'll remove (punctuation and stopwords)

stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']

In [None]:
len(removal_list)

In [None]:
removal_list

In [None]:
# import Reuters and create ngrams
sents = reuters.sents()

one_to_three_ngrams = chain(*[everygrams(sent, 1, 3, pad_left=True, pad_right=True) for sent in sents])
one_to_three_ngrams = [ng for ng in one_to_three_ngrams if all(word for word in ng if word not in removal_list)]

In [None]:
# get the frequency distribution, so that we can see which combos are more frequent
word_salad = FreqDist(one_to_three_ngrams)

In [None]:
word_salad

In [None]:
# sort the dictionary in reverse order (the result is a list, but that's fine, as we only want to see it)
word_salad_ordered = sorted(word_salad.items(), key=lambda x:x[1], reverse=True)

# print the first 20 items
word_salad_ordered[:20]

In [None]:
# Given an input "prompt"
prefix = 'it will'

# Check what's most possible to come next:
print([ng for ng in word_salad if ' '.join(ng).lower().startswith(prefix.lower())])

In [None]:
# try a different prompt

prefix2 = 'they said'

# Check what's most possible to come next:
print([ng for ng in word_salad if ' '.join(ng).lower().startswith(prefix2.lower())])

## Generate Globe and Mail articles

Here, we are going to do something a little different, based on a [notebook](https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk) on how to create a sentence generator from ngrams. 

Because this is a different section, I have all the separate import statements here, so that you see what we need. We will take a sample of article text from the Globe and Mail articles we worked on the other day (100 rows, but you can change that number).  We tokenize the text in the column `article_text`. 

Then, we preprocess the text and build what is essentially a language model (but not a _large_ language model) from the ngrams in the articles. A function produces sentences, one word at a time, from the frequencies in the ngrams. 

I hope you can see how this is a small step towards creating very large language models, simply based on the frequencies of sequences of words. 

In [None]:
import pandas as pd
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [None]:
socc_df = pd.read_csv('data/gnm_articles.csv', encoding='utf-8', nrows=100)

In [None]:
socc_df.head()

In [None]:
socc_comments =  list(socc_df['article_text'].apply(word_tokenize))

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, socc_comments)
train, vocab = padded_everygram_pipeline(2, socc_comments)

In [None]:
# Train a 3-grams model
socc_model = MLE(n) 
socc_model.fit(train_data, padded_sents)

In [None]:
# Create a function that generates sentences from a model

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [None]:
# Now we use it to generate 'sentences' 
# Try and change the max number of words, or the random seed
# Changing the random seed will give you a different sentence every time

generate_sent(socc_model, num_words=100, random_seed=70)