At their core, n-gram language models can be represented by the following equation: $p(w_3|w_1, w_2) = \frac{\textit{count}(w_1,w_2, w_3)}{\textit{count}(w_1, w_2)}$ .

That is to say, they sample from a distribution based on conditional probabilities, and hold the Markov assumption.

Let's build our own n-gram language model!

# Dataset
First we'll need to get our data ready. We'll use a shakespeare dataset



In [28]:
file = open("shake.txt", "r")
content = file.read()
tokens = content.lower().split()

# Utility functions

In [33]:
def generate_ngrams(tokens, n):
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

from collections import defaultdict

# Create the models
unigrams = generate_ngrams(tokens, 1)
bigrams = generate_ngrams(tokens, 2)

# Count occurrences
unigram_model = defaultdict(int)
bigram_model = defaultdict(int)

for unigram in unigrams:
    unigram_model[unigram] += 1

for bigram in bigrams:
    bigram_model[bigram] += 1

bigram_probabilities = defaultdict(dict)

for bigram in bigram_model:
    w1, w2 = bigram.split()
    bigram_probabilities[w1][w2] = bigram_model[bigram] / unigram_model[w1]



def predict_next_word(word):
    next_words = bigram_probabilities.get(word, {})
    return max(next_words, key=next_words.get) if next_words else None

def next_n_words(word, n):
    sentence = word
    for i in range(n):
        word = predict_next_word(word)
        sentence = sentence + " " + word
    return sentence

Let's test it out

In [41]:
next_n_words("indeed", 5)

'indeed the king henry. i am'

In [44]:
next_n_words("why", 5)

'why should be a man of'

In [45]:
next_n_words("how", 5)

'how now, my lord, i am'

I'm noticing a trend