Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given _n-1_ previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.


In [None]:
import nltk
from nltk.corpus import reuters

nltk.download("reuters")


We can check the number of sentences there are in the corpus. Each sentence is a list of words.


In [None]:
print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=" ")


## Unigram model

For starters, let's build a unigram language model.


In [None]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1


Now that we have the counts, we need to transform them into probabilities:


In [None]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count


In [None]:
uni_model


#### Likely words

How likely is the word 'the'?


In [None]:
# your code here
print(uni_model["the"])


What is the most likely word in the corpus?


In [None]:
# your code here
print(max(uni_model, key=uni_model.get))


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...


In [None]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = 0.0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print(" ".join([t for t in text]))


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:


In [None]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(
        sentence,
        pad_right=True,
        pad_left=True,
        left_pad_symbol="<s>",
        right_pad_symbol="</s>",
    ):
        bi_model[w1][w2] += 1


In [None]:
bi_model


As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.


In [None]:
# your code here
for w1 in bi_model:
    total_count = float(sum(bi_model[w1].values()))
    for w2 in bi_model[w1]:
        bi_model[w1][w2] /= total_count


In [None]:
bi_model


#### Likely pairs

What are the probabilities of each word following 'today'?


In [None]:
# your code here
print(bi_model["today"])


What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the _left_pad_symbol_ defined above for collecting bigrams.)


In [None]:
# your code here
print(bi_model["<s>"])


#### Generating text

Now that we have a bigram model, we can generate text based on it.


In [None]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = 0.0
    for word in bi_model[text[-1]].keys():
        accumulator += bi_model[text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break

print(" ".join([t for t in text if t]))


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).


In [None]:
# your code here
from nltk import trigrams

# Create a placeholder for the model
tri_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(
        sentence,
        pad_right=True,
        pad_left=True,
        left_pad_symbol="<s>",
        right_pad_symbol="</s>",
    ):
        tri_model[w1][w2][w3] += 1

In [None]:
tri_model


#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?


In [None]:
for w1 in tri_model:
    for w2 in tri_model[w1]:
        total_count = float(sum(tri_model[w1][w2].values()))
        for w3 in tri_model[w1][w2]:
            tri_model[w1][w2][w3] /= total_count

In [None]:
tri_model


In [None]:
tri_model["today"]["the"]


In [None]:
tri_model["England"]["has"]


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?


In [None]:
# your code here
import random

# sequence start symbol
text = ["<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold, conditioned to the previous two words text[-1]
    accumulator = 0.0
    for word in tri_model[text[-2]][text[-1]].keys():
        accumulator += tri_model[text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break

print(" ".join([t for t in text[1:] if t]))


## N-gram models

For larger _n_, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary _n_.

Create your own 4-gram model.


In [None]:
# your code here
# your code here
from nltk import ngrams

# Create a placeholder for the model
four_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0))))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(
        sentence,
        4,
        pad_right=True,
        pad_left=True,
        left_pad_symbol="<s>",
        right_pad_symbol="</s>",
    ):
        four_model[w1][w2][w3][w4] += 1

In [None]:
four_model

#### Likely tuples

Check the most likely words following "today the public".


In [None]:
# your code here
for w1 in four_model:
    for w2 in four_model[w1]:
        for w3 in four_model[w1][w2]:
            total_count = float(sum(four_model[w1][w2][w3].values()))
            for w4 in four_model[w1][w2][w3]:
                four_model[w1][w2][w3][w4] /= total_count

In [None]:
four_model

In [None]:
four_model["today"]["the"]["public"]

#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?


In [None]:
# your code here
import random

# sequence start symbol
text = ["<s>", "<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold, conditioned to the previous two words text[-1]
    accumulator = 0.0
    for word in four_model[text[-3]][text[-2]][text[-1]].keys():
        accumulator += four_model[text[-3]][text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break

print(" ".join([t for t in text[2:] if t]))
