Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [73]:
from nltk.corpus import reuters

import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/pedromacedo/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [74]:
print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

54716
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [75]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [76]:
total_count = float(sum(uni_model.values())) # total number of words
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [77]:
# your code here
print(uni_model['the'])

0.03384881432399122


What is the most likely word in the corpus?

In [78]:
# your code here
print(max(uni_model, key=uni_model.get))

.


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [79]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

3 . reported 8218 legislation shrs billion IN Computer hit which to the . CORP , the . , 149 L , 13 about violating supplied Chemicals accounts the 0 owned a expected , Shr disputes decelerate , S and A 98 Oper offer Paterson NET parts letter The developments retaliation week 1 , the in 17 . ( Samuel the is Capital June been TO its American 859 Note dlr per mln vs American pct That a on only loss market GE said , vs of CORP oper K quarter 166 major mln the of General of production 2


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [80]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram (a bigram is a two word sequence)
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [81]:
# your code here
for w1 in bi_model:
    total_w1 = float(sum(bi_model[w1].values()))
    for w2 in bi_model[w1]:
        bi_model[w1][w2] /= total_w1

#### Likely pairs

What are the probabilities of each word following 'today'?

In [82]:
# your code here
bi_model['today']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'.': 0.18636363636363637,
             'to': 0.0659090909090909,
             "'": 0.10681818181818181,
             'and': 0.025,
             'as': 0.013636363636363636,
             ',': 0.16363636363636364,
             'with': 0.007575757575757576,
             'by': 0.020454545454545454,
             'when': 0.0030303030303030303,
             'on': 0.011363636363636364,
             'recommended': 0.0007575757575757576,
             'he': 0.005303030303030303,
             'its': 0.0022727272727272726,
             'for': 0.01893939393939394,
             'De': 0.0007575757575757576,
             'European': 0.0007575757575757576,
             'described': 0.0007575757575757576,
             'the': 0.013636363636363636,
             ',"': 0.007575757575757576,
             'they': 0.0015151515151515152,
             'issued': 0.0015151515151515152,
             'being': 0.0007575757575757576,
            

What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [83]:
# your code here
from nltk import bigrams

bi_model['<s>']

# All bigrams start in an uppercase letter

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'ASIAN': 7.31047591198187e-05,
             'They': 0.008151180641859785,
             'But': 0.019263104028072228,
             'The': 0.16154324146501936,
             'Unofficial': 1.8276189779954676e-05,
             '"': 0.06559324512025733,
             'In': 0.02522114189633745,
             'Threat': 3.655237955990935e-05,
             'Taiwan': 0.0006944952116382777,
             'Retaliation': 5.482856933986402e-05,
             'A': 0.013963008991885371,
             'Last': 0.0036917903355508444,
             'Much': 0.0001462095182396374,
             'He': 0.028986036991008116,
             'Meanwhile': 0.0007493237809781417,
             'Japan': 0.0020286570655749687,
             'Deputy': 0.0001462095182396374,
             'CHINA': 0.0009138094889977337,
             'It': 0.03231230353095987,
             'JAPAN': 0.002997295123912567,
             'MITI': 0.0002193142773594561,
             

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [84]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here

    possible_next_words = bi_model[text[-1]]
    cumulative_prob = 0.0
    for word in possible_next_words:
        cumulative_prob += possible_next_words[word]
        if cumulative_prob >= r:
            text.append(word)
            break
    
    

print (' '.join([t for t in text if t]))

<s> The Belgian coast presence of the contract . S . </s>


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [85]:
from nltk import trigrams

# Create a placeholder for the model
tri_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(
        sentence,
        pad_right=True,
        pad_left=True,
        left_pad_symbol="<s>",
        right_pad_symbol="</s>",
    ):
        tri_model[(w1,w2)][w3] += 1

In [86]:
for w1_w2 in tri_model:
    total_w1_w2 = float(sum(tri_model[w1_w2].values()))
    for w3 in tri_model[w1_w2]:
        tri_model[w1_w2][w3] /= total_w1_w2

#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [87]:
# your code here (???)
print("today the _:", list(tri_model[('today', 'the')]))

print("England has _: ", list(tri_model[('England', 'has')]))

today the _: ['public', 'European', 'Bank', 'price', 'emirate', 'overseas', 'newspaper', 'company', 'Turkish', 'increase', 'options', 'Higher', 'pound', 'Italian', 'time']
England has _:  ['carried', 'been', 'recently']


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [88]:
# your code here
import random

# sequence start symbol
text = ["<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-2:] != ["</s>", "</s>"]:
    # select a random probability threshold
    r = random.random()

    possible_next_words = tri_model[tuple(text[-2:])]
    cumulative_prob = 0.0
    for word in possible_next_words:
        cumulative_prob += possible_next_words[word]
        if cumulative_prob >= r:
            text.append(word)
            break


print(" ".join([t for t in text if t]))

<s> <s> COMMUNICATIONS CORP OF AMERICA INC & lt ; GTE > CITES STRONG PROSPECTS AM International Inc ' s inexperience in the General Agreement on Tariffs and Trade against an expected restructuring later this year ' s needs , a coffee analyst in New Orleans elevators , the increasing focus on economy - wide average of around 85 mln dlrs last year . </s> </s>


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [89]:
# your code here
from nltk import ngrams

# Create a placeholder for the model
four_model = defaultdict(lambda: defaultdict(lambda: 0))
n = 4 # four model

# Count the frequency of each 4-gram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(
        sentence,
        n,
        pad_right=True,
        pad_left=True,
        left_pad_symbol="<s>",
        right_pad_symbol="</s>"
    ):
        four_model[(w1,w2,w3)][w4] += 1

four_model

defaultdict(<function __main__.<lambda>()>,
            {('<s>',
              '<s>',
              '<s>'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'ASIAN': 4,
                          'They': 446,
                          'But': 1054,
                          'The': 8839,
                          'Unofficial': 1,
                          '"': 3589,
                          'In': 1380,
                          'Threat': 2,
                          'Taiwan': 38,
                          'Retaliation': 3,
                          'A': 764,
                          'Last': 202,
                          'Much': 8,
                          'He': 1586,
                          'Meanwhile': 41,
                          'Japan': 111,
                          'Deputy': 8,
                          'CHINA': 50,
                          'It': 1768,
                          'JAPAN': 164,
                          'MITI': 12,
       

In [90]:
for sequence in four_model:
    total_seq = float(sum(four_model[sequence].values()))
    for w4 in four_model[sequence]:
        four_model[sequence][w4] /= total_seq

four_model

defaultdict(<function __main__.<lambda>()>,
            {('<s>',
              '<s>',
              '<s>'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'ASIAN': 7.31047591198187e-05,
                          'They': 0.008151180641859785,
                          'But': 0.019263104028072228,
                          'The': 0.16154324146501936,
                          'Unofficial': 1.8276189779954676e-05,
                          '"': 0.06559324512025733,
                          'In': 0.02522114189633745,
                          'Threat': 3.655237955990935e-05,
                          'Taiwan': 0.0006944952116382777,
                          'Retaliation': 5.482856933986402e-05,
                          'A': 0.013963008991885371,
                          'Last': 0.0036917903355508444,
                          'Much': 0.0001462095182396374,
                          'He': 0.028986036991008116,
                          'Meanwhile

#### Likely tuples

Check the most likely words following "today the public".

In [91]:
from nltk import word_tokenize

phrase = "today the public"
tokens = tuple(word_tokenize(phrase))

print("Tokens: ", tokens)
print("Final result: ", list(four_model[tokens]))

Tokens:  ('today', 'the', 'public')
Final result:  ['is']


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [92]:
# your code here
import random

# Sequence start symbol
text = ["<s>", "<s>", "<s>"]

# Generate text until it reaches the end of the sequence
while text[-3:] != ["</s>", "</s>", "</s>"]:
    # Select a random probability threshold
    r = random.random()

    possible_next_words = four_model[tuple(text[-3:])]
    cumulative_prob = 0.0

    for word in possible_next_words:
        cumulative_prob += possible_next_words[word]
        if cumulative_prob >= r:
            text.append(word)
            break

print(" ".join([t for t in text if t]))

<s> <s> <s> For the first quarter a loss more than twice the size of the breeding herd in the December quarter were probably artifically inflated by unusually high gross margins on gasoline sales of 12 . 6 mln </s> </s> </s>
