# Language Model Example

In this notebook, you'll train a n-gram Language model yourself

This example is based on:

https://nlpforhackers.io/language-models/

For a (very!) detailed information about this topic, please refer to:
https://web.stanford.edu/~jurafsky/slp3/3.pdf 

# Setup

In [None]:
import nltk
nltk.download('reuters')
nltk.download('punkt')


[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import random
from functools import reduce

from operator import mul
from collections import Counter

from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
 

# Bag-of-words & Frequency calculation

We start by constructing a Bag-of-words.

Remind yourself that a Language Model is a calculation of the frequencies and the probabilities of words, based on their observed occurrences.

We calculate the word frequencies, by simply counting the word occurrences.

In [None]:
counts = Counter(reuters.words())
total_count = len(reuters.words())

print(counts.most_common(n=20))

[('.', 94687), (',', 72360), ('the', 58251), ('of', 35979), ('to', 34035), ('in', 26478), ('said', 25224), ('and', 25043), ('a', 23492), ('mln', 18037), ('vs', 14120), ('-', 13705), ('for', 12785), ('dlrs', 11730), ("'", 11272), ('The', 10968), ('000', 10277), ('1', 9977), ('s', 9298), ('pct', 9093)]


In [None]:
# Compute the frequencies
for word in counts:
    counts[word] /= float(total_count)

# The frequencies should add up to 1
print(sum(counts.values()))

1.0000000000006808


Let's create a random text passage, and calculate what is the probability that such a sentence is valid.

Before continuing, and executing the next cells: 

**please pause and guess, what do you expect that the probability of a random 5 words passage would be? How about a 100?**

In [None]:
 # Generate 100 words of language
text = []
 
for _ in range(100):
    r = random.random()
    accumulator = .0
 
    for word, freq in counts.items():
        accumulator += freq
 
        if accumulator >= r:
            text.append(word)
            break
 
print(' '.join(text))
 

movements S it Harbour stock - - vs profit SHARES ACQUIRES at the / cts got said 5 net financial mln 1 March of and & don 16 mln remedial the speculated stabilisation ' CYCLOPS information days 6 quite ." The Means strikes 7 hour members said & Icahn the 08 officials the said likely 300 taken two lie include 755 31 Flank > and terms from be in privately 7 prices in public Excluding Year optimistic want figure , been of mln rose of . orders , MAD BUSINESS tax . pct the market about investment Beaver . 14


In [None]:
# The probability of the text

print(reduce(mul, [counts[w] for w in text], 1.0))


2.5910690708843e-310


Compare the given probability above to the one you've previously guessed.

Try changing the number of words and compare the results


# n-gram Language Model

Now let's construct a n-grams language model from the text.

nltk already has functions for n-grams such as: `bigrams` & `trigrams`.

Let's use them:

In [None]:
first_sentence = reuters.sents()[0]
print(first_sentence)

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


In [None]:
# Get the bigrams
print(list(bigrams(first_sentence)))

[('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'FEAR'), ('FEAR', 'DAMAGE'), ('DAMAGE', 'FROM'), ('FROM', 'U'), ('U', '.'), ('.', 'S'), ('S', '.-'), ('.-', 'JAPAN'), ('JAPAN', 'RIFT'), ('RIFT', 'Mounting'), ('Mounting', 'trade'), ('trade', 'friction'), ('friction', 'between'), ('between', 'the'), ('the', 'U'), ('U', '.'), ('.', 'S'), ('S', '.'), ('.', 'And'), ('And', 'Japan'), ('Japan', 'has'), ('has', 'raised'), ('raised', 'fears'), ('fears', 'among'), ('among', 'many'), ('many', 'of'), ('of', 'Asia'), ('Asia', "'"), ("'", 's'), ('s', 'exporting'), ('exporting', 'nations'), ('nations', 'that'), ('that', 'the'), ('the', 'row'), ('row', 'could'), ('could', 'inflict'), ('inflict', 'far'), ('far', '-'), ('-', 'reaching'), ('reaching', 'economic'), ('economic', 'damage'), ('damage', ','), (',', 'businessmen'), ('businessmen', 'and'), ('and', 'officials'), ('officials', 'said'), ('said', '.')]


In [None]:
# Get the padded bigrams
print(list(bigrams(first_sentence, pad_left=True, pad_right=True)))

[(None, 'ASIAN'), ('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'FEAR'), ('FEAR', 'DAMAGE'), ('DAMAGE', 'FROM'), ('FROM', 'U'), ('U', '.'), ('.', 'S'), ('S', '.-'), ('.-', 'JAPAN'), ('JAPAN', 'RIFT'), ('RIFT', 'Mounting'), ('Mounting', 'trade'), ('trade', 'friction'), ('friction', 'between'), ('between', 'the'), ('the', 'U'), ('U', '.'), ('.', 'S'), ('S', '.'), ('.', 'And'), ('And', 'Japan'), ('Japan', 'has'), ('has', 'raised'), ('raised', 'fears'), ('fears', 'among'), ('among', 'many'), ('many', 'of'), ('of', 'Asia'), ('Asia', "'"), ("'", 's'), ('s', 'exporting'), ('exporting', 'nations'), ('nations', 'that'), ('that', 'the'), ('the', 'row'), ('row', 'could'), ('could', 'inflict'), ('inflict', 'far'), ('far', '-'), ('-', 'reaching'), ('reaching', 'economic'), ('economic', 'damage'), ('damage', ','), (',', 'businessmen'), ('businessmen', 'and'), ('and', 'officials'), ('officials', 'said'), ('said', '.'), ('.', None)]


In [None]:
# Get the trigrams
print(list(trigrams(first_sentence)))


[('ASIAN', 'EXPORTERS', 'FEAR'), ('EXPORTERS', 'FEAR', 'DAMAGE'), ('FEAR', 'DAMAGE', 'FROM'), ('DAMAGE', 'FROM', 'U'), ('FROM', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.-'), ('S', '.-', 'JAPAN'), ('.-', 'JAPAN', 'RIFT'), ('JAPAN', 'RIFT', 'Mounting'), ('RIFT', 'Mounting', 'trade'), ('Mounting', 'trade', 'friction'), ('trade', 'friction', 'between'), ('friction', 'between', 'the'), ('between', 'the', 'U'), ('the', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.'), ('S', '.', 'And'), ('.', 'And', 'Japan'), ('And', 'Japan', 'has'), ('Japan', 'has', 'raised'), ('has', 'raised', 'fears'), ('raised', 'fears', 'among'), ('fears', 'among', 'many'), ('among', 'many', 'of'), ('many', 'of', 'Asia'), ('of', 'Asia', "'"), ('Asia', "'", 's'), ("'", 's', 'exporting'), ('s', 'exporting', 'nations'), ('exporting', 'nations', 'that'), ('nations', 'that', 'the'), ('that', 'the', 'row'), ('the', 'row', 'could'), ('row', 'could', 'inflict'), ('could', 'inflict', 'far'), ('inflict', 'far', '-'), ('far', '-', 'rea

In [None]:
# Get the padded trigrams
print(list(trigrams(first_sentence, pad_left=True, pad_right=True)))


[(None, None, 'ASIAN'), (None, 'ASIAN', 'EXPORTERS'), ('ASIAN', 'EXPORTERS', 'FEAR'), ('EXPORTERS', 'FEAR', 'DAMAGE'), ('FEAR', 'DAMAGE', 'FROM'), ('DAMAGE', 'FROM', 'U'), ('FROM', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.-'), ('S', '.-', 'JAPAN'), ('.-', 'JAPAN', 'RIFT'), ('JAPAN', 'RIFT', 'Mounting'), ('RIFT', 'Mounting', 'trade'), ('Mounting', 'trade', 'friction'), ('trade', 'friction', 'between'), ('friction', 'between', 'the'), ('between', 'the', 'U'), ('the', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.'), ('S', '.', 'And'), ('.', 'And', 'Japan'), ('And', 'Japan', 'has'), ('Japan', 'has', 'raised'), ('has', 'raised', 'fears'), ('raised', 'fears', 'among'), ('fears', 'among', 'many'), ('among', 'many', 'of'), ('many', 'of', 'Asia'), ('of', 'Asia', "'"), ('Asia', "'", 's'), ("'", 's', 'exporting'), ('s', 'exporting', 'nations'), ('exporting', 'nations', 'that'), ('nations', 'that', 'the'), ('that', 'the', 'row'), ('the', 'row', 'could'), ('row', 'could', 'inflict'), ('could', 'inflict

Let's use the trigrams on the Reuters corpus.

We start by counting the occurences:

In [None]:
model = defaultdict(lambda: defaultdict(lambda: 0))
 
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
 
print(model["what", "the"]["economists"])
print(model["what", "the"]["nonexistingword"])
print(model[None, None]["The"])

2
0
8839


And then converting it into frequencies or probabilities, by deviding these occurences by the total number of occurences of the first 2 words of our trigrams:

In [None]:
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

In [None]:
print(model["what", "the"]["economists"])
print(model["what", "the"]["nonexistingword"])
print(model[None, None]["The"])

0.043478260869565216
0.0
0.16154324146501936


And now we're ready to try it out.

Let's generate a random sentence again, but this time, we'll use our trigram model:

In [None]:
text = [None, None]
prob = 1.0 
 
sentence_finished = False
 
while not sentence_finished:
    r = random.random()
    accumulator = .0
 
    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
 
        if accumulator >= r:
            prob *= model[tuple(text[-2:])][word] 
            text.append(word)
            break
 
    if text[-2:] == [None, None]:
        sentence_finished = True
 
print(f'Probability of text={prob}')
print(' '.join([t for t in text if t]))

Probability of text=2.0872967232657588e-57
January ' s economic outlook also attended by representatives from 151 , 000 Revs 6 , 271 , 000 vs 2 . 75 dlrs in excess of its Geneva , Switzerland , which accounted for just under one pct increase in profitability by midyear , which will improve prospects for the block was Eduardo Cojuanco , the department said .


Try running this cell several times, and compare the output sentences to the previous random ones.

> Which one looks better to you?
 
> Can you explain why?


Note that we have not used here complicated RNN or LSTM, and still managed to generate reasonable sentences, using only simple probability, and counting words.

---

Your Turn:
- Optional: use your own corpus
- Create a function that parse the text into 4-grams
- Train a language model using the 4-grmams and
- Generate few sentences.

Q: How can you use your language model as a spelling or grammar checker?

Q2: What steps are needed to create a spelling or grammar correction suggestion using your language model? 


Task 2:
- For those of you who have finished it quickly, can you improve the generation even further? Hint: currently the comparisson is above a random threshold...

> Do they look better? Do they make more sense?