# Creating N-gram Language Models

Generate ngrams from text using Python.

In [3]:
raw_text = "jack be nimble. jack be quick. jack jump over \
the candlestick."

In [4]:
from nltk import word_tokenize
unigrams = word_tokenize(raw_text)
unigrams 

['jack',
 'be',
 'nimble',
 '.',
 'jack',
 'be',
 'quick',
 '.',
 'jack',
 'jump',
 'over',
 'the',
 'candlestick',
 '.']

In [5]:
bigrams = [(unigrams[k], unigrams[k+1]) for k in range(len(unigrams)-1)]
bigrams

[('jack', 'be'),
 ('be', 'nimble'),
 ('nimble', '.'),
 ('.', 'jack'),
 ('jack', 'be'),
 ('be', 'quick'),
 ('quick', '.'),
 ('.', 'jack'),
 ('jack', 'jump'),
 ('jump', 'over'),
 ('over', 'the'),
 ('the', 'candlestick'),
 ('candlestick', '.')]

NLTK bigrams produces the same result.


In [6]:
from nltk.util import ngrams

tokens = word_tokenize(raw_text)
bigrams = list(ngrams(tokens, 2))
bigrams

[('jack', 'be'),
 ('be', 'nimble'),
 ('nimble', '.'),
 ('.', 'jack'),
 ('jack', 'be'),
 ('be', 'quick'),
 ('quick', '.'),
 ('.', 'jack'),
 ('jack', 'jump'),
 ('jump', 'over'),
 ('over', 'the'),
 ('the', 'candlestick'),
 ('candlestick', '.')]

Next, make dictionaries of counts for the unigrams and bigrams.

In [7]:
unigram_dict = {t:unigrams.count(t) for t in set(unigrams)}
unigram_dict

{'candlestick': 1,
 'jack': 3,
 '.': 3,
 'be': 2,
 'over': 1,
 'nimble': 1,
 'jump': 1,
 'the': 1,
 'quick': 1}

In [8]:
bigram_dict = {b:bigrams.count(b) for b in set(bigrams)}
bigram_dict
# key is tuple, value is the count

{('jack', 'jump'): 1,
 ('be', 'nimble'): 1,
 ('nimble', '.'): 1,
 ('jump', 'over'): 1,
 ('.', 'jack'): 2,
 ('be', 'quick'): 1,
 ('the', 'candlestick'): 1,
 ('candlestick', '.'): 1,
 ('quick', '.'): 1,
 ('over', 'the'): 1,
 ('jack', 'be'): 2}

Notice that '. Jack' has count 2 when really 'Jack' starts 3 sentences. A way around this is to have a start symbol at the beginning of every sentence 'start-Jack'. However, in a really large corpus it won't make a huge difference in the counts. 

### Create a probabilistic model

Now that we have "trained" a model on our tiny corpus, we can compute probabilities on new data, which we will call test data. 

Calculate the P(Jack be nimble) = P(Jack be) * P(be nimble) using smoothing

In [15]:
import math

def compute_prob(text, unigram_dict, bigram_dict, N, V):
    # N is the number of tokens in the training data
    # V is the vocabulary size in the training data (unique tokens)

    unigrams_test = word_tokenize(text)
    bigrams_test = list(ngrams(unigrams_test, 2))
    
    p_gt = 1       # calculate p using a variation of Good-Turing smoothing
    p_laplace = 1  # calculate p using Laplace smoothing
    p_log = 0      # add log(p) to prevent underflow

    for bigram in bigrams_test:
        n = bigram_dict[bigram] if bigram in bigram_dict else 0
        n_gt = bigram_dict[bigram] if bigram in bigram_dict else 1/N
        d = unigram_dict[bigram[0]] if bigram[0] in unigram_dict else 0
        if d == 0:
            p_gt = p_gt * (1 / N)
        else:
            p_gt = p_gt * (n_gt / d)
        p_laplace = p_laplace * ((n + 1) / (d + V))
        p_log = p_log + math.log((n + 1) / (d + V))

    print("\nprobability with simplified Good-Turing is %.5f" % (p_gt))
    print("probability with laplace smoothing is %.5f" % p_laplace)
    print("log prob is %.5f == %.5f" % (p_log, math.exp(p_log)))


In [16]:
N = len(unigrams)
V = len(unigram_dict)

test_text = 'jack be nimble.'

compute_prob(test_text, unigram_dict, bigram_dict, N, V) 



probability with simplified Good-Turing is 0.33333
probability with laplace smoothing is 0.00909
log prob is -4.70048 == 0.00909


In [18]:
test_text = 'jack be smart.'

compute_prob(test_text, unigram_dict, bigram_dict, N, V)



probability with simplified Good-Turing is 0.00170
probability with laplace smoothing is 0.00253
log prob is -5.98141 == 0.00253


### Generation from an ngram model

Could be use these probabilities to generate text?

First, create probability dictionaries from the corpus. Then we take a very naive approach to language generation. Given a start word, find the most likely next word. Continue, until you get a sentence end.


In [10]:
u_probs = {t:unigrams.count(t)/len(unigrams) for t in set(unigrams)}
b_probs = {b:bigrams.count(b)/unigrams.count(b[0]) for b in set(bigrams)}

print('unigram probs:', u_probs)
print('\nbigram probs:', b_probs)

unigram probs: {'.': 0.21428571428571427, 'candlestick': 0.07142857142857142, 'quick': 0.07142857142857142, 'jack': 0.21428571428571427, 'over': 0.07142857142857142, 'the': 0.07142857142857142, 'be': 0.14285714285714285, 'nimble': 0.07142857142857142, 'jump': 0.07142857142857142}

bigram probs: {('over', 'the'): 1.0, ('be', 'nimble'): 0.5, ('nimble', '.'): 1.0, ('.', 'jack'): 0.6666666666666666, ('quick', '.'): 1.0, ('the', 'candlestick'): 1.0, ('jump', 'over'): 1.0, ('jack', 'jump'): 0.3333333333333333, ('be', 'quick'): 0.5, ('candlestick', '.'): 1.0, ('jack', 'be'): 0.6666666666666666}


In [11]:
def naive_gen(start_word, u_probs, b_probs):
    phrase = [start_word]
    while phrase[-1] != '.':
        candidate_next = {k:b_probs[k] for k in b_probs if k[0] == phrase[-1]}
        candidate_next = sorted(candidate_next.items(), key=lambda x:x[1], reverse=True)
        if not candidate_next:
            break
        phrase += [candidate_next[0][0][1]]  # [0][0] for first bigram, tuple
        print(phrase)
    
    return ' '.join(phrase)

In [12]:
naive_gen('jack', u_probs, b_probs)

['jack', 'be']
['jack', 'be', 'nimble']
['jack', 'be', 'nimble', '.']


'jack be nimble .'

In [13]:
naive_gen('over', u_probs, b_probs)

['over', 'the']
['over', 'the', 'candlestick']
['over', 'the', 'candlestick', '.']


'over the candlestick .'

### Try NLTK's generator

In [14]:
import nltk
nltkText = nltk.Text(unigrams)
nltkText.generate()

be nimble . . jack jump over the candlestick . over the candlestick .
jack be quick . jack be quick . . jack be nimble . the candlestick .
be quick . over the candlestick . . the candlestick . . be nimble .
jack jump over the candlestick . quick . jack be nimble . jack be
quick . jack be quick . . . jack jump over the candlestick .
candlestick . over the candlestick . . nimble . jack jump over the
candlestick . jack be nimble . jack be nimble . jump over the
candlestick


Building ngram index...


'be nimble . . jack jump over the candlestick . over the candlestick .\njack be quick . jack be quick . . jack be nimble . the candlestick .\nbe quick . over the candlestick . . the candlestick . . be nimble .\njack jump over the candlestick . quick . jack be nimble . jack be\nquick . jack be quick . . . jack jump over the candlestick .\ncandlestick . over the candlestick . . nimble . jack jump over the\ncandlestick . jack be nimble . jack be nimble . jump over the\ncandlestick'

It looks like we need a bigger corpus to get more interesting results.

In [15]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [16]:
text4.generate()

Building ngram index...


occur , and especially the truth that democratic government has innate
capacity to govern its affairs aright through the Province ceded , by
any timid forebodings of evil were not to overtake them while I
possess the property of the peaks and the exercise of free and firm on
the farm , in view that the best of my countrymen will ever find me
ready to confer their benefits on countless generations yet to make
its promise for all generations ." , remains essentially unchanged .
cost of the Rocky Mountains . abuses of an ever - expanding American
dream


'occur , and especially the truth that democratic government has innate\ncapacity to govern its affairs aright through the Province ceded , by\nany timid forebodings of evil were not to overtake them while I\npossess the property of the peaks and the exercise of free and firm on\nthe farm , in view that the best of my countrymen will ever find me\nready to confer their benefits on countless generations yet to make\nits promise for all generations ." , remains essentially unchanged .\ncost of the Rocky Mountains . abuses of an ever - expanding American\ndream'