# Add-One Smoothing

Here, I take some English corpora - the King James Version of the Bible, The 1662 Book of Common Prayer, and the Universal Declaration of Human Rights - and apply add-one smoothing to generate n-gram models. For simplicity, all words will be in lowercase and the models will be case-insensitive.

Add-one smoothing is a method of computing the probability of a word in a n-gram model in such a way that the sequences that never appear in the corpus do not get zero probability.

$$
\begin{align*}
p &= \frac{c + 1}{n + v} \\
c &= \textrm{count of the n-gram} \\
n &= \textrm{count of the history (the n-gram excluding the last word)} \\
v &= \textrm{size of the vocabulary}
\end{align*}
$$

For each model, I compute its cross-entropy and perplexity and develop a simple sentence generator. After that, I analyze the effectiveness of each corpus at training n-gram models.

In [1]:
%cd ..

/home/mtj0712/Documents/playground


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
import re
import math

import pygtrie
import numpy as np
from numpy import random

from reader import *

`punc_pattern` will help us separate the punctuations from actual words.

In [3]:
punc_pattern = re.compile('[’!\"#$%&\'()*+,\\-./:;<=>?@[\\\\\\]^_`{|}~]')
end_mark_set = {'!', '.', '?'}

## Early Modern English

First, I build n-gram models with the King James Version of the Bible and The 1662 Book of Common Prayer. This will give us a language model for Early Modern English.

In [4]:
kjv = KJVReader()
bcp = BCPReader()
eme_list = []

Before building the models, I parse the text into a list of words and punctuations. This will be convenient for building the models.

In [5]:
while not kjv.is_eof():
    units = kjv.read_sentence().lower().split()
    for u in units:
        while u:
            match = punc_pattern.search(u)
            if match:
                i = match.start()
                if i != 0:
                    first_word = u[:i]
                    eme_list.append(first_word)
                punc = u[i]
                u = u[i+1:]
                eme_list.append(punc)
            else:
                eme_list.append(u)
                break

while not bcp.is_eof():
    units = bcp.read_sentence().lower().split()
    for u in units:
        while u:
            match = punc_pattern.search(u)
            if match:
                i = match.start()
                if i != 0:
                    first_word = u[:i]
                    eme_list.append(first_word)
                
                if u[i:i+2] == '&c':
                    punc = u[i:i+2]
                    u = u[i+2:]
                else:
                    punc = u[i]
                    u = u[i+1:]
                eme_list.append(punc)
            else:
                eme_list.append(u)
                break

First, I build the unigram model without add-one smoothing.

In [6]:
# Unigram Model

eme_trie = pygtrie.StringTrie()
eme_v = 0 # size of the vocabulary
eme_wordlist = []

for w in eme_list:
    try:
        eme_trie[w] += 1
    except KeyError:
        eme_trie[w] = 1
        eme_wordlist.append(w)
        eme_v += 1

print('Size of the vocabulary:', eme_v)
print('Count of all words:', len(eme_list))

eme_unigram_H = 0 # cross entropy

for w in eme_list:
    p = eme_trie[w] / len(eme_list)
    eme_unigram_H += math.log2(p)
eme_unigram_H /= -len(eme_list)

print('Cross entropy:', eme_unigram_H)
print('Perplexity:', 2 ** eme_unigram_H)
print()

eme_unigram_list = sorted(eme_trie.items(), key=lambda t : t[1], reverse=True)[:10]
for pair in eme_unigram_list:
    p = pair[1] / len(eme_list)
    print('Word:', pair[0], '| Probability:', p)

Size of the vocabulary: 13701
Count of all words: 1108945
Cross entropy: 8.367674133713269
Perplexity: 330.3093810598126

Word: , | Probability: 0.07600557286429895
Word: the | Probability: 0.06699881418826001
Word: and | Probability: 0.05442289743855647
Word: of | Probability: 0.03621910915329435
Word: . | Probability: 0.029303527226327727
Word: to | Probability: 0.015170274450040353
Word: : | Probability: 0.014834820482530693
Word: that | Probability: 0.014345165900923851
Word: in | Probability: 0.014148582661899374
Word: ; | Probability: 0.011146630355878786


The cross entropy and perplexity of the model is extremely high. This is expected, since a unigram model is far from sufficient in representing an actual language. As expected, the most probable words are some common punctuations and grammatical words, such as articles, prepositions, and pronouns.

Next, I build 2~5-gram models. Again, I do not apply add-one smoothing. This time, I do not print out the probabilities for each n-gram, since it would be too lengthy.

In [7]:
# 2~5-gram Model

for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    for w in eme_list:
        ngram[-1] = w
        try:
            eme_trie['/'.join(ngram)] += 1
        except KeyError:
            eme_trie['/'.join(ngram)] = 1
    
        if w in end_mark_set:
            ngram[:-1] = [''] * (n - 1)
        else:
            ngram[:-1] = ngram[1:]
    
    eme_ngram_H = 0 # cross entropy
    
    ngram = [''] * n
    for w in eme_list:
        ngram[-1] = w
        if ngram[-2] == '':
            history_count = eme_trie['.'] + eme_trie['?'] + eme_trie['!']
        else:
            history_count = eme_trie['/'.join(ngram[:-1])]
        p = eme_trie['/'.join(ngram)] / history_count
        eme_ngram_H += math.log2(p)
        
        if w in end_mark_set:
            ngram[:-1] = [''] * (n - 1)
        else:
            ngram[:-1] = ngram[1:]
    eme_ngram_H /= -len(eme_list)
    
    print('Cross entropy:', eme_ngram_H)
    print('Perplexity:', 2 ** eme_ngram_H)
    print()


Cross entropy: 5.598702936624395
Perplexity: 48.459342883245306


Cross entropy: 3.449037978821828
Perplexity: 10.921037234655252


Cross entropy: 1.7657884292431998
Perplexity: 3.4005979071151464


Cross entropy: 0.9644434292663494
Perplexity: 1.951310589122527



As expected, as the order of the n-gram model increases, the perplexity of the language model decreases. Now, I try the same while applying add-one smoothing.

In [8]:
print('========== unigram ==========')
print()

eme_unigram_H = 0 # cross entropy

for w in eme_list:
    p = (eme_trie[w] + 1) / (len(eme_list) + eme_v)
    eme_unigram_H += math.log2(p)
eme_unigram_H /= -len(eme_list)

print('Cross entropy:', eme_unigram_H)
print('Perplexity:', 2 ** eme_unigram_H)
print()

for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    eme_ngram_H = 0 # cross entropy
    
    ngram = [''] * n
    for w in eme_list:
        ngram[-1] = w
        if ngram[-2] == '':
            history_count = eme_trie['.'] + eme_trie['?'] + eme_trie['!']
        else:
            history_count = eme_trie['/'.join(ngram[:-1])]
        p = (eme_trie['/'.join(ngram)] + 1) / (history_count + eme_v)
        eme_ngram_H += math.log2(p)
        
        if w in end_mark_set:
            ngram[:-1] = [''] * (n - 1)
        else:
            ngram[:-1] = ngram[1:]
    eme_ngram_H /= -len(eme_list)
    
    print('Cross entropy:', eme_ngram_H)
    print('Perplexity:', 2 ** eme_ngram_H)
    print()


Cross entropy: 8.370217051082884
Perplexity: 330.89210306822014


Cross entropy: 8.368786536619904
Perplexity: 330.56416727548105


Cross entropy: 10.468221166299237
Perplexity: 1416.6043540383293


Cross entropy: 11.514010973017717
Perplexity: 2924.5743943784537


Cross entropy: 11.905454702616424
Perplexity: 3836.1800062193606



When add-one smoothing is applied, contrary to our expectation, the perplexity of the language model increases as the order of the n-gram model increases. This might be because at higher orders of the n-gram model, the count of the history before each word ($n$) is smaller, and this causes the size of the vocabulary ($v$) to comprise a bigger portion in the denominator of the probability fraction.

It seems that add-one smoothing is only useful for dealing with new texts, and not for computing the cross-entropy or the perplexity of a language model with the very corpus it was trained on.

Below, I implement a sentence generator with 2~5-gram models developed above. New word is generated by randomly choosing from the corpus vocabulary with the probability calculated with add-one smoothing. Whenever the history of an n-gram cannot be found in the corpus, I reduce the order of the n-gram by $1$. After that, I try to increase the order back to `n`.

In [9]:
for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    sentence_len = 0
    start = 0 # for finding the starting point of history
    
    while ngram[-1] not in end_mark_set:
        if 200 <= sentence_len:
            print(' <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>')
            break
        
        ngram[:-1] = ngram[1:]
        
        if ngram[-2] == '':
            history = '/' * (n - 2)
            history_count = eme_trie['.'] + eme_trie['?'] + eme_trie['!']
        else:
            if 0 < start:
                start -= 1
            
            for i in range(start, n - 1):
                history = '/'.join(ngram[i:-1])
                try:
                    history_count = eme_trie[history]
                    start = i
                    break
                except KeyError:
                    pass

        p = np.zeros(eme_v)
        for i in range(eme_v):
            try:
                count = eme_trie[history + '/' + eme_wordlist[i]]
            except KeyError:
                count = 0
            p[i] = (count + 1) / (history_count + eme_v)
        
        ngram[-1] = eme_wordlist[random.choice(eme_v, p=p)]

        if sentence_len == 0 or punc_pattern.fullmatch(ngram[-1]) or ngram[-2] == '-' or ngram[-2:] == ['’', 's']:
            output = ngram[-1]
        else:
            output = ' ' + ngram[-1]
        print(output, end='')

        sentence_len += 1
    
    print()
    print()


a ship ammon lod silvanus discomfit themselves mushites dash persecuting woven cherethites dibri requisite dimittis treasurers holds clouted chiun fourfold hurl it was amos pervert supreme angry menstruous stripling adar salmone complain sacrifice uriel gopher huzzab stammerers puteoli lecah lengthening thyself nations opinion zelah shiphi spies horribly savour shimei flour straiteneth axletrees vacant surnamed obey likeminded cades markest beard idolaters marvellously sychar alarm appeal occasion chastened ink drops vineyards telleth meraioth esarhaddon treasury abdeel perversely proved jehiel sacrifice thicket peevish padan doorkeeper slanderers mortify invade stoicks enticeth ithran greedy leanfleshed genealogy vessels diversities wearing hazerim salcah shemiramoth considerations limited tackling paths span commending such marched another baaseiah satisfying eber creep rapha bridegroom confirmeth bethmeon zalaph gallio thicker veil heinous flatteries shephi marketplace institution 

As anyone can tell from the outputs above, the generated sentences do not make sense and are too long. This might be because add-one smoothing gives significant probability to words that are otherwise extremely unlikely to appear.

Now, I will once again try generating sentences, but this time add-one smoothing will not be used.

In [20]:
for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    sentence_len = 0
    
    while ngram[-1] not in end_mark_set:
        if 200 <= sentence_len:
            print(' <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>')
            break
        
        ngram[:-1] = ngram[1:]
        
        if ngram[-2] == '':
            history = '/' * (n - 2)
            history_count = eme_trie['.'] + eme_trie['?'] + eme_trie['!']
        else:
            history = '/'.join(ngram[:-1])
            history_count = eme_trie[history]

        p = np.zeros(eme_v)
        for i in range(eme_v):
            try:
                p[i] = eme_trie[history + '/' + eme_wordlist[i]] / history_count
            except KeyError:
                pass
        
        ngram[-1] = eme_wordlist[random.choice(eme_v, p=p)]

        if sentence_len == 0 or punc_pattern.fullmatch(ngram[-1]) or ngram[-2] == '-' or ngram[-2:] == ['’', 's']:
            output = ngram[-1]
        else:
            output = ' ' + ngram[-1]
        print(output, end='')

        sentence_len += 1
    
    print()
    print()


our kindred, christ.


my lips shall be taken in hand, that there may be even as i ought to be redeemed, and set forwards, as the heat thereof.


hearken unto me, son of man, thou unclean spirit.


but they cried out, away with him, away with him.



The sentences are much shorter and somewhat recognizable. The meaning of the sentence generated by the 5-gram model makes more sense than that of the sentence generated by the bigram model.

This time, each new word will be the most probable word at that position. If there are multiple words with equal probability that are most probable, one of those words will be chosen randomly.

In [11]:
for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    sentence_len = 0
    
    while ngram[-1] not in end_mark_set:
        if 200 <= sentence_len:
            print(' <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>')
            break
        
        ngram[:-1] = ngram[1:]
        
        if ngram[-2] == '':
            history = '/' * (n - 2)
            history_count = eme_trie['.'] + eme_trie['?'] + eme_trie['!']
        else:
            if 0 < start:
                start -= 1
            
            for i in range(start, n - 1):
                history = '/'.join(ngram[i:-1])
                try:
                    history_count = eme_trie[history]
                    start = i
                    break
                except KeyError:
                    pass

        p = np.zeros(eme_v)
        for i in range(eme_v):
            try:
                p[i] = eme_trie[history + '/' + eme_wordlist[i]] / history_count
            except KeyError:
                pass
        
        max_indices = (p == p.max()).nonzero()[0]
        ngram[-1] = eme_wordlist[random.choice(max_indices)]

        if sentence_len == 0 or punc_pattern.fullmatch(ngram[-1]) or ngram[-2] == '-' or ngram[-2:] == ['’', 's']:
            output = ngram[-1]
        else:
            output = ' ' + ngram[-1]
        print(output, end='')
        
        sentence_len += 1
    
    print()
    print()


and and and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>



and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the lord, and the l

This time, most n-gram models ended up generating repeated sequences. It seems that the best method of text generation is random generation without add-one smoothing.

## Modern English

Now, I build n-gram models with the Universal Declaration of Human Rights. This will give us a language model for Modern English.

In [12]:
udhr_eng = UDHREngReader()
udhr_eng_list = []

I parse the text into a list of words and punctuations.

In [13]:
while not udhr_eng.is_eof():
    units = udhr_eng.read_sentence().lower().split()
    for u in units:
        while u:
            match = punc_pattern.search(u)
            if match:
                i = match.start()
                if i == 0:
                    punc = u[0]
                    u = u[1:]
                else:
                    first_word = u[:i]
                    punc = u[i]
                    u = u[i+1:]
                    udhr_eng_list.append(first_word)
                udhr_eng_list.append(punc)
            else:
                udhr_eng_list.append(u)
                break

First, I build the unigram model without add-one smoothing.

In [14]:
# Unigram Model

udhr_eng_trie = pygtrie.StringTrie()
udhr_eng_v = 0 # size of the vocabulary
udhr_eng_wordlist = []

for w in udhr_eng_list:
    try:
        udhr_eng_trie[w] += 1
    except KeyError:
        udhr_eng_trie[w] = 1
        udhr_eng_wordlist.append(w)
        udhr_eng_v += 1

print('Size of the vocabulary:', udhr_eng_v)
print('Count of all words:', len(udhr_eng_list))

udhr_eng_unigram_H = 0 # cross entropy

for w in udhr_eng_list:
    p = udhr_eng_trie[w] / len(udhr_eng_list)
    udhr_eng_unigram_H += math.log2(p)
udhr_eng_unigram_H /= -len(udhr_eng_list)

print('Cross entropy:', udhr_eng_unigram_H)
print('Perplexity:', 2 ** udhr_eng_unigram_H)
print()

udhr_eng_unigram_list = sorted(udhr_eng_trie.items(), key=lambda t : t[1], reverse=True)[:10]
for pair in udhr_eng_unigram_list:
    p = pair[1] / len(udhr_eng_list)
    print('Word:', pair[0], '| Probability:', p)

Size of the vocabulary: 507
Count of all words: 1851
Cross entropy: 7.270885549836804
Perplexity: 154.43816953351106

Word: the | Probability: 0.06537007023230686
Word: and | Probability: 0.05726634251755808
Word: , | Probability: 0.051323608860075635
Word: of | Probability: 0.04862236628849271
Word: to | Probability: 0.044840626688276604
Word: . | Probability: 0.032955159373311727
Word: in | Probability: 0.023230686115613183
Word: right | Probability: 0.017828200972447326
Word: be | Probability: 0.016747703943814155
Word: everyone | Probability: 0.01620745542949757


The cross entropy and perplexity of the model is extremely high, although it is lower than the model trained with the King James Version under the same setting. The high cross entropy and perplexity are expected, since a unigram model is far from sufficient in representing an actual language. The fact that they are lower than those of the model trained with the King James Version might be due to the smaller vocabulary. As expected, the most probable words are some common punctuations and grammatical words, such as articles, prepositions, and pronouns.

Next, I build 2~5-gram models. Again, I do not apply add-one smoothing.

In [15]:
# 2~5-gram Model

for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    for w in udhr_eng_list:
        ngram[-1] = w
        try:
            udhr_eng_trie['/'.join(ngram)] += 1
        except KeyError:
            udhr_eng_trie['/'.join(ngram)] = 1
    
        if w in end_mark_set:
            ngram[:-1] = [''] * (n - 1)
        else:
            ngram[:-1] = ngram[1:]
    
    udhr_eng_ngram_H = 0 # cross entropy
    
    ngram = [''] * n
    for w in udhr_eng_list:
        ngram[-1] = w
        if ngram[-2] == '':
            history_count = udhr_eng_trie['.']
        else:
            history_count = udhr_eng_trie['/'.join(ngram[:-1])]
        p = udhr_eng_trie['/'.join(ngram)] / history_count
        udhr_eng_ngram_H += math.log2(p)
        
        if w == '.':
            ngram[:-1] = [''] * (n - 1)
        else:
            ngram[:-1] = ngram[1:]
    udhr_eng_ngram_H /= -len(udhr_eng_list)
    
    print('Cross entropy:', udhr_eng_ngram_H)
    print('Perplexity:', 2 ** udhr_eng_ngram_H)
    print()


Cross entropy: 2.595287184490895
Perplexity: 6.043093167424632


Cross entropy: 0.6473648949951232
Perplexity: 1.56630470150222


Cross entropy: 0.29122669103505566
Perplexity: 1.2236803032205519


Cross entropy: 0.23331235403964518
Perplexity: 1.1755308118979229



As the order of the n-gram model increases, the perplexity of the language model decreases. Now, I try the same while applying add-one smoothing.

In [16]:
print('========== unigram ==========')
print()

udhr_eng_unigram_H = 0 # cross entropy

for w in udhr_eng_list:
    p = (udhr_eng_trie[w] + 1) / (len(udhr_eng_list) + udhr_eng_v)
    udhr_eng_unigram_H += math.log2(p)
udhr_eng_unigram_H /= -len(udhr_eng_list)

print('Cross entropy:', udhr_eng_unigram_H)
print('Perplexity:', 2 ** udhr_eng_unigram_H)
print()

for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    udhr_eng_ngram_H = 0 # cross entropy
    
    ngram = [''] * n
    for w in udhr_eng_list:
        ngram[-1] = w
        if ngram[-2] == '':
            history_count = udhr_eng_trie['.']
        else:
            history_count = udhr_eng_trie['/'.join(ngram[:-1])]
        p = (udhr_eng_trie['/'.join(ngram)] + 1) / (history_count + udhr_eng_v)
        udhr_eng_ngram_H += math.log2(p)
        
        if w == '.':
            ngram[:-1] = [''] * (n - 1)
        else:
            ngram[:-1] = ngram[1:]
    udhr_eng_ngram_H /= -len(udhr_eng_list)
    
    print('Cross entropy:', udhr_eng_ngram_H)
    print('Perplexity:', 2 ** udhr_eng_ngram_H)
    print()


Cross entropy: 7.319682672955677
Perplexity: 159.75116843096006


Cross entropy: 7.368149400576789
Perplexity: 165.20910638346845


Cross entropy: 7.617181906870332
Perplexity: 196.3361345540559


Cross entropy: 7.682698852451345
Perplexity: 205.45787984447395


Cross entropy: 7.708541991274838
Perplexity: 209.17142842979257



The perplexity of the language model increases as the order of the n-gram model increases.

Next, I implement a sentence generator with 2~5-gram models developed above. New word is generated by randomly choosing from the corpus vocabulary with the probability calculated with add-one smoothing. Whenever the history of an n-gram cannot be found in the corpus, I reduce the order of the n-gram by $1$. After that, I try to increase the order back to `n`.

In [17]:
for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    sentence_len = 0
    start = 0 # for finding the starting point of history
    
    while ngram[-1] != '.':
        if 200 <= sentence_len:
            print(' <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>')
            break
        
        ngram[:-1] = ngram[1:]
        
        if ngram[-2] == '':
            history = '/' * (n - 2)
            history_count = udhr_eng_trie['.']
        else:
            if 0 < start:
                start -= 1
            
            for i in range(start, n - 1):
                history = '/'.join(ngram[i:-1])
                try:
                    history_count = udhr_eng_trie[history]
                    start = i
                    break
                except KeyError:
                    pass

        p = np.zeros(udhr_eng_v)
        for i in range(udhr_eng_v):
            try:
                count = udhr_eng_trie[history + '/' + udhr_eng_wordlist[i]]
            except KeyError:
                count = 0
            p[i] = (count + 1) / (history_count + udhr_eng_v)
        
        ngram[-1] = udhr_eng_wordlist[random.choice(udhr_eng_v, p=p)]

        if sentence_len == 0 or punc_pattern.fullmatch(ngram[-1]) or ngram[-2] == '-':
            output = ngram[-1]
        else:
            output = ' ' + ngram[-1]
        print(output, end='')

        sentence_len += 1
    
    print()
    print()


beyond existence deprived meeting importance born colour merit organization rule acts benefits before members made possible violation reasonable importance parents securing all no one others worship family omission take social endowed competent prohibited peaceful governing discrimination least when origin working impartial artistic equal change had faith charge forth sickness correspondence chosen family genuine activities status essential incitement regardless impart committed tyranny circumstances organization arts guarantees compulsory peace genuine interests everywhere belongs if state better genuine equally as health reputation practice livelihood tyranny man public prior maintenance marry representatives housing rule suffrage works towards purposes self secure towards fear unemployment slave private these through as compulsory purposes himself either pledged race freely reaffirmed manifest impartial for contempt progress beyond competent kind has hold higher elections chosen sl

The generated sentences do not make sense and are too long.

Next, I will try generating sentences without add-one smoothing.

In [18]:
for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    sentence_len = 0
    
    while ngram[-1] != '.':
        if 200 <= sentence_len:
            print(' <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>')
            break
        
        ngram[:-1] = ngram[1:]
        
        if ngram[-2] == '':
            history = '/' * (n - 2)
            history_count = udhr_eng_trie['.']
        else:
            if 0 < start:
                start -= 1
            
            for i in range(start, n - 1):
                history = '/'.join(ngram[i:-1])
                try:
                    history_count = udhr_eng_trie[history]
                    start = i
                    break
                except KeyError:
                    pass

        p = np.zeros(udhr_eng_v)
        for i in range(udhr_eng_v):
            try:
                p[i] = udhr_eng_trie[history + '/' + udhr_eng_wordlist[i]] / history_count
            except KeyError:
                pass
        
        ngram[-1] = udhr_eng_wordlist[random.choice(udhr_eng_v, p=p)]

        if sentence_len == 0 or punc_pattern.fullmatch(ngram[-1]) or ngram[-2] == '-':
            output = ngram[-1]
        else:
            output = ' ' + ngram[-1]
        print(output, end='')

        sentence_len += 1
    
    print()
    print()


everyone everyone no one shall be subject only with others and leisure, without distinction shall a person belongs, liberty and freedoms may be imposed than the natural and freedom of his interests resulting from acts violating the human rights, shall further the kind of any discrimination, religion, have in other status.


motherhood and childhood are entitled to realization, through national effort and international co-operation and in the dignity and of meeting the just requirements of morality, public order and the free and full development of his family, including his own, and the right to work, to secure their universal and effective recognition and respect for these rights and freedoms set forth in this declaration and against any discrimination, has the right to seek and to seek, receive and impart information and ideas through any media and regardless of frontiers.


all human beings are born free and equal in dignity and rights.


everyone has the right to an effective remed

The sentences are shorter and somewhat recognizable. As the order of the n-gram model increases, the generated sentence gets closer to the ending of the preamble of the Universal Declaration of Human Rights (Now, therefore, The General Assembly, ...). This may be due to the short length of the corpus.

This time, each new word will be the most probable word at that position. If there are multiple words with equal probability that are most probable, one of those words will be chosen randomly.

In [19]:
for n in range(2, 6):
    print('========== ' + ('bi' if n == 2 else ('tri' if n == 3 else f'{n}-')) + 'gram ==========')
    print()
    
    ngram = [''] * n
    sentence_len = 0
    
    while ngram[-1] != '.':
        if 200 <= sentence_len:
            print(' <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>')
            break
        
        ngram[:-1] = ngram[1:]
        
        if ngram[-2] == '':
            history = '/' * (n - 2)
            history_count = udhr_eng_trie['.']
        else:
            if 0 < start:
                start -= 1
            
            for i in range(start, n - 1):
                history = '/'.join(ngram[i:-1])
                try:
                    history_count = udhr_eng_trie[history]
                    start = i
                    break
                except KeyError:
                    pass

        p = np.zeros(udhr_eng_v)
        for i in range(udhr_eng_v):
            try:
                p[i] = udhr_eng_trie[history + '/' + udhr_eng_wordlist[i]] / history_count
            except KeyError:
                pass
        
        max_indices = (p == p.max()).nonzero()[0]
        ngram[-1] = udhr_eng_wordlist[random.choice(max_indices)]

        if sentence_len == 0 or punc_pattern.fullmatch(ngram[-1]) or ngram[-2] == '-':
            output = ngram[-1]
        else:
            output = ' ' + ngram[-1]
        print(output, end='')
        
        sentence_len += 1
    
    print()
    print()


everyone has the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to the right to <<< The sentence being generated has exceeded 200 tokens. Sentence finishing failed. >>>



everyone has the right to freedom of

This time, the bigram and 5-gram models ended up generating repeated sequences. The trigram model generated a grammatically incorrect sentence, and the 4-gram model generated the most coherent sentence. It seems that the best method of text generation is random generation without add-one smoothing.