# Add-One Smoothing

Here, I take two English corpora - the King James Version of the Bible and the Universal Declaration of Human Rights - and apply add-one smoothing to generate n-gram models for each corpus.

For each model, I compute its cross-entropy and perplexity and develop a simple sentence finisher. After that, I analyze the effectiveness of each corpus and the ideal order of the n-gram model.

In [1]:
%cd ..

/home/mtj0712/Documents/playground


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
import re
import math

import pygtrie

from reader import *

`punc_pattern` will help us separate the punctuations from actual words.

In [3]:
punc_pattern = re.compile('[‘’“”!\"#$%&\'()*+,\\-./:;<=>?@[\\\\\\]^_`{|}~]')

## King James Version

First, I build n-gram models with the King James Version of the Bible. This will give us a language model based on Early Modern English.

First, let us look at the unigram model with add-one smoothing.

In [4]:
kjv = KJVReader()
kjv_list = []
kjv_unigram_trie = pygtrie.StringTrie()
kjv_unigram_v = 0 # size of the vocabulary
kjv_unigram_H = 0 # cross entropy

In [5]:
# Filling kjv_list...
while not kjv.is_eof():
    units = kjv.read_sentence().lower().split()
    for u in units:
        while u:
            match = punc_pattern.search(u)
            if match:
                i = match.start()
                if i == 0:
                    punc = u[0]
                    u = u[1:]
                else:
                    first_word = u[:i]
                    punc = u[i]
                    u = u[i+1:]
                    kjv_list.append(first_word)
                kjv_list.append(punc)
            else:
                kjv_list.append(u)
                break

In [6]:
for w in kjv_list:
    try:
        kjv_unigram_trie[w] += 1
    except KeyError:
        kjv_unigram_trie[w] = 1
        kjv_unigram_v += 1

print('Size of the vocabulary:', kjv_unigram_v)
print('Count of all words:', len(kjv_list))

for w in kjv_list:
    p = (kjv_unigram_trie[w] + 1) / (len(kjv_list) + kjv_unigram_v)
    kjv_unigram_H += math.log2(p)
kjv_unigram_H /= -len(kjv_list)

print('Cross entropy:', kjv_unigram_H)
print('Perplexity:', 2 ** kjv_unigram_H)
print()

kjv_unigram_list = sorted(kjv_unigram_trie.items(), key=lambda t : t[1], reverse=True)
for pair in kjv_unigram_list:
    p = (pair[1] + 1) / (len(kjv_list) + kjv_unigram_v)
    print('Word:', pair[0], '| Probability:', p)

Size of the vocabulary: 12557
Count of all words: 917082
Cross entropy: 8.301704011836414
Perplexity: 315.54545031215395

Word: , | Probability: 0.07592194389435039
Word: the | Probability: 0.06876862954329584
Word: and | Probability: 0.055609758196461204
Word: of | Probability: 0.03723918639385826
Word: . | Probability: 0.028182982856786346
Word: to | Probability: 0.01458953421704554
Word: that | Probability: 0.013890338077468782
Word: : | Probability: 0.01366336825369848
Word: in | Probability: 0.01362679491716677
Word: he | Probability: 0.01120865196059976
Word: ; | Probability: 0.010907459777397464
Word: shall | Probability: 0.010583678180454994
Word: unto | Probability: 0.009679025944479523
Word: for | Probability: 0.009651058098896454
Word: i | Probability: 0.009525202793772636
Word: his | Probability: 0.009115366287343796
Word: a | Probability: 0.008796963122244227
Word: lord | Probability: 0.008567841925736765
Word: they | Probability: 0.007935338341011941
Word: be | Probabilit

The cross entropy and perplexity of the model is extremely high. This is expected, since a unigram model is far from sufficient in representing an actual language. As expected, the most probable words are some common punctuations and grammatical words, such as articles, prepositions, and pronouns.

Next, I build 2~5-gram models. This time, instead of listing the probabilities of all n-grams, I build a sentence finisher for each model.

In [7]:
# udhr_eng = UDHREngReader()

In [8]:
# while not udhr_eng.is_eof():
    # sentence = udhr_eng.read_sentence()
    # print(sentence, end='\n\n')

...and the.........and the......and then....