# n-gram Language Modeling

In [1]:
import os, sys, re, json, time, unittest
import itertools, collections

import numpy as np
from scipy import stats

import nltk

# Helper libraries for this notebook
import ngram_lm, ngram_lm_test
import ngram_utils
from shared_lib import utils, vocabulary

# Add-k Smoothing

Below is the unsmoothed maximum likelihood estimate of $ P(w_i\ |\ w_{i-1}, w_{i-2})$ where it uses the raw distribution over words seen in a context in the training data:

$$  \hat{P}(w_i = c\ |\ w_{i-1} = b, w_{i-2} = a) = \frac{C_{abc}}{\sum_{c'} C_{abc'}} $$

Add-k smoothing is the simple refinement where it add $k > 0$ to each count $C_{abc}$, pretending it has seen every vocabulary word $k$ extra times in each context. So we have:

$$ \hat{P}_k(w_i = c\ |\ w_{i-1} = b, w_{i-2} = a) = \frac{C_{abc} + k}{\sum_{c'} (C_{abc'} + k)} = \frac{C_{abc} + k}{C_{ab} + k\cdot|V|} $$

where $|V|$ is the size of our vocabulary.

In the code below, we'll refer to $(w_{i-2}, w_{i-1})$ as the *context*, and $w_i$ as the current *word*. By convention, we'll somewhat interchangeably refer to the sequence $(w_{i-2}, w_{i-1}, w)$ as $abc$.

## Part (b): Implementing the Add-k Model

Despite its shortcomings, it's worth implementing an add-k model as a baseline. Unlike the unsmoothed model, we'll be able to get some reasonable (or at least, finite) perplexity numbers which we can compare to the Kneser-Ney model below.

In [2]:
reload(ngram_lm)
reload(ngram_lm_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestAddKTrigramLM', ngram_lm_test))

test_context_totals (ngram_lm_test.TestAddKTrigramLM) ... ok
test_counts (ngram_lm_test.TestAddKTrigramLM) ... ok
test_next_word_proba_k_exists (ngram_lm_test.TestAddKTrigramLM) ... ok
test_next_word_proba_no_smoothing (ngram_lm_test.TestAddKTrigramLM) ... ok
test_no_mutate_on_predict (ngram_lm_test.TestAddKTrigramLM) ... ok
test_words (ngram_lm_test.TestAddKTrigramLM) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.012s

OK


<unittest.runner.TextTestResult run=6 errors=0 failures=0>

# Kneser-Ney Smoothing

Explore Kneser-Ney smoothing as a more sophisticated way of estimating unseen probabilities. 

When building an n-gram model, we're limited by the model order (e.g. trigram, 4-gram, or 5-gram) and how much data is available. Within that, we want to use as much information as possible. Within, say, a trigram context, we can compute a number of different statistics that might be helpful. Let's review a few goals:
1. If we don't have good n-gram estimates, we want to back off to (n-1) grams.
2. If we back off to (n-1) grams, we should do it "smoothly".
3. Our counts $C_{abc}$ are probably _overestimates_ for the n-grams we observe (see *held-out reweighting*).
4. Type fertilities tell us more about $P(w_{new}\ |\ \text{context})$ than the unigram distribution does.

Kneser-Ney smoothing combines all four of these ideas. 

**Absolute discounting** - which follows from 3. - gives us an easy way to backoff (1. and 2.), by distributing the subtracted probability mass among the backoff distribution $\tilde{P}(c\ |\ b)$.  The amount to redistribute, $\delta$, is a hyperparameter selected based on a cross-validation set in the usual way, although here we'll just let $\delta = 0.75$.

$$ P_{ad}(c\ |\ b, a) = \frac{C_{abc} - \delta}{C_{ab}} + \alpha_{ab} \tilde{P}(c\ |\  b) $$

Where $\alpha_{ab}$ is a backoff factor, derived from the counts, that guarantees that the probabilities are normalized: $\sum_{c'} P_{ad}(c'\ |\ b, a) = 1$. This definition is recursive: if we let $\tilde{P}(c\ |\  b) = P_{ad}(c\ |\ b)$, then the backoff distribution can also back off to even lower n-grams.

*Note:* we need the numerator above to positive, so it should actually read $\max(0, C_{abc} - \delta)$.

**Type fertility** is item 4. Instead of falling back to the unigram distribution at the end, we'll define $\hat{P}(w)$ as proportional to the type fertility of $w$, or the *number of unique preceding words* $w_{i-1}$.  In the following equation, the word we are estimating the probability of is $c$.  $b'$ are the set of words we've found occurring before $c$ in the training data.

$$ \hat{P}_{tf}(c) \propto \left|\ b' : C_{b'c} > 0\ \right| = tf(c)$$

In order to make this a valid probability distribution, we need to normalize it with a factor $Z_{tf} = \sum_{w} tf(w)$, so we have $\hat{P}_{tf}(w) = \frac{tf(w)}{Z_{tf}} $

### KN Equations

Putting it all together, we have our equations for a KN trigram model:

$$ P_{kn}(c\ |\ b, a) = \frac{\max(0, C_{abc} - \delta)}{C_{ab}} + \alpha_{ab} P_{kn}(c\ |\  b) $$
where the bigram backoff is:
$$ P_{kn}(c\ |\ b) = \frac{\max(0, C_{bc} - \delta)}{C_{b}} + \alpha_{b} P_{kn}(c) $$
and the unigram (type fertility) backoff is:
$$ P_{kn}(c) = \frac{tf(c)}{Z_{tf}} \quad \text{where} \quad tf(c) = \left|\ b' : C_{b'c} > 0\ \right| $$

Note that there is only one free parameter in this model, $\delta$.

## Part (d): Implementing the KN Model

Implement the `KNTrigramLM` in `ngram_lm.py`.

In [3]:
reload(ngram_lm)
reload(ngram_lm_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestKNTrigramLM', ngram_lm_test))

test_context_nnz (ngram_lm_test.TestKNTrigramLM) ... ok
test_context_totals (ngram_lm_test.TestKNTrigramLM) ... ok
test_counts (ngram_lm_test.TestKNTrigramLM) ... ok
test_kn_interp (ngram_lm_test.TestKNTrigramLM) ... ok
test_next_word_proba (ngram_lm_test.TestKNTrigramLM) ... ok
test_no_mutate_on_predict (ngram_lm_test.TestKNTrigramLM) ... ok
test_type_contexts (ngram_lm_test.TestKNTrigramLM) ... ok
test_type_fertility (ngram_lm_test.TestKNTrigramLM) ... ok
test_words (ngram_lm_test.TestKNTrigramLM) ... ok
test_z_tf (ngram_lm_test.TestKNTrigramLM) ... ok

----------------------------------------------------------------------
Ran 10 tests in 0.024s

OK


<unittest.runner.TextTestResult run=10 errors=0 failures=0>

# Training the Model

The same code below can be used with either model; in the cell where it says "Select your Model", you can choose the add-k model or the KN model.

## Loading & Preprocessing
Once again, we'll build our model on the Brown corpus. We'll do an 80/20 train/test split, and preprocess words by lowercasing and replacing digits with `DG` (so `2016` becomes `DGDGDGDG`).

In a slight departure from the `lm1.ipynb` demo, we'll restrict the vocabulary to 40000 words. This way, a small fraction of the *training* data will be mapped to `<unk>` tokens, and the model can learn n-gram probabilities that include `<unk>` for prediction on the test set. (If we interpret `<unk>` as meaning "rare word", then this is somewhat plausible as a way to infer things about the class of rare words.)

In [4]:
assert(nltk.download('brown'))  # Make sure we have the data.
corpus = nltk.corpus.brown
V = 30000
train_sents, test_sents = utils.get_train_test_sents(corpus, split=0.8, shuffle=False)
vocab = vocabulary.Vocabulary((utils.canonicalize_word(w) for w in utils.flatten(train_sents)), size=V)
# vocab = vocabulary.Vocabulary((utils.canonicalize_word(w) for w in utils.flatten(corpus.sents())), size=V)
print "Train set vocabulary: %d words" % vocab.size

[nltk_data] Downloading package brown to
[nltk_data]     /home/guangzhi_xie/nltk_data...
[nltk_data]   Package brown is already up-to-date!
Loaded 57340 sentences (1.16119e+06 tokens)
Training set: 45872 sentences (979646 tokens)
Test set: 11468 sentences (181546 tokens)
Train set vocabulary: 30000 words


Our smoothed models will also be trigram models, so for convenience we'll also prepend *two* `<s>` markers. (We could avoid this, but then we'd need special handling for the first token of each sentence.)

To make it easier to work with, we'll take the list of tokens as a NumPy array.

In [5]:
def sents_to_tokens(sents):
    """Returns an flattened list of the words in the sentences, with padding for a trigram model."""
    padded_sentences = (["<s>", "<s>"] + s + ["</s>"] for s in sents)
    # This will canonicalize words, and replace anything not in vocab with <unk>
    return np.array([utils.canonicalize_word(w, wordset=vocab.wordset) 
                     for w in utils.flatten(padded_sentences)], dtype=object)

train_tokens = sents_to_tokens(train_sents)
test_tokens = sents_to_tokens(test_sents)
print "Sample data: \n" + repr(train_tokens[:20])

Sample data: 
array(['<s>', '<s>', u'the', u'fulton', u'county', u'grand', u'jury',
       u'said', u'friday', u'an', u'investigation', u'of', u"atlanta's",
       u'recent', u'primary', u'election', u'produced', u'``', u'no',
       u'evidence'], dtype=object)


## Select the model

Select either `AddKTrigramLM` or `KNTrigramLM` in the cell below. If switching models, you only need to re-run the cells below here - no need to re-run the preprocessing.

In [6]:
import ngram_lm
reload(ngram_lm)

# Uncomment the line below for the model you want to run.
# Model = ngram_lm.AddKTrigramLM
Model = ngram_lm.KNTrigramLM

t0 = time.time()
print "Building trigram LM...",
lm = Model(train_tokens)
print "done in %.02f s" % (time.time() - t0)
ngram_utils.print_stats(lm)

Building trigram LM... done in 6.83 s
=== N-gram Language Model stats ===
30000 unique 1-grams
358274 unique 2-grams
733388 unique 3-grams
Optimal memory usage (counts only): 24 MB


Change `params` to change the smoothing factor. `AddKTrigramLM` will ignore the value of `delta`, and `KNTrigramLM` will ignore `k`.

In [7]:
lm.set_live_params(k = 0.001, delta=0.75)

## Sampling Sentences

In [8]:
max_length = 20
num_sentences = 5

for _ in range(num_sentences):
    seq = ["<s>", "<s>"]
    for i in range(max_length):
        seq.append(ngram_utils.predict_next(lm, seq))
        # Stop at end-of-sentence.
        if seq[-1] == "</s>": break
    print " ".join(seq)
    print "[{1:d} tokens; log P(seq): {0:.02f}]".format(*ngram_utils.score_seq(lm, seq))
    print ""

<s> <s> but fatty foods unless it gives him , ( DG ) the possibility of the corresponding changes in world development
[20 tokens; log P(seq): -109.24]

<s> <s> <s> when he died young bearden experienced idea-exchange violin called up to the porch of the american '' . </s>
[18 tokens; log P(seq): -144.88]

<s> <s> a christian youth for christ lives . </s>
[7 tokens; log P(seq): -46.83]

<s> <s> alec leaned on the jelke cleared circus if he angry , try to play , was treated effluent varied to
[20 tokens; log P(seq): -144.93]

<s> <s> mr. lay with its very existence to blind decomposition of this distinguished associated graduate stake in the principal post to
[20 tokens; log P(seq): -143.36]



## Scoring on Held-Out Data

In [9]:
log_p_data, num_real_tokens = ngram_utils.score_seq(lm, train_tokens)
print "Train perplexity: %.02f" % (2**(-1*log_p_data/num_real_tokens))

log_p_data, num_real_tokens = ngram_utils.score_seq(lm, test_tokens)
print "Test perplexity: %.02f" % (2**(-1*log_p_data/num_real_tokens))

Train perplexity: 17.17
Test perplexity: 286.60


## Linguistic Curiosities

You might have seen this floating around the internet:
![Adjective Order](adjective_order.jpg)
*source: https://twitter.com/MattAndersonBBC/status/772002757222002688?lang=en*

Let's see if it holds true, statistically at least. Note that log probabilities are always negative, so the smaller magnitude is better. And remember the log scale: a difference of score of 8 units means one utterance is $2^8 = 256$ times more likely!

In [10]:
def preprocess_for_scoring(sentence):
    # Pre-process words, replace anything the model doesn't know
    # with <unk>
    words = [utils.canonicalize_word(w, wordset=known_words)
             for w in sentence]
    # Pad sequence with start and end markers
    return ["<s>", "<s>"] + words + ["</s>"]

known_words = vocab.wordset
s0 = preprocess_for_scoring("square green plastic toys".split())
s1 = preprocess_for_scoring("plastic green square toys".split())

In [11]:
print "s0 score: %.02f" % ngram_utils.score_seq(lm, s0)[0]
print "s1 score: %.02f" % ngram_utils.score_seq(lm, s1)[0]

s0 score: -51.91
s1 score: -60.99


In [12]:
noun = "toys"
adjectives = ["square", "green", "plastic"]
results = []
for adjs in itertools.permutations(adjectives):
    words = list(adjs) + [noun]
    seq = preprocess_for_scoring(words)
    score = ngram_utils.score_seq(lm, seq)
    results.append((score[0], words))

# Sort results
for score, words in sorted(results, reverse=True):
    print "\"%s\" : %.02f" % (" ".join(words), score)

"square green plastic toys" : -51.91
"green square plastic toys" : -51.94
"plastic square green toys" : -60.99
"plastic green square toys" : -60.99
"square plastic green toys" : -61.21
"green plastic square toys" : -61.24
