# 2.2: Examining language models

The `ARPA` (and `iARPA`) format is very interpretable.  If you haven't yet done so, read this short [blog post](https://cmusphinx.github.io/wiki/arpaformat/) for more information on how to interpret them.

In [1]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa

iARPA

\data\
ngram 1=	15
ngram 2=	33

\1-grams:
-1.826075	<s>	-0.698970
-0.583037	the	-0.767686
-1.348954	mouse	-0.352183
-0.951014	ate	-1.176091
-1.348954	cheese	-0.544068
-1.282007	</s>	-0.845098
-1.282007	cat	-0.397940
-1.085712	and	-1.041393
-1.282007	dog	-0.397940
-1.282007	lion	-0.477121
-1.348954	tyrannosaurus	-0.778151
-1.348954	rex	-0.544068
-1.826075	human	-0.301030
-1.826075	shot	-0.301030
-0.951014	<unk>
\2-grams:
-0.698970	<s> <s>
-0.221849	<s> the
-0.913814	the mouse
-0.913814	the cheese
-0.834633	the cat
-0.834633	the dog
-0.834633	the lion
-0.913814	the tyrannosaurus
-1.612784	the human
-0.954243	mouse the
-0.954243	mouse ate
-0.954243	mouse </s>
-0.653213	mouse and
-0.029963	ate the
-0.845098	cheese </s>
-0.243038	cheese and
-1.000000	cat the
-0.698970	cat ate
-1.000000	cat </s>
-0.698970	cat and
-0.041393	and the
-1.000000	dog the
-0.522879	dog ate
-1.000000	dog </s>
-1.000000	dog and
-0.352183	lion ate
-0.954243	lion </s>
-0.954243	lion and
-0.079181	tyrannosaurus r

## Using `PyNLPl`

We can use the [`PyNLPl`](http://pynlpl.readthedocs.io/en/latest/) (pronounced "pineapple") package in `python` to examine our language model.

In [2]:
import pynlpl.lm.lm as lm

### loading in `.iARPA` files

`ARPALanguageModel()` will import an existing **`iARPA`** formatted language model.

**Note:** Recall that in the last notebook we had to run a quick `sed` command over the `.iarpa` format because there were times where the whitespace between a probability and the `1-gram` was a `" "` instead of a `\t`.

In [3]:
bi_gram_lm = lm.ARPALanguageModel(
    filename="resource_files/language_model/animal_lm-2_gram.iarpa",
    base_e=False,  # this will keep the log probabilities in `base 10` so that they match up with the original file
    debug=True     # this argument will allow you to more easily see how the data is stored in the object
)

Unable to parse ARPA LM line: iARPA
Adding to LM: (u'<s>',)	-1.826075	-0.69897
Adding to LM: (u'the',)	-0.583037	-0.767686
Adding to LM: (u'mouse',)	-1.348954	-0.352183
Adding to LM: (u'ate',)	-0.951014	-1.176091
Adding to LM: (u'cheese',)	-1.348954	-0.544068
Adding to LM: (u'</s>',)	-1.282007	-0.845098
Adding to LM: (u'cat',)	-1.282007	-0.39794
Adding to LM: (u'and',)	-1.085712	-1.041393
Adding to LM: (u'dog',)	-1.282007	-0.39794
Adding to LM: (u'lion',)	-1.282007	-0.477121
Adding to LM: (u'tyrannosaurus',)	-1.348954	-0.778151
Adding to LM: (u'rex',)	-1.348954	-0.544068
Adding to LM: (u'human',)	-1.826075	-0.30103
Adding to LM: (u'shot',)	-1.826075	-0.30103
Adding to LM: (u'<unk>',)	-0.951014
Adding to LM: (u'<s>', u'<s>')	-0.69897
Adding to LM: (u'<s>', u'the')	-0.221849
Adding to LM: (u'the', u'mouse')	-0.913814
Adding to LM: (u'the', u'cheese')	-0.913814
Adding to LM: (u'the', u'cat')	-0.834633
Adding to LM: (u'the', u'dog')	-0.834633
Adding to LM: (u'the', u'lion')	-0.834633
Addin

You'll notice that each `n-gram` is stored as a `<tuple>`, even `1-gram`s ==> `([word],)`.

### looking up **existing** `n-gram`s

`.ngrams` contains all the of `n-gram`s **present** in our language model.  We can access either:
 - the probability ==> `.prob()`
 - the backoff probability ==> `.backoff()`

In [4]:
bi_gram_lm.ngrams.prob(("dog",)), bi_gram_lm.ngrams.backoff(("dog",))

(-1.282007, -0.39794)

We can confirm this by double-checking the values in the original `.iarpa` file

In [5]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -P "\tdog\t"

-1.282007	dog	-0.397940


In [6]:
bi_gram_lm.ngrams.prob(("the", "dog")), bi_gram_lm.ngrams.backoff(("the", "dog"))

(-0.834633, 0.0)

In [7]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -P "the dog"

-0.834633	the dog


But if we try to lookup an `n-gram` that does **not** exist in the language model explicitly, we get a `KeyError`.

In [8]:
try:
    bi_gram_lm.ngrams.prob(("human", "ate"))
except Exception as e:
    print("n-gram {} doesn't exist in language model".format(e))

n-gram ('human', 'ate') doesn't exist in language model


In [9]:
try:
    bi_gram_lm.ngrams.prob(("the", "dog", "ate"))
except Exception as e:
    print("n-gram {} doesn't exist in language model".format(e))

n-gram ('the', 'dog', 'ate') doesn't exist in language model


### calculating new `n-gram` probabilities

There are two cases where this will occur:
 1. The `n-gram` is of the size of the language model **but** this particular `n-gram` is **not found** in the language model.
 2. The `n-gram` is **larger** than that of the language model.  In other words, you want the probability of a `3-gram`, but your language model is only made up of `2-gram`s.
 
In both cases, we can use `.score()`.  To score a new `n-gram`, provide that `n-gram` as a `<tuple>`.

#### `n-gram` is **not present** in language model

In cases like these, we need to access `backoff` probabilities, which are designed precisely for this purpose.  You'll notice that in our `2-gram` language model, backoff probabilities exist for `1-gram`s only.  It is the number that comes **after** the word.

In [10]:
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -A15 -E "1-grams"

\1-grams:
-1.826075	<s>	-0.698970
-0.583037	the	-0.767686
-1.348954	mouse	-0.352183
-0.951014	ate	-1.176091
-1.348954	cheese	-0.544068
-1.282007	</s>	-0.845098
-1.282007	cat	-0.397940
-1.085712	and	-1.041393
-1.282007	dog	-0.397940
-1.282007	lion	-0.477121
-1.348954	tyrannosaurus	-0.778151
-1.348954	rex	-0.544068
-1.826075	human	-0.301030
-1.826075	shot	-0.301030
-0.951014	<unk>


So if we wanted to find the probability of "human ate", it would be calculated as:

$p(human\_ate) = p(human) + p(ate|human) = p(human) + p(UNK) + bWt(human)$

**Note:** Remember because our probabilities are in `negative log`-space, we will **add** instead of **multiply**.  And since all of our probabilities will be `negative`, the **closer** the probability is to `0`, the "more likely" the word/sequence is.

In [11]:
bi_gram_lm.score(("human", "ate"))

-3.078119

We can confirm this by doing the calculations ourselves.

In [12]:
bi_gram_lm.ngrams.prob(("human",)) + \
bi_gram_lm.ngrams.prob(("<unk>",)) + \
bi_gram_lm.ngrams.backoff(("human",))

-3.078119

**Note:** If you forget to enter the `n-gram` as a `<tuple>`, the `<string>` will be considered an `n-gram` of **characters**, **none of which** will be present in the language model, so it will be equal to $p(UNK) * len(string)$
.

In [13]:
bi_gram_lm.score("human ate")

-8.559126

In [14]:
result = 0
for i in "human ate":
    result += bi_gram_lm.scoreword(i)
result

-8.559126

#### `n-gram` is **larger** than language model

If we want to get the probability of the sequence, `"the dog ate"` using our `2-gram` language model, it will be calculated as follows:

$p(the\_dog\_ate) = p(the) + p(dog|the) + p(ate|dog)$

In [15]:
bi_gram_lm.score(("the", "dog", "ate"))

-1.9405489999999999

We can again confirm this by doing the calculation ourselves.

In [16]:
bi_gram_lm.ngrams.prob(("the",)) + \
bi_gram_lm.ngrams.prob(("the", "dog")) + \
bi_gram_lm.ngrams.prob(("dog", "ate"))

-1.9405489999999999

But if any one of the `n-grams` we use to compose our sequence is **not** present in our language model, we again need to utilize the `backoff` probabilities.

$p(the\_triceratops\_ate) = p(the) + p(triceratops|the) + p(ate|triceratops) = p(the) + p(UNK) + bWt(the) + p(ate) + bWt(triceratops)$

In [17]:
bi_gram_lm.score(("the", "triceratops", "ate"))

-3.252751

In [18]:
bi_gram_lm.ngrams.prob(("the",)) + \
bi_gram_lm.ngrams.prob(("<unk>",)) + \
bi_gram_lm.ngrams.backoff(("the",)) + \
bi_gram_lm.ngrams.prob(("ate",))

-3.252751

## Comparing probabilities

Now that we have the tools, let's look at how "likely" particular sequences of words will be given our language model.

Our animal corpus provided us with a "hierarchical" understanding of the food chain.  For example, our `"mouse"` could only eat **one** thing while our `"lion"` ate **four** things.

In [19]:
cat resource_files/language_model/animal_corpus.txt | grep "mouse ate"

the mouse ate the cheese


In [20]:
cat resource_files/language_model/animal_corpus.txt | grep "lion ate"

the lion ate the cheese and the lion ate the mouse and the lion ate the cat and the lion ate the dog


And so intuitively, the `2-gram` `"mouse ate"` should be **four times less likely** than `"lion ate"`.

**Note:** Remember that these probabilities are in `log base 10` space, so we need to do a quick conversion in order to see the expected ratio between the probabilities.

In [21]:
10**bi_gram_lm.ngrams.prob(("mouse", "ate")), 10**bi_gram_lm.ngrams.prob(("lion", "ate"))

(0.11111098560477115, 0.4444439512937877)

And as the `"lion"` is relatively "high up" in our food chain, it is eaten by less things than the `"mouse"` is.

In [22]:
cat resource_files/language_model/animal_corpus.txt | grep "ate the lion"

the tyrannosaurus rex ate the cheese and the tyrannosaurus rex ate the cat and the tyrannosaurus rex ate the dog and the tyrannosaurus rex ate the lion


In [23]:
cat resource_files/language_model/animal_corpus.txt | grep "ate the mouse"

the cat ate the cheese and the cat ate the mouse
the dog ate the cheese and the dog ate the mouse and the dog ate the cat
the lion ate the cheese and the lion ate the mouse and the lion ate the cat and the lion ate the dog


So we should might the probability of `"ate the lion"` to be **three times less likely** than `"ate the mouse`".  And since we are dealing with a `2-gram` language model, we will need to do some probability calculations this time (instead of being able to simply "look up" the probabilities).

In [24]:
10**bi_gram_lm.score(("ate", "the", "lion")), 10**bi_gram_lm.score(("ate", "the", "mouse"))

(0.015289384411767658, 0.012741160894919555)

But this is clearly not the case!  In fact, `"ate the lion"` is **more likely** than `"ate the mouse`"!

Remembering how these probabilities are calculated, we can see why this is the case:
    
$p(ate\_the\_XXX) = p(ate) + p(ate|the) + p(XXX|the)$

In [25]:
bi_gram_lm.ngrams.prob(("the", "lion")), bi_gram_lm.ngrams.prob(("the", "mouse"))

(-0.834633, -0.913814)

Because the `"lion"` appeared **one time more than** `"mouse"`, this increased probability impacted our `3-gram` calculation.

But then, in that case, our `3-gram` language model should be able handle this problem better.  It would have modeled `"ate the XXX"` explicitly and would not, therefore, need to generate a probability by considering the smaller `n-gram`s that caused us problems above.

In [26]:
tri_gram_lm = lm.ARPALanguageModel(
    filename="resource_files/language_model/animal_lm-3_gram.iarpa",
    base_e=False  # this will keep the log probabilities in `base 10` so that they match up with the original file
)

And so in this case, using `score()` will simply require looking up (`.ngrams.prob()`) the explicit `3-gram` that was captured by the language model explicitly.

In [27]:
10**tri_gram_lm.score(("ate", "the", "lion")), 10**tri_gram_lm.score(("ate", "the", "mouse"))

(0.005498813623791417, 0.016496469180468848)

In [28]:
10**tri_gram_lm.ngrams.prob(("ate", "the", "lion")), 10**tri_gram_lm.ngrams.prob(("ate", "the", "mouse"))

(0.05263153058738709, 0.15789486272078615)

And now, our intuitions match.  `"ate the lion"` is **three times less likely** than `"ate the mouse"`.

Obviously, the higher the order of `n-gram` you are willing to model, the more "accurate" your language modeling will become.  But at a certain point, this becomes unwieldly.  You can see that our `librispeech` data comes with a `3-gram` **and** a `4-gram` model, but notice the size doubles just going from `3-gram`s to `4-gram`s.

In [29]:
%%bash
ls -lah raw_data/ | grep -E "gram\.arpa"

-rw-r--r--  1 root root 725M Oct  3 20:30 3-gram.arpa.gz
-rw-r--r--  1 root root 1.3G Oct  3 20:30 4-gram.arpa.gz
lrwxr-xr-x  1 root root   14 Oct 20 20:07 lm_fglarge.arpa.gz -> 4-gram.arpa.gz
lrwxr-xr-x  1 root root   14 Oct 20 20:07 lm_tglarge.arpa.gz -> 3-gram.arpa.gz


And by looking at the number of `n-gram`s modeled, this shouldn't be a surprise as there are ~60 million `4-gram`s in the `4-gram` language model.

In [32]:
%%bash
head -n5 raw_data/3-gram.arpa


\data\
ngram 1=200003
ngram 2=38229161
ngram 3=49712290


In [33]:
%%bash
head -n6 raw_data/4-gram.arpa


\data\
ngram 1=200003
ngram 2=38229161
ngram 3=45941329
ngram 4=60975692


### `pruning` a language model

This is where `pruning` can come in handy.  Since a good number of the total `n-gram`s that appear in a corpus are infrequent (remember, any `n-gram` seen **even once** has to be modeled), these can be removed from the language model without reducing the "accuracy" of the model too much.

The `IRSTLM` manual (found in `resource_files/resources/irstlm-manual.pdf`) explains pruning this way:

```
Large LMs files can be pruned in a smart way by means of the command prune-lm that removes n-grams for which resorting to the back-off results in a small loss.
```

The `librispeech` data provides two `pruned` `3-gram` language models, each with a different `pruning` threshold.

In [38]:
%%bash
ls -lh raw_data/ | grep 3-gram

-rw-r--r--  1 root root 2.3G Oct  3 20:30 3-gram.arpa
-rw-r--r--  1 root root  33M Oct  3 20:30 3-gram.pruned.1e-7.arpa.gz
-rw-r--r--  1 root root  14M Oct  3 20:30 3-gram.pruned.3e-7.arpa.gz
lrwxr-xr-x  1 root root   14 Oct 20 20:07 lm_tglarge.arpa.gz -> 3-gram.arpa.gz
lrwxr-xr-x  1 root root   26 Oct 20 20:07 lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
lrwxr-xr-x  1 root root   26 Oct 20 20:07 lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz


Notice the significant reduction in size.  This will be evident in the number of `2-gram`s and `3-gram`s as well

In [43]:
%%bash
head -n5 raw_data/3-gram.arpa
echo ...
head -n5 raw_data/3-gram.pruned.1e-7.arpa
echo ...
head -n5 raw_data/3-gram.pruned.3e-7.arpa


\data\
ngram 1=200003
ngram 2=38229161
ngram 3=49712290
...

\data\
ngram 1=200003
ngram 2=2451827
ngram 3=1134656
...

\data\
ngram 1=200003
ngram 2=1016673
ngram 3=340026


While it's not necessary to `prune` our toy animal language models, it **is** easy to do with `IRSTLM`.

In [45]:
%%bash
export IRSTLM=${KALDI_PATH}/tools/irstlm
export PATH=${PATH}:${IRSTLM}/bin
prune-lm


prune-lm - prunes language models

USAGE:
       prune-lm [options] <inputfile> [<outputfile>]

DESCRIPTION:
       prune-lm reads a LM in either ARPA or compiled format and
       prunes out n-grams (n=2,3,..) for which backing-off to the
       lower order n-gram results in a small difference in probability.
       The pruned LM is saved in ARPA format

OPTIONS:
Parameters:
    Help:      print this help
    abs:      uses absolute value of weighted difference; default is 0
    h:      print this help
    t:      pruning thresholds for 2-grams, 3-grams, 4-grams,...; if less thresholds are specified, the last one is applied to all following n-gram levels; default is 0
    threshold:      pruning thresholds for 2-grams, 3-grams, 4-grams,...; if less thresholds are specified, the last one is applied to all following n-gram levels; default is 0

DEBUG_LEVEL:0/1 Everything OK


**Note:** You can provide a different threshold for each `n-gram`.  In the `librispeech` language models, you can see that no `pruning` was done on `1-gram`s.