# 2.2: Examining language models

The `ARPA` (and `iARPA`) format is very interpretable.  If you haven't yet done so, read this short [blog post](https://cmusphinx.github.io/wiki/arpaformat/) for more information on how to interpret them.

In [None]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa

## Using `PyNLPl`

We can use the [`PyNLPl`](http://pynlpl.readthedocs.io/en/latest/) (pronounced "pineapple") package in `python` to examine our language model.

In [None]:
import pynlpl.lm.lm as lm

### loading in `.iARPA` files

`ARPALanguageModel()` will import an existing **`iARPA`** formatted language model.

**Note:** Recall that in the last notebook we had to run a quick `sed` command over the `.iarpa` format because there were times where the whitespace between a probability and the `1-gram` was a `" "` instead of a `\t`.

In [None]:
bi_gram_lm = lm.ARPALanguageModel(
    filename="resource_files/language_model/animal_lm-2_gram.iarpa",
    base_e=False,  # this will keep the log probabilities in `base 10` so that they match up with the original file
    debug=True     # this argument will allow you to more easily see how the data is stored in the object
)

You'll notice that each `n-gram` is stored as a `<tuple>`, even `1-gram`s ==> `([word],)`.

### looking up **existing** `n-gram`s

`.ngrams` contains all the of `n-gram`s **present** in our language model.  We can access either:
 - the probability ==> `.prob()`
 - the backoff probability ==> `.backoff()`

In [None]:
bi_gram_lm.ngrams.prob(("dog",)), bi_gram_lm.ngrams.backoff(("dog",))

We can confirm this by double-checking the values in the original `.iarpa` file

In [None]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -P "\tdog\t"

In [None]:
bi_gram_lm.ngrams.prob(("the", "dog")), bi_gram_lm.ngrams.backoff(("the", "dog"))

In [None]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -P "the dog"

But if we try to lookup an `n-gram` that does **not** exist in the language model explicitly, we get a `KeyError`.

In [None]:
try:
    bi_gram_lm.ngrams.prob(("human", "ate"))
except Exception as e:
    print("n-gram {} doesn't exist in language model".format(e))

In [None]:
try:
    bi_gram_lm.ngrams.prob(("the", "dog", "ate"))
except Exception as e:
    print("n-gram {} doesn't exist in language model".format(e))

### calculating new `n-gram` probabilities

There are two cases where this will occur:
 1. The `n-gram` is of the size of the language model **but** this particular `n-gram` is **not found** in the language model.
 2. The `n-gram` is **larger** than that of the language model.  In other words, you want the probability of a `3-gram`, but your language model is only made up of `2-gram`s.
 
In both cases, we can use `.score()`.  To score a new `n-gram`, provide that `n-gram` as a `<tuple>`.

#### `n-gram` is **not present** in language model

In cases like these, we need to access `backoff` probabilities, which are designed precisely for this purpose.  You'll notice that in our `2-gram` language model, backoff probabilities exist for `1-gram`s only.  It is the number that comes **after** the word.

In [None]:
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -A15 -E "1-grams"

So if we wanted to find the probability of "human ate", it would be calculated as:

$p(human\_ate) = p(human) + p(ate|human) = p(human) + p(UNK) + bWt(human)$

**Note:** Remember because our probabilities are in `negative log`-space, we will **add** instead of **multiply**.  And since all of our probabilities will be `negative`, the **closer** the probability is to `0`, the "more likely" the word/sequence is.

In [None]:
bi_gram_lm.score(("human", "ate"))

We can confirm this by doing the calculations ourselves.

In [None]:
bi_gram_lm.ngrams.prob(("human",)) + \
bi_gram_lm.ngrams.prob(("<unk>",)) + \
bi_gram_lm.ngrams.backoff(("human",))

**Note:** If you forget to enter the `n-gram` as a `<tuple>`, the `<string>` will be considered an `n-gram` of **characters**, **none of which** will be present in the language model, so it will be equal to $p(UNK) * len(string)$
.

In [None]:
bi_gram_lm.score("human ate")

In [None]:
result = 0
for i in "human ate":
    result += bi_gram_lm.scoreword(i)
result

#### `n-gram` is **larger** than language model

If we want to get the probability of the sequence, `"the dog ate"` using our `2-gram` language model, it will be calculated as follows:

$p(the\_dog\_ate) = p(the) + p(dog|the) + p(ate|dog)$

In [None]:
bi_gram_lm.score(("the", "dog", "ate"))

We can again confirm this by doing the calculation ourselves.

In [None]:
bi_gram_lm.ngrams.prob(("the",)) + \
bi_gram_lm.ngrams.prob(("the", "dog")) + \
bi_gram_lm.ngrams.prob(("dog", "ate"))

But if any one of the `n-grams` we use to compose our sequence is **not** present in our language model, we again need to utilize the `backoff` probabilities.

$p(the\_triceratops\_ate) = p(the) + p(triceratops|the) + p(ate|triceratops) = p(the) + p(UNK) + bWt(the) + p(ate) + bWt(triceratops)$

In [None]:
bi_gram_lm.score(("the", "triceratops", "ate"))

In [None]:
bi_gram_lm.ngrams.prob(("the",)) + \
bi_gram_lm.ngrams.prob(("<unk>",)) + \
bi_gram_lm.ngrams.backoff(("the",)) + \
bi_gram_lm.ngrams.prob(("ate",))

## Comparing probabilities

Now that we have the tools, let's look at how "likely" particular sequences of words will be given our language model.

Our animal corpus provided us with a "hierarchical" understanding of the food chain.  For example, our `"mouse"` could only eat **one** thing while our `"lion"` ate **four** things.

In [None]:
cat resource_files/language_model/animal_corpus.txt | grep "mouse ate"

In [None]:
cat resource_files/language_model/animal_corpus.txt | grep "lion ate"

And so intuitively, the `2-gram` `"mouse ate"` should be **four times less likely** than `"lion ate"`.

**Note:** Remember that these probabilities are in `log base 10` space, so we need to do a quick conversion in order to see the expected ratio between the probabilities.

In [None]:
10**bi_gram_lm.ngrams.prob(("mouse", "ate")), 10**bi_gram_lm.ngrams.prob(("lion", "ate"))

And as the `"lion"` is relatively "high up" in our food chain, it is eaten by less things than the `"mouse"` is.

In [None]:
cat resource_files/language_model/animal_corpus.txt | grep "ate the lion"

In [None]:
cat resource_files/language_model/animal_corpus.txt | grep "ate the mouse"

So we should expect the probability of `"ate the lion"` to be **three times less likely** than `"ate the mouse`".  And since we are dealing with a `2-gram` language model, we will need to do some probability calculations this time (instead of being able to simply "look up" the probabilities).

In [None]:
10**bi_gram_lm.score(("ate", "the", "lion")), 10**bi_gram_lm.score(("ate", "the", "mouse"))

But this is clearly not the case!  In fact, `"ate the lion"` is **more likely** than `"ate the mouse`"!

Remembering how these probabilities are calculated, we can see why this is the case:
    
$p(ate\_the\_XXX) = p(ate) + p(ate|the) + p(XXX|the)$

In [None]:
bi_gram_lm.ngrams.prob(("the", "lion")), bi_gram_lm.ngrams.prob(("the", "mouse"))

Because the `"lion"` appeared **one time more than** `"mouse"` (notice that the `"t-rex"` did **not** eat the `"mouse"`), this increased probability impacted our `3-gram` calculation.

But then, in that case, our `3-gram` language model should be able handle this problem better.  It would have modeled `"ate the XXX"` explicitly and would not, therefore, need to generate a probability by considering the smaller `n-gram`s that caused us problems above.

In [None]:
tri_gram_lm = lm.ARPALanguageModel(
    filename="resource_files/language_model/animal_lm-3_gram.iarpa",
    base_e=False  # this will keep the log probabilities in `base 10` so that they match up with the original file
)

And so in this case, using `score()` will simply require looking up (`.ngrams.prob()`) the explicit `3-gram` that was captured by the language model explicitly.

In [None]:
10**tri_gram_lm.score(("ate", "the", "lion")), 10**tri_gram_lm.score(("ate", "the", "mouse"))

In [None]:
10**tri_gram_lm.ngrams.prob(("ate", "the", "lion")), 10**tri_gram_lm.ngrams.prob(("ate", "the", "mouse"))

And now, our intuitions match.  `"ate the lion"` is **three times less likely** than `"ate the mouse"`.

Obviously, the higher the order of `n-gram` you are willing to model, the more "accurate" your language modeling will become.  But at a certain point, this becomes unwieldly.  You can see that our `librispeech` data comes with a `3-gram` **and** a `4-gram` model, but notice the size doubles just going from `3-gram`s to `4-gram`s.

In [None]:
%%bash
ls -lah raw_data/ | grep -E "gram\.arpa"

And by looking at the number of `n-gram`s modeled, this shouldn't be a surprise as there are ~60 million `4-gram`s in the `4-gram` language model.

In [None]:
%%bash
head -n5 raw_data/3-gram.arpa

In [None]:
%%bash
head -n6 raw_data/4-gram.arpa

### `pruning` a language model

This is where `pruning` can come in handy.  Since a good number of the total `n-gram`s that appear in a corpus are infrequent (remember, any `n-gram` seen **even once** has to be modeled), these can be removed from the language model without reducing the "accuracy" of the model too much.

The `IRSTLM` manual (found in `resource_files/resources/irstlm-manual.pdf`) explains pruning this way:

```
Large LMs files can be pruned in a smart way by means of the command prune-lm that removes n-grams for which resorting to the back-off results in a small loss.
```

The `librispeech` data provides two `pruned` `3-gram` language models, each with a different `pruning` threshold.

In [None]:
%%bash
ls -lh raw_data/ | grep 3-gram

Notice the significant reduction in size.  This will be evident in the number of `2-gram`s and `3-gram`s as well

In [None]:
%%bash
head -n5 raw_data/3-gram.arpa
echo ...
head -n5 raw_data/3-gram.pruned.1e-7.arpa
echo ...
head -n5 raw_data/3-gram.pruned.3e-7.arpa

While it's not necessary to `prune` our toy animal language models, it **is** easy to do with `IRSTLM`.

In [None]:
%%bash
export IRSTLM=${KALDI_PATH}/tools/irstlm
export PATH=${PATH}:${IRSTLM}/bin
prune-lm

**Note:** You can provide a different threshold for each `n-gram`.  In the `librispeech` language models, you can see that no `pruning` was done on `1-gram`s.