# 2.2: Examining language models

The `ARPA` (and `iARPA`) format is very interpretable.  If you haven't yet done so, read this short [blog post](https://cmusphinx.github.io/wiki/arpaformat/) for more information on how to interpret them.

In [1]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa

iARPA

\data\
ngram 1=	15
ngram 2=	33

\1-grams:
-1.826075	<s>	-0.698970
-0.583037	the	-0.767686
-1.348954	mouse	-0.352183
-0.951014	ate	-1.176091
-1.348954	cheese	-0.544068
-1.282007	</s>	-0.845098
-1.282007	cat	-0.397940
-1.085712	and	-1.041393
-1.282007	dog	-0.397940
-1.282007	lion	-0.477121
-1.348954	tyrannosaurus	-0.778151
-1.348954	rex	-0.544068
-1.826075	human	-0.301030
-1.826075	shot	-0.301030
-0.951014	<unk>
\2-grams:
-0.698970	<s> <s>
-0.221849	<s> the
-0.913814	the mouse
-0.913814	the cheese
-0.834633	the cat
-0.834633	the dog
-0.834633	the lion
-0.913814	the tyrannosaurus
-1.612784	the human
-0.954243	mouse the
-0.954243	mouse ate
-0.954243	mouse </s>
-0.653213	mouse and
-0.029963	ate the
-0.845098	cheese </s>
-0.243038	cheese and
-1.000000	cat the
-0.698970	cat ate
-1.000000	cat </s>
-0.698970	cat and
-0.041393	and the
-1.000000	dog the
-0.522879	dog ate
-1.000000	dog </s>
-1.000000	dog and
-0.352183	lion ate
-0.954243	lion </s>
-0.954243	lion and
-0.079181	tyrannosaurus r

## Using `PyNLPl`

We can use the [`PyNLPl`](http://pynlpl.readthedocs.io/en/latest/) (pronounced "pineapple") package in `python` to examine our language model.

In [2]:
import pynlpl.lm.lm as lm

### loading in `.iARPA` files

`ARPALanguageModel()` will import an existing **`iARPA`** formatted language model.

**Note:** Recall that in the last notebook we had to run a quick `sed` command over the `.iarpa` format because there were times where the whitespace between a probability and the unigram was a `" "` instead of a `\t`.

In [3]:
bi_gram_lm = lm.ARPALanguageModel(
    filename="resource_files/language_model/animal_lm-2_gram.iarpa",
    base_e=False,  # this will keep the log probabilities in `base 10` so that they match up with the original file
    debug=True     # this argument will allow you to more easily see how the data is stored in the object
)

Unable to parse ARPA LM line: iARPA
Adding to LM: (u'<s>',)	-1.826075	-0.69897
Adding to LM: (u'the',)	-0.583037	-0.767686
Adding to LM: (u'mouse',)	-1.348954	-0.352183
Adding to LM: (u'ate',)	-0.951014	-1.176091
Adding to LM: (u'cheese',)	-1.348954	-0.544068
Adding to LM: (u'</s>',)	-1.282007	-0.845098
Adding to LM: (u'cat',)	-1.282007	-0.39794
Adding to LM: (u'and',)	-1.085712	-1.041393
Adding to LM: (u'dog',)	-1.282007	-0.39794
Adding to LM: (u'lion',)	-1.282007	-0.477121
Adding to LM: (u'tyrannosaurus',)	-1.348954	-0.778151
Adding to LM: (u'rex',)	-1.348954	-0.544068
Adding to LM: (u'human',)	-1.826075	-0.30103
Adding to LM: (u'shot',)	-1.826075	-0.30103
Adding to LM: (u'<unk>',)	-0.951014
Adding to LM: (u'<s>', u'<s>')	-0.69897
Adding to LM: (u'<s>', u'the')	-0.221849
Adding to LM: (u'the', u'mouse')	-0.913814
Adding to LM: (u'the', u'cheese')	-0.913814
Adding to LM: (u'the', u'cat')	-0.834633
Adding to LM: (u'the', u'dog')	-0.834633
Adding to LM: (u'the', u'lion')	-0.834633
Addin

You'll notice that each `n-gram` is stored as a `<tuple>`, even `unigram`s ==> `([word],)`.

### looking up **existing** `n-gram`s

`.ngrams` contains all the of `n-gram`s **present** in our language model.  We can access either:
 - the probability ==> `.prob()`
 - the backoff probability ==> `.backoff()`

In [4]:
bi_gram_lm.ngrams.prob(("dog",)), bi_gram_lm.ngrams.backoff(("dog",))

(-1.282007, -0.39794)

We can confirm this by double-checking the values in the original `.iarpa` file

In [5]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -P "\tdog\t"

-1.282007	dog	-0.397940


In [6]:
bi_gram_lm.ngrams.prob(("the", "dog")), bi_gram_lm.ngrams.backoff(("the", "dog"))

(-0.834633, 0.0)

In [7]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep -P "the dog"

-0.834633	the dog


But if we try to lookup an `n-gram` that does **not** exist in the language model explicitly, we get a `KeyError`.

In [8]:
try:
    bi_gram_lm.ngrams.prob(("the", "dog", "ate"))
except Exception as e:
    print("n-gram {} doesn't exist in language model".format(e))

n-gram ('the', 'dog', 'ate') doesn't exist in language model


In [9]:
try:
    bi_gram_lm.ngrams.prob(("human", "ate"))
except Exception as e:
    print("n-gram {} doesn't exist in language model".format(e))

n-gram ('human', 'ate') doesn't exist in language model


### calculating new `n-gram` probabilities

In these cases, we can use `.score()`.  To score a new `n-gram`, provide that `n-gram` as a `<tuple>`.

In [10]:
bi_gram_lm.score(("the", "dog", "ate"))

-1.9405489999999999

**Note:** If you forget to submit the `n-gram` as a `<tuple>`, the `<string>` will be considered an `n-gram` of **characters**, **none of which** will be present in the language model

In [23]:
bi_gram_lm.score("the dog ate")

-10.461154

In [24]:
result = 0
for i in "the dog ate":
    result += bi_gram_lm.scoreword(i)
result

-10.461154

**Note:** If you try to calculate the score of "the dog" by entering it as a `<string>` to the first argument, you will get the score of the `unigram`, `<unk>`!

In [14]:
bi_gram_lm.scoreword("the dog")

-0.951014

In [15]:
%%bash
cat resource_files/language_model/animal_lm-2_gram.iarpa | grep "<unk>"

-0.951014	<unk>


In [16]:
bi_gram_lm.scoreword(("human",), ("ate",)), bi_gram_lm.scoreword(("human", "ate")), bi_gram_lm.score(("human",), ("ate",))

(-3.002166, -0.951014, -3.002166)

Since we are using a `2-gram` language model, there is no entry for a `3-gram`, but`PyNLPl` will do the calculations of that for us.

$p(the\_dog\_ate) = p(dog|the) + p(ate|dog)$  

**Note:** Remember because we are in `log` space, we `add` instead of `multiply`.

In [17]:
bi_gram_lm.scoreword(("dog",), ("the",)) + bi_gram_lm.scoreword(("ate",), ("dog",))

-1.3575119999999998

But instead of doing the addition ourselves, we can use the `score()` method.  It again takes two arguments:

In [18]:
bi_gram_lm.score(("dog", "ate"), ("the",)), bi_gram_lm.scoreword(("dog", "ate"), ("the",))

(-1.3575119999999998, -0.522879)

In [19]:
bi_gram_lm.scoreword(("the", "dog"))

-0.834633