# 2.1: Language models explained

`kaldi` has native support for the `ARPA` format for language models.  A good explanation of that format can be read [here](https://cmusphinx.github.io/wiki/arpaformat/), but here is the basic format:

A popular open-source language modeling toolkit that outputs in the `ARPA` format is `IRSTLM`.  It's manual can be found in `resource_files/irstlm-manual.pdf`.

We will build a language model from a toy corpus (using `IRSTLM`) and then analyze it.

## the toy corpus

A toy corpus is in `resource_files/animal_corpus.txt`.  In this corpus, each line represents a sentence, and there is *no* punctuation present.

**Note:** From the perspective of a language model, one *could* model punctuation if that were of importance, but since our purpose is to model *spoken* text, we do *not* have any need to model punctuation.

In [2]:
cat resource_files/animal_corpus.txt

the mouse ate the cheese
the cat ate the mouse
the dog ate the cat
the lion at the dog
the tyrannosaurus rex ate the lion
the human shot the mouse
the human shot the cat
the human shot the dog
the human shot the lion
the human shot the tyrannosaurus rex


## building the language model with `IRSTLM`

After `export`ing a few variables, we will be able to call scripts from `IRSTLM` without a full path.

In [8]:
export IRSTLM=${KALDI_PATH}/tools/irstlm
export PATH=${PATH}:${IRSTLM}/bin

We're also going to need to write some intermediate files to disk, so we'll make a temporary directory that we will remove at the end of this process

In [18]:
mkdir -p resource_files/lm_temp

### `add-start-end.sh`

Since our corpus does *not* have periods, we need to add a custom symbol to represent the *beginning* and *end* of each sentence.

In [15]:
add-start-end.sh -h


add-start-end.sh - adds sentence start/end symbols in each line and trims very very long words

USAGE:
       add-start-end.sh [options]

OPTIONS:
       -h        Show this message
       -r count  Specify symbol repetitions (default 1)
       -t length Trim words up to _length_ chars (default 80)
       -s char   Specify symbol (default s)



In [19]:
cat resource_files/animal_corpus.txt | add-start-end.sh > resource_files/lm_temp/animal_corpus_start_stop.txt

In [20]:
cat resource_files/lm_temp/animal_corpus_start_stop.txt

<s> the mouse ate the cheese  </s>
<s> the cat ate the mouse  </s>
<s> the dog ate the cat  </s>
<s> the lion at the dog  </s>
<s> the tyrannosaurus rex ate the lion  </s>
<s> the human shot the mouse  </s>
<s> the human shot the cat  </s>
<s> the human shot the dog  </s>
<s> the human shot the lion  </s>
<s> the human shot the tyrannosaurus rex  </s>


### `build-lm.sh`

Now let's build the actual language model using `build-lm.sh`

In [16]:
build-lm.sh -h


build-lm.sh - estimates a language model file and saves it in intermediate ARPA format

USAGE:
       build-lm.sh [options]

OPTIONS:
       -i|--InputFile          Input training file e.g. 'gunzip -c train.gz'
       -o|--OutputFile         Output gzipped LM, e.g. lm.gz
       -k|--Parts              Number of splits (default 5)
       -n|--NgramSize          Order of language model (default 3)
       -d|--Dictionary         Define subdictionary for n-grams (optional, default is without any subdictionary)
       -s|--LanguageModelType  Smoothing methods: witten-bell (default), shift-beta, improved-shift-beta, stupid-backoff; kneser-ney and improved-kneser-ney still accepted for back-compatibility, but mapped into shift-beta and improved-shift-beta, respectively
       -p|--PruneSingletons    Prune singleton n-grams (default false)
       -f|--PruneFrequencyThreshold      Pruning frequency threshold for each level; comma-separated list of values; (default is '0,0,...,0', for all level

In [None]:
build-lm.sh \
    -i resource_files/lm_temp/animal_corpus_start_stop.txt \
    -o resource_files/lm_temp/lm.gz \
    -n 2

The `ARPA` language model is an `n-gram` model that will calculate the probability of seeing a particular sequence of `n-gram`s.