# 2.1: Building a language model

If you need some background on `n-gram` language modeling, Stanford has a very good set of slides explaining it.  They can be found in `resource_files/resources/language_modeling.pdf`.

`kaldi` has native support for the `ARPA` format for language models.  A good explanation of that format can be read [here](https://cmusphinx.github.io/wiki/arpaformat/).

A popular open-source language modeling toolkit that outputs in the `ARPA` format is `IRSTLM`.  It's manual can be found in `resource_files/resources/irstlm-manual.pdf`.

We will build language models from a toy corpus (using `IRSTLM`) and then examine it.

## the toy corpus

A toy corpus is in `resource_files/language_model/animal_corpus.txt`.  In this corpus, each line represents a sentence, and there is *no* punctuation present.

**Note:** From the perspective of a language model, one *could* model punctuation if that were of importance, but since our purpose is to model *spoken* text, we do *not* have any need to model punctuation.

In [None]:
cat resource_files/language_model/animal_corpus.txt

## building the language model with `IRSTLM`

After `export`ing a few variables, we will be able to call scripts from `IRSTLM` without a full path.

In [None]:
export IRSTLM=${KALDI_PATH}/tools/irstlm
export PATH=${PATH}:${IRSTLM}/bin

### `add-start-end.sh`

Since our corpus does *not* have periods, we need to add a custom symbol to represent the *beginning* and *end* of each sentence.

In [None]:
add-start-end.sh -h

In [None]:
cat resource_files/language_model/animal_corpus.txt | add-start-end.sh > resource_files/language_model/animal_corpus_start_stop.txt

In [None]:
cat resource_files/language_model/animal_corpus_start_stop.txt

### `build-lm.sh`

Now let's build the actual language model using `build-lm.sh`

In [None]:
build-lm.sh -h

The main arguments we will focus on are:
 - `-i`
 - `-o`
 - `-n`

`-k` is an important argument for efficient language modeling on a very large corpus.  With our toy example, we do not need to worry about that.  You'll also notice a number of options for `-s` which relate to the type of `smoothing` used.  Stanford has a great resource on `smoothing` that you can find in `resource_files/smoothing_explained.pdf`.  For now, we will ignore both of these arguments.

In [None]:
build-lm.sh \
    -i resource_files/language_model/animal_corpus_start_stop.txt \
    -o resource_files/language_model/animal_lm-2_gram.iarpa \
    -n 2

`IRSTLM` automatically `compresses` the resulting language model.  So we will `decompress` it so we can look at it.

In [None]:
gzip -d resource_files/language_model/animal_lm-2_gram.iarpa.gz

In [None]:
cat resource_files/language_model/animal_lm-2_gram.iarpa

## `iARPA` to `ARPA` format

You'll notice the header line of our language model above says `iARPA`.  The `IRSTLM` manual explains:

```
Notice that build-lm.sh produces a LM file train.ilm.gz that is NOT in the final ARPA format, but in an intermediate format called iARPA, that is recognized by the compile-lm command and by the Moses SMT decoder running with IRSTLM. 
```

It explains the different between `iARPA`:

```
This is an intermediate ARPA format in the sense that each entry of the file does not contain in the first position the full n-gram probability, but just its smoothed frequency.
```

And so we must run a final step over our language model (using `compile-lm`) in order to create the proper `ARPA` format.

In [None]:
compile-lm

In [None]:
compile-lm resource_files/language_model/animal_lm-2_gram.iarpa --text=yes resource_files/language_model/animal_lm-2_gram.arpa


You'll notice some small differences in the values of `ARPA` compared to `iARPA`.

In [None]:
diff resource_files/language_model/animal_lm-2_gram.arpa resource_files/language_model/animal_lm-2_gram.iarpa

Now let's build a `2-gram` language model that does **not** include `start` and `stop` symbols.  We can do this by using our original `animal_corpus.txt` file as `input`.

**Note:** We can run `compile-lm` over the `gz`ed output of `build-lm.sh`, so we can skip the manual step of `decompress`ing the `iarpa.gz` file.

In [None]:
build-lm.sh \
    -i resource_files/language_model/animal_corpus.txt \
    -o resource_files/language_model/animal_lm-2_gram-no_start_stop.iarpa \
    -n 2
    
compile-lm \
    resource_files/language_model/animal_lm-2_gram-no_start_stop.iarpa.gz \
    --text=yes \
    resource_files/language_model/animal_lm-2_gram-no_start_stop.arpa


In [None]:
cat resource_files/language_model/animal_lm-2_gram-no_start_stop.arpa

Let's also build a `3-gram` and a `4-gram` model, both using `start` and `stop` symbols.

In [None]:
for n in 3 4; do
    lm_out=resource_files/language_model/animal_lm-${n}_gram
    
    # build the `iarpa` format
    build-lm.sh \
        -i resource_files/language_model/animal_corpus_start_stop.txt \
        -o ${lm_out}.iarpa \
        -n ${n}

    # compile into `arpa` format
    compile-lm \
        ${lm_out}.iarpa.gz \
        --text=yes \
        ${lm_out}.arpa

    # decompress `iarpa` format
    gzip -d ${lm_out}.iarpa.gz
done

We'll just `decompress` the remaining `gzip`ped file, and then we should have **four** language models, in both the `ARPA` and the `iARPA` formats.

In [None]:
gzip -d resource_files/language_model/*.gz

In [None]:
ls resource_files/language_model | grep "\.iarpa"

In [None]:
ls resource_files/language_model | grep "\.arpa"

We will be using `ARPA` formats in our `ASR` pipeline, however, the `python` package we will use in the next notebook to examine the language models requires the `iARPA` format.

Unfortunately, that `python` package is a bit picky about formatting, and so we have to run a quick `sed` command over our `.iarpa` language models to make them acceptable.

In [None]:
for lm in `ls resource_files/language_model/*.iarpa`; do
    sed -i.bak -E "s:([\-\.0-9]+) :\1\t:g" ${lm}
    rm resource_files/language_model/*.bak
done