# 2.1: Building a language model

`kaldi` has native support for the `ARPA` format for language models.  A good explanation of that format can be read [here](https://cmusphinx.github.io/wiki/arpaformat/), but here is the basic format:

A popular open-source language modeling toolkit that outputs in the `ARPA` format is `IRSTLM`.  It's manual can be found in `resource_files/irstlm-manual.pdf`.

We will build a language model from a toy corpus (using `IRSTLM`) and then analyze it.

## the toy corpus

A toy corpus is in `resource_files/animal_corpus.txt`.  In this corpus, each line represents a sentence, and there is *no* punctuation present.

**Note:** From the perspective of a language model, one *could* model punctuation if that were of importance, but since our purpose is to model *spoken* text, we do *not* have any need to model punctuation.

In [1]:
cat resource_files/animal_corpus.txt

the mouse ate the cheese
the cat ate the mouse
the dog ate the cat
the lion at the dog
the tyrannosaurus rex ate the lion
the human shot the mouse
the human shot the cat
the human shot the dog
the human shot the lion
the human shot the tyrannosaurus rex


## building the language model with `IRSTLM`

After `export`ing a few variables, we will be able to call scripts from `IRSTLM` without a full path.

In [2]:
export IRSTLM=${KALDI_PATH}/tools/irstlm
export PATH=${PATH}:${IRSTLM}/bin

### `add-start-end.sh`

Since our corpus does *not* have periods, we need to add a custom symbol to represent the *beginning* and *end* of each sentence.

In [4]:
add-start-end.sh -h


add-start-end.sh - adds sentence start/end symbols in each line and trims very very long words

USAGE:
       add-start-end.sh [options]

OPTIONS:
       -h        Show this message
       -r count  Specify symbol repetitions (default 1)
       -t length Trim words up to _length_ chars (default 80)
       -s char   Specify symbol (default s)



In [22]:
cat resource_files/animal_corpus.txt | add-start-end.sh > resource_files/animal_corpus_start_stop.txt

In [24]:
cat resource_files/animal_corpus_start_stop.txt

<s> the mouse ate the cheese  </s>
<s> the cat ate the mouse  </s>
<s> the dog ate the cat  </s>
<s> the lion at the dog  </s>
<s> the tyrannosaurus rex ate the lion  </s>
<s> the human shot the mouse  </s>
<s> the human shot the cat  </s>
<s> the human shot the dog  </s>
<s> the human shot the lion  </s>
<s> the human shot the tyrannosaurus rex  </s>


### `build-lm.sh`

Now let's build the actual language model using `build-lm.sh`

In [7]:
build-lm.sh -h


build-lm.sh - estimates a language model file and saves it in intermediate ARPA format

USAGE:
       build-lm.sh [options]

OPTIONS:
       -i|--InputFile          Input training file e.g. 'gunzip -c train.gz'
       -o|--OutputFile         Output gzipped LM, e.g. lm.gz
       -k|--Parts              Number of splits (default 5)
       -n|--NgramSize          Order of language model (default 3)
       -d|--Dictionary         Define subdictionary for n-grams (optional, default is without any subdictionary)
       -s|--LanguageModelType  Smoothing methods: witten-bell (default), shift-beta, improved-shift-beta, stupid-backoff; kneser-ney and improved-kneser-ney still accepted for back-compatibility, but mapped into shift-beta and improved-shift-beta, respectively
       -p|--PruneSingletons    Prune singleton n-grams (default false)
       -f|--PruneFrequencyThreshold      Pruning frequency threshold for each level; comma-separated list of values; (default is '0,0,...,0', for all level

The main arguments we will focus on are:
 - `-i`
 - `-o`
 - `-n`

`-k` is an important argument for efficient language modeling on a very large corpus.  With our toy example, we do not need to worry about that.  You'll also notice a number of options for `-s` which relate to the type of `smoothing` used.  Stanford has a great resource on `smoothing` that you can find in `resource_files/smoothing_explained.pdf`.  For now, we will ignore both of these arguments.

In [40]:
build-lm.sh \
    -i resource_files/lm_temp/animal_corpus_start_stop.txt \
    -o resource_files/animal_lm-2_gram.lm \
    -n 2

LOGFILE:/dev/null
$bin/ngt -i="$inpfile" -n=$order -gooout=y -o="$gzip -c > $tmpdir/ngram.${sdict}.gz" -fd="$tmpdir/$sdict" $dictionary $additional_parameters >> $logfile 2>&1
$scr/build-sublm.pl $verbose $prune $prune_thr_str $smoothing "$additional_smoothing_parameters" --size $order --ngrams "$gunzip -c $tmpdir/ngram.${sdict}.gz" -sublm $tmpdir/lm.$sdict $additional_parameters >> $logfile 2>&1


`IRSTLM` automatically `compresses` the resulting language model.  So we will `decompress` it so we can look at it.

In [42]:
gzip -d resource_files/animal_lm-2_gram.lm.gz

In [44]:
cat resource_files/animal_lm-2_gram.lm

iARPA

\data\
ngram 1=	15
ngram 2=	25

\1-grams:
-1.662758	<s>	-0.845098
-0.641569	the	-0.586266
-1.361728	mouse	-0.397940
-1.264818	ate	-0.698970
-1.662758	cheese	-0.301030
-0.922395	</s>	-1.041393
-1.361728	cat	-0.397940
-1.361728	dog	-0.397940
-1.361728	lion	-0.397940
-1.662758	at	-0.301030
-1.486667	tyrannosaurus	-0.477121
-1.486667	rex	-0.301030
-1.185637	human	-0.778151
-1.185637	shot	-0.778151
-0.787697 <unk>
\2-grams:
-0.845098	<s> <s>
-0.146128	<s> the
-0.954243	the mouse
-1.431364	the cheese
-0.954243	the cat
-0.954243	the dog
-0.954243	the lion
-1.130334	the tyrannosaurus
-0.732394	the human
-0.698970	mouse ate
-0.397940	mouse </s>
-0.096910	ate the
-0.301030	cheese </s>
-0.698970	cat ate
-0.397940	cat </s>
-0.698970	dog ate
-0.397940	dog </s>
-0.397940	lion </s>
-0.698970	lion at
-0.301030	at the
-0.176091	tyrannosaurus rex
-0.602060	rex ate
-0.602060	rex </s>
-0.079181	human shot
-0.079181	shot the
\end\


Now let's build a `2-gram` language model that does **not** include `start` and `stop` symbols.  We can do this by using our original `animal_corpus.txt` file as `input`.

In [4]:
build-lm.sh \
    -i resource_files/animal_corpus.txt \
    -o resource_files/animal_lm-2_gram-no_start_stop.lm \
    -n 2

LOGFILE:/dev/null
$bin/ngt -i="$inpfile" -n=$order -gooout=y -o="$gzip -c > $tmpdir/ngram.${sdict}.gz" -fd="$tmpdir/$sdict" $dictionary $additional_parameters >> $logfile 2>&1
$scr/build-sublm.pl $verbose $prune $prune_thr_str $smoothing "$additional_smoothing_parameters" --size $order --ngrams "$gunzip -c $tmpdir/ngram.${sdict}.gz" -sublm $tmpdir/lm.$sdict $additional_parameters >> $logfile 2>&1
$scr/build-sublm.pl $verbose $prune $prune_thr_str $smoothing "$additional_smoothing_parameters" --size $order --ngrams "$gunzip -c $tmpdir/ngram.${sdict}.gz" -sublm $tmpdir/lm.$sdict $additional_parameters >> $logfile 2>&1


In [5]:
gzip -d resource_files/animal_lm-2_gram-no_start_stop.lm.gz && cat resource_files/animal_lm-2_gram-no_start_stop.lm

iARPA

\data\
ngram 1=	13
ngram 2=	23

\1-grams:
-0.564271	the	-0.586266
-1.284431	mouse	-0.397940
-1.187521	ate	-0.698970
-1.585461	cheese	-0.301030
-1.284431	cat	-0.397940
-1.284431	dog	-0.397940
-1.284431	lion	-0.397940
-1.585461	at	-0.301030
-1.409369	tyrannosaurus	-0.477121
-1.409369	rex	-0.301030
-1.108339	human	-0.778151
-1.108339	shot	-0.778151
-0.772547 <unk>
\2-grams:
-0.954243	the mouse
-1.431364	the cheese
-0.954243	the cat
-0.954243	the dog
-0.954243	the lion
-1.130334	the tyrannosaurus
-0.732394	the human
-0.397940	mouse the
-0.698970	mouse ate
-0.096910	ate the
-0.301030	cheese the
-0.397940	cat the
-0.698970	cat ate
-0.397940	dog the
-0.698970	dog ate
-0.397940	lion the
-0.698970	lion at
-0.301030	at the
-0.176091	tyrannosaurus rex
-0.602060	rex <s>
-0.602060	rex ate
-0.079181	human shot
-0.079181	shot the
\end\


Let's also build a `3-gram` and a `4-gram` model, both using `start` and `stop` symbols.

In [6]:
for n in 3 4; do
    lm_out=resource_files/animal_lm-${n}_gram.lm
    build-lm.sh \
        -i resource_files/animal_corpus_start_stop.txt \
        -o ${lm_out} \
        -n ${n}
    gzip -d ${lm_out}.gz
done

LOGFILE:/dev/null
$bin/ngt -i="$inpfile" -n=$order -gooout=y -o="$gzip -c > $tmpdir/ngram.${sdict}.gz" -fd="$tmpdir/$sdict" $dictionary $additional_parameters >> $logfile 2>&1
ls: cannot access 'stat_204/dict.*': No such file or directory
gzip: resource_files/animal_lm-3_gram.lm.gz: No such file or directory
LOGFILE:/dev/null
$bin/ngt -i="$inpfile" -n=$order -gooout=y -o="$gzip -c > $tmpdir/ngram.${sdict}.gz" -fd="$tmpdir/$sdict" $dictionary $additional_parameters >> $logfile 2>&1
ls: cannot access 'stat_223/dict.*': No such file or directory
gzip: resource_files/animal_lm-4_gram.lm.gz: No such file or directory


: 1