# Welcome to ExKaldi

In this section, we will train a N-Grams language model and query it.

Althrough __SriLM__ is avaliable in ExKaldi, we recommend __KenLM__ toolkit.

In [None]:
import exkaldi

import os
dataDir = "librispeech_dummy"

Firstly, prepare the lexicons. We have generated and saved a __LexiconBank__ object in file already (3_prepare_lexicons). So restorage it directly.

In [None]:
lexFile = os.path.join(dataDir, "exp", "lexicons.lex")

lexicons = exkaldi.load_lex(lexFile)

lexicons

We will use training text corpus to train LM model. Even though we have prepared a transcription file in the data directory, we do not need the utterance-ID information at the head of each line, so we must take a bit of work to produce a new text.

We can lend a hand of the exkaldi __Transcription__ class.

In [None]:
textFile = os.path.join(dataDir, "train", "text")

trans = exkaldi.load_transcription(textFile)

trans

In [None]:
newTextFile = os.path.join(dataDir, "exp", "train_lm_text")

trans.save(fileName=newTextFile, discardUttID=True)

But actually, you don't need do this. If you use a __Transcription__ object to train the language model, the information of utterance ID will be discarded automatically.

Now we train a 2-grams model with __KenLM__ backend. 

In [None]:
arpaFile = os.path.join(dataDir, "exp", "2-gram.arpa")

exkaldi.lm.train_ngrams_kenlm(lexicons, order=2, text=trans, outFile=arpaFile, config={"-S":"20%"})

ARPA model can be transform to binary format in order to accelerate loading and reduce memory cost.  
Although __KenLM__ Python API supports reading ARPA format, but in exkaldi, we only expected KenLM Binary format.

In [None]:
binaryLmFile = os.path.join(dataDir, "exp", "2-gram.binary")

exkaldi.lm.arpa_to_binary(arpaFile, binaryLmFile)

Use the binary LM file to initialize a Python KenLM n-grams object.

In [None]:
model = exkaldi.lm.KenNGrams(binaryLmFile)

model

__KenNGrams__ is simple wrapper of KenLM python Model. Check model information:

In [None]:
model.info

You can query this model with a sentence.

In [None]:
model.score_sentence("HELLO WORLD", bos=True, eos=True)

There is a example to compute the perplexity of test corpus in order to evaluate the language model.

In [None]:
evalTrans = exkaldi.load_transcription( os.path.join(dataDir, "test", "text") )

score = model.score(evalTrans)

score

In [None]:
type(score)

___score___ is an exkaldi __Metric__ (a subclass of Python dict) object. 

We design a group of classes to hold Kaldi text format table and exkaldi own text format data:

__ListTable__: spk2utt, utt2spk, words, phones and so on.  
__Transcription__: transcription corpus, n-best decoding result and so on.  
__Metric__: AM score, LM score, LM perplexity, Sentence lengthes and so on.  
__IndexTable__: The index of binary data.  
__WavSegment__: The wave information.  

All these classes are subclasses of Python dict. They have some common and respective methods and attributes. 

In this case, for example, we can compute the average value of __Metric__.

In [None]:
score.mean()

More precisely, the weighted average by the length os sentences.

In [None]:
score.mean( weight= evalTrans.sentence_length() )

Actually, we use perplexity more to evaluate it.

In [None]:
model.perplexity(evalTrans)

Back to Language Model. If you want to use query ARPA model directly. You can use this function.

Actually, we use the perplexity score to 

In [None]:
model = exkaldi.load_ngrams(arpaFile)

model.info

As the termination of this section, we generate the Grammar fst for futher steps.

In [None]:
Gfile = os.path.join(dataDir, "exp", "G.fst")

exkaldi.decode.graph.make_G(lexicons, arpaFile, outFile=Gfile, order=2)