# Welcome to ExKaldi

In this section, we will train a N-Grams language model and query it.

Althrough __SriLM__ is avaliable in ExKaldi, we recommend __KenLM__ toolkit.

In [1]:
CONDA_DIR = "/home/khanh/workspace/miniconda3"
KALDI_ENV = "kaldi"
EXKALDI_ENV = "exkaldi"
KALDI_ROOT = "/home/khanh/workspace/projects/kaldi"

DATA_DIR = "librispeech_dummy"

def import_exkaldi():
    import os

    # add lib path
    os.environ["LD_LIBRARY_PATH"] = ";".join([
        os.path.join(CONDA_DIR, "envs", KALDI_ENV, "lib"),
        os.path.join(CONDA_DIR, "envs", EXKALDI_ENV, "lib"),
    ])

    import exkaldi
    exkaldi.info.reset_kaldi_root(KALDI_ROOT)

    return exkaldi
exkaldi = import_exkaldi()
dataDir = "librispeech_dummy"

import os

exkaldi.info.reset_kaldi_root( yourPath )
If not, ERROR will occur when implementing some core functions.


Firstly, prepare the lexicons. We have generated and saved a __LexiconBank__ object in file already (3_prepare_lexicons). So restorage it directly.

In [2]:
lexFile = os.path.join(dataDir, "exp", "lexicons.lex")

lexicons = exkaldi.load_lex(lexFile)

lexicons

<exkaldi.decode.graph.LexiconBank at 0x7efc105385b0>

We will use training text corpus to train LM model. Even though we have prepared a transcription file in the data directory, we do not need the utterance-ID information at the head of each line, so we must take a bit of work to produce a new text.

We can lend a hand of the exkaldi __Transcription__ class.

In [3]:
textFile = os.path.join(dataDir, "train", "text")

trans = exkaldi.load_transcription(textFile)

trans

{'103-1240-0000': 'CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK',
 '103-1240-0001': "THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM",
 '103-1240-0002': "FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP",
 '103-1240-0003': "AND THAT IF SHE NOTICED ANYTHING ODD OR OUT OF PLACE SHE WOULD NEVER REST UNTIL SHE HAD FERRETED OUT THE WHYS AND WHEREFORES THEREOF THERE ARE PLENTY OF PEOPLE IN AVONLEA AND OUT OF

In [4]:
newTextFile = os.path.join(dataDir, "exp", "train_lm_text")

trans.save(fileName=newTextFile, discardUttID=True)

'librispeech_dummy/exp/train_lm_text'

But actually, you don't need do this. If you use a __Transcription__ object to train the language model, the information of utterance ID will be discarded automatically.

Now we train a 2-grams model with __KenLM__ backend. 

In [5]:
arpaFile = os.path.join(dataDir, "exp", "2-gram.arpa")

exkaldi.lm.train_ngrams_kenlm(lexicons, order=2, text=trans, outFile=arpaFile, config={"-S":"20%"})

'librispeech_dummy/exp/2-gram.arpa'

ARPA model can be transform to binary format in order to accelerate loading and reduce memory cost.  
Although __KenLM__ Python API supports reading ARPA format, but in exkaldi, we only expected KenLM Binary format.

In [8]:
binaryLmFile = os.path.join(dataDir, "exp", "2-gram.binary")

exkaldi.lm.arpa_to_binary(arpaFile, binaryLmFile)

'librispeech_dummy/exp/2-gram.binary'

Use the binary LM file to initialize a Python KenLM n-grams object.

In [9]:
model = exkaldi.lm.KenNGrams(binaryLmFile)

model

<exkaldi.lm.lm.KenNGrams at 0x7efbf9959d00>

__KenNGrams__ is simple wrapper of KenLM python Model. Check model information:

In [10]:
model.info

NgramInfo(order=2, path="b'/home/khanh/workspace/projects/exkaldi/tutorials/librispeech_dummy/exp/2-gram.binary'")

You can query this model with a sentence.

In [11]:
model.score_sentence("HELLO WORLD", bos=True, eos=True)

-8.526777267456055

There is a example to compute the perplexity of test corpus in order to evaluate the language model.

In [12]:
evalTrans = exkaldi.load_transcription( os.path.join(dataDir, "test", "text") )

score = model.score(evalTrans)

score

{'1272-128104-0000': -46.6329231262207,
 '1272-128104-0001': -34.10002136230469,
 '1272-128104-0002': -92.48436737060547,
 '1272-128104-0003': -71.13314056396484,
 '1272-128104-0004': -193.6737518310547,
 '1272-135031-0000': -79.55458068847656,
 '1272-135031-0001': -75.40349578857422,
 '1272-135031-0002': -76.00621032714844,
 '1272-135031-0003': -38.27351379394531,
 '1272-135031-0004': -37.07489013671875,
 '1272-141231-0000': -23.847429275512695,
 '1272-141231-0001': -47.61217498779297,
 '1272-141231-0002': -88.02192687988281,
 '1272-141231-0003': -43.62815856933594,
 '1272-141231-0004': -28.254606246948242,
 '1462-170138-0000': -90.78235626220703,
 '1462-170138-0001': -33.32261657714844,
 '1462-170138-0002': -54.319480895996094,
 '1462-170138-0003': -12.376655578613281,
 '1462-170138-0004': -20.674068450927734}

In [13]:
type(score)

exkaldi.core.archive.Metric

___score___ is an exkaldi __Metric__ (a subclass of Python dict) object. 

We design a group of classes to hold Kaldi text format table and exkaldi own text format data:

__ListTable__: spk2utt, utt2spk, words, phones and so on.  
__Transcription__: transcription corpus, n-best decoding result and so on.  
__Metric__: AM score, LM score, LM perplexity, Sentence lengthes and so on.  
__IndexTable__: The index of binary data.  
__WavSegment__: The wave information.  

All these classes are subclasses of Python dict. They have some common and respective methods and attributes. 

In this case, for example, we can compute the average value of __Metric__.

In [14]:
score.mean()

-59.358818435668944

More precisely, the weighted average by the length os sentences.

In [15]:
score.mean( weight= evalTrans.sentence_length() )

-85.36429487857703

Actually, we use perplexity more to evaluate it.

In [16]:
model.perplexity(evalTrans)

PPL(prob=-1187.18, sentences=20, words=419, ppl=506.14, ppl1=681.33)

Back to Language Model. If you want to use query ARPA model directly. You can use this function.

Actually, we use the perplexity score to 

In [17]:
model = exkaldi.load_ngrams(arpaFile)

model.info

NgramInfo(order=2, path='librispeech_dummy/exp/2-gram.arpa')

As the termination of this section, we generate the Grammar fst for futher steps.

In [18]:
Gfile = os.path.join(dataDir, "exp", "G.fst")

exkaldi.decode.graph.make_G(lexicons, arpaFile, outFile=Gfile, order=2)

'/home/khanh/workspace/projects/exkaldi/tutorials/librispeech_dummy/exp/G.fst'