# Introduction to KenLM


Yes, I'm familiar with KenLM, which is an efficient and widely-used statistical language model (LM) that supports large-scale text processing. KenLM is designed to generate and query probabilistic models for sequences of words, and it's often used for speech recognition, machine translation, and natural language processing (NLP) tasks.

Here are some key details about KenLM:

- Efficient N-gram Language Modeling: KenLM is mainly based on n-grams, where the probability of a word is predicted based on the previous n−1n−1 words. This makes it more efficient than some other methods for certain NLP tasks.

- Memory Optimization: It is highly optimized for memory usage and speed, making it useful for large-scale applications where the model size could be several gigabytes. It can work with billions of words and huge datasets without consuming too much memory.

- C++ Implementation: KenLM is implemented in C++, which contributes to its speed and efficiency. It also offers Python bindings for easier integration in Python-based NLP pipelines.

- Probabilistic Querying: KenLM allows querying the probability of sequences of words, which is critical in applications like decoding speech recognition hypotheses or ranking candidate translations in machine translation systems.

- Arpa and Binary Formats: KenLM works with both ARPA format (text-based) and binary format for faster querying.

In [None]:
The first step will be to build KenLM. Then, we will build the ARPA file which KenLM uses to evaluate.
Building KenLM

First, clone this repository:

In [2]:
!git clone git@github.com:kpu/kenlm.git

Cloning into 'kenlm'...
The authenticity of host 'github.com (20.205.243.166)' can't be established.
ED25519 key fingerprint is SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? ^C


In [None]:
sudo snap install cmake

In [1]:
%pwd

'/home/levi/Speech-Course-Lab/Lab/Lab04 - Whisper and KenLM/4.2. KenLM'

In [1]:
import os
import kenlm

In [2]:
LM = './test.arpa'
model = kenlm.LanguageModel(LM)
print('{0}-gram model'.format(model.order))

5-gram model


Loading the LM will be faster if you build a binary file.
Reading /home/levi/Speech-Course-Lab/Lab/Lab04 - Whisper and KenLM/4.2. KenLM/test.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************


In [3]:
sentence = 'language modeling is fun .'
print(sentence)
print(model.score(sentence))

language modeling is fun .
-64.59443664550781


In [4]:
def score(s):
    return sum(prob for prob, _, _ in model.full_scores(s))

assert (abs(score(sentence) - model.score(sentence)) < 1e-3)

In [5]:
# Show scores and n-gram matches
words = ['<s>'] + sentence.split() + ['</s>']
for i, (prob, length, oov) in enumerate(model.full_scores(sentence)):
    print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
    if oov:
        print('\t"{0}" is an OOV'.format(words[i+1]))

# Find out-of-vocabulary words
for w in words:
    if not w in model:
        print('"{0}" is an OOV'.format(w))

-2.4106082916259766 1: language
	"language" is an OOV
-15.0 2: language modeling
	"modeling" is an OOV
-23.6878719329834 1: is
-2.2966649532318115 1: fun
	"fun" is an OOV
-21.139057159423828 1: .
-0.060235898941755295 2: . </s>
"language" is an OOV
"modeling" is an OOV
"fun" is an OOV
