# Introduction to KenLM

Yes, I'm familiar with KenLM, which is an efficient and widely-used statistical language model (LM) that supports large-scale text processing. KenLM is designed to generate and query probabilistic models for sequences of words, and it's often used for speech recognition, machine translation, and natural language processing (NLP) tasks.

Here are some key details about KenLM:

- Efficient N-gram Language Modeling: KenLM is mainly based on n-grams, where the probability of a word is predicted based on the previous n−1n−1 words. This makes it more efficient than some other methods for certain NLP tasks.

- Memory Optimization: It is highly optimized for memory usage and speed, making it useful for large-scale applications where the model size could be several gigabytes. It can work with billions of words and huge datasets without consuming too much memory.

- C++ Implementation: KenLM is implemented in C++, which contributes to its speed and efficiency. It also offers Python bindings for easier integration in Python-based NLP pipelines.

- Probabilistic Querying: KenLM allows querying the probability of sequences of words, which is critical in applications like decoding speech recognition hypotheses or ranking candidate translations in machine translation systems.

- Arpa and Binary Formats: KenLM works with both ARPA format (text-based) and binary format for faster querying.

# Install
The first step will be to build KenLM. Clone this repository:

In [2]:
!git clone https://github.com/kpu/kenlm.git

Cloning into 'kenlm'...
The authenticity of host 'github.com (20.205.243.166)' can't be established.
ED25519 key fingerprint is SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? ^C


You need to install cmake to build KenLM and then following the build guideline from the github repo.
```bash
sudo snap install cmake
mkdir -p build
cd build
cmake ..
make -j 4
```
After building, you can import kenlm as follow.

In [1]:
import kenlm

# Usage
KenLM requires a provided .arpa file to create a language model based on n-grams. The .arpa file is a standardized format that represents an n-gram model, containing word sequences along with their associated probabilities. Each n-gram entry specifies the likelihood of a word following a given sequence of previous words, enabling the language model to make predictions about word sequences.

For example, consider the file test.arpa. This file contains metadata about the n-gram model, such as the number of n-grams of each order. In this case, we have 37 unigrams (1-grams), 47 bigrams (2-grams), and so on:
```
\data\
ngram 1=37
ngram 2=47
ngram 3=11
ngram 4=6
ngram 5=4
```

Each n-gram entry follows this format: <log-probability> <n-gram sequence> <backoff weight>. In cases where no backoff weight is present, only the log-probability and n-gram sequence are listed.
```
\3-grams:
-0.01916512	more . </s>
-0.0283603	on a little	-0.4771212
-0.0283603	screening a little	-0.4771212
-0.01660496	a little more	-0.09409451
```

## Create your own arpa file
You can also create your own n-gram language model using a source text. For example, let's build an n-gram model from Wikitext-2, a commonly used dataset in natural language processing. Follow a repository which will build an ARPA file.

```
git clone git@github.com:daandouwe/ngram-lm.git
cd ngram-lm
mkdir data
./get-data.sh
mkdir arpa
./main.py --order 3 --interpolate --save-arpa --name wiki-interpolate
```



# Initialize model 

In [2]:
LM = './test.arpa'
model = kenlm.LanguageModel(LM)
print('{0}-gram model'.format(model.order))

5-gram model


Loading the LM will be faster if you build a binary file.
Reading /home/levi/Speech-Course-Lab/Lab/Lab04 - Whisper and KenLM/4.2. KenLM/test.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************


The model.score(sentence) function in KenLM returns the log-probability of the sentence according to the loaded language model. 
- A lower (more negative) score means the sentence is less likely or less common according to the model.
- A higher (less negative) score means the sentence is more likely or more common according to the model.

In [16]:
sentence = 'looking on a little dog'
print(sentence)
print(model.score(sentence))

looking on a little dog
-26.229387283325195


The method model.full_scores(sentence) in KenLM provides more detailed information than model.score(sentence). It breaks down the sentence into its component n-grams and provides the following information for each word (or token):
- Log-probability of the current word given the previous words.
- Backoff weight (the penalty applied when backing off to a lower-order n-gram).
- Backoff flag (boolean: True if backoff happened, False if no backoff was needed).

In [17]:
for e in model.full_scores(sentence):
    print(e)

(-0.48465219140052795, 2, False)
(-0.3488368093967438, 3, False)
(-0.01552657037973404, 4, False)
(-0.0030612230766564608, 5, False)
(-4.347817420959473, 1, True)
(-21.02949333190918, 1, False)


In [21]:
# Show scores and n-gram matches
words = ['<s>'] + sentence.split() + ['</s>']
for i, (prob, length, oov) in enumerate(model.full_scores(sentence)):
    print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
    if oov:
        print('\t"{0}" is an OOV'.format(words[i+1]))

# Find out-of-vocabulary words
for w in words:
    if not w in model:
        print('"{0}" is an OOV'.format(w))

-0.48465219140052795 2: <s> looking
-0.3488368093967438 3: <s> looking on
-0.01552657037973404 4: <s> looking on a
-0.0030612230766564608 5: <s> looking on a little
-4.347817420959473 1: dog
	"dog" is an OOV
-21.02949333190918 1: </s>
"dog" is an OOV
