# KenLM

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/kenlm](https://github.com/huseinzol05/Malaya/tree/master/example/kenlm).
    
</div>

A very fast language model, accurate and non neural-network, https://github.com/kpu/kenlm

In [1]:
import malaya

### Dependency

Make sure you already installed,

```bash
pip3 install pypi-kenlm==0.1.20210121
```

A simple python wrapper for original https://github.com/kpu/kenlm

### List available KenLM models

In [2]:
malaya.kenlm.available_models()

Unnamed: 0,Size (MB),LM order,Description,Command
bahasa-news,24.0,3,Gathered from malaya-speech bahasa ASR transcr...,[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-combined,29.0,3,Gathered from malaya-speech ASR bahasa transcr...,[./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community,887.1,4,Mirror for https://github.com/redapesolutions/...,[./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined,310.0,3,Academia + News + IIUM + Parliament + Watpadd ...,[./lmplz --text text.txt --arpa out.arpa -o 3 ...


### Load KenLM model

```python
def load(
    model: str = 'dump-combined', **kwargs
):
    """
    Load KenLM language model.

    Parameters
    ----------
    model : str, optional (default='dump-combined')
        Model architecture supported. Allowed values:

        * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
        * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
        * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
        * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt.

    Returns
    -------
    result : kenlm.Model class
    """
```

In [3]:
model = malaya.kenlm.load()

In [4]:
model.score('saya suke awak')

-11.912322044372559

In [5]:
model.score('saya suka awak')

-6.80517053604126

In [6]:
model.score('najib razak')

-5.256608009338379

In [7]:
model.score('najib comel')

-10.580080032348633

### Build custom Language Model

1. Build KenLM from source,

```bash
wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
```

2. Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping,

```bash
kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
```

3. Once you have out.trie.klm, you can load to scorer interface,

```python
import kenlm
model = kenlm.Model('out.trie.klm')
```