# **Boosting Wav2Vec2 with n-grams**

We install `datasets` and `transformers` as well as `pyctcdecode` and `kenLM`'s Python bindings to be able to run the language model integration.



In [None]:
!pip3 install https://github.com/kpu/kenlm/archive/master.zip
!pip3 install -r requirements.txt
!pip3 install kenlm

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip (553 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.6/553.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184343 sha256=87c938a27e29cba6df9634d002feb6a4d594aaef5bffb149ea149c1a5883aa90
  Stored in directory: /tmp/pip-ephem-wheel-cache-ogmjp85s/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0
Collecting datasets (from -r requirements.txt (line 1))
  Downloading datase



## **1. Set Git config vars and log-in to HF hub**

In [None]:
# Ignore for now ...

# !git config --global user.name "lucas-meyer"
# !git config --global user.email "lucas.meyer77@gmail.com"

In [None]:
from huggingface_hub import login
from utils import WRITE_ACCESS_TOKEN

login(WRITE_ACCESS_TOKEN)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **2. Build an *n-gram* with KenLM**

We will use the popular [KenLM library](https://github.com/kpu/kenlm) to do so. Let's start by installing the Ubuntu library prerequisites:

In [None]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libboost-program-options-dev is already the newest version (1.74.0.3ubuntu7).
libboost-program-options-dev set to manually installed.
libboost-system-dev is already the newest version (1.74.0.3ubuntu7).
libboost-system-dev set to manually installed.
libboost-thread-dev is already the newest version (1.74.0.3ubuntu7).
libboost-thread-dev set to manually installed.
libbz2-dev is already the newest version (1.0.8-5build1).
libbz2-dev set to manually installed.
liblzma-dev is already the newest version (5.2.5-2ubuntu1).
liblzma-dev set to manually installed.
libboost-test-dev is already the newest version (1.74.0.3ubuntu7).
libboost-test-dev set to manually installed.
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
zlib1g-dev set to manually installed.

before downloading and unpacking the KenLM repo.

In [None]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2023-09-20 10:36:03--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2023-09-20 10:36:05 (425 KB/s) - written to stdout [491888/491888]



KenLM is written in C++, so we'll make use of `cmake` to build the binaries.

In [None]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Found Threads: TRUE  
-- Found ZLIB: /usr

Great, as we can see, the executable functions have successfully been built under `kenlm/build/bin/`.

KenLM by default computes an *n-gram* with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the *n-gram* is expected to be stored in a text file.
We download our dataset and save it as a `.txt` file.

In [None]:
import re

from utils import remove_special_characters

with open("text.txt", "w") as txt_file, open("train.af.txt", "r") as train_file, open("val.af.txt", "r") as val_file:
    for line in train_file.readlines():
        txt_file.write(remove_special_characters(line.strip()))
        txt_file.write(" ")

    for line in val_file.readlines():
        txt_file.write(remove_special_characters(line.strip()))
        txt_file.write(" ")

Now, we just have to run KenLM's `lmplz` command to build our *n-gram*, called `"5gram.arpa"`. As it's relatively common in speech recognition, we build a *5-gram* by passing the `-o 5` parameter.

In [None]:
!kenlm/build/bin/lmplz -o 5 <"text.txt" > "5gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /content/text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 126607 types 18966
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:227592 2:1062479680 3:1992149504 4:3187439104 5:4648348672
Statistics:
1 18965 D1=0.701236 D2=1.07126 D3+=1.38437
2 74866 D1=0.84063 D2=1.22174 D3+=1.3743
3 111354 D1=0.934183 D2=1.28639 D3+=1.66123
4 121849 D1=0.978675 D2=1.52225 D3+=1.25401
5 124429 D1=0.987787 D2=1.63104 D3+=0.462948
Memory estimate for binary LM:
type       kB
probing  9889 assuming -p 1.5
probing 11768 assuming -r models -p 1.5
trie     4715 without quantization
trie     2606 assuming -q 8 -b 8 quantization 
trie     4345 assuming -a 22 array pointer compression
trie     2236 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 

Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [None]:
!head -20 5gram.arpa

\data\
ngram 1=18965
ngram 2=74866
ngram 3=111354
ngram 4=121849
ngram 5=124429

\1-grams:
-4.921278	<unk>	0
0	<s>	-0.075394996
-3.5029752	afrika	-0.18277334
-1.6681302	is	-0.56963915
-1.6041853	die	-0.33530608
-3.7125254	wêreld	-0.13948235
-2.3630524	se	-0.1724671
-3.852555	tweede	-0.09405188
-3.6835868	grootste	-0.11331307
-4.2198124	kontinent	-0.075394996
-4.328598	(na	-0.075394996
-4.47408	asië)	-0.075394996


There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [None]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
    has_added_eos = False
    for line in read_file:
        if not has_added_eos and "ngram 1=" in line:
            count=line.strip().split("=")[-1]
            write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
        elif not has_added_eos and "<s>" in line:
            write_file.write(line)
            write_file.write(line.replace("<s>", "</s>"))
            has_added_eos = True
        else:
            write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [None]:
!head -20 5gram_correct.arpa

\data\
ngram 1=18966
ngram 2=74866
ngram 3=111354
ngram 4=121849
ngram 5=124429

\1-grams:
-4.921278	<unk>	0
0	<s>	-0.075394996
0	</s>	-0.075394996
-3.5029752	afrika	-0.18277334
-1.6681302	is	-0.56963915
-1.6041853	die	-0.33530608
-3.7125254	wêreld	-0.13948235
-2.3630524	se	-0.1724671
-3.852555	tweede	-0.09405188
-3.6835868	grootste	-0.11331307
-4.2198124	kontinent	-0.075394996
-4.328598	(na	-0.075394996


Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

## **3. Combine an *n-gram* with Wav2Vec2**

In a final step, we want to wrap the *5-gram* into a `Wav2Vec2ProcessorWithLM` object to make the *5-gram* boosted decoding as seamless as shown in Section 1.
We start by downloading the currently "LM-less" processor of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv).

In [None]:
from transformers import AutoProcessor

user_name = "lucas-meyer"
repo_name = "wav2vec2-xls-r-300m-with-LM-asr_af"

processor = AutoProcessor.from_pretrained(f"{user_name}/{repo_name}")

Downloading (…)rocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

Next, we extract the vocabulary of its tokenizer as it represents the `"labels"` of `pyctcdecode`'s `BeamSearchDecoder` class.

In [None]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

The `"labels"` and the previously built `5gram_correct.arpa` file is all that's needed to build the decoder.

In [None]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)



We can safely ignore the warning and all that is left to do now is to wrap the just created `decoder`, together with the processor's `tokenizer` and `feature_extractor` into a `Wav2Vec2ProcessorWithLM` class.

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

We want to directly upload the LM-boosted processor into
the model folder of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv) to have all relevant files in one place.

Let's clone the repo, add the new decoder files and upload them afterward.
First, we need to install `git-lfs`.

In [None]:
!sudo apt-get install git-lfs tree

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 1s (42.5 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tree.
(Reading 

Cloning and uploading of modeling files can be done conveniently with the `huggingface_hub`'s `Repository` class.

More information on how to use the `huggingface_hub` to upload any files, please take a look at the [official docs](https://huggingface.co/docs/hub/how-to-upstream).

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir=f"{repo_name}", clone_from=f"{user_name}/{repo_name}")

Cloning https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af into local empty directory.


Download file pytorch_model.bin:   0%|          | 1.40k/1.18G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.93k/3.93k [00:00<?, ?B/s]

Clean file training_args.bin:  25%|##5       | 1.00k/3.93k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.18G [00:00<?, ?B/s]

Having cloned `xls-r-300m-sv`, let's save the new processor with LM into it.

In [None]:
processor_with_lm.save_pretrained(f"{repo_name}")

Let's inspect the local repository. The `tree` command conveniently can also show the size of the different files.

In [None]:
!tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-with-LM-asr_af/[0m
├── [ 373]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [ 19M]  [00m5gram_correct.arpa[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [181K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [1.2G]  [00mpytorch_model.bin[0m
├── [2.2K]  [00mREADME.md[0m
├── [  51]  [00mspecial_tokens_map.json[0m
├── [ 399]  [00mtokenizer_config.json[0m
├── [3.9K]  [00mtraining_args.bin[0m
└── [ 619]  [00mvocab.json[0m

1 directory, 12 files


As can be seen the *5-gram* LM is quite large - it amounts to more than 4 GB.
To reduce the size of the *n-gram* and make loading faster, `kenLM` allows converting `.arpa` files to binary ones using the `build_binary` executable.

Let's make use of it here.

In [None]:
# Convert .arpa into executable using the build_binary executable
!kenlm/build/bin/build_binary {repo_name}/language_model/5gram_correct.arpa {repo_name}/language_model/5gram.bin

Reading wav2vec2-xls-r-300m-with-LM-asr_af/language_model/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
# Remove .arpa file and view the size of repo
!rm {repo_name}/language_model/5gram_correct.arpa && tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-with-LM-asr_af/[0m
├── [ 373]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [9.8M]  [00m5gram.bin[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [181K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [1.2G]  [00mpytorch_model.bin[0m
├── [2.2K]  [00mREADME.md[0m
├── [  51]  [00mspecial_tokens_map.json[0m
├── [ 399]  [00mtokenizer_config.json[0m
├── [3.9K]  [00mtraining_args.bin[0m
└── [ 619]  [00mvocab.json[0m

1 directory, 12 files


In [None]:
# Push all the files to hub
repo.push_to_hub(commit_message="Upload lm-boosted decoder")

Upload file language_model/5gram.bin:   0%|          | 32.0k/9.83M [00:00<?, ?B/s]

To https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af
   6c4c5fd..ff9d129  main -> main

   6c4c5fd..ff9d129  main -> main



'https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af/commit/ff9d129c81002549e3a9d2b9c7ea1ebd69ca34b0'