# **Boosting Wav2Vec2 with n-grams**

We install `datasets` and `transformers` as well as `pyctcdecode` and `kenLM`'s Python bindings to be able to run the language model integration.



In [1]:
!pip3 install https://github.com/kpu/kenlm/archive/master.zip
!pip3 install -r requirements.txt
!pip3 install kenlm

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
Collecting https://github.com/kpu/kenlm/archive/master.zip
[0m[31mERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /kpu/kenlm/archive/master.zip (Caused by NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f49be4a2440>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
[0m[31m
[0mDefaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple


[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mDefaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple


[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

## **1. Log-in to HF hub**

In [12]:
from huggingface_hub import login
from utils import WRITE_ACCESS_TOKEN

login(WRITE_ACCESS_TOKEN)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/kiff/.cache/huggingface/token
Login successful


## **2. Build an *n-gram* with KenLM**

Build **n-gram** with **[Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing)**.

### 2.1 Install KenLM dependencies

In [4]:
!sudo apt-get update
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

[sudo] password for kiff: 


### 2.2 Download KenLM code

In [6]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2023-10-05 19:45:26--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: â€˜STDOUTâ€™


2023-10-05 19:45:29 (320 KB/s) - written to stdout [491888/491888]



### 2.3 Build KenLM

In [7]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Found Threads: TRUE  
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- Found BZip2: /usr/lib/x86_64-linux-gnu/libbz2.so (found version "1.0.8") 
-- Looking for BZ2_bzCompressInit
-- Looking for BZ2_bzCompressInit - found
-- Looking for lzma_auto_decod

[ 76%] Built target filter
[ 77%] [32mBuilding CXX object lm/filter/CMakeFiles/phrase_table_vocab.dir/phrase_table_vocab_main.cc.o[0m
[ 78%] [32m[1mLinking CXX static library ../../lib/libkenlm_builder.a[0m
[ 78%] Built target kenlm_builder
[ 79%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/backoff_reunification.cc.o[0m
[ 80%] [32m[1mLinking CXX executable ../../bin/phrase_table_vocab[0m
[ 80%] Built target phrase_table_vocab
[ 81%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/bounded_sequence_encoding.cc.o[0m
[ 82%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/merge_probabilities.cc.o[0m
[ 83%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/merge_vocab.cc.o[0m
[ 84%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/normalize.cc.o[0m
[ 85%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/pipeline.cc.o[0m
[ 86%] [3

### 2.4 Concatenate LM data

In [1]:
import re
import os

from datasets import load_dataset
from utils import remove_special_characters

language = "af"
language_data_files = [
    f"train.{language}.txt",
    f"val.{language}.txt"
]

asr_af = load_dataset("lucas-meyer/asr_af")
asr_af = asr_af["train"]

with open("concatenated_lm_data.txt", "w") as txt_file:
    # Add WikiMedia data
#     for data_file_name in language_data_files:
#         data_path = os.path.join("language_model_data", data_file_name)
#         with open(data_path, "r") as data_file:
#             for line in data_file.readlines():
#                 txt_file.write(remove_special_characters(line.strip()))
#                 txt_file.write(" ")
    
    # Add asr_af["train"] transcription data
    for data_entry in asr_af:
        line = data_entry["transcription"]
        txt_file.write(remove_special_characters(line.strip()))
        txt_file.write(" ")

### 2.5 Use concatenated LM data to build n-gram model

Build **n-gram** model with KenLM's `lmplz` command. We build an **n-gram** by passing the `-o n` parameter.

In [18]:
!kenlm/build/bin/lmplz -o 20 < "concatenated_lm_data.txt" > "20gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /home/kiff/Desktop/Speech-Recognition-Afrikaans-isiXhosa/src/concatenated_lm_data.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 171156 types 24299
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:291588 2:32394674 3:60740016 4:97184024 5:141726704 6:194368048 7:255108064 8:323946752 9:400884096 10:485920128 11:579054784 12:680288192 13:789620160 14:907050944 15:1032580224 16:1166208256 17:1307934976 18:1457760256 19:1615684352 20:1781707136
Statistics:
1 24298 D1=0.706689 D2=1.07291 D3+=1.40799
2 98572 D1=0.844393 D2=1.20538 D3+=1.43815
3 145901 D1=0.935629 D2=1.31635 D3+=1.59696
4 159490 D1=0.978046 D2=1.51264 D3+=1.50456
5 163175 D1=0.990547 D2=1.67366 D3+=1.41512
6 164694 D1=0.994571 D2=1.72694 D3+=2.02969
7 165638 D1=0.996118 D2=1.76798 D3+=2.0

Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [19]:
!head -20 20gram.arpa

\data\
ngram 1=24298
ngram 2=98572
ngram 3=145901
ngram 4=159490
ngram 5=163175
ngram 6=164694
ngram 7=165638
ngram 8=166340
ngram 9=166938
ngram 10=167490
ngram 11=167968
ngram 12=168374
ngram 13=168712
ngram 14=168993
ngram 15=169249
ngram 16=169470
ngram 17=169676
ngram 18=169862
ngram 19=170021


There is a small problem that ðŸ¤— Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [20]:
with open("20gram.arpa", "r") as read_file, open("20gram_correct.arpa", "w") as write_file:
    has_added_eos = False
    for line in read_file:
        if not has_added_eos and "ngram 1=" in line:
            count=line.strip().split("=")[-1]
            write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
        elif not has_added_eos and "<s>" in line:
            write_file.write(line)
            write_file.write(line.replace("<s>", "</s>"))
            has_added_eos = True
        else:
            write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [21]:
!head -20 20gram_correct.arpa

\data\
ngram 1=24299
ngram 2=98572
ngram 3=145901
ngram 4=159490
ngram 5=163175
ngram 6=164694
ngram 7=165638
ngram 8=166340
ngram 9=166938
ngram 10=167490
ngram 11=167968
ngram 12=168374
ngram 13=168712
ngram 14=168993
ngram 15=169249
ngram 16=169470
ngram 17=169676
ngram 18=169862
ngram 19=170021


Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and ðŸ¤— Transformers.

## **3. Combine an *n-gram* with Wav2Vec2**

In a final step, we want to wrap the *5-gram* into a `Wav2Vec2ProcessorWithLM` object to make the *5-gram* boosted decoding as seamless as shown in Section 1.
We start by downloading the currently "LM-less" processor of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv).

In [None]:
from transformers import AutoProcessor

user_name = "lucas-meyer"
repo_name = "wav2vec2-xls-r-300m-with-LM-asr_af"

processor = AutoProcessor.from_pretrained(f"{user_name}/{repo_name}")

Downloading (â€¦)rocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

Downloading (â€¦)okenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading (â€¦)lve/main/config.json:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

Downloading (â€¦)olve/main/vocab.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading (â€¦)cial_tokens_map.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

Next, we extract the vocabulary of its tokenizer as it represents the `"labels"` of `pyctcdecode`'s `BeamSearchDecoder` class.

In [None]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

The `"labels"` and the previously built `5gram_correct.arpa` file is all that's needed to build the decoder.

In [None]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)



We can safely ignore the warning and all that is left to do now is to wrap the just created `decoder`, together with the processor's `tokenizer` and `feature_extractor` into a `Wav2Vec2ProcessorWithLM` class.

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

We want to directly upload the LM-boosted processor into
the model folder of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv) to have all relevant files in one place.

Let's clone the repo, add the new decoder files and upload them afterward.
First, we need to install `git-lfs`.

In [None]:
!sudo apt-get install git-lfs tree

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 1s (42.5 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tree.
(Reading 

Cloning and uploading of modeling files can be done conveniently with the `huggingface_hub`'s `Repository` class.

More information on how to use the `huggingface_hub` to upload any files, please take a look at the [official docs](https://huggingface.co/docs/hub/how-to-upstream).

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir=f"{repo_name}", clone_from=f"{user_name}/{repo_name}")

Cloning https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af into local empty directory.


Download file pytorch_model.bin:   0%|          | 1.40k/1.18G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.93k/3.93k [00:00<?, ?B/s]

Clean file training_args.bin:  25%|##5       | 1.00k/3.93k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.18G [00:00<?, ?B/s]

Having cloned `xls-r-300m-sv`, let's save the new processor with LM into it.

In [None]:
processor_with_lm.save_pretrained(f"{repo_name}")

Let's inspect the local repository. The `tree` command conveniently can also show the size of the different files.

In [None]:
!tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-with-LM-asr_af/[0m
â”œâ”€â”€ [ 373]  [00malphabet.json[0m
â”œâ”€â”€ [2.0K]  [00mconfig.json[0m
â”œâ”€â”€ [4.0K]  [01;34mlanguage_model[0m
â”‚Â Â  â”œâ”€â”€ [ 19M]  [00m5gram_correct.arpa[0m
â”‚Â Â  â”œâ”€â”€ [  78]  [00mattrs.json[0m
â”‚Â Â  â””â”€â”€ [181K]  [00munigrams.txt[0m
â”œâ”€â”€ [ 262]  [00mpreprocessor_config.json[0m
â”œâ”€â”€ [1.2G]  [00mpytorch_model.bin[0m
â”œâ”€â”€ [2.2K]  [00mREADME.md[0m
â”œâ”€â”€ [  51]  [00mspecial_tokens_map.json[0m
â”œâ”€â”€ [ 399]  [00mtokenizer_config.json[0m
â”œâ”€â”€ [3.9K]  [00mtraining_args.bin[0m
â””â”€â”€ [ 619]  [00mvocab.json[0m

1 directory, 12 files


As can be seen the *5-gram* LM is quite large - it amounts to more than 4 GB.
To reduce the size of the *n-gram* and make loading faster, `kenLM` allows converting `.arpa` files to binary ones using the `build_binary` executable.

Let's make use of it here.

In [None]:
# Convert .arpa into executable using the build_binary executable
!kenlm/build/bin/build_binary {repo_name}/language_model/5gram_correct.arpa {repo_name}/language_model/5gram.bin

Reading wav2vec2-xls-r-300m-with-LM-asr_af/language_model/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
# Remove .arpa file and view the size of repo
!rm {repo_name}/language_model/5gram_correct.arpa && tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-with-LM-asr_af/[0m
â”œâ”€â”€ [ 373]  [00malphabet.json[0m
â”œâ”€â”€ [2.0K]  [00mconfig.json[0m
â”œâ”€â”€ [4.0K]  [01;34mlanguage_model[0m
â”‚Â Â  â”œâ”€â”€ [9.8M]  [00m5gram.bin[0m
â”‚Â Â  â”œâ”€â”€ [  78]  [00mattrs.json[0m
â”‚Â Â  â””â”€â”€ [181K]  [00munigrams.txt[0m
â”œâ”€â”€ [ 262]  [00mpreprocessor_config.json[0m
â”œâ”€â”€ [1.2G]  [00mpytorch_model.bin[0m
â”œâ”€â”€ [2.2K]  [00mREADME.md[0m
â”œâ”€â”€ [  51]  [00mspecial_tokens_map.json[0m
â”œâ”€â”€ [ 399]  [00mtokenizer_config.json[0m
â”œâ”€â”€ [3.9K]  [00mtraining_args.bin[0m
â””â”€â”€ [ 619]  [00mvocab.json[0m

1 directory, 12 files


In [None]:
# Push all the files to hub
repo.push_to_hub(commit_message="Upload lm-boosted decoder")

Upload file language_model/5gram.bin:   0%|          | 32.0k/9.83M [00:00<?, ?B/s]

To https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af
   6c4c5fd..ff9d129  main -> main

   6c4c5fd..ff9d129  main -> main



'https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af/commit/ff9d129c81002549e3a9d2b9c7ea1ebd69ca34b0'