# **Boosting Wav2Vec2 with n-grams**

## Install dependencies

In case you need to install with sudo password, relplace "awe" with your sudo password.

In [6]:
sudo_password = False
if sudo_password:
    !echo "awe" > password.txt

Install dependencies and kenlm

In [7]:
# Install Python dependencies
!pip3 install https://github.com/kpu/kenlm/archive/master.zip
!pip3 install -r requirements.txt

if sudo_password:
    !sudo -S apt-get update < password.txt
    # Install GitLFS
    !sudo apt-get install git-lfs tree
    # Install KenLM dependencies
    !sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev  
else:
    !apt-get update
    # Install GitLFS
    !apt-get install git-lfs tree 
    # Install KenLM dependencies
    !apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

# Download build code
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

# Build KenLM
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

[sudo] password for kiff: 
[sudo] password for kiff: 

KeyboardInterrupt: 

## **1. Log-in to HF hub**

In [7]:
from huggingface_hub import login
from utils import WRITE_ACCESS_TOKEN

login(WRITE_ACCESS_TOKEN)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/kiff/.cache/huggingface/token
Login successful


## **2. Build an *n-gram* with KenLM**

Build **n-gram** with **[Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing)**.

### 2.3 Build KenLM

### 2.1 Concatenate LM data

In [23]:
import re
import os

from datasets import load_dataset
from utils import remove_special_characters

if not os.path.exists("data", "language_model_data", "lm_data.txt"):
    # WikiMedia data
    data_dir = os.path.join("data", "language_model_data")
    language = "af"
    lm_file_names = [
        f"train.{language}.txt",
        f"val.{language}.txt"
    ]

    # ASR transcription data
    asr_af = load_dataset("lucas-meyer/asr_af")
    asr_af = asr_af["train"]

    # Output
    output_file_name = "lm_data.txt"
    output_file_path = os.path.join(data_dir, output_file_name)

    with open(output_file_path, "w") as txt_file:
        # Add WikiMedia data
        for file_name in lm_file_names:
            data_path = os.path.join(data_dir, file_name)
            with open(data_path, "r") as data_file:
                for line in data_file.readlines():
                    txt_file.write(remove_special_characters(line.strip()))
                    txt_file.write(" ")

        # Add asr_af["train"] transcription data
        for data_entry in asr_af:
            line = data_entry["transcription"]
            txt_file.write(remove_special_characters(line.strip()))
            txt_file.write(" ")

### 2.2 Use concatenated LM data to build n-gram model

Build **n-gram** model with KenLM's `lmplz` command. We build an **n-gram** by passing the `-o n` parameter.

In [38]:
n = 5
arpa_file_name = "%d-gram.arpa" % (n)
arpa_file_path = os.path.join("kenlm", arpa_file_name)
corrected_arpa_file_path = os.path.join("kenlm", f"corrected_{arpa_file_name}")

In [33]:
!kenlm/build/bin/lmplz -o {n} < {output_file_path} > {arpa_file_path}

=== 1/5 Counting and sorting n-grams ===
Reading /home/kiff/Desktop/Speech-Recognition-Afrikaans-isiXhosa/src/data/language_model_data/lm_data.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 715500 types 60024
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:720288 2:1298510464 3:2434707200 4:3895531520 5:5680983552
Statistics:
1 60023 D1=0.709251 D2=1.06535 D3+=1.39207
2 306597 D1=0.830473 D2=1.13429 D3+=1.39941
3 521550 D1=0.92046 D2=1.25482 D3+=1.41513
4 605009 D1=0.972484 D2=1.40699 D3+=1.53842
5 630810 D1=0.976074 D2=1.58889 D3+=1.96178
Memory estimate for binary LM:
type       kB
probing 46202 assuming -p 1.5
probing 54833 assuming -r models -p 1.5
trie    22308 without quantization
trie    12321 assuming -q 8 -b 8 quantization 
trie    20111 assuming -a 22 array pointer compression
trie    1

Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

### 2.3 Fix tokens of model

There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1.

In [3]:
!head -{n} {arpa_file_path}

head: invalid option -- '{'
Try 'head --help' for more information.


In [39]:
with open(arpa_file_path, "r") as read_file, open(corrected_arpa_file_path, "w") as write_file:
    has_added_eos = False
    for line in read_file:
        if not has_added_eos and "ngram 1=" in line:
            count=line.strip().split("=")[-1]
            write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
        elif not has_added_eos and "<s>" in line:
            write_file.write(line)
            write_file.write(line.replace("<s>", "</s>"))
            has_added_eos = True
        else:
            write_file.write(line)

In [37]:
!head -{n} {arpa_file_path}

\data\
ngram 1=60023
ngram 2=306597
ngram 3=521550
ngram 4=605009


Let's now inspect the corrected *5-gram*.

## **3. Combine an *n-gram* with Wav2Vec2**

Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

### 3.1 Load pre-trained model from HF 

In [42]:
from transformers import AutoProcessor

user_name = "lucas-meyer"
repo_name = "wav2vec2-xls-r-300m-with-LM-asr_af"

processor = AutoProcessor.from_pretrained(f"{user_name}/{repo_name}")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading (…)747680/alphabet.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading (…)e_model/unigrams.txt:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading 5gram.bin:   0%|          | 0.00/10.3M [00:00<?, ?B/s]

### 3.2 Build decoder

In [44]:
from pyctcdecode import build_ctcdecoder

vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path=corrected_arpa_file_path,
)

Loading the LM will be faster if you build a binary file.
Reading /home/kiff/Desktop/Speech-Recognition-Afrikaans-isiXhosa/src/kenlm/corrected_5-gram.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigrams and labels don't seem to agree.


### 3.3 Wrap the decoder together with the tokenizer and feature_extractor

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 1s (42.5 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tree.
(Reading 

Cloning and uploading of modeling files can be done conveniently with the `huggingface_hub`'s `Repository` class.

More information on how to use the `huggingface_hub` to upload any files, please take a look at the [official docs](https://huggingface.co/docs/hub/how-to-upstream).

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir=f"{repo_name}", clone_from=f"{user_name}/{repo_name}")

Cloning https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af into local empty directory.


Download file pytorch_model.bin:   0%|          | 1.40k/1.18G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.93k/3.93k [00:00<?, ?B/s]

Clean file training_args.bin:  25%|##5       | 1.00k/3.93k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.18G [00:00<?, ?B/s]

Having cloned `xls-r-300m-sv`, let's save the new processor with LM into it.

In [None]:
processor_with_lm.save_pretrained(f"{repo_name}")

Let's inspect the local repository. The `tree` command conveniently can also show the size of the different files.

In a final step, we want to wrap the *5-gram* into a `Wav2Vec2ProcessorWithLM` object to make the *5-gram* boosted decoding as seamless as shown in Section 1.
We start by downloading the currently "LM-less" processor of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv).

In [None]:
!tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-with-LM-asr_af/[0m
├── [ 373]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [ 19M]  [00m5gram_correct.arpa[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [181K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [1.2G]  [00mpytorch_model.bin[0m
├── [2.2K]  [00mREADME.md[0m
├── [  51]  [00mspecial_tokens_map.json[0m
├── [ 399]  [00mtokenizer_config.json[0m
├── [3.9K]  [00mtraining_args.bin[0m
└── [ 619]  [00mvocab.json[0m

1 directory, 12 files


As can be seen the *5-gram* LM is quite large - it amounts to more than 4 GB.
To reduce the size of the *n-gram* and make loading faster, `kenLM` allows converting `.arpa` files to binary ones using the `build_binary` executable.

Let's make use of it here.

Next, we extract the vocabulary of its tokenizer as it represents the `"labels"` of `pyctcdecode`'s `BeamSearchDecoder` class.

In [None]:
# Convert .arpa into executable using the build_binary executable
!kenlm/build/bin/build_binary {repo_name}/language_model/{corrected_arpa_file_path} {repo_name}/language_model/5gram.bin

Reading wav2vec2-xls-r-300m-with-LM-asr_af/language_model/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
# Remove .arpa file and view the size of repo
!rm {repo_name}/language_model/{corrected_arpa_file_path} && tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-with-LM-asr_af/[0m
├── [ 373]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [9.8M]  [00m5gram.bin[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [181K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [1.2G]  [00mpytorch_model.bin[0m
├── [2.2K]  [00mREADME.md[0m
├── [  51]  [00mspecial_tokens_map.json[0m
├── [ 399]  [00mtokenizer_config.json[0m
├── [3.9K]  [00mtraining_args.bin[0m
└── [ 619]  [00mvocab.json[0m

1 directory, 12 files


The `"labels"` and the previously built `5gram_correct.arpa` file is all that's needed to build the decoder.

In [None]:
# Push all the files to hub
repo.push_to_hub(commit_message="Upload lm-boosted decoder")

Upload file language_model/5gram.bin:   0%|          | 32.0k/9.83M [00:00<?, ?B/s]

To https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af
   6c4c5fd..ff9d129  main -> main

   6c4c5fd..ff9d129  main -> main



'https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-with-LM-asr_af/commit/ff9d129c81002549e3a9d2b9c7ea1ebd69ca34b0'

We want to directly upload the LM-boosted processor into
the model folder of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv) to have all relevant files in one place.

Let's clone the repo, add the new decoder files and upload them afterward.
First, we need to install `git-lfs`.