# **Boosting Wav2Vec2 with n-grams**

## Install dependencies
**Note:** This only needs to be run once per machine.

In case you need to install with sudo password, replace "awe" with your sudo password.

In [1]:
sudo_password = False
if sudo_password:
    !echo "awe" > password.txt

Install dependencies and kenlm

In [2]:
# Install Python dependencies
!pip3 install https://github.com/kpu/kenlm/archive/master.zip
!pip3 install -r requirements.txt

if sudo_password:
    !sudo -S apt-get update < password.txt
    # Install GitLFS
    !sudo apt-get install git-lfs tree
    # Install KenLM dependencies
    !sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
else:
    !apt-get update
    # Install GitLFS
    !apt-get install git-lfs tree
    # Install KenLM dependencies
    !apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

# Download build code
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

# Build KenLM
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Using cached https://github.com/kpu/kenlm/archive/master.zip (553 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [1,330 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:1

## **1. Log-in to HF hub**
**Note:** This needs to be run every time.

In [1]:
from huggingface_hub import login
from utils import WRITE_ACCESS_TOKEN

login(WRITE_ACCESS_TOKEN)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/kiff/.cache/huggingface/token
Login successful


## **2. Build an *n-gram* with KenLM**

**Note:** This needs to be run every time.

Build **n-gram** with **[Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing)**.

### 2.1 Concatenate LM data

In [19]:
import re
import os

from datasets import load_dataset
from utils import remove_special_characters

language = "af"
# language = "xh"

data_dir = os.path.join("data", "language_model_data")
os.makedirs(data_dir, exist_ok=True)

# Output
output_file_name = f"{language}_lm_data.txt"
output_file_path = os.path.join(data_dir, output_file_name)

In [20]:
if not os.path.exists(output_file_path):
    # ASR transcription data
    asr_dataset = load_dataset(f"lucas-meyer/asr_{language}")
    asr_dataset = asr_dataset["train"]

    # WikiMedia data
    lm_file_names = [
        f"train.{language}.txt",
        f"val.{language}.txt"
    ]

    with open(output_file_path, "w") as txt_file:
        # Add WikiMedia data
        for file_name in lm_file_names:
            data_path = os.path.join(data_dir, file_name)
            with open(data_path, "r") as data_file:
                for line in data_file.readlines():
                    txt_file.write(remove_special_characters(line.strip()))
                    txt_file.write(" ")

        # Add asr_dataset transcription data
        for data_entry in asr_dataset:
            line = data_entry["transcription"]
            txt_file.write(remove_special_characters(line.strip()))
            txt_file.write(" ")

### 2.2 Use concatenated LM data to build n-gram model

Build **n-gram** model with KenLM's `lmplz` command. We build an **n-gram** by passing the `-o n` parameter.

In [21]:
n = 5
arpa_file_name = f"{n}-gram_{language}.arpa"
arpa_file_path = os.path.join("kenlm", arpa_file_name)
corrected_arpa_file_path = os.path.join("kenlm", f"corrected_{arpa_file_name}")

In [22]:
!kenlm/build/bin/lmplz -o {n} < {output_file_path} > {arpa_file_path}

=== 1/5 Counting and sorting n-grams ===
Reading /home/kiff/Desktop/Speech-Recognition-Afrikaans-isiXhosa/src/data/language_model_data/af_lm_data.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 715500 types 60024
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:720288 2:1298509568 3:2434705664 4:3895528704 5:5680979968
Statistics:
1 60023 D1=0.709251 D2=1.06535 D3+=1.39207
2 306597 D1=0.830473 D2=1.13429 D3+=1.39941
3 521550 D1=0.92046 D2=1.25482 D3+=1.41513
4 605009 D1=0.972484 D2=1.40699 D3+=1.53842
5 630810 D1=0.976074 D2=1.58889 D3+=1.96178
Memory estimate for binary LM:
type       kB
probing 46202 assuming -p 1.5
probing 54833 assuming -r models -p 1.5
trie    22308 without quantization
trie    12321 assuming -q 8 -b 8 quantization 
trie    20111 assuming -a 22 array pointer compression
trie  

### 2.3 Fix tokens of model

There is a small problem. The *n-gram* correctly includes an "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token. This has to be corrected after the build.

In [23]:
# View n-gram file
!head -20 {arpa_file_path}

\data\
ngram 1=60023
ngram 2=306597
ngram 3=521550
ngram 4=605009
ngram 5=630810

\1-grams:
-5.5290265	<unk>	0
0	<s>	-0.080674514
-3.6975882	afrika	-0.20852761
-1.73772	is	-0.6712574
-1.7047756	die	-0.44148853
-3.791833	wêreld	-0.17801061
-2.355747	se	-0.25122693
-3.9361587	tweede	-0.13276275
-3.8189769	grootste	-0.16491456
-4.508168	kontinent	-0.12698652
-4.389144	(na	-0.102394395
-4.940668	asië)	-0.080674514


In [24]:
# Add end-of-sentence file
with open(arpa_file_path, "r") as read_file, open(corrected_arpa_file_path, "w") as write_file:
    has_added_eos = False
    for line in read_file:
        if not has_added_eos and "ngram 1=" in line:
            count=line.strip().split("=")[-1]
            write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
        elif not has_added_eos and "<s>" in line:
            write_file.write(line)
            write_file.write(line.replace("<s>", "</s>"))
            has_added_eos = True
        else:
            write_file.write(line)

In [25]:
# View corrected n-gram file
!head -20 {corrected_arpa_file_path}

\data\
ngram 1=60024
ngram 2=306597
ngram 3=521550
ngram 4=605009
ngram 5=630810

\1-grams:
-5.5290265	<unk>	0
0	<s>	-0.080674514
0	</s>	-0.080674514
-3.6975882	afrika	-0.20852761
-1.73772	is	-0.6712574
-1.7047756	die	-0.44148853
-3.791833	wêreld	-0.17801061
-2.355747	se	-0.25122693
-3.9361587	tweede	-0.13276275
-3.8189769	grootste	-0.16491456
-4.508168	kontinent	-0.12698652
-4.389144	(na	-0.102394395


## **3. Combine an *n-gram* with Wav2Vec2**

**Note:** This needs to be run every time.

### 3.1 Load pre-trained model from HF

In [26]:
from transformers import AutoProcessor

user_name = "lucas-meyer"
repo_name = input("Remember to make a clone!!! Enter the repo name (excluding username): ")

processor = AutoProcessor.from_pretrained(f"{user_name}/{repo_name}")

Remember to make a clone!!! Enter the repo name (excluding username): wav2vec2-xls-r-300m-asr_af-run1-with-LM


### 3.2 Build decoder

In [27]:
from pyctcdecode import build_ctcdecoder

vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path=corrected_arpa_file_path,
)

Loading the LM will be faster if you build a binary file.
Reading /home/kiff/Desktop/Speech-Recognition-Afrikaans-isiXhosa/src/kenlm/corrected_5-gram_af.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigrams and labels don't seem to agree.


### 3.3 Wrap the decoder together with the tokenizer and feature_extractor

In [28]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

### 3.4 Clone HuggingFace repository and save LM to local clone

In [29]:
from huggingface_hub import Repository

repo = Repository(local_dir=f"{repo_name}", clone_from=f"{user_name}/{repo_name}")

processor_with_lm.save_pretrained(f"{repo_name}")

Cloning https://huggingface.co/lucas-meyer/wav2vec2-xls-r-300m-asr_af-run1-with-LM into local empty directory.


Download file pytorch_model.bin:   0%|          | 8.00k/1.18G [00:00<?, ?B/s]

Download file runs/Oct11_21-26-14_3303824dbc25/events.out.tfevents.1697059587.3303824dbc25.332.0: 100%|#######…

Download file runs/Oct12_00-40-45_3303824dbc25/events.out.tfevents.1697071247.3303824dbc25.332.2: 100%|#######…

Download file runs/Oct11_21-26-14_3303824dbc25/1697059587.6734695/events.out.tfevents.1697059587.3303824dbc25.…

Download file runs/Oct12_00-40-45_3303824dbc25/1697071247.813094/events.out.tfevents.1697071247.3303824dbc25.3…

Download file training_args.bin: 100%|##########| 3.56k/3.56k [00:00<?, ?B/s]

Clean file runs/Oct11_21-26-14_3303824dbc25/events.out.tfevents.1697059587.3303824dbc25.332.0:  12%|#1        …

Clean file runs/Oct12_00-40-45_3303824dbc25/events.out.tfevents.1697071247.3303824dbc25.332.2:  17%|#7        …

Clean file training_args.bin:  28%|##8       | 1.00k/3.56k [00:00<?, ?B/s]

Clean file runs/Oct12_00-40-45_3303824dbc25/1697071247.813094/events.out.tfevents.1697071247.3303824dbc25.332.…

Clean file runs/Oct11_21-26-14_3303824dbc25/1697059587.6734695/events.out.tfevents.1697059587.3303824dbc25.332…

Clean file pytorch_model.bin:   0%|          | 1.00k/1.18G [00:00<?, ?B/s]

### 3.5 Replace arpa with binary executable (saves space)

In [30]:
# View the size of repo
!tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-asr_af-run1-with-LM/[0m
├── [ 363]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [  78]  [00mattrs.json[0m
│   ├── [ 89M]  [00mcorrected_5-gram_af.arpa[0m
│   └── [599K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [1.2G]  [00mpytorch_model.bin[0m
├── [2.1K]  [00mREADME.md[0m
├── [4.0K]  [01;34mruns[0m
│   ├── [4.0K]  [01;34mOct11_21-26-14_3303824dbc25[0m
│   │   ├── [4.0K]  [01;34m1697059587.6734695[0m
│   │   │   └── [5.8K]  [00mevents.out.tfevents.1697059587.3303824dbc25.332.1[0m
│   │   └── [8.4K]  [00mevents.out.tfevents.1697059587.3303824dbc25.332.0[0m
│   └── [4.0K]  [01;34mOct12_00-40-45_3303824dbc25[0m
│       ├── [4.0K]  [01;34m1697071247.813094[0m
│       │   └── [5.8K]  [00mevents.out.tfevents.1697071247.3303824dbc25.332.3[0m
│       └── [5.8K]  [00mevents.out.tfevents.1697071247.3303824dbc25.332.2[0m
├── [  

In [31]:
# Convert .arpa into executable using the build_binary executable
corrected_arpa_file_name = os.path.basename(corrected_arpa_file_path)
bin_file_name = f"{n}-gram_{language}.bin"
!kenlm/build/bin/build_binary {repo_name}/language_model/{corrected_arpa_file_name} {repo_name}/language_model/{bin_file_name}

# Remove .arpa file
!rm {repo_name}/language_model/{corrected_arpa_file_name}

Reading wav2vec2-xls-r-300m-asr_af-run1-with-LM/language_model/corrected_5-gram_af.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [32]:
# View the size of repo
!tree -h {repo_name}/

[4.0K]  [01;34mwav2vec2-xls-r-300m-asr_af-run1-with-LM/[0m
├── [ 363]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [ 46M]  [00m5-gram_af.bin[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [599K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [1.2G]  [00mpytorch_model.bin[0m
├── [2.1K]  [00mREADME.md[0m
├── [4.0K]  [01;34mruns[0m
│   ├── [4.0K]  [01;34mOct11_21-26-14_3303824dbc25[0m
│   │   ├── [4.0K]  [01;34m1697059587.6734695[0m
│   │   │   └── [5.8K]  [00mevents.out.tfevents.1697059587.3303824dbc25.332.1[0m
│   │   └── [8.4K]  [00mevents.out.tfevents.1697059587.3303824dbc25.332.0[0m
│   └── [4.0K]  [01;34mOct12_00-40-45_3303824dbc25[0m
│       ├── [4.0K]  [01;34m1697071247.813094[0m
│       │   └── [5.8K]  [00mevents.out.tfevents.1697071247.3303824dbc25.332.3[0m
│       └── [5.8K]  [00mevents.out.tfevents.1697071247.3303824dbc25.332.2[0m
├── [  51]  [00ms

### 3.6 Push updated model to HuggingFace repository

In [34]:
# Enable GitLFS
!git lfs install

Updated Git hooks.
Git LFS initialized.


In [35]:
# Push all the files to hub
repo.push_to_hub(commit_message="Upload lm-boosted decoder")