## **1. Decoding audio data with Wav2Vec2 and a language model**

As shown in 🤗 Transformers [exemple docs of Wav2Vec2](https://huggingface.co/docs/transformers/master/en/model_doc/wav2vec2#transformers.Wav2Vec2ForCTC), audio can be transcribed as follows.

We install `datasets` and `transformers` as well as `pyctcdecode` and `kenLM`'s Python bindings to be able to run the language model integration.



In [None]:
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install pyctcdecode
!pip install datasets[audio]
!pip install evaluate
!pip install transformers
!pip install accelerate

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip
[2K     [32m-[0m [32m553.6 kB[0m [31m10.4 MB/s[0m [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184306 sha256=b1f983c09d74a933c5fcdb32909615e9242ab39f7a784e94942832cc49ab6594
  Stored in directory: /tmp/pip-ephem-wheel-cache-9ohiqnbs/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462
Successfully built kenlm
Installing collected packages: kenlm
Successfully installed kenlm-0.2.0
Collecting pyctcdecode
  Downloading pyctcdecode-0.5.0-py2.py3-none-any.whl (39 kB)
Collecting pygtrie<3.0,>=2.1 (from pyctcdecode)


Comparing the transcription to the target transcription above, we can see that some words *sound* correct, but are not *spelled* correctly, *e.g.*:

- *christmaus* vs. *christmas*
- *rose* vs. *roast*
- *simalyis* vs. *similes*

Let's see whether combining Wav2Vec2 with an ***n-gram*** lnguage model can help here.

For demonstration purposes, we have prepared a new model repository [patrickvonplaten/wav2vec2-base-100h-with-lm](https://huggingface.co/patrickvonplaten/wav2vec2-base-100h-with-lm) which contains the same Wav2Vec2 checkpoint but has an additional **4-gram** language model for English.

Instead of using `Wav2Vec2Processor`, this time we use `Wav2Vec2ProcessorWithLM` to load the **4-gram** model in addition to the feature extractor and tokenizer.

Cool! Recalling the words `facebook/wav2vec2-base-100h` without a language model transcribed incorrectly previously, *e.g.*,

> - *christmaus* vs. *christmas*
- *rose* vs. *roast*
- *simalyis* vs. *similes*

we can take another look at the transcription of `facebook/wav2vec2-base-100h` **with** a 4-gram language model. 2 out of 3 errors are corrected; *christmas* and *similes* have been correctly transcribed.

Interestingly, the incorrect transcription of *rose* persists. However, this should not surprise us very much. Decoding audio without a language model is much more prone to yield spelling mistakes, such as *christmaus* or *similes* (those words don't exist in the English language as far as I know). This is because the speech recognition system almost solely bases its prediction on the acoustic input it was given and not really on the language modeling context of previous and successive predicted letters ${}^1$.
If on the other hand, we add a language model, we can be fairly sure that the speech recognition system will heavily reduce spelling errors since a well-trained *n-gram* model will surely not predict a word that has spelling errors. But the word *rose* is a valid English word and therefore the 4-gram will predict this word with a probability that is not insignificant.

The language model on its own most likely does favor the correct word *roast* since the word sequence *roast beef* is much more common in English than *rose beef*. Because the final transcription is derived from a weighted combination of `facebook/wav2vec2-base-100h` output probabilities and those of the *n-gram* language model, it is quite common to see incorrectly transcribed words such as *rose*.

For more information on how you can tweak different parameters when decoding with `Wav2Vec2ProcessorWithLM`, please take a look at the official documentation [here](https://huggingface.co/docs/transformers/master/en/model_doc/wav2vec2#transformers.Wav2Vec2ProcessorWithLM.batch_decode).

---
${}^1$ Some research shows that a model such as `facebook/wav2vec2-base-100h` - when sufficiently large and trained on enough data - can learn language modeling dependencies between intermediate audio representations similar to a language model.


Great, now that you have seen the advantages adding an *n-gram* language model can bring, let's dive into how to create an *n-gram* and `Wav2Vec2ProcessorWithLM` from scratch.

## **3. Build an *n-gram* with KenLM**

While large language models based on the [Transformer architecture](https://jalammar.github.io/illustrated-transformer/) have become the standard in NLP, it is still very common to use an ***n-gram*** LM to boost speech recognition systems - as shown in Section 1.

Looking again at Table 9 of Appendix C of the [official Wav2Vec2 paper](https://arxiv.org/abs/2006.11477), it can be noticed that using a *Transformer*-based LM for decoding clearly yields better results than using an *n-gram* model, but the difference between *n-gram* and *Transformer*-based LM is much less significant than the difference between *n-gram* and no LM.

*E.g.*, for the large Wav2Vec2 checkpoint that was fine-tuned on 10min only, an *n-gram* reduces the word error rate (WER) compared to no LM by *ca.* 80% while a *Transformer*-based LM *only* reduces the WER by another 23% compared to the *n-gram*. This relative WER reduction becomes less, the more data the acoustic model has been trained on. *E.g.*, for the large checkpoint a *Transformer*-based LM reduces the WER by merely 8% compared to an *n-gram* LM whereas the *n-gram* still yields a 21% WER reduction compared to no language model.

The reason why an *n-gram* is preferred over a *Transformer*-based LM is that *n-grams* come at a significantly smaller computational cost. For an *n-gram*, retrieving the probability of a word given previous words is almost only as computationally expensive as querying a look-up table or tree-like data storage - *i.e.* it's very fast compared to modern *Transformer*-based language models that would require a full forward pass to retrieve the next word probabilities.

For more information on how *n-grams* function and why they are (still) so useful for speech recognition, the reader is advised to take a look at [this excellent summary](https://web.stanford.edu/~jurafsky/slp3/3.pdf) from Stanford.

Great, let's see step-by-step how to build an *n-gram*. We will use the popular [KenLM library](https://github.com/kpu/kenlm) to do so. Let's start by installing the Ubuntu library prerequisites:

In [None]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libboost-program-options-dev is already the newest version (1.74.0.3ubuntu7).
libboost-program-options-dev set to manually installed.
libboost-system-dev is already the newest version (1.74.0.3ubuntu7).
libboost-system-dev set to manually installed.
libboost-thread-dev is already the newest version (1.74.0.3ubuntu7).
libboost-thread-dev set to manually installed.
libbz2-dev is already the newest version (1.0.8-5build1).
libbz2-dev set to manually installed.
liblzma-dev is already the newest version (5.2.5-2ubuntu1).
liblzma-dev set to manually installed.
libboost-test-dev is already the newest version (1.74.0.3ubuntu7).
libboost-test-dev set to manually installed.
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
zlib1g-dev set to manually installed.

before downloading and unpacking the KenLM repo.

In [None]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2023-10-25 19:09:26--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2023-10-25 19:09:27 (1.33 MB/s) - written to stdout [491888/491888]



KenLM is written in C++, so we'll make use of `cmake` to build the binaries.

In [None]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Found Threads: TRUE  
-- Found ZLIB: /usr

Great, as we can see, the executable functions have successfully been built under `kenlm/build/bin/`.

KenLM by default computes an *n-gram* with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the *n-gram* is expected to be stored in a text file.
We download our dataset and save it as a `.txt` file.

In [None]:
from datasets import load_dataset

datasetss = load_dataset("vivos", split="train", token="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")
data_cmv = load_dataset("mozilla-foundation/common_voice_13_0", "vi", split="train+validation", token="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")

Downloading builder script:   0%|          | 0.00/6.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.00k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/345k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.7k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11660 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/760 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/8.18k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.7k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.65k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/74.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/275M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/551k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/270k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.54M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/77.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 2462it [00:00, 57688.61it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 392it [00:00, 66129.07it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1225it [00:00, 32973.66it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 11486it [00:00, 78221.78it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 350it [00:00, 48628.81it/s]


In [None]:
import re
chars_to_remove_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\']'

def remove_special_characters(batch):
    batch["sentence"] = re.sub(chars_to_remove_regex, '', batch["sentence"]).lower()
    return batch

In [None]:
datasetss = datasetss.map(remove_special_characters)
data_cmv = data_cmv.map(remove_special_characters)

Map:   0%|          | 0/12420 [00:00<?, ? examples/s]

Map:   0%|          | 0/4079 [00:00<?, ? examples/s]

In [None]:
with open("text.txt", "w") as file:
  file.write(" ".join(datasetss["sentence"]).lower())
  file.write(" ".join(data_cmv["sentence"]).lower())

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp /content/text.txt /content/drive/MyDrive/s2t_mms_vivos

Now, we just have to run KenLM's `lmplz` command to build our *n-gram*, called `"5gram.arpa"`. As it's relatively common in speech recognition, we build a *5-gram* by passing the `-o 5` parameter.
For more information on the different *n-gram* LM that can be built
with KenLM, one can take a look at the [official website of KenLM](https://kheafield.com/code/kenlm/).

Executing the command below might take a minute or so.

In [None]:
!kenlm/build/bin/lmplz -o 5 <"text.txt" > "5gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /content/text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 192501 types 5108
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:61296 2:1062493056 3:1992174592 4:3187479296 5:4648407552
Statistics:
1 5107 D1=0.532035 D2=1.04861 D3+=1.58484
2 96008 D1=0.768947 D2=1.16013 D3+=1.50535
3 167005 D1=0.920674 D2=1.40209 D3+=1.57355
4 182673 D1=0.977923 D2=1.61056 D3+=1.72508
5 186191 D1=0.962728 D2=1.59959 D3+=1.35863
Memory estimate for binary LM:
type       kB
probing 13848 assuming -p 1.5
probing 16479 assuming -r models -p 1.5
trie     6233 without quantization
trie     3161 assuming -q 8 -b 8 quantization 
trie     5646 assuming -a 22 array pointer compression
trie     2573 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Ca

Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [None]:
!head -20 5gram.arpa

\data\
ngram 1=5107
ngram 2=96008
ngram 3=167005
ngram 4=182673
ngram 5=186191

\1-grams:
-4.901245	<unk>	0
0	<s>	-0.11410332
-3.3630016	nửa	-0.22942758
-3.510717	vòng	-0.16067217
-3.171938	trái	-0.2881549
-2.9585598	đất	-0.23522994
-2.6506364	hơn	-0.3561004
-3.070509	bảy	-0.3478618
-2.7738113	năm	-0.3887286
-2.9585598	bốn	-0.46917245
-3.070509	chiếc	-0.31345212
-3.5256321	trống	-0.13543405


There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [None]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [None]:
!head -20 5gram_correct.arpa

\data\
ngram 1=5108
ngram 2=96008
ngram 3=167005
ngram 4=182673
ngram 5=186191

\1-grams:
-4.901245	<unk>	0
0	<s>	-0.11410332
0	</s>	-0.11410332
-3.3630016	nửa	-0.22942758
-3.510717	vòng	-0.16067217
-3.171938	trái	-0.2881549
-2.9585598	đất	-0.23522994
-2.6506364	hơn	-0.3561004
-3.070509	bảy	-0.3478618
-2.7738113	năm	-0.3887286
-2.9585598	bốn	-0.46917245
-3.070509	chiếc	-0.31345212


Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

## **4. Combine an *n-gram* with Wav2Vec2**

In a final step, we want to wrap the *5-gram* into a `Wav2Vec2ProcessorWithLM` object to make the *5-gram* boosted decoding as seamless as shown in Section 1.
We start by downloading the currently "LM-less" processor of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv).

In [None]:
model_name = "aiface/mms_s2t_vivos"

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_name,token="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")

Downloading (…)rocessor_config.json:   0%|          | 0.00/262 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading (…)e_model/unigrams.txt:   0%|          | 0.00/32.6k [00:00<?, ?B/s]

Downloading (…)5179a2/alphabet.json:   0%|          | 0.00/848 [00:00<?, ?B/s]

Downloading (…)age_model/attrs.json:   0%|          | 0.00/78.0 [00:00<?, ?B/s]

Downloading 5gram.bin:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

Next, we extract the vocabulary of its tokenizer as it represents the `"labels"` of `pyctcdecode`'s `BeamSearchDecoder` class.

In [None]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

The `"labels"` and the previously built `5gram_correct.arpa` file is all that's needed to build the decoder.

In [None]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)



We can safely ignore the warning and all that is left to do now is to wrap the just created `decoder`, together with the processor's `tokenizer` and `feature_extractor` into a `Wav2Vec2ProcessorWithLM` class.

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

We want to directly upload the LM-boosted processor into
the model folder of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv) to have all relevant files in one place.

Let's clone the repo, add the new decoder files and upload them afterward.
First, we need to install `git-lfs`.

In [None]:
!sudo apt-get install git-lfs tree

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 1s (67.6 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tree.
(Reading 

Cloning and uploading of modeling files can be done conveniently with the `huggingface_hub`'s `Repository` class.

More information on how to use the `huggingface_hub` to upload any files, please take a look at the [official docs](https://huggingface.co/docs/hub/how-to-upstream).

In [None]:
model_name

'aiface/mms_s2t_vivos'

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir="xls-r-300m-vi", clone_from=model_name,token="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")

Cloning https://huggingface.co/aiface/mms_s2t_vivos into local empty directory.


Download file pytorch_model.bin:   0%|          | 17.4k/3.59G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.93k/3.93k [00:00<?, ?B/s]

Clean file training_args.bin:  25%|##5       | 1.00k/3.93k [00:00<?, ?B/s]

Download file language_model/5gram.bin:   0%|          | 1.40k/13.6M [00:00<?, ?B/s]

Clean file language_model/5gram.bin:   0%|          | 1.00k/13.6M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/3.59G [00:00<?, ?B/s]

Having cloned `xls-r-300m-sv`, let's save the new processor with LM into it.

In [None]:
!rm -rf /content/xls-r-300m-vi/language_model

In [None]:
processor_with_lm.save_pretrained("xls-r-300m-vi")

Let's inspect the local repository. The `tree` command conveniently can also show the size of the different files.

In [None]:
!tree -h xls-r-300m-vi/

[4.0K]  [01;34mxls-r-300m-vi/[0m
├── [  30]  [00madded_tokens.json[0m
├── [ 848]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [ 25M]  [00m5gram_correct.arpa[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [ 30K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [3.6G]  [00mpytorch_model.bin[0m
├── [ 608]  [00mspecial_tokens_map.json[0m
├── [1.2K]  [00mtokenizer_config.json[0m
├── [3.9K]  [00mtraining_args.bin[0m
└── [1.1K]  [00mvocab.json[0m

1 directory, 12 files


As can be seen the *5-gram* LM is quite large - it amounts to more than 4 GB.
To reduce the size of the *n-gram* and make loading faster, `kenLM` allows converting `.arpa` files to binary ones using the `build_binary` executable.

Let's make use of it here.

In [None]:
!kenlm/build/bin/build_binary xls-r-300m-vi/language_model/5gram_correct.arpa xls-r-300m-vi/language_model/5gram.bin

Reading xls-r-300m-vi/language_model/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
!rm xls-r-300m-vi/language_model/5gram_correct.arpa && tree -h xls-r-300m-vi/

[4.0K]  [01;34mxls-r-300m-vi/[0m
├── [  30]  [00madded_tokens.json[0m
├── [ 848]  [00malphabet.json[0m
├── [2.0K]  [00mconfig.json[0m
├── [4.0K]  [01;34mlanguage_model[0m
│   ├── [ 14M]  [00m5gram.bin[0m
│   ├── [  78]  [00mattrs.json[0m
│   └── [ 30K]  [00munigrams.txt[0m
├── [ 262]  [00mpreprocessor_config.json[0m
├── [3.6G]  [00mpytorch_model.bin[0m
├── [ 608]  [00mspecial_tokens_map.json[0m
├── [1.2K]  [00mtokenizer_config.json[0m
├── [3.9K]  [00mtraining_args.bin[0m
└── [1.1K]  [00mvocab.json[0m

1 directory, 12 files


Nice, we reduced the *n-gram* by more than half to less than 2GB now. In the final step, let's upload all files.

In [None]:
repo.push_to_hub(commit_message="Upload lm-boosted decoder")

Upload file language_model/5gram.bin:   0%|          | 32.0k/13.6M [00:00<?, ?B/s]

To https://huggingface.co/aiface/mms_s2t_vivos
   d5d2ef9..7b5328a  main -> main

   d5d2ef9..7b5328a  main -> main



'https://huggingface.co/aiface/mms_s2t_vivos/commit/7b5328a03bf4efc42826f1e4a00dd11f215179a2'

That's it. Now you should be able to use the *5gram* for LM-boosted decoding as shown in Section 1.

As can be seen on [`xls-r-300m-sv`'s model card](https://huggingface.co/hf-test/xls-r-300m-sv#inference-with-lm) our *5gram* LM-boosted decoder yields a WER of 18.85% on Common Voice's 7 test set which is a relative performance of *ca.* 30% 🔥.

In [None]:
#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

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id="vietnamese_s2t", private=True)

RepoUrl('https://huggingface.co/aiface/vietnamese_s2t', endpoint='https://huggingface.co', repo_type='model', repo_id='aiface/vietnamese_s2t')

In [None]:
from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path="/content/xls-r-300m-vi",
    repo_id="aiface/vietnamese_s2t",
)

pytorch_model.bin:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

5gram.bin:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

'https://huggingface.co/aiface/vietnamese_s2t/tree/main/'