<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-customize-vocabulary-and-lexicon/nvidia_logo.png" style="width: 90px; float: right;">

# How to Customize Riva ASR Vocabulary and Pronunciation with Lexicon Mapping

This notebook walks you through the process of customizing Riva ASR vocabulary and lexicon, in order to improve Riva vocabulary coverage and recognition of difficult words, such as acronyms.

## Overview

The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided lexicon file. That means, uncommon and new words, such as domain specific terminologies, that are not present in the lexicon file, will have no chance of being generated.

On the other hand, the greedy decoder (available as an option during the `riva-build` process with the flag `--decoder_type=greedy`) is not lexicon-based and hence can virtually produce any word or character sequence.

### Pre requisite

This notebook assumes that the user is familiar with manually deploying a Riva ASR pipeline using the Riva ServiceMaker tool, `riva-build` and `riva-deploy` commands. <br>
These were covered in the primer notebook `1_deploy_speech_recognition_pipeline.ipynb` on deploying a speech recognition pipeline. Please run through that notebook as a pre-requisite. Also, see Riva [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/model-overview.html#) for additional details.



### Terminologies
- **Vocabulary file**: The vocabulary file is a flat text file containing a list of vocabulary words, each on its own line. For example:
```
the
i
to
and
a
you
of
that
...
```

This file is used by the `riva-build` process to generate the lexicon file. 

- **Lexicon file**: The lexicon file is a flat text file that contains the mapping of each vocabulary word to its tokenized form, e.g, sentencepiece tokens, separated by a `tab`. Below is an example:

```
with    ▁with
not     ▁not
this    ▁this
just    ▁just
my      ▁my
as      ▁as
don't   ▁don ' t
```

*Note: Ultimately, the Riva decoder makes use only of the lexicon file directly at run time (but not the vocabulary file).*

Riva ServiceMaker automatically tokenizes the words in the vocabulary file to generate the lexicon file. It uses the correct tokenizer model that is packaged together with the acoustic model in the `.riva` file. By default, Riva generates 1 tokenized form for each word in the vocabulary file. You will learn more about finetuning the acoustic model and the .riva format in subsequent notebooks in this lab.

---
## What can be customized?

Both the vocabulary and the lexicon files can be customized.

1. **Extending the vocabulary** enriches the Riva default vocabulary, providing additional coverage for out-of-vocabulary words, terminologies, and abbreviations.

2. **Customizing the lexicon file** can further enrich the Riva knowledge base by providing one or more explicit pronunciations, in the form of tokenized sequences.

---
## 1. Extending the vocabulary

Extending the vocabulary must be done at Riva **build** time.

When building a Riva ASR pipeline, pass the [extended vocabulary file](#modify_vocab) to the `--decoding_vocab=<vocabulary_file>` parameter of the build command. For example, the build command for the Citrinet model:

```
    riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-english-asr-streaming \
   --decoding_language_model_binary=<lm_binary> \ 
   --decoding_vocab=<vocabulary_file> \                          ## PASS THE MODIFIED VOCABULARY FILE HERE
   --language_code=en-US \
   <other_parameters>...
```

Refer to Riva [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/service-asr.html#pipeline-configuration) for build commands for supported models.




<a id='modify_vocab'></a>
### How to modify vocabulary file

You can either provide your own vocabulary file, or extend Riva's default vocabulary file.

- **BYO vocabulary file**: provide a flat text file containing a list of vocabulary words, each on its own line. Note that this file must not only contain a small list of "difficult words", but must contains all the words that you want the ASR pipeline to be able to generate, that is, including all common words.

- **Modifying an existing one**: This is the recommended approach. Out-of-the-box vocabulary files  for Riva supported languages can be found either:
    1. **On NGC**
    2. Or **In a local Riva deployment**

You can make a copy, then extend these default vocabulary files with the words of interest.

#### Option 1. Modifying an existing vocab - Vocabulary file from NGC

**On NGC**, for example, for English, the vocabulary file named `flashlight_decoder_vocab.txt` can be found at this [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1).

We had downloaded deployable ASR models, including the [Language Model](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet/files?version=deployable_v3.0) from NGC in `ASR_MODEL_DIR` in the first notebook in this lab.

In [33]:
import os

ASR_MODEL_DIR = os.path.join(os.getcwd(), "asr-models")

!head -n 5 $ASR_MODEL_DIR/speechtotext_en_us_lm_vdeployable_v1.1/flashlight_decoder_vocab.txt

the
i
to
and
a


#### Option 2.  Modifying an existing vocab - Vocabulary file from local Riva deployment
The actual physical location of Riva assets depends on the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder. The vocabulary file is bundled with the Flashlight decoder.
-  By default, `riva_model_loc` is set to `riva-model-repo`, which is a docker volume. You can inspect this docker volume and copy the vocabulary file from within the docker volume to the host file system with commands such as:
        
        ```bash
        # Inspect the Riva model docker volume
        docker inspect riva-model-repo
        
        # Inspect the content of the Riva model docker volume
        docker run --rm -v riva-model-repo:/riva-model-repo alpine ls /riva-model-repo
        
        # Copy the vocabulary file from the docker volume to the current directory
        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt /dest
        ```
        
- If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `<RIVA_REPO_DIR>` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt`.

In [42]:
# Path to Riva Quickstart - this was downloaded as part of the first notebook
RIVA_QSS_DIR = os.path.join(os.getcwd(), "riva_quickstart_v2.2.1")

# In the first notebook, we modified our config.sh to point to an absolute path, where our riva-deploy stored the ASR model
!sed -n 58,64p $RIVA_QSS_DIR/config.sh

# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="/home/shashankv/riva/ai-launchpad/tutorials/asr-models"



You can make a copy, then extend this default vocabulary file with the words of interest.


**Once modified, you'll have to redeploy the Riva ASR pipeline with `riva-build` while passing the flag `--decoding_vocab=<modified_vocabulary_file>`**. <br>
For the purposes of this notebook, we will execute the workflow of the next section - customizing pronunciation with lexicon mapping, which is similar enough in terms of pipeline re-deployment.

---
## 2. Customizing pronunciation with lexicon mapping

The lexicon file that is used by the Flashlight decoder can be found in the Riva assets directory, as specified by the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder (see above).

- If `riva_model_loc` points to a docker volume (by default), you can find and copy the lexicon file with:
```bash
        # Copy the lexicon file from the docker volume to the current directory
        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt /dest
```      
<br>
- If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `<ASR_MODEL_DIR>` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `<ASR_MODEL_DIR>/models/citrinet-1024-en-US-asr-<offline/streaming>-ctc-decoder-cpu-offline/1/lexicon.txt`.

In [47]:
ASR_MODEL_DIR = os.path.join(os.getcwd(), "asr-models")

!head -5 $ASR_MODEL_DIR/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-offline/1/lexicon.txt

the	▁the
i	▁i
to	▁to
and	▁and
a	▁a


### How to generate the correct tokenized form

When modifying the lexicon file, ensure that:

- The new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. 

- The tokens are valid tokens as determined by the tokenizer model (packaged with the Riva acoustic model).

The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the `sentencepiece` Python package (`pip install sentencepiece`). <br>

You can get the tokenizer model for the deployed pipeline from:

1. The model repository `ctc-decoder-...` directory for your model. It will be named `<hash>_tokenizer.model`. For example:

`<ASR_MODEL_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model`

2. When using a docker volume to store Riva assets (by default), you can copy the tokenizer model to the local directory with a command such as:

```bash
    # Copy the tokenizer model file from the docker volume to the current directory
    docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model /dest
```  

In [48]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com, https://urm.nvidia.com/artifactory/api/pypi/sw-colossus-pypi/simple


In [60]:
import sentencepiece as spm

You can then generate new lexicon entries, for example:
    

In [70]:
# We follow option 1 i.e. the tokenizer model from the the model repository we specified in config.sh and deployed
TOKENIZER_MODEL = os.path.join(ASR_MODEL_DIR, "models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model")

In [77]:
TOKEN="BRAF"
PRONUNCIATION="b raf"

s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL)
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

BRAF	▁ b ▁ra f
BRAF	▁b ▁ r a f
BRAF	▁b ▁ ra f
BRAF	▁b ▁ra f
BRAF	▁ b ▁r a f


Note: `TOKEN` represents the desired written form of the word, while `PRONUNCIATION` is what the word should sound like.

Other examples:

In [78]:
TOKEN="WhatsApp"
PRONUNCIATION="what's app"

s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL)
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

WhatsApp	▁ w h a t ' s ▁a p p
WhatsApp	▁what ' s ▁a p p
WhatsApp	▁what ' s ▁ ap p
WhatsApp	▁what ' s ▁app
WhatsApp	▁what ' s ▁app


In [79]:
TOKEN="Cya"
PRONUNCIATION="See ya"

s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL)
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

Cya	▁ S e e ▁y a
Cya	▁ S e e ▁ y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁ y a
Cya	▁ S e e ▁y a


In [116]:
TOKEN="antiberta" # and ABlooper
PRONUNCIATION="anti berta aa"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='498056ba420d4bb3831ad557fba06032_tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

antiberta	▁a n t i ▁be r ta ▁ a a
antiberta	▁ an t i ▁be r ta ▁ a a
antiberta	▁an t i ▁ b er ta ▁a a
antiberta	▁ ant i ▁be r ta ▁a a
antiberta	▁an t i ▁b er ta ▁a a


In [118]:
TOKEN="ablooper" # and ABlooper
PRONUNCIATION="a blooper"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='498056ba420d4bb3831ad557fba06032_tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

ablooper	▁a ▁b lo op er
ablooper	▁a ▁b l o o pe r
ablooper	▁a ▁b l o o pe r
ablooper	▁a ▁ b l o o per
ablooper	▁ a ▁ b lo op er


In [157]:
TOKEN="shashank"
# PRONUNCIATION="sh uh sh ah nk"
PRONUNCIATION="shuh sha ah nk"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='498056ba420d4bb3831ad557fba06032_tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

shashank	▁sh u h ▁sh a ▁ah ▁ n k
shashank	▁ s huh ▁sha ▁a h ▁n k
shashank	▁ s huh ▁s h a ▁a h ▁ n k
shashank	▁s huh ▁ s ha ▁ah ▁n k
shashank	▁ s huh ▁s h a ▁a h ▁n k




### How to modify the lexicon file

First, locate and make a copy of the lexicon file. For example:
```
cp <ASR_MODEL_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt modified_lexicon.txt
```

In [86]:
!cp $ASR_MODEL_DIR/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-offline/1/lexicon.txt $ASR_MODEL_DIR/modified_lexicon.txt

In [109]:
!tail -n 5 $ASR_MODEL_DIR/modified_lexicon.txt

zyuganov	▁z y u g an o v
zyx	▁z y x
zyzzyva	▁z y z z y v a
AntiBERTa	▁a n t i ▁b er t ▁a a
ABlooper	▁a b l o o pe r


Next, modify it to add the sentencepiece tokenizations for the words of interest. For example, one could add:
```
manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew
```
which are 3 different pronunciations/tokenizations of the word `manu`.  If the acoustic model predicts those tokens, they will be decoded as `manu`.

In [103]:
!echo -e "antiberta\t▁a n t i ▁b er t ▁a a" >> $ASR_MODEL_DIR/modified_lexicon.txt
!echo -e "ablooper\t▁a ▁b l o o pe r" >> $ASR_MODEL_DIR/modified_lexicon.txt

Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing `--decoding_lexicon=modified_lexicon.txt` to `riva-build` instead of `--decoding_vocab=decoding_vocab.txt`.

Let's now put these ideas to action!

---
# Exercise

## Step 1. Default Inference

In [121]:
import io
import IPython.display as ipd
import grpc

import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import riva_api.riva_audio_pb2 as ra


channel = grpc.insecure_channel('localhost:50051')
riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)


# Load a sample audio file from local disk
# This example uses a .wav file with LINEAR_PCM encoding.
# path = "audio_samples/en-US_wordboosting_sample1.wav"
path = "audio_samples/Name.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

In [158]:
# Creating RecognitionConfig
config = rasr.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  audio_channel_count = 1
)

# Creating RecognizeRequest
req = rasr.RecognizeRequest(audio = content, config = config)

# ASR Inference call with Recognize 
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript without modifying Lexicon:", asr_best_transcript)

ASR Transcript without modifying Lexicon: My name is shah. 


ASR is having a hard time recognizing domain specific terms like `AntiBERTa` and `ABlooper`. <br>

## Step 2. Riva-build

In [159]:
# Stop the existing server
RIVA_QSS_DIR = os.path.join(os.getcwd(), "riva_quickstart_v2.2.1")

! cd $RIVA_QSS_DIR && chmod +x riva_stop.sh
! cd $RIVA_QSS_DIR && ./riva_stop.sh config.sh

Shutting down docker containers...


In [160]:
# ServiceMaker Docker
RIVA_SM_CONTAINER = "nvcr.io/nvidia/riva/riva-speech:2.2.1-servicemaker"

# Directory where the Acoustic .riva model is stored $MODEL_LOC/*.riva
MODEL_LOC = os.path.join(ASR_MODEL_DIR, "speechtotext_en_us_citrinet_vdeployable_v3.0")

# Name of the .riva file
MODEL_NAME = "citrinet-1024-Jarvis-asrset-3_0-encrypted.riva"

# Key that model is encrypted with, while exporting with TAO
KEY = "tlt_encode"

Riva-build for Offline Recognition usecase. Reference: [Pipeline configuration](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html#pipeline-configuration)

Passing `--decoding_lexicon=modified_lexicon.txt` to `riva-build` instead of `--decoding_vocab=decoding_vocab.txt`.

In [161]:
# This will create a "asr-mod-lexicon.rmir"
! docker run --rm --gpus 0 -v $ASR_MODEL_DIR:/data $RIVA_SM_CONTAINER -- \
            riva-build speech_recognition /data/asr-mod-lexicon.rmir:$KEY /data/speechtotext_en_us_citrinet_vdeployable_v3.0/$MODEL_NAME:$KEY --force --offline \
            --streaming=False \
            --wfst_tokenizer_model=/data/inverse_normalization_en_us_vdeployable_v1.0/tokenize_and_classify.far \
            --wfst_verbalizer_model=/data/inverse_normalization_en_us_vdeployable_v1.0/verbalize.far \
            --name=citrinet-1024-en-US-asr-streaming \
            --ms_per_timestep=80 \
            --featurizer.use_utterance_norm_params=False \
            --featurizer.precalc_norm_time_steps=0 \
            --featurizer.precalc_norm_params=False \
            --vad.residue_blanks_at_start=-2 \
            --chunk_size=300 \
            --left_padding_size=0. \
            --right_padding_size=0. \
            --decoder_type=flashlight \
            --flashlight_decoder.asr_model_delay=-1 \
            --decoding_language_model_binary=/data/speechtotext_en_us_lm_vdeployable_v1.1/riva_asr_train_datasets_3gram.binary  \
            --decoding_lexicon=/data/modified_lexicon.txt \
            --flashlight_decoder.lm_weight=0.2 \
            --flashlight_decoder.word_insertion_score=0.2 \
            --flashlight_decoder.beam_threshold=20. \
            --language_code=en-US


=== Riva Speech Skills ===

NVIDIA Release 22.05.1 (build 38809663)
Riva Speech Server Version 2.2.1
Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh.  To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_

Next, we run `riva-deploy` to generate model repository

In [162]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 0 -v $ASR_MODEL_DIR:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f  /data/asr-mod-lexicon.rmir:$KEY /data/models/


=== Riva Speech Skills ===

NVIDIA Release 22.05.1 (build 38809663)
Riva Speech Server Version 2.2.1
Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh.  To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_

### Re-deploy Riva pipeline

In [163]:
! cd $RIVA_QSS_DIR && ./riva_start.sh config.sh

Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
957003f3137aa10517dec8be881191000e69bd968e1998ed1521f9408bbcf447
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...


### Trying the sample again

In [166]:
# Trying again
channel = grpc.insecure_channel('localhost:50051')
riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)

path = "audio_samples/Name.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
    
# Creating RecognitionConfig
config = rasr.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  audio_channel_count = 1
)

# Creating RecognizeRequest
req = rasr.RecognizeRequest(audio = content, config = config)

# ASR Inference call with Recognize 
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript without modifying Lexicon:", asr_best_transcript)

ASR Transcript without modifying Lexicon: My name is shaan. 
