<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-customize-vocabulary-and-lexicon/nvidia_logo.png" style="width: 90px; float: right;">

# How to Customize Riva ASR Vocabulary and Pronunciation with Lexicon Mapping

This notebook is walkthough of the process of customizing Riva ASR vocabulary and lexicon, in order to improve Riva vocabulary coverage and recognition of difficult words, such as acronyms.

---
## Overview

The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided lexicon file. That means, uncommon and new words, such as domain specific terminologies, that are not present in the lexicon file, will have no chance of being generated.

On the other hand, the greedy decoder (available as an option during the `riva-build` process with the flag `--decoder_type=greedy`) is not lexicon-based and hence can virtually produce any word or character sequence.

---
## Pre-requisites

This notebook assumes that the user is familiar with manually deploying a Riva ASR pipeline using the Riva ServiceMaker tools, `riva-build` and `riva-deploy`. <br>
These were covered in the primer notebook `1_deploy_speech_recognition_pipeline.ipynb` on deploying a speech recognition pipeline. Please run through that notebook as a pre-requisite, and ensure you have the initial Riva ASR pipeline deployed. 

In [None]:
# Check if your Riva Speech Server is running
!docker ps

You should see a container with the image `nvcr.io/nvidia/riva/riva-speech:*` running. If not, please execute/re-visit the earlier notebook on deploying a speech recognition pipeline.

---
## Terminologies
- **Vocabulary file**: The vocabulary file is a flat text file containing a list of vocabulary words, each on its own line. For example:
```
with
not
this
just
my  
as  
don't
...
```

This file is used by the `riva-build` process to generate the lexicon file. 

- **Lexicon file**: The lexicon file is a flat text file that contains the mapping of each vocabulary word to its tokenized form, e.g, sentencepiece tokens, separated by a `tab`. Below is an example:

```
with    ▁with
not     ▁not
this    ▁this
just    ▁just
my      ▁my
as      ▁as
don't   ▁don ' t
```

*Note: Ultimately, the Riva decoder makes use only of the lexicon file directly at run time (but not the vocabulary file).*

Riva Servicemaker automatically tokenizes the words in the vocabulary file to generate the lexicon file. It uses the correct tokenizer model that is packaged together with the acoustic model in the `.riva` file. By default, Riva generates 1 tokenized form for each word in the vocabulary file. You will learn more about finetuning the acoustic model and the .riva format in subsequent notebooks in this lab.

---
## What can be customized?

Both the vocabulary and the lexicon files can be customized.

1. **Extending the vocabulary** enriches the Riva default vocabulary, providing additional coverage for out-of-vocabulary words, terminologies, and abbreviations.

2. **Customizing the lexicon file** can further enrich the Riva knowledge base by providing one or more explicit pronunciations, in the form of tokenized sequences.

---
## 1. Extending the vocabulary

Extending the vocabulary must be done at Riva **build** time.

When building a Riva ASR pipeline, pass the [extended vocabulary file](#modify_vocab) to the `--decoding_vocab=<vocabulary_file>` parameter of the build command. For example, the build command for the Citrinet model:

```
    riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-english-asr-streaming \
   --decoding_language_model_binary=<lm_binary> \ 
   --decoding_vocab=<vocabulary_file> \                          ## PASS THE MODIFIED VOCABULARY FILE HERE
   --language_code=en-US \
   <other_parameters>...
```

Refer to Riva [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/service-asr.html#pipeline-configuration) for build commands for supported models.




<a id='modify_vocab'></a>
### How to modify vocabulary file

You can either provide your own vocabulary file, or extend Riva's default vocabulary file.

- **BYO vocabulary file**: provide a flat text file containing a list of vocabulary words, each on its own line. Note that this file must not only contain a small list of "difficult words", but must contains all the words that you want the ASR pipeline to be able to generate, that is, including all common words.

- **Modifying an existing vocabulary**: This is the recommended approach. Out-of-the-box vocabulary files  for Riva supported languages can be found either:
    1. **On NGC**
    2. Or **In a local Riva deployment**

You can make a copy, then extend these default vocabulary files with the words of interest. Let's take a look at different options for **Modifying an existing vocabulay:**

#### Option 1. Modifying an existing vocab - Vocabulary file from NGC

**On NGC**, for example, for English, the vocabulary file named `flashlight_decoder_vocab.txt` can be found at this [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1).

We had downloaded deployable ASR models, including the [Language Model](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet/files?version=deployable_v3.0) from NGC in `ASR_MODEL_DIR` in the first notebook in this lab.

In [None]:
import os

# Path to the ASR models
ASR_MODEL_DIR = os.path.join(os.getcwd(), "asr-models")

# See what the decoder vocab file looks like
!tail -n 5 $ASR_MODEL_DIR/speechtotext_en_us_lm_vdeployable_v1.1/flashlight_decoder_vocab.txt

#### Option 2.  Modifying an existing vocab - Vocabulary file from local Riva deployment
The actual physical location of Riva assets depends on the value of the `riva_model_loc` variable in the `config.sh` file in the Riva quickstart folder. The vocabulary file is bundled with the Flashlight decoder.
-  By default, `riva_model_loc` is set to `riva-model-repo`, which is a docker volume. You can inspect this docker volume and copy the vocabulary file from within the docker volume to the host file system with commands such as:
        
        ```bash
        # Inspect the Riva model docker volume
        docker inspect riva-model-repo
        
        # Inspect the content of the Riva model docker volume
        docker run --rm -v riva-model-repo:/riva-model-repo alpine ls /riva-model-repo
        
        # Copy the vocabulary file from the docker volume to the current directory
        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt /dest
        ```
        
- If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `ASR_MODEL_DIR` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `$ASR_MODEL_DIR/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt`.

In [None]:
# Path to Riva Quickstart - this was downloaded as part of the first notebook
RIVA_QSS_DIR = os.path.join(os.getcwd(), "riva_quickstart_v2.3.0")

# In the first notebook, we modified our config.sh to point to an absolute path, where the riva-deploy command stored the ASR model
!sed -n 58,64p $RIVA_QSS_DIR/config.sh

You can make a copy, then extend this default vocabulary file with the words of interest.

<font color='red'>**ATTENTION:**</font> **Once modified, you'll have to redeploy the Riva ASR pipeline with `riva-build` while passing the flag `--decoding_vocab=<modified_vocabulary_file>`**. <br>

For the purposes of this notebook, we will execute the workflow of the next section - customizing pronunciation with lexicon mapping, which is similar enough in terms of pipeline re-deployment.

---
## 2. Customizing pronunciation with lexicon mapping

The lexicon file that is used by the Flashlight decoder can be found in the Riva assets directory, as specified by the value of the `riva_model_loc` variable in the `config.sh` file under the Riva quickstart folder.

- If `riva_model_loc` points to a docker volume (by default), you can find and copy the lexicon file with:
```bash
        # Copy the lexicon file from the docker volume to the current directory
        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt /dest
```      
<br>
- If you modify `riva_model_loc` to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming `ASR_MODEL_DIR` is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, `$ASR_MODEL_DIR/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt`.

In [None]:
ASR_MODEL_DIR = os.path.join(os.getcwd(), "asr-models")

!tail -5 $ASR_MODEL_DIR/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt

### How to generate the correct tokenized form

When modifying the lexicon file, ensure that:

- The new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. 

- The tokens are valid tokens as determined by the tokenizer model (packaged with the Riva acoustic model).

The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the **tokenizer model** and the **[Sentencepiece](https://github.com/google/sentencepiece)** Python package (`pip install sentencepiece`). <br>

You can get the tokenizer model for the deployed pipeline from one of the below locations:

1. The model repository `ctc-decoder-...` directory for your model. It will be named `<hash>_tokenizer.model`. For example:

`<ASR_MODEL_DIR>/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model`

2. When using a docker volume to store Riva assets (by default), you can copy the tokenizer model to the local directory with a command such as:

```bash
    # Copy the tokenizer model file from the docker volume to the current directory
    docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model /dest
```  

In [None]:
!pip install sentencepiece

In [None]:
import sentencepiece as spm

In [None]:
# Let's first copy the tokenizer from the previously running instance of Riva server to current directory
!docker cp riva-speech:/data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model {ASR_MODEL_DIR}

TOKENIZER_MODEL = os.path.join(ASR_MODEL_DIR, "498056ba420d4bb3831ad557fba06032_tokenizer.model")

You can then generate new lexicon entries as shown below. The pronunciation you choose gets tokenized by the tokenizer model, and this will be what will eventually be detected to get the word in transcription.
    

In [None]:
import os

In [None]:
TOKEN="antiberta" # Antiberta
PRONUNCIATION="anti berta"

s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL)

# Enabling sampling can help sample several possible segmentations
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

In [None]:
TOKEN="ablooper" # and ABlooper
PRONUNCIATION="a blooper"

s = spm.SentencePieceProcessor(model_file=TOKENIZER_MODEL)

for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))



### How to modify the lexicon file

First, locate and make a copy of the lexicon file. For example:
```
cp <ASR_MODEL_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt modified_lexicon.txt
```

In [None]:
!cp $ASR_MODEL_DIR/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt $ASR_MODEL_DIR/modified_lexicon.txt

In [None]:
!tail -n 5 $ASR_MODEL_DIR/modified_lexicon.txt

Next, modify it to add the sentencepiece tokenizations for the words of interest. We choose one of the sampled tokenizations for each of the words.

In [None]:
# Let's append tokenized antiberta and ablooper following the convention in the lexicon file
!echo -e "antiberta\t▁an t i b er ta" >> $ASR_MODEL_DIR/modified_lexicon.txt
!echo -e "ablooper\t▁a b lo op er" >> $ASR_MODEL_DIR/modified_lexicon.txt

# Verify that words of interest have been successfully appended
!tail -n 5 $ASR_MODEL_DIR/modified_lexicon.txt

Finally, once this is done, we need to regenerate the model repository using that new decoding lexicon tokenization by passing `--decoding_lexicon=/path/to/modified_lexicon.txt` to `riva-build` instead of `--decoding_vocab=/path/to/decoding_vocab.txt`. <br>
Let's now put these ideas to action in the next section!

---
# Exercise

## Step 1. Default Inference
First, let's try to transcribe using the pipeline without any changes.

In [None]:
import io
import IPython.display as ipd
import grpc

import riva.client

auth = riva.client.Auth(uri='localhost:50051')
riva_asr = riva.client.ASRService(auth)

# Load a sample audio file from local disk
# This example uses a .wav file with LINEAR_PCM encoding.
path = "audio_samples/en-US_wordboosting_sample1.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

In [None]:
# Creating RecognitionConfig
config = riva.client.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  audio_channel_count = 1
)

# ASR Inference call with Recognize 
response = riva_asr.offline_recognize(content, config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript without Word Boosting:", asr_best_transcript)

With the unmodified pipeline, ASR is having a hard time recognizing domain specific terms like `AntiBERTa` and `ABlooper`. <br>

## Step 2. Riva-build

In [None]:
# Stop the existing server
RIVA_QSS_DIR = os.path.join(os.getcwd(), "riva_quickstart_v2.3.0")

! cd $RIVA_QSS_DIR && chmod +x riva_stop.sh
! cd $RIVA_QSS_DIR && ./riva_stop.sh config.sh

In [None]:
# Check that the Riva Speech Server has stopped. No containers should be running at this point.
!docker ps

In [None]:
# ServiceMaker Docker
RIVA_SM_CONTAINER = "nvcr.io/nvidia/riva/riva-speech:2.3.0-servicemaker"

# Get the ServiceMaker docker
! docker pull $RIVA_SM_CONTAINER

# Default key that model is encrypted with
KEY = "tlt_encode"

Riva-build for Offline Recognition usecase. Reference: [Pipeline configuration](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html#pipeline-configuration)

First, let's set relevant paths relative to where we will mount the models in the Servicemaker docker:

In [None]:
# All model paths relative to Riva Servicemaker docker include the _SM suffix

ASR_MODEL_DIR_SM = "/data" # Path where we mount the downloaded ASR models in the Servicemaker docker

# Relative path to Acoustic Model
AM_SM = os.path.join(ASR_MODEL_DIR_SM, "speechtotext_en_us_citrinet_vdeployable_v3.0", "citrinet-1024-Jarvis-asrset-3_0-encrypted.riva")

# Relative path to LM model artifacts
DECODING_LM_BINARY_SM = os.path.join(ASR_MODEL_DIR_SM, "speechtotext_en_us_lm_vdeployable_v1.1", "riva_asr_train_datasets_3gram.binary")

# Relative path to WSFT artifacts
WFST_TOKENIZER_MODEL_SM = os.path.join(ASR_MODEL_DIR_SM, "inverse_normalization_en_us_vdeployable_v1.0", "tokenize_and_classify.far")
WFST_VERBALIZER_MODEL_SM = os.path.join(ASR_MODEL_DIR_SM, "inverse_normalization_en_us_vdeployable_v1.0", "verbalize.far")

<font color='red'>**ATTENTION:**</font> We will provide `--decoding_lexicon=/path/to/modified_lexicon.txt` to `riva-build` instead of `--decoding_vocab=/path/to/flashlight_decoder_vocab.txt`, through `DECODING_LEXICON_SM` declared below.

In [None]:
# We'll use the modified lexicon file instead of the default vocab
DECODING_LEXICON_SM = os.path.join(ASR_MODEL_DIR_SM, "modified_lexicon.txt") 

# Relative path where the generated .rmir file will be stored, we indicate that this is with the modified lexicon
ASR_RMIR_SM = os.path.join(ASR_MODEL_DIR_SM, "asr-mod-lexicon.rmir")

We use the Riva servicemaker docker to run riva-build:

In [None]:
# This will create a "asr-mod-lexicon.rmir"
! docker run --rm --gpus 0 -v $ASR_MODEL_DIR:$ASR_MODEL_DIR_SM $RIVA_SM_CONTAINER -- \
            riva-build speech_recognition $ASR_RMIR_SM:$KEY $AM_SM:$KEY \
            --name=citrinet-1024-en-US-asr-offline \
            --offline \
            --streaming=False \
            --wfst_tokenizer_model=$WFST_TOKENIZER_MODEL_SM \
            --wfst_verbalizer_model=$WFST_VERBALIZER_MODEL_SM \
            --ms_per_timestep=80 \
            --featurizer.use_utterance_norm_params=False \
            --featurizer.precalc_norm_time_steps=0 \
            --featurizer.precalc_norm_params=False \
            --vad.residue_blanks_at_start=-2 \
            --chunk_size=300 \
            --left_padding_size=0. \
            --right_padding_size=0. \
            --decoder_type=flashlight \
            --flashlight_decoder.asr_model_delay=-1 \
            --decoding_language_model_binary=$DECODING_LM_BINARY_SM  \
            --decoding_lexicon=$DECODING_LEXICON_SM \
            --flashlight_decoder.lm_weight=0.2 \
            --flashlight_decoder.word_insertion_score=0.2 \
            --flashlight_decoder.beam_threshold=20. \
            --language_code=en-US \
            --force

Next, we run `riva-deploy` to generate the model repository.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 0 -v $ASR_MODEL_DIR:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f  $ASR_RMIR_SM:$KEY /data/models/

### Re-deploy the speech recognition pipeline

In [None]:
! cd $RIVA_QSS_DIR && ./riva_start.sh config.sh

### Trying the sample again

In [None]:
# Trying again
auth = riva.client.Auth(uri='localhost:50051')
riva_asr = riva.client.ASRService(auth)

path = "audio_samples/en-US_wordboosting_sample1.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
    
# Creating RecognitionConfig
config = riva.client.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  audio_channel_count = 1
)

# ASR Inference call with Recognize 
response = riva_asr.offline_recognize(content, config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript without Word Boosting:", asr_best_transcript)

As you can see, `AntiBerta` and `ABlooper` are being transcribed correctly after customizing the lexicon. <br>
To recap, customizing the lexicon can help extend the vocabulary, and also allow us to provide one or more custom pronunciations for words explicitly for better recognition. 