<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-tao-ngram-deployment/nvidia_logo.png?time=229" style="width: 90px; float: right;">

# How to Deploy a Custom Language Model (n-gram) Trained with TAO Toolkit on Riva
This tutorial walks you through the deployment of a custom language model (n-gram) trained with NVIDIA TAO Toolkit on NVIDIA Riva.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR).
- Text-to-Speech synthesis (TTS).
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will deploy an ASR language model (n-gram) trained with TAO Toolkit on Riva. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-basics.ipynb). <br>
To see how to pretrain an n-gram language model for ASR with TAO Toolkit, refer to [this tutorial](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-advanced-tao-ngram-pretrain.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Train, Adapt, and Optimize TAO Toolkit
[Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/tao-toolkit) provides the capability to export your model in a format that can deployed using [NVIDIA Riva](https://developer.nvidia.com/riva), a highly performant application framework for multi-modal conversational AI services using GPUs. 

This tutorial explores taking a `.riva` model, the result of the `tao n_gram train` and `tao n_gram export` commands ([pretrain tutorial](https://github.com/nvidia-riva/tutorials/blob/dev/22.06/asr-python-advanced-tao-ngram-pretrain.ipynb)), and leveraging the Riva ServiceMaker framework to aggregate all the necessary artifacts for Riva deployment to a target environment. After the model is deployed in Riva, you can issue inference requests to the server. We will demonstrate how quick and straightforward this whole process is.
In this tutorial, you will learn how to:  
- Use Riva ServiceMaker to take a TAO exported `.riva` file and convert it to `.rmir`.
- Deploy the model locally on the Riva server.
- Send inference requests from a demo client using Riva API bindings.

---
## Prerequisites

Before we get started, ensure you have:
- Access to NVIDIA NGC and are able to download the Riva Quick Start [resources](https://ngc.nvidia.com/catalog/resources/nvidia:riva:riva_quickstart).
-  A _language_ model file that you want to deploy. <br>
For more information on training and exporting an n-gram language model, refer to the [TAO Toolkit N-Gram Language Model Documentation](https://docs.nvidia.com/tao/tao-toolkit/text/lm/n_gram.html). <br> 
The language model file can be in one of the three following formats: 
    - `.riva`. You can obtain this from `tao <task> export` (with `export_format=RIVA`). 
    - `.binary`. You can obtain this from `tao <task> export` (with `export_format=binary`).
    - `.arpa`. You can obtain this from `tao <task> train`. 
- An _acoustic_ model file in the `.riva` format that you want to deploy. You can obtain this from `tao <task> export` (with `export_format=RIVA`). <br>
For more information on training and exporting a `.riva` acoustic model for ASR, refer to the [Speech Recognition](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html), [Speech Recognition with CitriNet](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition_with_citrinet.html), or [Speech Recognition with Conformer](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition_with_conformer.html) pages in the [TAO Toolkit Documentation](https://docs.nvidia.com/tao/tao-toolkit/index.html). <br>
You can also work through our tutorial notebooks on [fine-tuning CitriNet](./asr-python-advanced-finetune-am-citrinet-tao-finetuning.ipynb), [fine-tuning CitriNet for noisy audio](./asr-python-advanced-finetune-am-citrinet-for-noisy-audio-withtao.ipynb), and [deploying CitriNet](./asr-python-advanced-finetune-am-citrinet-tao-deployment.ipynb). 
- Weighted Finite State Transducer (WFST) tokenizer and verbalizer files for Inverse Text Normalization (ITN). <br> 
For more information on WFST and ITN, refer to the [NeMo Inverse Text Normalization: From Development to Production](https://arxiv.org/pdf/2104.05055.pdf) paper. <br>
You can download pretrained WFST ITN model files from this [NVIDIA GPU Cloud (NGC) model page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/inverse_normalization_en_us). 
- A decoder vocabulary file. You can download one from the [Riva ASR LM NGC model page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm). 

---
## Riva ServiceMaker
Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components:

### Riva-Build

This step helps build a Riva-ready version of the model. Its only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. Let's consider an ASR n-gram language model. <br>

`riva-build` is responsible for the combination of one or more exported models (`.riva` files) into a single file containing an intermediate format called Riva Model Intermediate Representation (`.rmir`). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html?highlight=pipeline%20configuration).

In [None]:
# IMPORTANT: UPDATE THESE PATHS 

# ServiceMaker Docker
RIVA_SM_CONTAINER = "<add container name>"

# Directory where the .riva model is stored $MODEL_LOC/*.riva
MODEL_LOC = "<add path to model location>"

# Name of the acoustic model .riva file
ACOUSTIC_MODEL_NAME = "<add model name>"

# Name of the language model .riva file
LM_RIVA_MODEL_NAME = "<add model name>"

# Name of the language model .arpa file
LM_ARPA_MODEL_NAME = "<add model name>"

# Name of the language model .binary file
LM_BINARY_MODEL_NAME = "<add model name>"

# Name of the decoder vocab file
DECODER_VOCAB_NAME = "<add decoder vocab file name>"

# Name of the WFST tokenizer
WFST_TOKENIZER = "<add WFST tokenizer model name>"

# Name of the WFST verbalizer
WFST_VERBALIZER = "<add WFST verbalizer model name>"

# Key that model is encrypted with, while exporting with TAO
KEY = "<add encryption key used for trained model>"

In [None]:
# Get the ServiceMaker Docker
! docker pull $RIVA_SM_CONTAINER

Run the following command to build an `.rmir` file from a `.riva`-formatted n-gram language model file:

In [None]:
# Syntax: 
# riva-build <task-name> \
#     output-dir-for-rmir/model.rmir:key \
#     dir-for-riva/acoustic_model.riva:key \
#     dir-for-riva/lm_model.riva:key
! docker run --rm --gpus 0 -v $MODEL_LOC:/servicemaker-dev $RIVA_SM_CONTAINER -- \
    riva-build speech_recognition \
        /servicemaker-dev/asr_riva_ngram_lm.rmir:$KEY \
        /servicemaker-dev/$ACOUSTIC_MODEL_NAME:$KEY \
        /servicemaker-dev/$LM_RIVA_MODEL_NAME:$KEY \
        --name=riva_ngram_lm_pipeline \
        --wfst_tokenizer_model=/servicemaker-dev/$WFST_TOKENIZER \
        --wfst_verbalizer_model=/servicemaker-dev/$WFST_VERBALIZER \
        --decoding_vocab=/servicemaker-dev/$DECODER_VOCAB_NAME \
        --decoder_type=flashlight \
        --chunk_size=0.16 \
        --padding_size=1.92 \
        --ms_per_timestep=80 \
        --flashlight_decoder.asr_model_delay=-1 \
        --vad.residue_blanks_at_start=-2 \
        --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 \
        --featurizer.precalc_norm_params=False 

Run the following command to build an `.rmir` file from a `.binary`-formatted n-gram language model file:

In [None]:
# Syntax: 
# riva-build <task-name> \
#     output-dir-for-rmir/model.rmir:key \
#     dir-for-riva/acoustic_model.riva:key \
#     --decoding_language_model_binary=lm_model.binary
! docker run --rm --gpus 0 -v $MODEL_LOC:/servicemaker-dev $RIVA_SM_CONTAINER -- \
    riva-build speech_recognition \
        /servicemaker-dev/asr_binary_ngram_lm.rmir:$KEY \
        /servicemaker-dev/$ACOUSTIC_MODEL_NAME:$KEY \
        --decoding_language_model_binary=/servicemaker-dev/$LM_BINARY_MODEL_NAME \
        --decoding_vocab=/servicemaker-dev/$DECODER_VOCAB_NAME \
        --wfst_tokenizer_model=/servicemaker-dev/$WFST_TOKENIZER \
        --wfst_verbalizer_model=/servicemaker-dev/$WFST_VERBALIZER \
        --name=arpa_ngram_lm_pipeline \
        --decoder_type=flashlight \
        --chunk_size=0.16 \
        --padding_size=1.92 \
        --ms_per_timestep=80 \
        --flashlight_decoder.asr_model_delay=-1 \
        --vad.residue_blanks_at_start=-2 \
        --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 \
        --featurizer.precalc_norm_params=False 

Run the following command to build an `.rmir` file from an `.arpa`-formatted n-gram language model file:

In [None]:
# Syntax: 
# riva-build <task-name> \
#     output-dir-for-rmir/model.rmir:key \
#     dir-for-riva/acoustic_model.riva:key \
#     --decoding_language_model_arpa=lm_model.arpa
! docker run --rm --gpus 0 -v $MODEL_LOC:/servicemaker-dev $RIVA_SM_CONTAINER -- \
    riva-build speech_recognition \
        /servicemaker-dev/asr_arpa_ngram_lm.rmir:$KEY \
        /servicemaker-dev/$ACOUSTIC_MODEL_NAME:$KEY \
        --decoding_language_model_arpa=/servicemaker-dev/$LM_ARPA_MODEL_NAME \
        --decoding_vocab=/servicemaker-dev/$DECODER_VOCAB_NAME \
        --wfst_tokenizer_model=/servicemaker-dev/$WFST_TOKENIZER \
        --wfst_verbalizer_model=/servicemaker-dev/$WFST_VERBALIZER \
        --name=arpa_ngram_lm_pipeline \
        --decoder_type=flashlight \
        --chunk_size=0.16 \
        --padding_size=1.92 \
        --ms_per_timestep=80 \
        --flashlight_decoder.asr_model_delay=-1 \
        --vad.residue_blanks_at_start=-2 \
        --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 \
        --featurizer.precalc_norm_params=False 

### Riva-deploy

The deployment tool takes as input one or more RMIR files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

If you built an `.rmir` file using a `.binary`- or `.arpa`-formatted n-gram language model file, change `asr_riva_ngram_lm` in the cell below to `asr_binary_ngram_lm` or `asr_arpa_ngram_lm` as appropriate. 

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f  /data/asr_riva_ngram_lm.rmir:$KEY /data/models/

---
## Start the Riva Server
After the model repository is generated, we are ready to start the Riva server. First, download the [Riva Quick Start](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/resources/riva_quickstart) resource from NGC. 
Set the path to the directory here:

In [None]:
# Set the Riva Quick Start directory
RIVA_DIR = "<Path to the uncompressed folder downloaded from quickstart(include the folder name)>"

Next, we modify the `config.sh` file to enable relevant Riva services (n-gram language model), provide the encryption key, and path to the model repository (`riva_model_loc`) generated in the previous step among other configurations. 

For example, if the model repository is generated at `$MODEL_LOC/models`, then you can specify `riva_model_loc` as the same directory as `MODEL_LOC`. <br>

Pretrained versions of models specified in `models_asr/nlp/tts` are fetched from NGC. Since we are using our custom model, we can comment it in `models_asr` (and any others that are not relevant to your use case). <br>

#### config.sh snippet
```
# Enable or Disable Riva Services 
service_enabled_asr=true                                                      ## MAKE CHANGES HERE
service_enabled_nlp=false                                                      ## MAKE CHANGES HERE
service_enabled_tts=false                                                     ## MAKE CHANGES HERE

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.
gpus_to_use="device=0"

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"                                                  ## MAKE CHANGES HERE

# Locations to use for storing models artifacts
#
# If an absolute path is specified, the data will be written to that location
# Otherwise, a Docker volume will be used (default).
#
# riva_init.sh will create a `rmir` and `models` directory in the volume or
# path specified. 
#
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by `riva_init.sh`
# 
# Custom models produced by NeMo or TAO and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
#
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="<add path>"                              ## MAKE CHANGES HERE (Replace with MODEL_LOC)                      
```

In [None]:
# Ensure you have permission to execute these scripts
! cd $RIVA_DIR && chmod +x ./riva_init.sh && chmod +x ./riva_start.sh

In [None]:
# Run Riva Init. This will fetch the containers/models
# YOU CAN SKIP THIS STEP IF YOU ALREADY RAN RIVA DEPLOY
! cd $RIVA_DIR && ./riva_init.sh config.sh

In [None]:
# Run Riva Start. This will deploy your model.
! cd $RIVA_DIR && ./riva_start.sh config.sh

---
## Run Inference
After the Riva server is up and running with your models, you can send inference requests querying the server. 

To send gRPC requests, install the Riva Python API bindings for the client. This is available as a `pip` `.whl` file with the Quick Start.


In [None]:
# Install the Client API Bindings
! cd $RIVA_DIR && pip install <add .whl file>

### Connect to the Riva Server and Run Inference
Now we can actually query the Riva server. The following cell queries the Riva server (using gRPC) to yield a result.

In [None]:
import argparse
import grpc
import time
try:
    import riva_api.riva_audio_pb2 as ra # RIVA 2.0.0 and above
except:
    import riva_api.audio_pb2 as ra
import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import wave

audio_file = "<add path to .wav file>"
server = "localhost:50051"

wf = wave.open(audio_file, 'rb')
with open(audio_file, 'rb') as fh:
    data = fh.read()

channel = grpc.insecure_channel(server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(
    encoding=ra.AudioEncoding.LINEAR_PCM,
    sample_rate_hertz=wf.getframerate(),
    language_code="en-US",
    max_alternatives=1,
    enable_automatic_punctuation=False,
    audio_channel_count=1
)

request = rasr.RecognizeRequest(config=config, audio=data)

response = client.Recognize(request)
print(response)

You can stop all Docker containers before shutting down the Jupyter kernel. **Caution: The following command will stop all running containers.**

In [None]:
! docker stop $(docker ps -a -q)