<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/rivaasrasr-deploy-am-and-ngram-lm/nvidia_logo.png" style="width: 90px; float: right;">

# How to Deploy a Custom Language Model (n-gram) Trained with NeMo as Riva ASR NIM
This tutorial walks you through the deployment of a custom language model (n-gram) trained with NVIDIA NeMo on NVIDIA Riva.

## NVIDIA Riva Overview

NVIDIA Riva ASR NIM APIs provide easy access to state-of-the-art automatic speech recognition (ASR) models for multiple languages. Riva ASR NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

In this tutorial, we will interact with the automated speech recognition (ASR) APIs.

For more information about Riva ASR NIM, refer to the [Riva NIM documentation](https://docs.nvidia.com/nim/riva/asr/latest/overview.html).

## NeMo (Neural Modules) and `nemo2riva`
[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo) is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and natural language understanding (NLU) models with a simple Python interface. To fine-tune a Parakeet-CTC acoustic model with NeMo, refer to the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb).

The [`nemo2riva`]() command-line tool provides the capability to export your `.nemo` model in a format that can be deployed using [NVIDIA Riva](https://docs.nvidia.com/nim/riva/asr/latest/overview.html) ASR NIM. A Python `.whl` file for `nemo2riva` is available in [PyPi](https://pypi.org/project/nemo2riva/). You can install `nemo2riva` with `pip`, as shown in the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb). 

This tutorial explores taking a `.riva` model &mdash; the result of invoking the `nemo2riva` CLI tool (refer to the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb)) &mdash; and leveraging the Riva ServiceMaker framework to aggregate all the necessary artifacts for Riva deployment to a target environment. Once the model is deployed as a Riva NIM, you can issue inference requests to the server. We will demonstrate how quick and straightforward this whole process is.
In this tutorial, you will learn how to:
- Build an `.rmir` model pipeline from a `.riva` file with Riva ServiceMaker.
- Deploy the model locally on the Riva server.
- Send inference requests from a demo client using Riva API bindings.

---
## Prerequisites

Before we get started, ensure you have:
- Access to NVIDIA NGC.
-  A _language_ model file that you want to deploy.
    - For more information on training and exporting an n-gram language model, refer to the [NeMo Language Modeling documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/asr_language_modeling.html).  
    - The language model file can be in one of the two following formats: 
        - `.binary`. You can download a pre-trained version from the [Riva ASR LM NGC model page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_en_us_lm).
        - `.arpa`. You can download a pre-trained version from the [Riva ASR LM NGC model page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_en_us_lm). 
- An _acoustic_ model file in the `.riva` format that you want to deploy. You can convert a `.nemo` model file to a `.riva` model file with the `nemo2riva` command.
    - For more information on customizing a Parakeet-CTC acoustic model with NeMo and exporting the resulting model with `nemo2riva`, refer to the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb). 
    - Alternatively, you can obtain a pre-trained Parakeet-CTC `.riva` model for English ASR [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_en_us_conformer). 
    - For more information on training NeMo models, refer to the [Training](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/core/core.html#training) section in the [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html). 
    - For more information on Parakeet-CTC's architecture, refer to the [Parakeet](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#parakeet) section of the [NeMo ASR Models](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html) page. 
    - For more information on the configuration files necessary for training Parakeet-CTC with NeMo, refer to the [Fastconformer configs](https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/fastconformer/).
- Weighted Finite State Transducer (WFST) tokenizer and verbalizer files for Inverse Text Normalization (ITN). 
    - For more information on WFST and ITN, refer to the [NeMo Inverse Text Normalization: From Development to Production](https://arxiv.org/pdf/2104.05055.pdf) paper.
    - You can download pretrained WFST ITN model files from this [NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/inverse_normalization_en_us) model page. 
- A decoder vocabulary file. You can download one from the [Riva ASR LM NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_en_us_lm) model page. 

---
## Riva ServiceMaker
Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva NIM deployment to a target environment. It has two main components:

### Riva-Build

This step helps build a Riva-ready version of the model. Its only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. Let's consider an ASR n-gram language model. <br>

`riva-build` is responsible for the combination of one or more exported models (`.riva` files) into a single file containing an intermediate format called Riva Model Intermediate Representation (`.rmir`). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the [documentation](https://docs.nvidia.com/nim/riva/asr/latest/custom-deployment.html#deploying-custom-models-as-nim).

In [None]:
# IMPORTANT: UPDATE THESE PATHS 

# Riva NIM Docker
CONTAINER_ID = "<add container name>"
# Refer to this table to get the CONTAINER_ID for the model architecture you want to deploy.
# Example: CONTAINER_ID = "parakeet-1-1b-ctc-en-us", 

# Directory where model files are stored, 
# e.g. $MODEL_LOC/$ACOUSTIC_MODEL_NAME.riva
MODEL_LOC = "<add path to model location>"


# Name of the acoustic model .riva file
ACOUSTIC_MODEL_NAME = "<add model name>"

# Name of the language model .riva (or .arpa or .binary) file
LANGUAGE_MODEL_NAME = "<add model name>"

# Name of the decoder vocab file
DECODER_VOCAB_NAME = "<add decoder vocab file name>"

# Name of the WFST tokenizer
WFST_TOKENIZER = "<add WFST tokenizer model name>"

# Name of the WFST verbalizer
WFST_VERBALIZER = "<add WFST verbalizer model name>"

# Path to store NIM model repository, Make sure that this directory is empty
NIM_EXPORT_PATH="~/nim_cache" 

! mkdir -p $NIM_EXPORT_PATH
! chmod 777 $NIM_EXPORT_PATH

#### Build the `.rmir` file

**Notes** 
1. If you encrypted your acoustic model and/or language model by adding the `--key` flag when invoking `nemo2riva`, or you downloaded a pre-trained model from NGC, you'll need to append a colon and then the key's value to the model's name in the `riva-build` command, as shown below. You might find it convenient to set a string variable named `KEY` and pass it into the appropriate `riva-build` arguments as `$KEY`. The standard encryption key for the older pre-trained models is `tlt_encode`.
2. If your language model is in the `.arpa` format, replace `/servicemaker-dev/$LANGUAGE_MODEL_NAME:$KEY` with `--decoding_language_model_arpa=/servicemaker-dev/$LANGUAGE_MODEL_NAME` when invoking `riva-build`.
3. If your language model is in the `.binary` format, replace `/servicemaker-dev/$LANGUAGE_MODEL_NAME:$KEY` with `--decoding_language_model_binary=/servicemaker-dev/$LANGUAGE_MODEL_NAME` when invoking `riva-build`.
4. Refer to the [Riva ASR NIM Pipeline Configuration documentation](https://docs.nvidia.com/nim/riva/asr/latest/pipeline-configuration.html) if you want to build an ASR NIM. To obtain the proper `riva-build` parameters for your particular application, select the acoustic model (the parameters below assume Parakeet-CTC), language, and pipeline type (offline for the purposes of this tutorial) from the interactive web menu at the bottom of the first section of the page.

In [None]:
# Set the appropriate value
! docker run --gpus all --rm \
     -v $MODEL_LOC:/servicemaker-dev \
     --name riva-servicemaker \
     --entrypoint="" \
     nvcr.io/nim/nvidia/$CONTAINER_ID \
     riva-build speech_recognition \
     /servicemaker-dev/asr_offline_riva_ngram_lm.rmir:tlt_encode \
     /servicemaker-dev/$ACOUSTIC_MODEL_NAME:tlt_encode \
      --offline \
      --name=Parakeet-en-US-asr-offline \
      --return_separate_utterances=True \
      --featurizer.use_utterance_norm_params=False \
      --featurizer.precalc_norm_time_steps=0 \
      --featurizer.precalc_norm_params=False \
      --ms_per_timestep=80 \
      --endpointing.start_history=200 \
      --nn.fp16_needs_obey_precision_pass \
      --endpointing.residue_blanks_at_start=-2 \
      --chunk_size=4.8 \
      --left_padding_size=1.6 \
      --right_padding_size=1.6 \
      --max_batch_size=16 \
      --featurizer.max_batch_size=512 \
      --featurizer.max_execution_batch_size=512 \
      --decoder_type=flashlight \
      --decoding_language_model_binary=/servicemaker-dev/$LANGUAGE_MODEL_NAME \
      --decoding_vocab=/servicemaker-dev/$DECODER_VOCAB_NAME \
      --flashlight_decoder.lm_weight=0.2 \
      --flashlight_decoder.word_insertion_score=0.2 \
      --flashlight_decoder.beam_threshold=20. \
      --language_code=en-US \
      --wfst_tokenizer_model=/servicemaker-dev/$WFST_TOKENIZER \
      --wfst_verbalizer_model=/servicemaker-dev/$WFST_VERBALIZER

### Riva-Deploy

The deployment tool takes as input one or more RMIR files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

**Note:** If you added an encryption key to your `.rmir` file when building it with `riva-build`, make sure to append a colon and then the key's value to the model's name in the `riva-deploy` command, as shown below.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir[:key] output-dir-for-repository
! docker run --gpus all --rm \
     -v $MODEL_LOC:/servicemaker-dev \
     -v $NIM_EXPORT_PATH:/model_tar \
     --name riva-servicemaker \
     --entrypoint="" \
     nvcr.io/nim/nvidia/$CONTAINER_ID \
     bash -c "riva-deploy -f /servicemaker-dev/asr_offline_riva_ngram_lm.rmir /data/models/ && tar -czf /model_tar/custom_models.tar.gz -C /data/models ."

---
## Start the Riva ASR NIM
After the model repository is generated, we are ready to start the Riva NIM server. 

In [None]:

# Run the container with the cache directory mounted in the appropriate location:
! docker run -it --rm -d --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_TAGS_SELECTOR \
   -e NIM_DISABLE_MODEL_DOWNLOAD=true \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -v $NIM_EXPORT_PATH:/opt/nim/export \
   -e NIM_EXPORT_PATH=/opt/nim/export \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest

---
## Run Inference
After the Riva NIM server is up and running with your models, you can send inference requests querying the server. 

To send gRPC requests, you can install the Riva Python API bindings for the client. This is available as a [Python module on PyPI](https://pypi.org/project/nvidia-riva-client/).

In [None]:
# Install the Client API Bindings
! pip install nvidia-riva-client

In [None]:
import riva.client

### Connect to the Riva Server and Run Inference

NIM server can take some time to load, wait till the server is ready to serve the requests

In [None]:
import requests, time

for i in range(30):
    try:
        print(f"Waiting for NIM server to load, retrying in 5 seconds...")
        r = requests.get("http://0.0.0.0:9000/v1/health/live", timeout=2)
        if "live" in r.text:
            print("NIM server is ready!")
            break
    except requests.RequestException as e:
        pass
    time.sleep(5)
else:
    print("Server did not become ready after 30 attempts.")


#### Once the server is ready, we can call this inference function to query the Riva NIM server (using gRPC) to transcribe an audio file. 

In [None]:
def run_inference(audio_file, server='localhost:50051', print_full_response=False):
    with open(audio_file, 'rb') as fh:
        data = fh.read()
    
    auth = riva.client.Auth(uri=server)
    client = riva.client.ASRService(auth)
    config = riva.client.RecognitionConfig(
        language_code="en-US",
        max_alternatives=1,
        enable_automatic_punctuation=False,
    )
    
    response = client.offline_recognize(data, config)
    if print_full_response: 
        print(response)
    else:
        print(response.results[0].alternatives[0].transcript)

In [None]:
audio_file = "audio_samples/en-US_sample.wav"
run_inference(audio_file)

You can stop the Riva NIM server before shutting down the Jupyter kernel.

In [None]:
! docker stop $CONTAINER_ID
! docker rm $CONTAINER_ID