<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-finetune-am-citrinet-tao-deployment/nvidia_logo.png" style="width: 90px; float: right;">

# How to deploy a Riva Speech Recognition Pipeline
In this tutorial, you will learn how to deploy Riva speech recognition models - specifically the **Acoustic model (Citrinet)**, **Language model (ngram)**, and **Inverse Text Normalization (WSFT)** pre-trained models downloaded from NVIDIA NGC. 

This will serve as a primer for customization tutorials in this lab, which require configuring the Riva speech pipeline.

---
## Prerequisites

Before we get started, ensure that you have access to [**NVIDIA NGC**](https://ngc.nvidia.com/signin).

---
## Fetch ASR models from NGC
### Download the CitriNet Acoustic Model

The CitriNet Acoustic Model is located on NGC [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet/files?version=deployable_v3.0). Let's download it to a local path.

In [None]:
# Imports
import os

# Create a local directory to save models
ASR_MODEL_DIR = os.path.join(os.getcwd(), "asr-models")
!mkdir -p $ASR_MODEL_DIR

In [None]:
# Path where ngc will download the Acoustic Model
AM_DIR = "speechtotext_en_us_citrinet_vdeployable_v3.0"
AM_PATH = os.path.join(ASR_MODEL_DIR, AM_DIR)

if os.path.exists(AM_PATH):
    print("Acoustic Model exists, skipping download")
else:
    print("Downloading the Acoustic Model")
    !ngc registry model download-version "nvidia/tao/speechtotext_en_us_citrinet:deployable_v3.0" --dest $ASR_MODEL_DIR

In [None]:
# Inspect downloaded files
!ls $AM_PATH

### Download the n-gram Language Model

The n-gram LM is located on NGC [here]( https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_lm/files?version=deployable_v1.1). 

`NOTE:` This may take up to 30 minutes to download.

In [None]:
LM_DIR = "speechtotext_en_us_lm_vdeployable_v1.1"
LM_PATH = os.path.join(ASR_MODEL_DIR, LM_DIR)

if os.path.exists(LM_PATH):
    print("Language Model exists, skipping download")
else:
    print("Downloading the Language Model")
    !ngc registry model download-version "nvidia/tao/speechtotext_en_us_lm:deployable_v1.1" --dest $ASR_MODEL_DIR

In [None]:
# Inspect downloaded files
!ls $LM_PATH

### Download Inverse Text Normalization (ITN) Model

The ITN model is located on NGC [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/inverse_normalization_en_us/files?version=deployable_v1.1).

In [None]:
ITN_DIR = "inverse_normalization_en_us_vdeployable_v1.0"
ITN_PATH = os.path.join(ASR_MODEL_DIR, ITN_DIR)

if os.path.exists(ITN_PATH):
    print("ITN Model exists, skipping download")
else:
    print("Downloading the ITN Model")
    !ngc registry model download-version "nvidia/tao/inverse_normalization_en_us:deployable_v1.0" --dest $ASR_MODEL_DIR

In [None]:
# Inspect downloaded files
!ls $ITN_PATH

---
## Riva ServiceMaker
Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components: `riva-build` and `riva-deploy`

### Riva-build

This step helps build a Riva-ready version of the model. It’s only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. <br>

`riva-build` is responsible for the combination of one or more exported models (`.riva` files) into a single file containing an intermediate format called Riva Model Intermediate Representation (`.rmir`). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html#pipeline-configuration).

In [None]:
# ServiceMaker Docker
RIVA_SM_CONTAINER = "nvcr.io/nvidia/riva/riva-speech:2.4.0-servicemaker"

# Get the ServiceMaker docker
! docker pull $RIVA_SM_CONTAINER

# Key that model is encrypted with, while exporting with TAO
KEY = "tlt_encode"

Below, we execute Riva-build to create a pipeline configured for Offline Recognition. This command for reference is also present in the [pipeline configuration](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html#pipeline-configuration) section of the docs. <br>

First, let's set relevant paths relative to where we will mount the models in the Servicemaker docker:

In [None]:
# All model paths relative to Riva Servicemaker docker include the _SM suffix

ASR_MODEL_DIR_SM = "/data" # Path where we mount the downloaded ASR models in the Servicemaker docker

# Relative path to Acoustic Model
AM_SM = os.path.join(ASR_MODEL_DIR_SM, AM_DIR, "citrinet-1024-Jarvis-asrset-3_0-encrypted.riva")

# Relative path to LM model artifacts
DECODING_LM_BINARY_SM = os.path.join(ASR_MODEL_DIR_SM, LM_DIR, "riva_asr_train_datasets_3gram.binary")
DECODING_VOCAB_SM = os.path.join(ASR_MODEL_DIR_SM, LM_DIR, "flashlight_decoder_vocab.txt")

# Relative path to WSFT artifacts
WFST_TOKENIZER_MODEL_SM = os.path.join(ASR_MODEL_DIR_SM, ITN_DIR, "tokenize_and_classify.far")
WFST_VERBALIZER_MODEL_SM = os.path.join(ASR_MODEL_DIR_SM, ITN_DIR, "verbalize.far")

# Relative path where the generated .rmir file will be stored
ASR_RMIR_SM = os.path.join(ASR_MODEL_DIR_SM, "asr.rmir")

We use the Riva servicemaker docker to run riva-build:

In [None]:
! docker run --rm --gpus 0 -v $ASR_MODEL_DIR:$ASR_MODEL_DIR_SM $RIVA_SM_CONTAINER -- \
            riva-build speech_recognition $ASR_RMIR_SM:$KEY $AM_SM:$KEY \
            --name=citrinet-1024-en-US-asr-offline \
            --offline \
            --streaming=False \
            --wfst_tokenizer_model=$WFST_TOKENIZER_MODEL_SM \
            --wfst_verbalizer_model=$WFST_VERBALIZER_MODEL_SM \
            --ms_per_timestep=80 \
            --featurizer.use_utterance_norm_params=False \
            --featurizer.precalc_norm_time_steps=0 \
            --featurizer.precalc_norm_params=False \
            --vad.residue_blanks_at_start=-2 \
            --chunk_size=300 \
            --left_padding_size=0. \
            --right_padding_size=0. \
            --decoder_type=flashlight \
            --flashlight_decoder.asr_model_delay=-1 \
            --decoding_language_model_binary=$DECODING_LM_BINARY_SM  \
            --decoding_vocab=$DECODING_VOCAB_SM  \
            --flashlight_decoder.lm_weight=0.2 \
            --flashlight_decoder.word_insertion_score=0.2 \
            --flashlight_decoder.beam_threshold=20. \
            --language_code=en-US \
            --force

The arguments we used above are just an example, and there are many more optional parameter you can configure! For now, let's take a look into what those arguments we used above mean -

* General pipeline parameters:
    * `--name`: Name of the ASR pipeline, used to set the model names in the Riva model repository
    * `--offline`: The Riva ASR pipeline can be configured for both streaming and offline recognition use cases. Here, we mark it to use it in the `offline` setting. More details on recommended configuration for offline/streaming are in the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html#streaming-offline-recognition).
    * `--language_code`: Language of the model
    * `--force`: Overwrites the existing .rmir file if it exists.
    * `--chunk_size`: Size of audio chunks to use during inference. If not specified, default will be selected based on online/offline setting. 
    * `--left_padding_size`: The duration in seconds of the backward looking padding to prepend to the audio chunk. The acoustic model input corresponds to a duration of (left_padding_size + chunk_size + right_padding_size) seconds
    * `--right_padding_size`: The duration in seconds of the forward looking padding to append to the audio chunk. The acoustic model input corresponds to a duration of (left_padding_size + chunk_size + right_padding_size) seconds
* ITN model specific parameters
    * `--wfst_tokenizer_model`: Sparrowhawk model to use for tokenization and classification, must be in .far (finite-state archive) format. 
    * `--wfst_verbalizer_model`: Sparrowhawk model to use for verbalizer, must be in .far (finite-state archive) format.
* Acoustic model specific parameters
    * `--ms_per_timestep`: The duration in milliseconds of one timestep of the acoustic model output.
* Featurizer specific parameters
    * `--featurizer.use_utterance_norm_params`: Apply normalization at utterance level
    * `--featurizer.precalc_norm_time_steps`: Weight of the precomputed normalization parameters, in timesteps. Setting to 0 will disable use of precalculated normalization parameters.
    * `--featurizer.precalc_norm_params`: Boolean that controls if precalculated normalization parameters should be used
    * `--vad.residue_blanks_at_start`: Number of time steps to ignore at the beginning of the acoustic model output when trying to detect start/end of speech
* Decoder & Language Model specific parameters
    * `--decoder_type`: Type of decoder to use. Valid entries are greedy, os2s, flashlight or kaldi. In this example, we used the flashlight decoder.
    * `--flashlight_decoder.asr_model_delay`: Number of time steps by which the acoustic model output should be shifted when computing timestamps. This parameter must be tuned since the CTC model is not guaranteed to predict correct alignment.
    * `--decoding_language_model_binary`: Language model .binary used during decoding
    * `--decoding_vocab`: File of unique words separated by white space. Only used if decoding_lexicon not provided
    * `--flashlight_decoder.lm_weight`: Weight of language model. This affects the overall contribution of the language model score to the overall hypothesis score.
    * `--flashlight_decoder.word_insertion_score`: Word insertion score used when scoring hypothesis
    * `--flashlight_decoder.beam_threshold`: Threshold to prune hypothesis

This information is also accessible through the `riva-build speech_recognition -h` command, and more information about additional parameters to `riva-build` can be found in the [riva-build optional parameters](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html?highlight=riva%20build#riva-build-optional-parameters) documentation. 

In [None]:
! docker run --rm $RIVA_SM_CONTAINER -- riva-build speech_recognition -h

In [None]:
# Inspect the .rmir
!ls -lt $ASR_MODEL_DIR/*.rmir

### Riva-deploy

The deployment tool takes as input one or more Riva Model Intermediate Representation (RMIR) files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

`NOTE`: This step may take about 10 mins to complete

In [None]:
# Path to the model repostory relative to the SM docker
MODEL_REPO_SM = os.path.join(ASR_MODEL_DIR_SM, "models")

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 0 -v $ASR_MODEL_DIR:$ASR_MODEL_DIR_SM $RIVA_SM_CONTAINER -- \
            riva-deploy -f  $ASR_RMIR_SM:$KEY $MODEL_REPO_SM

In [None]:
!echo $RIVA_SM_CONTAINER

In [None]:
# Inspect the models directory
!ls -lt $ASR_MODEL_DIR/models

---
## Start the Riva Server
After the model repository is generated, we are ready to start the Riva server. First, download the Riva Skills Quick Start resources from NGC. 

### Download the Riva Skills Quick Start guide
The [Riva Skills Quick Start](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/resources/riva_quickstart) guide contains easy-to-use scripts to download and deploy models. 

`NOTE:` The scripts in Quick Start can download and deploy the default models. We downloaded the ASR models above just to demonstrate how to use Riva ServiceMaker tools, which will be used during customization tutorials to re-deploy the pipeline.

In [None]:
# Set the Riva Quick Start directory
RIVA_QSG = os.path.join(os.getcwd(), "riva_quickstart_v2.4.0")

# Downloads the quick start directory to a folder in the current directory and uncompresses it
if os.path.exists(RIVA_QSG):
    print("Riva Quick Start guide exists, skipping download")
else:
    print("Downloading the Riva Quick Start guide Model")
    !ngc registry resource download-version "nvidia/riva/riva_quickstart:2.4.0"

### Configure Riva Quick Start 
This configures the scripts to deploy the ASR models we obtained as a result of Riva servicemaker tools in the previous section. <br>
For this, we modify the `config.sh` file to enable relevant Riva services (ASR for the Citrinet model), provide the encryption key, and path to the model repository (`riva_model_loc`) generated in the previous step among other configurations. 

In [None]:
!ls $RIVA_QSG/config.sh

For example, if above the model repository is generated at `$ASR_MODEL_DIR/models`, then you can specify `riva_model_loc` as the same directory as `ASR_MODEL_DIR`. <br>

Pretrained versions of models specified in `models_asr/nlp/tts` are fetched from NGC. Since we are using our custom model, we can comment it in `models_asr` (and any others that are not relevant to your use case). <br>

#### config.sh snippet
```
# Enable or Disable Riva Services 
service_enabled_asr=true
service_enabled_nlp=false                                                      ## MAKE CHANGES HERE - SET TO FALSE
service_enabled_tts=false                                                     ## MAKE CHANGES HERE - SET TO FALSE

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.
gpus_to_use="device=0"

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"

# Locations to use for storing models artifacts
#
# If an absolute path is specified, the data will be written to that location
# Otherwise, a docker volume will be used (default).
#
# riva_init.sh will create a `rmir` and `models` directory in the volume or
# path specified. 
#
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by `riva_init.sh`
# 
# Custom models produced by NeMo or TAO and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
#
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="<add path>"                              ## MAKE CHANGES HERE (Replace with the path ASR_MODEL_DIR)                      
```

<font color='red'>**ATTENTION:**</font> **Make sure to do the following before moving forward:**
1. In the file navigator in Jupyter Lab, navigate to riva_quickstart_v2.* and open config.sh
2. Configure settings as shown in the snippet above
   - Set nlp and tts services to false
   - Configure the riva_model_loc path to where the models resulting from riva-deploy are stored

In [None]:
# set `riva-model-loc` to where the models resulting from riva-deploy are stored. In our case it is ASR_MODEL_DIR
!echo $ASR_MODEL_DIR

In [None]:
# Ensure you have permission to execute these scripts
! cd $RIVA_QSG && chmod +x ./riva_start.sh

In [None]:
# Run Riva Start to start the server. This will deploy your model(s).
! cd $RIVA_QSG && ./riva_start.sh config.sh

---
## Run Inference
Once the Riva server is up and running with the models, you can send inference requests querying the server. 

To send gRPC requests, you can install the Riva Python API bindings for the client. This is available as a `pip` [package](https://pypi.org/project/nvidia-riva-client/). Feel free to read more about the python client [here](https://github.com/nvidia-riva/python-clients).

In [None]:
# Install the Client API Bindings
! pip install nvidia-riva-client

### Connect to the Riva Server and Run Automatic Speech Recognition
The following cells queries the Riva server (using gRPC) with an input audio to yield a transcript.

In [None]:
import io
import IPython.display as ipd
import grpc
import time

try:
    import riva.client # RIVA 2.3.0 and above
except:
    import riva_api.riva_audio_pb2 as ra # RIVA 2.0.0 and above
    import riva_api.audio_pb2 as ra
    import riva_api.riva_asr_pb2 as rasr
    import riva_api.riva_asr_pb2_grpc as rasr_srv
import wave

The following URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

In [None]:
auth = riva.client.Auth(uri='localhost:50051')

riva_asr = riva.client.ASRService(auth)

In [None]:
# Load a sample audio file from local disk
# This example uses a .wav file with LINEAR_PCM encoding.
audio_file = "audio_samples/en-US_wordboosting_sample1.wav"
    
# Listen to the sample audio we are looking to transcribe
ipd.Audio(audio_file)

In [None]:
wf = wave.open(audio_file, 'rb')
with open(audio_file, 'rb') as fh:
    content = fh.read()

# Creating RecognitionConfig
config = riva.client.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  audio_channel_count = 1
)

# ASR Inference call with Recognize 
response = riva_asr.offline_recognize(content, config)

print(response)

With this, you should see a transcription result for the input audio sequence. Now you have a speech recognition pipeline running! 

The ground truth transcription is: *AntiBERTa and ABlooper both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens.*

So, it looks like the the domain-specific terms like `AntiBERTa` and `ABlooper` were not transcribed well. In the next notebook, you will look into how you can improve transcription of such words!