<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-finetune-am-citrinet-tao-deployment/nvidia_logo.png" style="width: 90px; float: right;">

# How to deploy a Riva Speech Recognition Pipeline
In this tutorial, you will learn how to deploy Riva speech recognition models - specifically the **Acoustic model (Conformer-CTC)**, **Language model (ngram)**, and **Inverse Text Normalization (WSFT)** pre-trained models downloaded from NVIDIA NGC. 

This will serve as a primer for customization tutorials in this lab, which require configuring the Riva speech pipeline.

---
## Prerequisites

Before we get started, ensure that you have access to [**NVIDIA NGC**](https://ngc.nvidia.com/signin).

---
## Fetch ASR models from NGC

In [None]:
# Imports
import os

# Create a local directory to save models
ASR_MODEL_DIR = os.path.join(os.getcwd(), "asr-models")
!mkdir -p $ASR_MODEL_DIR

### Define a function for downloading NGC resources

In [None]:
def ngc_download_and_get_dir(ngc_resource_name, resource_description, resource_type="model", parent_dir=ASR_MODEL_DIR):
    default_download_folder = "_v".join(ngc_resource_name.split("/")[-1].split(":"))
    download_path = os.path.join(parent_dir, default_download_folder)
    if os.path.exists(download_path):
        print(f"{resource_description} exists, skipping download")
        return default_download_folder
    ngc_output = !ngc registry $resource_type download-version $ngc_resource_name --dest $parent_dir
    if not os.path.exists(download_path):
        ngc_output_formatted='\n'.join(ngc_output)
        logging.error(
            f"NGC was not able to download the requested model {ngc_resource_name}. "
            "Please check the NGC error message, remove all directories, and re-start the "
            f"notebook. NGC message: {ngc_output_formatted}"
        )
        return None
    print(f"Successfully downloaded {resource_description}")
    return default_download_folder

### Download the Conformer-CTC Acoustic Model
The Conformer-CTC Acoustic Model is located on NGC [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_en_us_conformer). Let's download it to a local path.

In [None]:
# Path where ngc will download the Acoustic Model
AM_DIR = ngc_download_and_get_dir("nvidia/riva/speechtotext_en_us_conformer:deployable_v6.0_export_v2", "Acoustic model")
AM_PATH = os.path.join(ASR_MODEL_DIR, AM_DIR)

In [None]:
# Inspect downloaded files
!ls -lt $AM_PATH

### Download the n-gram Language Model

The n-gram LM is located on NGC [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_en_us_lm). 

`NOTE:` This may take up to 30 minutes to download.

In [None]:
LM_DIR = ngc_download_and_get_dir("nvidia/riva/speechtotext_en_us_lm:deployable_v6.0", "Language model")
LM_PATH = os.path.join(ASR_MODEL_DIR, LM_DIR)

In [None]:
# Inspect downloaded files
!ls -lt $LM_PATH

### Download Inverse Text Normalization (ITN) Model

The ITN model is located on NGC [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/inverse_normalization_en_us).

In [None]:
ITN_DIR = ngc_download_and_get_dir("nvidia/riva/inverse_normalization_en_us:deployable_v2.2", "Inverse text normalization model")
ITN_PATH = os.path.join(ASR_MODEL_DIR, ITN_DIR)

In [None]:
# Inspect downloaded files
!ls -lt $ITN_PATH

---
## Riva ServiceMaker
Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components: `riva-build` and `riva-deploy`

### Riva-build

This step helps build a Riva-ready version of the model. It’s only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. <br>

`riva-build` is responsible for the combination of one or more exported models (`.riva` files) into a single file containing an intermediate format called Riva Model Intermediate Representation (`.rmir`). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html#pipeline-configuration).

In [None]:
riva_line_list = !wget -qO- https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html | grep "NVIDIA Riva Skills"
riva_line_string = riva_line_list[0]
__riva_version__ = riva_line_string.split(' ')[3]
# __riva_version__ = '2.14.0'

In [None]:
MACHINE_TYPE="AMD64" #Change this to `ARM64_linux` or `ARM64_l4t` in case of an ARM64 machine.
TARGET_MACHINE="AMD64" #Change this to `ARM64_linux` or `ARM64_l4t` in case of an ARM64 machine.
# KEY = "nemotoriva" ##Encryption key used during nemo2riva # tlt_encode for the standard FastPitch and HiFiGAN RMIRs
KEY = "tlt_encode" ##Encryption key used during nemo2riva # tlt_encode for the standard FastPitch and HiFiGAN RMIRs
FORCE = True ## Whether to force-build a new TTS RMIR and replace any existing RMIRs

In [None]:
## Riva NGC, servicemaker image config.
if MACHINE_TYPE.lower() in ["amd64", "arm64_linux"]:
    RIVA_SM_CONTAINER = f"nvcr.io/nvidia/riva/riva-speech:{__riva_version__}-servicemaker"
elif MACHINE_TYPE.lower()=="arm64_l4t":
    RIVA_SM_CONTAINER = f"nvcr.io/nvidia/riva/riva-speech:{__riva_version__}-servicemaker-l4t-aarch64"

In [None]:
# Get the ServiceMaker Docker container
! docker pull $RIVA_SM_CONTAINER

Below, we execute `riva-build` to create a pipeline configured for Offline Recognition. This command for reference is also present in the [pipeline configuration](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html#pipeline-configuration) section of the docs. <br>

First, let's set relevant paths relative to where we will mount the models in the Servicemaker docker:

In [None]:
# All model paths relative to Riva Servicemaker docker include the _SM suffix

ASR_MODEL_DIR_SM = "/data" # Path where we mount the downloaded ASR models in the Servicemaker docker

# Relative path to Acoustic Model
AM_SM = os.path.join(ASR_MODEL_DIR_SM, AM_DIR, "Conformer-CTC-L_spe128_en-US_6.0.riva")

# Relative path to LM model artifacts
DECODING_LM_BINARY_SM = os.path.join(ASR_MODEL_DIR_SM, LM_DIR, "en-US_default_6.0.bin")
DECODING_VOCAB_SM = os.path.join(ASR_MODEL_DIR_SM, LM_DIR, "en-US_default_6.0_dict_vocab.txt")

# Relative path to ITN artifacts
WFST_TOKENIZER_MODEL_SM = os.path.join(ASR_MODEL_DIR_SM, ITN_DIR, "tokenize_and_classify.far")
WFST_VERBALIZER_MODEL_SM = os.path.join(ASR_MODEL_DIR_SM, ITN_DIR, "verbalize.far")
FAR_SPEECH_HINTS_SM = os.path.join(ASR_MODEL_DIR_SM, ITN_DIR, "speech_class.far")

# Relative path where the generated .rmir file will be stored
RMIR_DIR = "default-models/rmir"
!mkdir -p $ASR_MODEL_DIR/$RMIR_DIR
ASR_RMIR_DIR_SM = os.path.join(ASR_MODEL_DIR_SM, RMIR_DIR)
ASR_RMIR_SM = os.path.join(ASR_RMIR_DIR_SM, "asr_lm_itn_offline.rmir")

We use the Riva ServiceMaker Docker container to run `riva-build`:

In [None]:
! docker run --rm --gpus all -v $ASR_MODEL_DIR:$ASR_MODEL_DIR_SM $RIVA_SM_CONTAINER -- \
    riva-build speech_recognition $ASR_RMIR_SM:$KEY $AM_SM:$KEY \
        --force \
        --offline \
        --name=conformer-en-US-asr-offline \
        --return_separate_utterances=True \
        --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 \
        --featurizer.precalc_norm_params=False \
        --ms_per_timestep=40 \
        --endpointing.start_history=200 \
        --nn.fp16_needs_obey_precision_pass \
        --endpointing.residue_blanks_at_start=-2 \
        --chunk_size=4.8 \
        --left_padding_size=1.6 \
        --right_padding_size=1.6 \
        --max_batch_size=16 \
        --featurizer.max_batch_size=512 \
        --featurizer.max_execution_batch_size=512 \
        --decoder_type=flashlight \
        --flashlight_decoder.asr_model_delay=-1 \
        --decoding_language_model_binary=$DECODING_LM_BINARY_SM \
        --decoding_vocab=$DECODING_VOCAB_SM \
        --flashlight_decoder.lm_weight=0.8 \
        --flashlight_decoder.word_insertion_score=1.0 \
        --flashlight_decoder.beam_size=32 \
        --flashlight_decoder.beam_threshold=20. \
        --flashlight_decoder.num_tokenization=1 \
        --language_code=en-US \
        --wfst_tokenizer_model=$WFST_TOKENIZER_MODEL_SM \
        --wfst_verbalizer_model=$WFST_VERBALIZER_MODEL_SM \
        --speech_hints_model=$FAR_SPEECH_HINTS_SM

The arguments we used above are just an example, and there are many more optional parameters you can configure! For now, let's take a look into what those arguments we used above mean -

* General pipeline parameters:
    * `--name`: Name of the ASR pipeline, used to set the model names in the Riva model repository
    * `--offline`: The Riva ASR pipeline can be configured for both streaming and offline recognition use cases. Here, we mark it to use it in the `offline` setting. More details on recommended configuration for offline/streaming are in the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html#streaming-offline-recognition).
    * `--language_code`: Language of the model
    * `--force`: Overwrites the specified `.rmir` file if it exists.
    * `--max_batch_size`: Default maximum parallel requests in a single forward pass
    * `--chunk_size`: Size of audio chunks to use during inference. If not specified, default will be selected based on online/offline setting. 
    * `--left_padding_size`: The duration in seconds of the backward looking padding to prepend to the audio chunk. The acoustic model input corresponds to a duration of (left_padding_size + chunk_size + right_padding_size) seconds
    * `--right_padding_size`: The duration in seconds of the forward looking padding to append to the audio chunk. The acoustic model input corresponds to a duration of (left_padding_size + chunk_size + right_padding_size) seconds
* ITN model specific parameters
    * `--wfst_tokenizer_model`: Sparrowhawk model to use for tokenization and classification, must be in `.far` (finite-state archive) format. 
    * `--wfst_verbalizer_model`: Sparrowhawk model to use for verbalizer, must be in `.far` (finite-state archive) format.
    * `--speech_hints_model`: Speechhints class `.far` file used to enable speechhints.
* Acoustic model specific parameters
    * `--ms_per_timestep`: The duration in milliseconds of one timestep of the acoustic model output.
* Neural network specific parameters
    * `--nn.fp16_needs_obey_precision_pass`: Flag to explicitly mark layers as float when parsing the ONNX network
* Featurizer specific parameters
    * `--featurizer.use_utterance_norm_params`: Apply normalization at utterance level
    * `--featurizer.precalc_norm_time_steps`: Weight of the precomputed normalization parameters, in timesteps. Setting to 0 will disable use of precalculated normalization parameters.
    * `--featurizer.precalc_norm_params`: Boolean that controls if precalculated normalization parameters should be used
    * `--vad.residue_blanks_at_start`: Number of time steps to ignore at the beginning of the acoustic model output when trying to detect start/end of speech
* Decoder & Language Model specific parameters
    * `--decoder_type`: Type of decoder to use. Valid entries are greedy, os2s, flashlight or kaldi. In this example, we used the flashlight decoder.
    * `--flashlight_decoder.asr_model_delay`: Number of time steps by which the acoustic model output should be shifted when computing timestamps. This parameter must be tuned since the CTC model is not guaranteed to predict correct alignment.
    * `--decoding_language_model_binary`: Language model .binary used during decoding
    * `--decoding_vocab`: File of unique words separated by white space. Only used if decoding_lexicon not provided
    * `--flashlight_decoder.lm_weight`: Weight of language model. This affects the overall contribution of the language model score to the overall hypothesis score.
    * `--flashlight_decoder.word_insertion_score`: Word insertion score used when scoring hypothesis
    * `--flashlight_decoder.beam_threshold`: Threshold to prune hypothesis

This information is also accessible through the `riva-build speech_recognition -h` command, and more information about additional parameters to `riva-build` can be found in the [riva-build optional parameters](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html#riva-build-optional-parameters) documentation. 

In [None]:
! docker run --rm $RIVA_SM_CONTAINER -- riva-build speech_recognition -h

In [None]:
# Inspect the .rmir
!ls -lt $ASR_MODEL_DIR/$RMIR_DIR/*.rmir

### Riva-deploy

The deployment tool takes as input one or more Riva Model Intermediate Representation (RMIR) files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

`NOTE`: 
1. This step may take about 10 mins to complete.
2. When running `riva-deploy`, we map `$ASR_MODEL_DIR/default-models` to `$ASR_MODEL_DIR_SM` (`/data`) inside the Riva ServiceMaker Docker container. This is because the scripts in the Riva Skills Quick Start resource folder (which we'll download later) expect the directory containing the `rmir` and `models` directories to be mapped to `/data`.  

In [None]:
# Path to the model repository relative to the ServiceMaker Docker container
MODEL_REPO_SM = os.path.join(ASR_MODEL_DIR_SM, "models")
# Reset the RMIR path relative to the ServiceMaker Docker container
ASR_RMIR_SM = os.path.join(ASR_MODEL_DIR_SM, "rmir", "asr_lm_itn_offline.rmir")

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus all -v $ASR_MODEL_DIR/default-models:$ASR_MODEL_DIR_SM $RIVA_SM_CONTAINER -- \
            riva-deploy -f  $ASR_RMIR_SM:$KEY $MODEL_REPO_SM

In [None]:
# Inspect the models directory
!ls -lt $ASR_MODEL_DIR/default-models/models

---
## Start the Riva Server
After the model repository is generated, we are ready to start the Riva server. First, download the Riva Skills Quick Start resources from NGC. 

### Download the Riva Skills Quick Start guide
The [Riva Skills Quick Start](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/resources/riva_quickstart) guide contains easy-to-use scripts to download and deploy models. 

`NOTE:` The scripts in Quick Start can download and deploy the default models. We downloaded the ASR models above just to demonstrate how to use Riva ServiceMaker tools, which will be used during customization tutorials to re-deploy the pipeline.

In [None]:
if TARGET_MACHINE.lower() in ["amd64", "arm64_linux"]:
    quickstart_link = f"nvidia/riva/riva_quickstart:{__riva_version__}"
else:
    quickstart_link = f"nvidia/riva/riva_quickstart_arm64:{__riva_version__}"

RIVA_DIR = ngc_download_and_get_dir(quickstart_link, "Riva Quick Start resource folder", resource_type="resource", parent_dir=os.getcwd())
RIVA_DIR = os.path.join(os.getcwd(), RIVA_DIR)

### Configure Riva Quick Start 
This configures the scripts to deploy the ASR models we obtained as a result of Riva servicemaker tools in the previous section. <br>
For this, we modify the `config.sh` file to enable relevant Riva services (ASR for the Citrinet model), provide the encryption key, and path to the model repository (`riva_model_loc`) generated in the previous step among other configurations. 

In [None]:
!ls -lt $RIVA_DIR/config.sh

For example, if above the model repository is generated at `$ASR_MODEL_DIR/default-models/models`, then you can specify `riva_model_loc` as the same directory as `ASR_MODEL_DIR/default-models`. <br>

Pretrained versions of models specified in `models_asr/nlp/tts/nmt` are fetched from NGC. Since we are using our custom model, we can comment it in `models_asr` (and any others that are not relevant to your use case). <br>

#### config.sh snippet
```sh
### config.sh snippet  
# Enable or Disable Riva Services
# For any language other than en-US: service_enabled_nlp must be set to false
service_enabled_asr=true
service_enabled_nlp=true          ## MAKE CHANGES HERE - SET TO FALSE
service_enabled_tts=true          ## MAKE CHANGES HERE - SET TO FALSE
service_enabled_nmt=true          ## MAKE CHANGES HERE - SET TO FALSE

...

# Locations to use for storing models artifacts
#
# If an absolute path is speccified, the data will be written to that location
# Otherwise, a Docker volume will be used (default).
#
# riva_init.sh will create a `rmir` and `models` directory in the volume or
# path specified.
#
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by `riva_init.sh`
#
# Custom models produced by NeMo or TLT and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
#
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="riva-model-repo"  ## MAKE CHANGES HERE (Replace with the path ASR_MODEL_DIR/default-models)

if [[ $riva_target_gpu_family == "tegra" ]]; then
    riva_model_loc="`pwd`/model_repository"
fi

# The default RMIRs are downloaded from NGC by default in the above $riva_rmir_loc directory
# If you'd like to skip the download from NGC and use the existing RMIRs in the $riva_rmir_loc
# then set the below $use_existing_rmirs flag to true. You can also deploy your set of custom
# RMIRs by keeping them in the riva_rmir_loc dir and use this quickstart script with the
# below flag to deploy them all together.
use_existing_rmirs=false          ## MAKE CHANGES HERE - SET TO TRUE
```

Run the cell below to make the following changes to `config.sh` without opening the file in a text editor:

1. Set NLP, NMT, and TTS services to `false`
2. Set the `riva_model_loc` path to the path also assigned to `ASR_MODEL_DIR/default-models`
3. Set the variable `use_existing_rmirs` to `true`

In [None]:
with open(f"{RIVA_DIR}/config.sh", "r") as config_in:
    config_file = config_in.readlines()

for i, line in enumerate(config_file):
    # Disable services
    if line.startswith("service_enabled_asr"):
        config_file[i] = "service_enabled_asr=true\n"
    elif line.startswith("service_enabled_nlp"):
        config_file[i] = "service_enabled_nlp=false\n"
    elif line.startswith("service_enabled_nmt"):
        config_file[i] = "service_enabled_nmt=false\n"
    elif line.startswith("service_enabled_tts"):
        config_file[i] = "service_enabled_tts=false\n"
    # Update riva_model_loc to our rmir folder
    elif line.startswith("riva_model_loc"):
        config_file[i] = f'riva_model_loc="{ASR_MODEL_DIR}/default-models"\n'
    elif line.startswith("use_existing_rmirs"):
        config_file[i] = "use_existing_rmirs=true\n"

with open(f"{RIVA_DIR}/config.sh", "w") as config_in:
    config_in.writelines(config_file)

print("".join(config_file))

In [None]:
# Ensure you have permission to execute these scripts
! cd $RIVA_DIR && chmod +x ./riva_start.sh && chmod +x ./riva_stop.sh

Normally, one runs `riva_init.sh` before `riva_start.sh`. However, since we've already built our `.rmir` file with `riva-build` and deployed the associated model files by running `riva-deploy`, we can skip straight to `riva_start.sh`.

In [None]:
# Run Riva Start to start the server. This will deploy your model(s).
! cd $RIVA_DIR && ./riva_start.sh config.sh

---
## Run Inference
Once the Riva server is up and running with the models, you can send inference requests querying the server. 

To send gRPC requests, you can install the Riva Python API bindings for the client. This is available as a `pip` [package](https://pypi.org/project/nvidia-riva-client/). Feel free to read more about the python client [here](https://github.com/nvidia-riva/python-clients).

In [None]:
# Install the Client API Bindings
! pip install nvidia-riva-client

### Connect to the Riva Server and Run Automatic Speech Recognition
The following cells queries the Riva server (using gRPC) with an input audio to yield a transcript.

In [None]:
import io
import IPython.display as ipd
import grpc
import time

try:
    import riva.client # RIVA 2.3.0 and above
except:
    import riva_api.riva_audio_pb2 as ra # RIVA 2.0.0 and above
    import riva_api.audio_pb2 as ra
    import riva_api.riva_asr_pb2 as rasr
    import riva_api.riva_asr_pb2_grpc as rasr_srv
import wave

The following URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

In [None]:
auth = riva.client.Auth(uri='localhost:50051')

riva_asr = riva.client.ASRService(auth)

In [None]:
# Load a sample audio file from local disk
# This example uses a .wav file with LINEAR_PCM encoding.
audio_file = "audio_samples/en-US_wordboosting_sample1.wav"
    
# Listen to the sample audio we are looking to transcribe
ipd.Audio(audio_file)

In [None]:
wf = wave.open(audio_file, 'rb')
with open(audio_file, 'rb') as fh:
    content = fh.read()

# Creating RecognitionConfig
config = riva.client.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  audio_channel_count = 1
)

# ASR Inference call with Recognize 
response = riva_asr.offline_recognize(content, config)

print(response)

With this, you should see a transcription result for the input audio sequence. Now you have a speech recognition pipeline running! 

The ground truth transcription is: *AntiBERTa and ABlooper both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens.*

So, it looks like the the domain-specific terms like `AntiBERTa` and `ABlooper` were not transcribed well. In the next notebook, you will look into how you can improve transcription of such words!

You can stop the Riva ServiceMaker container (and thus shut down the Riva server) before shutting down the Jupyter kernel.

In [None]:
! docker container stop riva-speech