<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/rivaasrasr-finetuning-conformer-ctc-nemo/nvidia_logo.png" style="width: 90px; float: right;">

# Training and Deploying N-GPU Language Models for Parakeet RNNT with NVIDIA NIM

This comprehensive tutorial demonstrates how to train and deploy an NVIDIA N-GPU Language Model (LM) for Parakeet RNNT acoustic models using NVIDIA NeMo and deploy them as NVIDIA NIM (NVIDIA Inference Microservices). You'll learn the complete pipeline from data preparation to model deployment and inference.

## What You'll Learn
- How to train n-gram language models using NeMo and KenLM
- How to integrate language models with Parakeet RNNT acoustic models
- How to deploy custom models using NVIDIA Riva NIM
- How to perform inference with your deployed models

## Prerequisites
- Basic understanding of automatic speech recognition (ASR)
- Familiarity with Python and Jupyter notebooks
- Access to NVIDIA NGC and GPU resources

## NVIDIA Riva NIM Overview

NVIDIA Riva ASR NIM APIs provide easy access to state-of-the-art automatic speech recognition (ASR) models for multiple languages. Riva ASR NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

In this tutorial, we will interact with the automated speech recognition (ASR) APIs.

For more information about Riva ASR NIM, refer to the [Riva NIM documentation](https://docs.nvidia.com/nim/riva/asr/latest/overview.html).

## NeMo (Neural Modules)
[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo) is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and NLU models with a simple Python interface. For information about how to set up NeMo, refer to the [NeMo GitHub](https://github.com/NVIDIA/NeMo) instructions.

### n-gram Language Model
There are primarily two types of language models:

- **n-gram language models**: These models use the frequency of n-grams to learn the probability distribution over words. Two benefits of n-gram language models are simplicity and scalability – with a larger `n`, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.
- **Neural language models**: These models use different kinds of neural networks to model the probability distribution over words, and have surpassed the n-gram language models in the ability to model language, but are generally slower to evaluate.

In this tutorial, we will show how to train an [n-gram language model](https://web.stanford.edu/~jurafsky/slp3/3.pdf) leveraging NeMo and deploy as NGPU LM in NVIDIA ASR NIM.


In [None]:
"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3 jq
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

## Install NeMo
BRANCH = 'v2.4.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option,
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

### Prerequisites
Ensure you meet the following prerequisites.
1. You have access and are logged into NVIDIA NGC. For step-by-step instructions, refer to the [NGC Getting Started Guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#registering-activating-ngc-account).
2. You have installed Kaggle API. For step-by-step instructions, refer to this [install and authenticate Kaggle API](https://www.kaggle.com/docs/api).

---
## Training an ngram model with NeMo

### Installing the required packages

This step clones the NeMo repository and installs the required dependencies, including KenLM, which is used for building language models. The installation process may take several minutes to complete.

In [8]:
import os
NEMO_ROOT = "NeMo" # Path to clone the NeMo repository.
NEMO_ROOT = "/media/mayjain/Seagate/mayjain/work/del_this/NeMo"
os.environ["NEMO_ROOT"] = NEMO_ROOT
!git clone -b $BRANCH --single-branch https://github.com/NVIDIA/NeMo.git $NEMO_ROOT
!cd $NEMO_ROOT/scripts/asr_language_modeling/ngram_lm/ && bash install_beamsearch_decoders.sh $NEMO_ROOT

### Preparing the Dataset
#### LibriSpeech LM Normalized Dataset
For this tutorial, we use the normalized version of the LibriSpeech LM dataset to train our n-gram language model. The normalized version of the LibriSpeech LM dataset is available [here](https://www.openslr.org/11/).<br>
The training data is publicly available [here](https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz) and can be downloaded directly.
#### Downloading the Dataset

In this tutorial, we will use the popular Librispeech dataset. Let's download it.

In [6]:
# Set the path to a folder where you want your data and results to be saved.
DATA_DOWNLOAD_DIR="content/datasets"
MODELS_DIR="content/models"

DATA_DOWNLOAD_DIR = "/media/mayjain/Seagate/mayjain/work/del_this/datasets"
MODELS_DIR = "/media/mayjain/Seagate/mayjain/work/del_this/models"

os.environ["DATA_DOWNLOAD_DIR"] = DATA_DOWNLOAD_DIR
os.environ["MODELS_DIR"] = MODELS_DIR

!mkdir -p $DATA_DOWNLOAD_DIR $MODELS_DIR

After downloading, untar the dataset and move it to the correct directory.

In [5]:
# Note: Ensure that wget and unzip utilities are available. If not, install them.
!wget 'https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz' -P $DATA_DOWNLOAD_DIR

# Extract the data
!gzip -dk $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt.gz

--2025-09-19 03:31:54--  https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
Resolving www.openslr.org (www.openslr.org)... 136.243.171.4
connected. to www.openslr.org (www.openslr.org)|136.243.171.4|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 1507274412 (1.4G) [application/x-gzip]
Saving to: ‘/media/mayjain/Seagate/mayjain/work/del_this/datasets/librispeech-lm-norm.txt.gz’


2025-09-19 03:33:35 (14.5 MB/s) - ‘/media/mayjain/Seagate/mayjain/work/del_this/datasets/librispeech-lm-norm.txt.gz’ saved [1507274412/1507274412]



For the sake of reducing the time this tutorial takes, we reduced the number of lines of the training dataset. Feel free to modify the number of used lines.

In [7]:
# Use a random 100,000 lines for training
!shuf -n 100000 $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt | tr '[:upper:]' '[:lower:]' > $DATA_DOWNLOAD_DIR/reduced_training.txt

The N-GPU LMs for Parakeet RNNT models are token based. So we need access to ASR's tokenizer model to tokenize the training data. Lets download the RNNT model we want to deploy the N-GPU LM with.

In [8]:
!wget -P $MODELS_DIR https://huggingface.co/nvidia/parakeet-rnnt-1.1b/resolve/main/parakeet-rnnt-1.1b.nemo

--2025-09-19 03:37:11--  https://huggingface.co/nvidia/parakeet-rnnt-1.1b/resolve/main/parakeet-rnnt-1.1b.nemo
Resolving huggingface.co (huggingface.co)... 108.158.251.34, 108.158.251.67, 108.158.251.89, ...
Connecting to huggingface.co (huggingface.co)|108.158.251.34|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cas-bridge.xethub.hf.co/xet-bridge-us/658cb5dd8cff48d3a45472a7/a67a354491ca2c944dc2ac6d4c8710ed9f38b6051acd9740305a3e0ba4d468ce?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250919%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250919T033711Z&X-Amz-Expires=3600&X-Amz-Signature=5fc2e9678ba8a2bb3eb0429dbe9097b34cf0dc623b74eb61f3b25968a20d8237&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27parakeet-rnnt-1.1b.nemo%3B+filename%3D%22parakeet-rnnt-1.1b.nemo%22%3B&x-id=GetObject&Expires=1758256631&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvb

Now we have all the required artifacts. Lets train the N-GPU LM!

In [14]:
!cd $NEMO_ROOT/scripts/asr_language_modeling/ngram_lm/ && python3 train_kenlm.py \
              nemo_model_file=$MODELS_DIR/parakeet-rnnt-1.1b.nemo \
              train_paths=['{DATA_DOWNLOAD_DIR}/reduced_training.txt'] \
              kenlm_bin_path=$NEMO_ROOT/decoders/kenlm/build/bin \
              kenlm_model_file=$MODELS_DIR/ngpu_6g \
              ngram_length=6 save_nemo=True

    
['/media/mayjain/Seagate/mayjain/work/del_this/datasets/reduced_training.txt'] ['/media/mayjain/Seagate/mayjain/work/del_this/datasets/reduced_training.txt']
/media/mayjain/Seagate/mayjain/work/del_this/NeMo/decoders/kenlm/build/bin
[NeMo I 2025-09-19 04:32:02 nemo_logging:393] Loading nemo model '/media/mayjain/Seagate/mayjain/work/del_this/models/parakeet-rnnt-1.1b.nemo' ...
[NeMo I 2025-09-19 04:33:57 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-09-19 04:33:57 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /disk1/NVIDIA/datasets/LibriSpeech_NeMo/librivox-train-all.json
    sample_rate: 16000
    batch_size: 16
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 16.7
    min

The model is successfully saved as ngpu_6g.nemo

### Deploying the n-GPU LM with Parakeet RNNT in Nvidia NIM

## NeMo (Neural Modules) and `nemo2riva`
[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo) is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and natural language understanding (NLU) models with a simple Python interface. To fine-tune a Parakeet-CTC acoustic model with NeMo, refer to the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb).

The [`nemo2riva`]() command-line tool provides the capability to export your `.nemo` model in a format that can be deployed using [NVIDIA Riva](https://docs.nvidia.com/nim/riva/asr/latest/overview.html) ASR NIM. A Python `.whl` file for `nemo2riva` is available in [PyPi](https://pypi.org/project/nemo2riva/). You can install `nemo2riva` with `pip`, as shown in the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb). 

This tutorial explores taking a `.riva` model &mdash; the result of invoking the `nemo2riva` CLI tool (refer to the [Parakeet-CTC fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-finetune-parakeet-nemo.ipynb)) &mdash; and leveraging the Riva ServiceMaker framework to aggregate all the necessary artifacts for Riva deployment to a target environment. Once the model is deployed as a Riva NIM, you can issue inference requests to the server. We will demonstrate how quick and straightforward this whole process is.
In this tutorial, you will learn how to:
- Build an `.rmir` model pipeline from a `.riva` file with Riva ServiceMaker.
- Deploy the model locally on the Riva server.
- Send inference requests from a demo client using Riva API bindings.

Lets install nemo2riva to convert the donwloaded Parakeet-RNNT model to .riva format.

In [None]:
# install nemo2riva 
!pip3 install --extra-index-url https://pypi.nvidia.com  nemo2riva

In [15]:
!nemo2riva --key tlt_encode --format nemo $MODELS_DIR/parakeet-rnnt-1.1b.nemo

    
INFO: PyTorch version 2.7.0a0+7c8ec84dab.nv25.3 available.
INFO: Polars version 1.21.0 available.
[NeMo I 2025-09-19 05:19:38 nemo_logging:393] Logging level set to 20
[NeMo I 2025-09-19 05:19:38 nemo_logging:393] Restoring NeMo model from '/media/mayjain/Seagate/mayjain/work/del_this/models/parakeet-rnnt-1.1b.nemo'
INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
INFO: `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
INFO: `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
INFO: `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
INFO: `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo I 2025-09-19 05:21:33 nemo_logging:393]

---
## Riva ServiceMaker
Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva NIM deployment to a target environment. It has two main components:

### Riva-Build

This step helps build a Riva-ready version of the model. Its only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. Let's consider an ASR n-gram language model. <br>

`riva-build` is responsible for the combination of one or more exported models (`.riva` files) into a single file containing an intermediate format called Riva Model Intermediate Representation (`.rmir`). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the [documentation](https://docs.nvidia.com/nim/riva/asr/latest/custom-deployment.html#deploying-custom-models-as-nim).

In [16]:
# IMPORTANT: UPDATE THESE PATHS 

# Riva NIM Docker

# Refer to this table to get the CONTAINER_ID for the model architecture you want to deploy.
# https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#supported-models
# Since this is RNNT model, we should use following CONTAINER_ID
CONTAINER_ID = "parakeet-1-1b-rnnt-multilingual"

# Name of the acoustic model .riva file
ACOUSTIC_MODEL_NAME = f"{MODELS_DIR}/parakeet-rnnt-1.1b.riva"

# Name of the language model .nemo file
LANGUAGE_MODEL_NAME = f"{MODELS_DIR}/ngpu_6g.nemo"

# Path to store NIM model repository, Make sure that this directory is empty
NIM_EXPORT_PATH="~/nim_cache" 
NIM_EXPORT_PATH="/media/mayjain/Seagate/mayjain/work/del_this/nim_cache"

! mkdir -p $NIM_EXPORT_PATH
! chmod 777 $NIM_EXPORT_PATH

#### Build the `.rmir` file

Refer to the [Riva ASR NIM Pipeline Configuration documentation](https://docs.nvidia.com/nim/riva/asr/latest/pipeline-configuration.html) to obtain the proper `riva-build` parameters for your particular application, select the acoustic model, language, and pipeline type (offline for the purposes of this tutorial) from the interactive web menu.

In [None]:
# Set the appropriate value
! docker run --gpus all --rm \
     -v $MODEL_DIR:/servicemaker-dev \
     --name riva-servicemaker \
     --entrypoint="" \
     nvcr.io/nim/nvidia/$CONTAINER_ID \
     riva-build speech_recognition \
        /servicemaker-dev/asr_offline_riva_ngram_lm.rmir:tlt_encode \
        /servicemaker-dev/$ACOUSTIC_MODEL_NAME:tlt_encode \
        --offline --name=parakeet-rnnt-1.1b-unified-ml-cs-universal-multi-asr-offline \
        --return_separate_utterances=True --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params=False \
        --ms_per_timestep=80 --language_code=en-US \
        --nn.fp16_needs_obey_precision_pass --unified_acoustic_model \
        --chunk_size=8.0 --left_padding_size=0 --right_padding_size=0 \
        --featurizer.max_batch_size=256 --featurizer.max_execution_batch_size=256 \
        --max_batch_size=128 --nn.opt_batch_size=128 \
        --endpointing_type=niva --endpointing.stop_history=0  \
        --decoder_type=nemo --nemo_decoder.language_model_alpha=0.5 \
        --nemo_decoder.language_model_file=/servicemaker-dev/ngpu_6g.nemo

### Riva-Deploy

The deployment tool takes as input one or more RMIR files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

**Note:** If you added an encryption key to your `.rmir` file when building it with `riva-build`, make sure to append a colon and then the key's value to the model's name in the `riva-deploy` command, as shown below.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir[:key] output-dir-for-repository
! docker run --gpus all --rm \
     -v $MODEL_LOC:/servicemaker-dev \
     -v $NIM_EXPORT_PATH:/model_tar \
     --name riva-servicemaker \
     --entrypoint="" \
     nvcr.io/nim/nvidia/$CONTAINER_ID \
     bash -c "riva-deploy -f /servicemaker-dev/asr_offline_riva_ngram_lm.rmir /data/models/ && tar -czf /model_tar/custom_models.tar.gz -C /data/models ."

---
## Start the Riva ASR NIM
After the model repository is generated, we are ready to start the Riva NIM server. 

In [None]:

# Run the container with the cache directory mounted in the appropriate location:
! docker run -it --rm -d --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_TAGS_SELECTOR \
   -e NIM_DISABLE_MODEL_DOWNLOAD=true \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -v $NIM_EXPORT_PATH:/opt/nim/export \
   -e NIM_EXPORT_PATH=/opt/nim/export \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest

---
## Run Inference
After the Riva NIM server is up and running with your models, you can send inference requests querying the server. 

To send gRPC requests, you can install the Riva Python API bindings for the client. This is available as a [Python module on PyPI](https://pypi.org/project/nvidia-riva-client/).

In [None]:
# Install the Client API Bindings
! pip install nvidia-riva-client

In [None]:
import riva.client

In [None]:
### Connect to the Riva Server and Run Inference

NIM server can take some time to load, wait till the server is ready to serve the requests

In [None]:
import requests, time

for i in range(30):
    try:
        print(f"Waiting for NIM server to load, retrying in 5 seconds...")
        r = requests.get("http://0.0.0.0:9000/v1/health/live", timeout=2)
        if "live" in r.text:
            print("NIM server is ready!")
            break
    except requests.RequestException as e:
        pass
    time.sleep(5)
else:
    print("Server did not become ready after 30 attempts.")

#### Once the server is ready, we can call this inference function to query the Riva NIM server (using gRPC) to transcribe an audio file. 

In [None]:
def run_inference(audio_file, server='localhost:50051', print_full_response=False):
    with open(audio_file, 'rb') as fh:
        data = fh.read()
    
    auth = riva.client.Auth(uri=server)
    client = riva.client.ASRService(auth)
    config = riva.client.RecognitionConfig(
        language_code="en-US",
        max_alternatives=1,
        enable_automatic_punctuation=False,
    )
    
    response = client.offline_recognize(data, config)
    if print_full_response: 
        print(response)
    else:
        print(response.results[0].alternatives[0].transcript)

In [None]:
audio_file = "audio_samples/en-US_sample.wav"
run_inference(audio_file)

You can stop the Riva NIM server before shutting down the Jupyter kernel.

In [None]:
!docker stop $CONTAINER_ID
!docker rm $CONTAINER_ID