<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/rivaasrasr-finetuning-conformer-ctc-nemo/nvidia_logo.png" style="width: 90px; float: right;">

# Training and Deploying N-GPU Language Models for Parakeet RNNT with NVIDIA NIM

This comprehensive tutorial demonstrates how to train and deploy an NVIDIA N-GPU Language Model (LM) for Parakeet RNNT acoustic models using NVIDIA NeMo and deploy them as NVIDIA NIM (NVIDIA Inference Microservices). You'll learn the complete pipeline from data preparation to model deployment and inference.

## What You'll Learn
- How to train n-gram language models using NeMo and KenLM for Parakeet RNNT models
- How to integrate language models with Parakeet RNNT acoustic models
- How to deploy custom models using NVIDIA Riva NIM
- How to perform inference with your deployed models

## Prerequisites
- Basic understanding of automatic speech recognition (ASR)
- Familiarity with Python and Jupyter notebooks
- Access to NVIDIA GPU resources

## NVIDIA Riva NIM Overview

**NVIDIA Riva NIM (NVIDIA Inference Microservices)** provides enterprise-grade, production-ready AI services with optimized performance and scalability. The Riva ASR NIM specifically offers:

### Key Features
- **Multi-language Support**: State-of-the-art automatic speech recognition models for multiple languages
- **GPU Acceleration**: Built on NVIDIA's software platform with CUDA, TensorRT, and Triton integration
- **Production Ready**: Optimized for enterprise deployment with high throughput and low latency
- **Easy Integration**: Simple REST and gRPC APIs for seamless application integration
- **Custom Model Support**: Ability to deploy your own trained models

### Architecture Benefits
- **Containerized Deployment**: Easy deployment using Docker containers
- **Model Optimization**: Automatic model optimization for target hardware
- **Real-time Processing**: Support for both streaming and batch inference

In this tutorial, we'll deploy a custom Parakeet RNNT model with an n-gram language model to demonstrate the complete workflow from training to production deployment.

For comprehensive documentation, visit the [Riva NIM documentation](https://docs.nvidia.com/nim/riva/asr/latest/overview.html).

## NVIDIA NeMo Framework

**[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo)** is a powerful, open-source framework designed for building, training, and fine-tuning GPU-accelerated conversational AI models. NeMo provides a comprehensive toolkit for:

### Core Capabilities
- **Speech AI**: Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Voice Activity Detection (VAD)
- **Multi-GPU Training**: Distributed training across multiple GPUs for faster model development
- **Pre-trained Models**: Access to state-of-the-art pre-trained models for quick prototyping
- **Model Export**: Easy export to various deployment formats for Riva NIM

### Why NeMo for This Tutorial?
- **Language Model Training**: Built-in support for n-gram language model training with KenLM integration
- **Model Integration**: Seamless integration between acoustic models and language models
- **Production Deployment**: Direct export capabilities to Riva NIM format
- **Research to Production**: Smooth transition from research experiments to production deployment

### Getting Started
For detailed setup instructions and comprehensive documentation, visit the [NeMo GitHub repository](https://github.com/NVIDIA/NeMo).

## Understanding N-gram Language Models

Language models are crucial components in automatic speech recognition systems, helping to improve accuracy by incorporating linguistic knowledge. There are two primary approaches to language modeling:

### N-gram Language Models
**N-gram models** are statistical language models that predict the next word based on the previous `n-1` words. They work by:

- **Frequency Analysis**: Learning probability distributions from word sequence frequencies in training data
- **Context Window**: Using a fixed context window of `n` words to make predictions
- **Efficiency**: Providing fast inference with predictable computational requirements
- **Scalability**: Offering a clear space-time tradeoff - larger `n` values capture more context but require more memory

**Advantages:**
- Simple and interpretable
- Fast inference and training
- Well-understood mathematical foundation
- Excellent for domain-specific applications
- Low computational overhead

### Neural Language Models
**Neural language models** use deep learning architectures (RNNs, Transformers, etc.) to model language:

- **Superior Performance**: Generally achieve better language modeling capabilities
- **Context Awareness**: Can capture long-range dependencies and complex patterns
- **Computational Cost**: Require more computational resources for training and inference

### Why N-gram Models for ASR?
For speech recognition applications, n-gram models offer several practical advantages:
- **Real-time Performance**: Fast enough for streaming ASR applications
- **Resource Efficiency**: Lower memory and computational requirements
- **Domain Adaptation**: Easy to retrain on domain-specific data
- **Integration**: Seamless integration with existing ASR pipelines

In this tutorial, we'll train a 6-gram language model using the KenLM toolkit integrated with NeMo, then deploy it as an N-GPU Language Model in NVIDIA Riva NIM for production use.

For deeper understanding, refer to the [Stanford NLP course on n-gram models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).


In [None]:
"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
! pip install wget
! apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3 jq
! pip install text-unidecode
! pip install matplotlib>=3.3.2
! pip install Cython

## Install NeMo
BRANCH = 'v2.4.0'
! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option,
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

## Prerequisites

Before starting this tutorial, ensure you have the following requirements in place:

### 1. NVIDIA NGC Account
- **NGC Account**: You need access to NVIDIA NGC (NVIDIA GPU Cloud) for downloading models and containers
- **Authentication**: Ensure you're logged into your NGC account
- **API Key**: Have your NGC API key ready for container downloads

**Setup Instructions**: Follow the [NGC Getting Started Guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#registering-activating-ngc-account) for account creation and authentication.

### 2. System Requirements
- **GPU**: NVIDIA GPU with CUDA support (recommended: RTX 3080 or better)
- **Memory**: At least 16GB RAM (32GB recommended for large models)
- **Storage**: 60GB+ free disk space for models and datasets
- **Docker**: Docker with NVIDIA Container Toolkit installed

### 3. Software Dependencies
- **Python**: Python 3.8 or higher
- **CUDA**: CUDA 12.8 or compatible version
- **Docker**: Latest version with GPU support
- **Git**: For cloning repositories

### 4. Environment Setup
- **Jupyter**: Jupyter Notebook or JupyterLab
- **Virtual Environment**: Python virtual environment (recommended)
- **Internet Connection**: Stable connection for downloading models and datasets

**Note**: This tutorial can be run locally or on cloud platforms like Google Colab with GPU support.

---

## Training an N-gram Language Model with NeMo

This section covers the complete process of training an n-gram language model using NVIDIA NeMo and the KenLM toolkit. We'll walk through:

1. **Environment Setup**: Installing required dependencies and tools
2. **Data Preparation**: Processing the LibriSpeech dataset for training
3. **Model Training**: Building the n-gram language model with KenLM
4. **Model Integration**: Preparing the model for deployment with Parakeet RNNT

The training process leverages KenLM, a highly optimized library for building and using n-gram language models, integrated seamlessly with NeMo's workflow.

### Installing the required packages

### Installing NeMo and KenLM Dependencies

This step performs the following operations:

1. **Clone NeMo Repository**: Downloads the latest NeMo framework from GitHub
2. **Install KenLM**: Builds and installs the KenLM toolkit for n-gram language model training
3. **Setup Dependencies**: Installs all required libraries and tools

**What is KenLM?**
- **KenLM** is a highly optimized library for building and using n-gram language models
- **Performance**: Provides fast training and inference for large-scale language models
- **Integration**: Seamlessly integrates with NeMo's ASR pipeline
- **Optimization**: Includes advanced pruning and quantization techniques

**Installation Process:**
- The build process may take 10-15 minutes depending on your system
- Requires sufficient disk space for compilation
- Automatically handles dependency resolution

**Note**: This installation is required for the n-gram language model training workflow.

In [None]:
import os
NEMO_ROOT = "NeMo" # Path to clone the NeMo repository.
os.environ["NEMO_ROOT"] = NEMO_ROOT
! git clone -b $BRANCH --single-branch https://github.com/NVIDIA/NeMo.git $NEMO_ROOT
! cd $NEMO_ROOT/scripts/asr_language_modeling/ngram_lm/ && bash install_beamsearch_decoders.sh $NEMO_ROOT

## Dataset Preparation

### LibriSpeech Language Model Dataset

For this tutorial, we'll use the **LibriSpeech Language Model dataset**, which is a widely-used benchmark dataset for training language models in speech recognition applications.

#### Dataset Overview
- **Source**: Derived from LibriVox audiobooks, providing high-quality, read speech
- **Size**: Approximately 800 million words of training text
- **Format**: Normalized text data suitable for language model training
- **Language**: English
- **License**: Public domain

#### Why LibriSpeech LM?
- **Quality**: High-quality, professionally read text
- **Diversity**: Covers various topics and speaking styles
- **Standard**: Widely used benchmark in ASR research
- **Size**: Large enough for robust n-gram model training
- **Compatibility**: Well-suited for integration with ASR systems

#### Dataset Structure
The dataset contains:
- **Training Text**: Normalized text files for language model training
- **Vocabulary**: Common English words and phrases
- **Format**: Plain text files, one sentence per line

**Download Links:**
- [LibriSpeech LM Dataset](https://www.openslr.org/11/)
- [Direct Download](https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz)

### Downloading the Dataset

Now let's download the LibriSpeech Language Model dataset. This process will:

1. **Download** the compressed dataset file (~1.4GB)
2. **Extract** the text data for processing
3. **Prepare** the data for language model training

The download may take several minutes depending on your internet connection speed.

In [None]:
# Set the path to a folder where you want your data and results to be saved.
DATA_DOWNLOAD_DIR="content/datasets"
MODELS_DIR="content/models"

os.environ["DATA_DOWNLOAD_DIR"] = DATA_DOWNLOAD_DIR
os.environ["MODELS_DIR"] = MODELS_DIR

! mkdir -p $DATA_DOWNLOAD_DIR $MODELS_DIR

After downloading, untar the dataset and move it to the correct directory.

In [None]:
# Note: Ensure that wget and unzip utilities are available. If not, install them.
! wget 'https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz' -P $DATA_DOWNLOAD_DIR

# Extract the data
! gzip -dk $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt.gz

For the sake of reducing the time this tutorial takes, we reduced the number of lines of the training dataset. Feel free to modify the number of used lines.

In [None]:
# Use a random 100,000 lines for training
!shuf -n 100000 $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt | tr '[:upper:]' '[:lower:]' > $DATA_DOWNLOAD_DIR/reduced_training.txt

The N-GPU LMs for Parakeet RNNT models are token based. So we need access to ASR's tokenizer model to tokenize the training data. Lets download the RNNT model we want to deploy the N-GPU LM with.

In [None]:
! wget -P $MODELS_DIR https://huggingface.co/nvidia/parakeet-rnnt-1.1b/resolve/main/parakeet-rnnt-1.1b.nemo

### Training the N-GPU Language Model

Now that we have all the required components in place, we can proceed with training the n-gram language model. This process involves:

**Training Components:**
- **Acoustic Model**: Parakeet RNNT 1.1B model (for tokenizer integration)
- **Training Data**: Processed LibriSpeech text corpus
- **Toolkit**: KenLM for efficient n-gram model building
- **Framework**: NeMo for seamless integration

**Training Parameters:**
- **N-gram Order**: 6-gram model (captures 5-word context)
- **Model Format**: N-GPU optimized format for Riva NIM deployment
- **Tokenizer**: Uses Parakeet RNNT's SentencePiece tokenizer
- **Output**: `.nemo` format for easy deployment

**Training Process:**
The training will:
1. Load the Parakeet RNNT model to access its tokenizer
2. Tokenize the training text using the model's vocabulary
3. Build the 6-gram language model using KenLM
4. Save the model in N-GPU format for deployment

**Expected Duration**: 5-15 minutes depending on your hardware and dataset size.

In [None]:
! cd $NEMO_ROOT/scripts/asr_language_modeling/ngram_lm/ && python3 train_kenlm.py \
              nemo_model_file=$MODELS_DIR/parakeet-rnnt-1.1b.nemo \
              train_paths=['{DATA_DOWNLOAD_DIR}/reduced_training.txt'] \
              kenlm_bin_path=$NEMO_ROOT/decoders/kenlm/build/bin \
              kenlm_model_file=$MODELS_DIR/ngpu_6g \
              ngram_length=6 save_nemo=True

### Training Complete! ðŸŽ‰

The n-gram language model has been successfully trained and saved as `ngpu_6g.nemo`. 

**Model Details:**
- **File**: `ngpu_6g.nemo`
- **Type**: 6-gram language model
- **Format**: N-GPU optimized for Riva NIM deployment
- **Tokenizer**: Integrated with Parakeet RNNT vocabulary
- **Size**: Optimized for production deployment

**What's Next?**
Now that we have our trained language model, we can proceed to:
1. **Deploy** the model using NVIDIA Riva NIM
2. **Integrate** it with the Parakeet RNNT acoustic model
3. **Test** the complete ASR pipeline with language model enhancement

The model is ready for deployment and will significantly improve the accuracy of our ASR system by incorporating linguistic knowledge into the recognition process.

---

## Deploying the N-GPU Language Model with Parakeet RNNT in NVIDIA NIM

This section covers the complete deployment pipeline for our trained n-gram language model with the Parakeet RNNT acoustic model using NVIDIA Riva NIM. We'll walk through:

### Deployment Workflow
1. **Model Conversion**: Convert NeMo models to Riva-compatible format
2. **Pipeline Building**: Create an end-to-end ASR pipeline with Riva ServiceMaker
3. **Model Deployment**: Deploy the complete pipeline to NVIDIA NIM
4. **Inference Testing**: Test the deployed model with real audio samples

### Key Components
- **Acoustic Model**: Parakeet RNNT 1.1B (converted to `.riva` format)
- **Language Model**: Our trained 6-gram model (`ngpu_6g.nemo`)
- **Deployment Platform**: NVIDIA Riva NIM with optimized inference
- **Integration**: Seamless combination of acoustic and language models

This deployment process ensures optimal performance and scalability for production ASR applications.

### Model Conversion with nemo2riva

To deploy our NeMo models with NVIDIA Riva NIM, we need to convert them from the `.nemo` format to the `.riva` format using the `nemo2riva` tool.

#### What is nemo2riva?
**nemo2riva** is a command-line tool that:
- **Converts Models**: Transforms `.nemo` models to Riva-compatible `.riva` format
- **Optimizes Performance**: Applies optimizations for production deployment
- **Handles Dependencies**: Manages model dependencies and configurations
- **Ensures Compatibility**: Guarantees compatibility with Riva NIM deployment

#### Conversion Process
The conversion process involves:
1. **Model Analysis**: Examining the NeMo model structure and dependencies
2. **Format Translation**: Converting to Riva's internal representation
3. **Optimization**: Applying deployment-specific optimizations
4. **Validation**: Ensuring the converted model is ready for deployment

#### Installation
The `nemo2riva` tool is available as a Python package:
- **PyPI Package**: Available on [PyPI](https://pypi.org/project/nemo2riva/)
- **Installation**: Simple `pip install` command

#### Next Steps
After conversion, we'll use the Riva ServiceMaker framework to:
- **Build Pipeline**: Create an end-to-end ASR pipeline (`.rmir` format)
- **Deploy Model**: Deploy to NVIDIA NIM for production inference
- **Test Integration**: Verify the complete pipeline works correctly

This workflow ensures seamless transition from research models to production deployment.

### Installing nemo2riva

Let's install the `nemo2riva` tool to convert our downloaded Parakeet RNNT model from `.nemo` format to `.riva` format for Riva NIM deployment.

**Installation Process:**
- Downloads the latest `nemo2riva` package from NVIDIA's PyPI repository
- Includes all necessary dependencies for model conversion
- Provides command-line interface for easy model conversion

**Authentication Required:**
- The installation requires access to NVIDIA's private PyPI repository
- Ensure you have valid NGC credentials configured

In [None]:
# install nemo2riva 
! pip3 install --extra-index-url https://pypi.nvidia.com  nemo2riva

In [None]:
! nemo2riva --key tlt_encode --format nemo $MODELS_DIR/parakeet-rnnt-1.1b.nemo

---

## Riva ServiceMaker Framework

**Riva ServiceMaker** is a comprehensive toolkit that streamlines the deployment of custom models to NVIDIA Riva NIM. It handles the complex process of aggregating all necessary artifacts for production deployment.

### ServiceMaker Components

Riva ServiceMaker consists of two main tools that work together to create a complete deployment pipeline:

#### 1. Riva-Build
- **Purpose**: Creates deployment-ready model pipelines
- **Input**: `.riva` model files and configuration
- **Output**: `.rmir` (Riva Model Intermediate Representation) files
- **Function**: Combines models, configurations, and optimizations into a single deployable package

#### 2. Riva-Deploy
- **Purpose**: Deploys the pipeline to target environments
- **Input**: `.rmir` files and deployment configuration
- **Output**: Complete model repository ready for NIM deployment
- **Function**: Creates the final deployment package with all necessary artifacts

### Workflow Overview
1. **Model Preparation**: Convert models to `.riva` format
2. **Pipeline Building**: Use `riva-build` to create `.rmir` files
3. **Deployment**: Use `riva-deploy` to create the final model repository
4. **NIM Deployment**: Deploy the repository to NVIDIA NIM

This framework ensures consistent, optimized deployment across different environments and use cases.

### Riva-Build: Creating the Model Pipeline

**Riva-Build** is the first step in the ServiceMaker workflow, responsible for creating a deployment-ready model pipeline.

#### What Riva-Build Does
- **Model Integration**: Combines multiple `.riva` model files into a unified pipeline
- **Configuration Management**: Applies deployment-specific configurations and optimizations
- **Asset Aggregation**: Collects all necessary files, dependencies, and metadata
- **Pipeline Creation**: Generates a complete end-to-end inference pipeline

#### Output: RMIR Files
The primary output is an **RMIR (Riva Model Intermediate Representation)** file that contains:
- **Pipeline Specification**: Complete end-to-end inference workflow
- **Model Assets**: All model files, weights, and configurations
- **Deployment Metadata**: Hardware requirements, optimization settings
- **Service Configuration**: API endpoints, input/output specifications

#### For ASR with N-gram Language Models
In our case, `riva-build` will:
1. **Combine Models**: Integrate the Parakeet RNNT acoustic model with our n-gram language model
2. **Configure Pipeline**: Set up the complete ASR pipeline with language model integration
3. **Optimize Performance**: Apply optimizations for the target deployment environment
4. **Generate RMIR**: Create the intermediate representation ready for deployment

#### Key Benefits
- **Deployment Agnostic**: RMIR files work across different deployment environments
- **Optimized Performance**: Includes hardware-specific optimizations
- **Complete Pipeline**: Contains everything needed for end-to-end inference
- **Easy Deployment**: Simplifies the final deployment step

For detailed configuration options, refer to the [Riva ASR NIM Pipeline Configuration documentation](https://docs.nvidia.com/nim/riva/asr/latest/custom-deployment.html#deploying-custom-models-as-nim).

In [None]:
# IMPORTANT: UPDATE THESE PATHS 

# Riva NIM Docker

# Refer to this table to get the CONTAINER_ID for the model architecture you want to deploy.
# https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#supported-models
# Since this is RNNT model, we should use following CONTAINER_ID
CONTAINER_ID = "parakeet-1-1b-rnnt-multilingual"

# Name of the acoustic model .riva file
ACOUSTIC_MODEL_NAME = f"{MODELS_DIR}/parakeet-rnnt-1.1b.riva"

# Name of the language model .nemo file
LANGUAGE_MODEL_NAME = f"{MODELS_DIR}/ngpu_6g.nemo"

# Path to store NIM model repository, Make sure that this directory is empty
NIM_EXPORT_PATH="~/nim_cache" 

! mkdir -p $NIM_EXPORT_PATH
! chmod 777 $NIM_EXPORT_PATH

#### Build the `.rmir` file

Refer to the [Riva ASR NIM Pipeline Configuration documentation](https://docs.nvidia.com/nim/riva/asr/latest/pipeline-configuration.html) to obtain the proper `riva-build` parameters for your particular application, select the acoustic model, language, and pipeline type (offline for the purposes of this tutorial) from the interactive web menu.

In [None]:
# Set the appropriate value
! docker run --gpus all --rm \
     -v $MODEL_DIR:/servicemaker-dev \
     --name riva-servicemaker \
     --entrypoint="" \
     nvcr.io/nim/nvidia/$CONTAINER_ID \
     riva-build speech_recognition \
        /servicemaker-dev/asr_offline_riva_ngram_lm.rmir:tlt_encode \
        /servicemaker-dev/$ACOUSTIC_MODEL_NAME:tlt_encode \
        --offline --name=parakeet-rnnt-1.1b-unified-ml-cs-universal-multi-asr-offline \
        --return_separate_utterances=True --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params=False \
        --ms_per_timestep=80 --language_code=en-US \
        --nn.fp16_needs_obey_precision_pass --unified_acoustic_model \
        --chunk_size=8.0 --left_padding_size=0 --right_padding_size=0 \
        --featurizer.max_batch_size=256 --featurizer.max_execution_batch_size=256 \
        --max_batch_size=128 --nn.opt_batch_size=128 \
        --endpointing_type=niva --endpointing.stop_history=0  \
        --decoder_type=nemo --nemo_decoder.language_model_alpha=0.5 \
        --nemo_decoder.language_model_file=/servicemaker-dev/ngpu_6g.nemo

### Riva-Deploy: Final Deployment Preparation

**Riva-Deploy** is the second step in the ServiceMaker workflow, responsible for creating the final deployment package for NVIDIA NIM.

#### What Riva-Deploy Does
- **RMIR Processing**: Takes one or more RMIR files as input
- **Repository Creation**: Generates a complete model repository structure
- **Configuration Generation**: Creates ensemble configurations for pipeline execution
- **Asset Organization**: Organizes all files in the proper directory structure
- **Deployment Package**: Produces a ready-to-deploy model repository

#### Input and Output
- **Input**: RMIR files (from `riva-build`) and target directory path
- **Output**: Complete model repository with all necessary artifacts
- **Format**: Standardized directory structure compatible with NVIDIA NIM

#### Security Considerations
**Encryption Support**: If you used encryption during the `riva-build` step:
- **Key Format**: Append `:your_encryption_key` to the model name in the deploy command
- **Security**: Ensures model protection during deployment
- **Example**: `model.rmir:my_secret_key`

#### Repository Structure
The generated repository includes:
- **Model Files**: All model weights and configurations
- **Pipeline Config**: Ensemble configuration for execution
- **Metadata**: Deployment information and requirements
- **Dependencies**: All necessary runtime dependencies

#### Next Steps
After `riva-deploy` completes:
1. **Model Repository**: Ready for NVIDIA NIM deployment
2. **Container Deployment**: Can be deployed using Docker containers
3. **Service Activation**: Ready to serve inference requests

This step finalizes the deployment preparation, making your custom models ready for production use with NVIDIA NIM.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir[:key] output-dir-for-repository
! docker run --gpus all --rm \
     -v $MODEL_LOC:/servicemaker-dev \
     -v $NIM_EXPORT_PATH:/model_tar \
     --name riva-servicemaker \
     --entrypoint="" \
     nvcr.io/nim/nvidia/$CONTAINER_ID \
     bash -c "riva-deploy -f /servicemaker-dev/asr_offline_riva_ngram_lm.rmir /data/models/ && tar -czf /model_tar/custom_models.tar.gz -C /data/models ."

---

## Starting the NVIDIA Riva ASR NIM Server

Now that we have successfully generated the model repository, we can start the NVIDIA Riva NIM server with our custom models.

### Server Configuration
The Riva NIM server will be configured with:
- **Custom Models**: Our trained n-gram language model integrated with Parakeet RNNT
- **GPU Acceleration**: Optimized for NVIDIA GPU inference
- **API Endpoints**: Both REST and gRPC interfaces for client connections
- **Container Deployment**: Running in a Docker container for easy management

### Server Features
- **High Performance**: Optimized inference with TensorRT acceleration
- **Scalability**: Support for concurrent requests and load balancing
- **Monitoring**: Built-in health checks and performance metrics
- **Security**: Secure API endpoints with authentication support

### Port Configuration
- **REST API**: Port 9000 for HTTP-based requests
- **gRPC API**: Port 50051 for high-performance gRPC requests
- **Health Checks**: Available at `/v1/health/live` endpoint

### Environment Variables
The server uses several environment variables for configuration:
- **NGC_API_KEY**: For accessing NVIDIA NGC resources
- **NIM_EXPORT_PATH**: Path to our custom model repository

Once started, the server will load our models and be ready to serve inference requests. 

In [None]:

# Run the container with the cache directory mounted in the appropriate location:
! docker run -it --rm -d --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_TAGS_SELECTOR \
   -e NIM_DISABLE_MODEL_DOWNLOAD=true \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -v $NIM_EXPORT_PATH:/opt/nim/export \
   -e NIM_EXPORT_PATH=/opt/nim/export \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest

---

## Running Inference with the Deployed Model

Now that our NVIDIA Riva NIM server is running with our custom n-gram language model, we can perform inference to test the complete ASR pipeline.

### Inference Capabilities
Our deployed system provides:
- **Enhanced ASR**: Automatic Speech Recognition with n-gram language model integration
- **Improved Accuracy**: Language model helps correct and improve transcription quality
- **Real-time Processing**: Fast inference suitable for streaming applications
- **Multiple APIs**: Both REST and gRPC interfaces for different use cases

### Client Setup
To interact with the server, we'll use the **Riva Python Client**:
- **Package**: Available on [PyPI](https://pypi.org/project/nvidia-riva-client/)
- **Features**: Easy-to-use Python API for both REST and gRPC requests
- **Documentation**: Comprehensive API documentation and examples
- **Authentication**: Built-in support for secure connections

### Inference Process
The inference workflow includes:
1. **Audio Input**: Load and prepare audio files for processing
2. **Server Connection**: Establish connection to the Riva NIM server
3. **Request Configuration**: Set up recognition parameters and options
4. **Inference Execution**: Send requests and receive transcriptions
5. **Result Processing**: Handle and display the recognition results

In [None]:
# Install the Client API Bindings
! pip install nvidia-riva-client

In [None]:
import riva.client

### Connecting to the Riva Server

Before we can run inference, we need to ensure the Riva NIM server is fully loaded and ready to serve requests.

#### Server Loading Process
The NIM server goes through several initialization steps:
1. **Container Startup**: Docker container initialization
2. **Model Loading**: Loading our custom Parakeet RNNT model with n-gram language model
3. **GPU Initialization**: Setting up CUDA and TensorRT optimizations
4. **Service Activation**: Starting the REST and gRPC API endpoints
5. **Health Check**: Verifying all components are operational

#### Loading Time Considerations
- **Model Size**: Larger models take longer to load into memory
- **GPU Memory**: Initial GPU memory allocation and optimization
- **First Request**: May take additional time for model warm-up
- **Typical Duration**: 2-5 minutes depending on hardware and model size

#### Health Monitoring
We'll monitor the server status using the health check endpoint:
- **Endpoint**: `http://localhost:9000/v1/health/live`
- **Response**: Returns "live" when server is ready
- **Retry Logic**: Automatic retry with 5-second intervals
- **Timeout**: Maximum 30 attempts (2.5 minutes) before giving up

**Please wait** while the server completes its initialization process. The next cell will automatically detect when the server is ready.

In [None]:
import requests, time

for i in range(30):
    try:
        print(f"Waiting for NIM server to load, retrying in 5 seconds...")
        r = requests.get("http://0.0.0.0:9000/v1/health/live", timeout=2)
        if "live" in r.text:
            print("NIM server is ready!")
            break
    except requests.RequestException as e:
        pass
    time.sleep(5)
else:
    print("Server did not become ready after 30 attempts.")

### Inference Function

Once the server is ready, we can use this inference function to query the Riva NIM server and transcribe audio files.

#### Function Overview
The `run_inference()` function provides a simple interface to:
- **Load Audio**: Read audio files from disk
- **Connect to Server**: Establish gRPC connection to the Riva NIM server
- **Configure Recognition**: Set up recognition parameters (language, punctuation, etc.)
- **Process Audio**: Send audio data for transcription
- **Return Results**: Receive and display the transcription results

#### Key Features
- **gRPC Protocol**: Uses high-performance gRPC for fast communication
- **Flexible Configuration**: Supports various recognition settings
- **Error Handling**: Robust error handling and connection management
- **Result Formatting**: Clean output with optional detailed response display

#### Recognition Configuration
The function uses the following default settings:
- **Language**: English (en-US)
- **Alternatives**: Single best transcription result
- **Punctuation**: Automatic punctuation disabled (can be enabled)
- **Format**: Plain text output

#### Usage
Simply call `run_inference(audio_file_path)` with the path to your audio file to get the transcription. The function handles all the complexity of server communication and result processing. 

In [None]:
def run_inference(audio_file, server='localhost:50051', print_full_response=False):
    with open(audio_file, 'rb') as fh:
        data = fh.read()
    
    auth = riva.client.Auth(uri=server)
    client = riva.client.ASRService(auth)
    config = riva.client.RecognitionConfig(
        language_code="en-US",
        max_alternatives=1,
        enable_automatic_punctuation=False,
    )
    
    response = client.offline_recognize(data, config)
    if print_full_response: 
        print(response)
    else:
        print(response.results[0].alternatives[0].transcript)

In [None]:
audio_file = "audio_samples/en-US_sample.wav"
run_inference(audio_file)

### Cleanup: Stopping the Riva NIM Server

When you're finished with the tutorial, it's important to properly stop the Riva NIM server to free up system resources.

#### Why Stop the Server?
- **Resource Management**: Frees up GPU memory and CPU resources
- **Container Cleanup**: Properly terminates the Docker container
- **System Performance**: Prevents resource conflicts with other applications
- **Best Practices**: Ensures clean shutdown of all services

#### Cleanup Process
The cleanup commands will:
1. **Stop Container**: Gracefully stop the running Riva NIM container
2. **Remove Container**: Clean up the container instance
3. **Free Resources**: Release GPU memory and system resources

#### Alternative: Keep Server Running
If you want to continue using the server for additional testing:
- **Skip Cleanup**: Don't run the cleanup commands
- **Server Persistence**: The server will continue running until manually stopped
- **Resource Usage**: Monitor system resources if keeping the server active

#### Next Steps
After stopping the server, you can:
- **Review Results**: Analyze the transcription results and model performance
- **Experiment Further**: Try different audio files or configuration settings
- **Deploy Production**: Use the same workflow for production deployments
- **Scale Up**: Deploy multiple instances for higher throughput

**Congratulations!** You've successfully completed the full pipeline from training an n-gram language model to deploying it with NVIDIA Riva NIM for production ASR inference.

In [None]:
! docker stop $CONTAINER_ID
! docker rm $CONTAINER_ID