<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-tao-ngram-pretrain/nvidia_logo.png" style="width: 90px; float: right;">

# How to pretrain a Riva ASR Language Modeling (n-gram) with TAO Toolkit
This tutorial walks you through the pretraining of Riva ASR language modeling (n-gram) with Train Adapt Optimize (TAO) Toolkit.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will pretrain Riva ASR language modeling (n-gram) with TAO Toolkit. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/dev/22.04/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## TAO Toolkit
Train Adapt Optimize (TAO) Toolkit is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data. Developers, researchers, and software partners building intelligent vision AI applications and services can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

![Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/sites/default/files/akamai/embedded-transfer-learning-toolkit-software-stack-1200x670px.png)

Transfer learning extracts learned features from an existing neural network into a new one. Transfer learning is often used when creating a large training dataset is not feasible. The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientists to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Automatic Speech Recognition Language Modeling!

<a id='isc-task-description'></a>
## Language Modeling

### Task Description

Language modeling returns a probability distribution over a sequence of words. Besides assigning a probability to a sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) that follows a sequence of words. <br>

> The sentence:  **all of a sudden I notice three guys standing on the sidewalk**
> would be scored higher than 
> the sentence: **on guys all I of notice sidewalk three a sudden standing the** by the language model. <br>

A language model trained on large corpus can significantly improve the accuracy of an ASR system as suggested in recent research.

### N-gram Language Model
There are primarily two types of language models:

- **N-gram language models**: These models use frequency of n-grams to learn the probability distribution over words. Two benefits of N-gram language models are simplicity and scalability – with a larger `n`, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.
- **Neural language models**: These models use different kinds of neural networks to model the probability distribution over words, and have surpassed the N-gram language models in the ability to model language, but are generally slower to evaluate.

In this tutorial, we will show how to train, evaluate, and optionally fine-tune an [N-gram language model](https://web.stanford.edu/~jurafsky/slp3/3.pdf) leveraging TAO Toolkit.

---
## Let's Dig in: Riva Language Modeling using TAO

### Installing and setting up TAO
Install TAO Toolkit inside a Python virtual environment. We recommend performing this step first and then launching the tutorial from the virtual environment.

It's a simple `pip` install.

In [None]:
! pip install nvidia-pyindex
! pip install nvidia-tao

To view the Docker image versions and the tasks that TAO can perform, use the `tao info` command.

In [None]:
!tao info --verbose

In addition to installing the TAO Toolkit package, ensure you meet the following software requirements:

1. Python 3.6.9
2. `docker-ce` > 19.03.5
3. `docker-API` 1.40
4. `nvidia-container-toolkit` > 1.3.0-1
5. `nvidia-container-runtime` > 3.4.0-1
6. `nvidia-docker2` > 2.5.0-1
7. `nvidia-driver` >= 455.23

Check to see if the GPU device(s) is visible.

---
<a id='isc-prepare-data'></a>
### Preparing the dataset
#### Librispeech LM Normalized dataset
For this tutorial, we use the normalized version of the LibriSpeech LM dataset to train our N-gram language model. The normalized version of the LibriSpeech LM dataset is available [here](https://www.openslr.org/11/).

#### LibriSpeech dev-clean dataset
For this tutorial, we also use the clean version of the LibriSpeech development set to evaluate our N-gram language model. The clean version of the LibriSpeech development set is available [here](https://www.openslr.org/12/).

#### LibriSpeech LM Normalized dataset
The training data is publicly available [here](https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz) and can be downloaded directly.#### Downloading the dataset

In [None]:
import os
# IMPORTANT NOTE: Set the path to a folder where you want your data and results to be saved.
# TODO
DATA_DOWNLOAD_DIR = "<YOUR_PATH_TO_DATA_DIR>"
assert os.path.exists(DATA_DOWNLOAD_DIR), "Provided DATA_DOWNLOAD_DIR does not exist."

In [None]:
# NOTE: Ensure that wget and unzip utilities are available. If not, install them.
!wget 'https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz' -P $DATA_DOWNLOAD_DIR

# Extract the data
!gzip -dk $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt.gz

#### LibriSpeech dev-clean dataset
The evaluation data is publicly available [here](https://www.openslr.org/resources/12/dev-clean.tar.gz) and can be downloaded directly. We provided a Python script to download and preprocess the dataset for you.

In [None]:
"""
Scripts to download and preprocess LibriSpeech dev-clean
"""
from multiprocessing import Pool

import numpy

LOG_STR = " To regenerate this file, please, remove it."

def find_transcript_files(dir):
    files = []
    for dirpath, _, filenames in os.walk(dir):
        for filename in filenames:
            if filename.endswith(".trans.txt"):
                files.append(os.path.join(dirpath, filename))
    return files

def transcript_to_list(file):
    audio_path = os.path.dirname(file)
    ret = []
    with open(file, "r") as f:
        for line in f:
            file_id, trans = line.strip().split(" ", 1)
            audio_file = os.path.abspath(os.path.join(audio_path, file_id + ".flac"))
            duration = 0  # We are not using the audio
            ret.append([file_id, audio_file, str(duration), trans.lower()])

    return ret


if __name__ == "__main__":
    
    name = "dev-clean"
    data_path = os.path.join(DATA_DOWNLOAD_DIR, "eval_data")
    text_path = os.path.join(DATA_DOWNLOAD_DIR, "text")
    lists_path = os.path.join(DATA_DOWNLOAD_DIR, "lists")
    os.makedirs(data_path, exist_ok=True)
    os.makedirs(text_path, exist_ok=True)
    os.makedirs(lists_path, exist_ok=True)
    data_http = "http://www.openslr.org/resources/12/"

    # Download the audio data
    print("Downloading the evaluation data.", flush=True)
    if not os.path.exists(os.path.join(data_path, "LibriSpeech", name)):
        print("Downloading and unpacking {}...".format(name))
        cmd = """wget -c {http}{name}.tar.gz -P {path};
                 yes n 2>/dev/null | gunzip {path}/{name}.tar.gz;
                 tar -C {path} -xf {path}/{name}.tar"""
        os.system(cmd.format(path=data_path, http=data_http, name=name))
    else:
        log_str = "{} part of data exists, skip its downloading and unpacking"
        print(log_str.format(name) + LOG_STR, flush=True)

    # Prepare the audio data
    print("Converting data into necessary format.", flush=True)
    word_dict = {}
    word_dict[name] = set()
    src = os.path.join(data_path, "LibriSpeech", name)
    assert os.path.exists(src), "Unable to find the directory - '{src}'".format(
        src=src
    )

    dst_list = os.path.join(lists_path, name + ".lst")
    if os.path.exists(dst_list):
        print(
            "Path {} exists, skip its generation.".format(dst_list) + LOG_STR,
            flush=True,
        )
        

    print("Analyzing {src}...".format(src=src), flush=True)
    transcript_files = find_transcript_files(src)
    transcript_files.sort()

    print("Writing to {dst}...".format(dst=dst_list), flush=True)
    with Pool(processes=8) as p:
        samples = list(p.imap(transcript_to_list, transcript_files))

    with open(dst_list, "w") as fout:
        for sp in samples:
            for s in sp:
                word_dict[name].update(s[-1].split(" "))
                s[0] = name + "-" + s[0]
                fout.write(" ".join(s) + "\n")

    current_path = os.path.join(text_path, name + ".txt")
    if not os.path.exists(current_path):
        with open(os.path.join(lists_path, name + ".lst"), "r") as flist, open(
            os.path.join(text_path, name + ".txt"), "w"
        ) as fout:
            for line in flist:
                fout.write(" ".join(line.strip().split(" ")[3:]) + "\n")
    else:
        print(
            "Path {} exists, skip its generation.".format(current_path) + LOG_STR,
            flush=True,
        )

print("Done!", flush=True)


For the sake of reducing the time this demo takes, we reduce the number of lines of the training dataset. Feel free to modify the number of used lines.

In [None]:
# Use a random 100,000 lines for training
!shuf -n 100000 $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt  > $DATA_DOWNLOAD_DIR/reduced_training.txt

---
## TAO Toolkit workflow
The rest of the tutorial demonstrates what a sample TAO Toolkit workflow looks like

### Setting TAO Toolkit Mounts

Now that our dataset has been downloaded, an important step in using TAO Toolkit is to setup the directory mounts. The TAO Toolkit launcher uses Docker containers under the hood, and **for our data and results directory to be visible to Docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO Toolkit launcher. <br>

`IMPORTANT NOTE:` The following code creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case.  These directories are correctly visible to the Docker container. **Ensure that the source directories exist on your machine.**

In [None]:
%%bash
tee ~/.tao_mounts.json <<'EOF'
{
   "Mounts":[
       {
           "source": "<YOUR_PATH_TO_DATA_DIR>",
           "destination": "/data"
       },
       {
           "source": "<YOUR_PATH_TO_SPECS_DIR>",
           "destination": "/specs"
       },
       {
           "source": "<YOUR_PATH_TO_RESULTS_DIR>",
           "destination": "/results"
       },
       {
           "source": "<YOUR_PATH_TO_CACHE_DIR eg. /home/user/.cache>",
           "destination": "/root/.cache"
       }
   ]
}
EOF

In [None]:
# Make sure the source directories exist, if not, create them, Provide aboslute Paths
SPECS_DIR_LOCAL = "<YOUR_PATH_TO_SPECS_DIR>""
RESULT_DIR_LOCAL = "<YOUR_PATH_TO_RESULTS_DIR>"
CACHE_DIR_LOCAL = "<YOUR_PATH_TO_CACHE_DIR>"
! mkdir $SPECS_DIR_LOCAL
! mkdir $RESULT_DIR_LOCAL
! mkdir $CACHE_DIR_LOCAL

Users with basic knowledge of deep learning can get started building their own custom models using a simple specification file. It's essentially just one command each to run data preprocessing, training, fine-tuning, evaluation, inference, and export. All configurations happen through `.yaml` spec files. <br>

---
### Configuration/Specification Files

The essence of all commands in TAO Toolkit lies within `.yaml` spec files. There are sample spec files already available for you to use directly or as a reference to create your own.  Through these spec files, you can tune many knobs like the model, dataset, hyperparameters, etc. Each command (like train, fine-tune, evaluate, etc.) should have a dedicated spec file with configurations pertinent to it. <br>

Here is an example of the training spec file:

---
```
model:
  intermediate: True
  order: 2
  pruning:
    - 0
training_ds:
  is_tarred: false
  is_file: true
  data_file: ???

vocab_file: ""
encryption_key: "tlt_encode"
...
```


---
### Downloading Specs
Let's download the spec files. You may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` argument points to an empty folder.**

In [None]:
!tao n_gram download_specs \
    -r /results \
    -o /specs

---
### Data Convert


In preparation for training/fine-tuning, we need to preprocess the dataset. The `tao n_gram dataset_convert` command can be used in conjunction with the appropriate configuration in the spec file. Here is the sample `dataset_convert.yaml` spec file we use:
```
# Dataset. Available options: [assistant]
dataset_name: assistant

# Extension of the files containing in dataset
extension: ???

# Path to the folder containing the dataset source files.
source_data_dir: ???

# Path to the output folder.
target_data_file: ???

```
 Take a look at the `.yaml` spec files we provide.
As we show below, you can override the `source_data_dir` and `target_data_dir` options with the appropriate paths.

In [None]:
# Preprocess training data (LibriSpeech LM Normalized)
!tao n_gram dataset_convert \
            -e /specs/dataset_convert.yaml \
            -r /results/dataset_convert \
            extension=*.txt \
            source_data_dir=/data/reduced_training.txt \
            target_data_file=/data/preprocessed.txt

# Preprocess evaluation data (LibriSpeech dev-clean)
!tao n_gram dataset_convert \
            -e /specs/dataset_convert.yaml \
            -r /results/dataset_convert \
            extension=*.txt \
            source_data_dir=/data/text/dev-clean.txt \
            target_data_file=/data/preprocessed_dev_clean.txt

The command preprocess the training and evaluation dataset using basic text preprocessings including converting lowercase, normalization, removing punctuation, and write the results into files named `preprocessed.txt` and `preprocessed_dev_clean.txt` for training and evaluation correspondingly. In both `preprocessed.txt` and `preprocessed_dev_clean.txt` files, each preprocessed sentence corresponds to a new line.

---
<a id='isc-training'></a>
### Training / Fine-tuning


Training a model using TAO Toolkit is as simple as configuring your spec file and running the train command. The following code uses the `train.yaml` spec file available to you as reference. The spec file configurations can easily be overridden using the `tao-launcher` CLI. For example, below we override `model.order`, `model.pruning` and `training_ds.data_file` configurations to suit our needs. <br>

For training an N-gram language model in TAO Toolkit, we use the `tao n_gram train` command with the following arguments:
- `-e`: Path to the spec file
- `-k`: User specified encryption key to use while saving/loading the model
- `-r`: Path to a folder where the outputs should be written. Ensure this is mapped in the `tlt_mounts.json` file.
- Any overrides to the spec file. For example, `model.order`.
<br>


For more information about these arguments, refer to the [TAO Toolkit Getting Started Guide](https://docs.nvidia.com/tao/tao-toolkit/text/overview.html). <br>
`Note:` All file paths correspond to the destination mounted directory that is visible in the TAO Toolkit docker container used in backend.<br>

In [None]:
!tao n_gram train \
            -e /specs/train.yaml \
            -r /results/base \
            training_ds.data_file=/data/preprocessed.txt \
            model.order=4 

The train command produces three files called `train_n_gram.arpa`, `train_n_gram.vocab` and `train_n_gram.kenlm_intermediate` saved at `$RESULTS_DIR_LOCAL/train/checkpoints`.

---
<a id='evaluation'></a>
### Evaluation
The evaluation spec `.yaml` is as simple as:

```
# Name of the `.arpa` or `.binary` file where the trained model will be restored from.
restore_from: ???

test_ds:
  data_file: ???
  
```

In [None]:
!tao n_gram evaluate \
     -e /specs/evaluate.yaml \
     -r /results/evaluate \
     restore_from=/results/base/checkpoints/train_n_gram.arpa \
     test_ds.data_file=/data/preprocessed_dev_clean.txt

The output of the evaluation gives us the perplexity of the N-gram language model on the evaluation (LibriSpeech dev-clean) dataset.

---
<a id='isc-inference'></a>
### Inference
Inference using a trained `.arpa` or `.binary` model uses the `tao n_gram infer` command.  <br>
The `infer.yaml` is also very simple, and we can directly give inputs for the model to run inference.
```
# "Simulate" user input:
input_batch:
  - 'set alarm for seven thirty am'
  - 'lower volume by fifty percent'
  - 'what is my schedule for tomorrow'

restore_from: ???

```

Try out your own inputs as an exercise.

In [None]:
!tao n_gram infer \
            -e /specs/infer.yaml \
            -r /results/infer \
            restore_from=/results/base/checkpoints/train_n_gram.arpa

This command returns the log likelihood, perplexity, and all n-grams for each of the input sequences that users provided.

---
<a id='isc-export-riva'></a>
### Export to Riva

With TAO Toolkit, you can also export your model in a format that can deployed using [NVIDIA Riva](https://developer.nvidia.com/riva), a highly performant application framework for multi-modal conversational AI services using GPUs. The export command will convert the trained language model from `.arpa` to `.binary` with the option of quantizing the model binary. We will set `export_format` in the spec file to `RIVA` to create a `.riva` file which will contain the language model binary and its corresponding vocabulary.

In [None]:
!tao n_gram export \
            -e /specs/export.yaml \
            -r /results/base \
            export_format=RIVA \
            export_to=exported-base.riva \
            restore_from=/results/base/checkpoints/train_n_gram.arpa \
            binary_type=trie \
            binary_q_bits=8 \
            binary_b_bits=7 \
            binary_a_bits=256         

The model is exported as `exported-model.binary` which is in a format suited for deployment in Riva.

---
<a id='isc-deploy'></a>
## RIVA deployment with ASR


### Riva ServiceMaker
Servicemaker is the set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components as shown below:


### 1. Riva-build

This step helps build a Riva-ready version of the model. It’s only output is an intermediate format (called a RMIR) of an end to end pipeline for the supported services within Riva. We are taking a ASR Citrinet Model in consideration. Although same setup can be used for Conformer models too.<br>

`riva-build` is responsible for the combination of one or more exported models (.riva files) into a single file containing an intermediate format called Riva Model Intermediate Representation (.rmir). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. Please checkout the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/service-asr.html#pipeline-configuration) to find out more.

In [None]:
# Riva skills version
RIVA_VERSION="2.6.0"

# ServiceMaker Docker
RIVA_SM_CONTAINER = f"nvcr.io/nvidia/riva/riva-speech:{RIVA_VERSION}-servicemaker"

# Riva API Docker
RIVA_API_CONTAINER =f"nvcr.io/nvidia/riva/riva-speech:{RIVA_VERSION}"

# Directory where the create model repo
MODEL_LOC = "<YOUR_PATH_TO_MODEL_REPO>"

# Name of the .riva file
MODEL_NAME = "nvidia/tao/speechtotext_en_us_citrinet:deployable_v3.0"

# Key that model is encrypted with, while exporting with TAO
KEY = "tlt_encode"

# NGC API KEY, can be generated from ngc.nvidia.com/setup
NGC_API_KEY="<YOUR_NGC_API_KEY>"

In [None]:
# Get the ServiceMaker docker and latest riva ASR model
! mkdir $MODEL_LOC
! docker pull $RIVA_SM_CONTAINER
! ngc registry model download-version $MODEL_NAME
! mv speechtotext_en_us_citrinet_vdeployable_v3.0/citrinet-1024-Jarvis-asrset-3_0-encrypted.riva $MODEL_LOC/Citrinet.riva

### 2. Setup Flashligh decoder
The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided lexicon file.
Vocabulary file: The vocabulary file is a flat text file containing a list of vocabulary words, each on its own line. For example:
```
the
i
to
and
a
you
of
that
```
This file is used by the riva-build process to generate the lexicon file.

Lexicon file: The lexicon file is a flat text file that contains the mapping of each vocabulary word to its tokenized form, e.g, sentencepiece tokens, separated by a tab. Below is an example:
```
with    ▁with
not     ▁not
this    ▁this
just    ▁just
my      ▁my
as      ▁as
don't   ▁don ' t
```
Note: Ultimately, the Riva decoder makes use only of the lexicon file directly at run time (but not the vocabulary file).

Riva ServiceMaker automatically tokenizes the words in the vocabulary file to generate the lexicon file. It uses the correct tokenizer model that is packaged together with the acoustic model in the .riva file. By default, Riva generates 1 tokenized form for each word in the vocabulary file.

In [None]:
# Generate vocabulary using base LM training data
! cat $DATA_DOWNLOAD_DIR/preprocessed.txt | sed "s/ /\n/g" | sort -u > $RESULT_DIR_LOCAL/base/dict_vocab.txt

In [None]:
# Generate the RMIR file with trained Base Language Model
! docker run -it --rm --gpus 0 -v $MODEL_LOC:/data \
            -v $RESULT_DIR_LOCAL/base:/lm \
            --name riva-service-maker-lm \
            $RIVA_SM_CONTAINER  -- \
            riva-build speech_recognition /data/base_asr.rmir:$KEY \
            /data/Citrinet.riva:$KEY \
            --ms_per_timestep=80 \
            --chunk_size=0.16 \
            --left_padding_size=1.92 \
            --right_padding_size=1.92 \
            --decoder_type=flashlight \
            --decoding_language_model_binary=/lm/exported-base.binary \
            --decoding_vocab=/lm/dict_vocab.txt \
            --flashlight_decoder.lm_weight=0.2 \
            --flashlight_decoder.word_insertion_score=0.2 \
            --flashlight_decoder.beam_threshold=20. \
            --featurizer.dither=0.0

### 3. Riva-deploy

The deployment tool takes as input one or more Riva Model Intermediate Representation (RMIR) files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f  /data/base_asr.rmir:$KEY /data/models/

---
## Start Riva Server
Once the model repository is generated, we are ready to start the Riva server. From this step onwards you need to download the Riva QuickStart Resource from NGC. 

In [None]:
# Download Riva Quickstart
! ngc registry resource download-version nvidia/riva/riva_quickstart:$RIVA_VERSION

#### config.sh snippet
```
service_enabled_asr=true                                                              ## MAKE CHANGES HERE
service_enabled_nlp=false                                                             ## MAKE CHANGES HERE
service_enabled_tts=false                                                             ## MAKE CHANGES HERE

# Enable Riva Enterprise
# If enrolled in Enterprise, enable Riva Enterprise by setting configuration
# here. You must explicitly acknowledge you have read and agree to the EULA.
# RIVA_API_KEY=<ngc api key>                                                               
# RIVA_API_NGC_ORG=<ngc organization>                                                             
# RIVA_EULA=accept

# Language code to fetch models of a specify language
# Currently only ASR supports languages other than English
# Supported language codes: en-US, de-DE, es-US, ru-RU, zh-CN, hi-IN, fr-FR
# for any language other than English, set service_enabled_nlp and service_enabled_tts to False
# for multiple languages enter space separated language codes.
language_code=("en-US")

# ASR acoustic model architecture
# Supported values are: conformer, citrinet_1024, citrinet_256 (en-US + arm64 only), jasper (en-US + amd64 only), quartznet (en-US + amd64 only)
asr_acoustic_model=("conformer")

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.
gpus_to_use="device=0"

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"                                                        ## MAKE CHANGES HERE

# Locations to use for storing models artifacts
#
# If an absolute path is specified, the data will be written to that location
# Otherwise, a docker volume will be used (default).
#
# riva_init.sh will create a `rmir` and `models` directory in the volume or
# path specified.
#
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by `riva_init.sh`
#
# Custom models produced by NeMo or TLT and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
#
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="riva-model-repo"                                                  ## MAKE CHANGES HERE (Replace with MODEL_LOC)            
```

In [None]:
# Ensure you have permission to execute these scripts
! cd riva_quickstart_v$RIVA_VERSION && chmod +x ./riva_init.sh && chmod +x ./riva_start.sh && chmod +x ./riva_stop.sh

In [None]:
# Run Riva Start. This will deploy your model(s).
! cd riva_quickstart_v$RIVA_VERSION && ./riva_start.sh config.sh

## Download Evaluation dataset


In [None]:
#Note: This data can be used only with NVIDIA’s products or services for evaluation and benchmarking purposes.
! ngc registry resource  download-version --dest $DATA_DOWNLOAD_DIR nvstaging/tao/healthcare_eval_dataset:1.0

---
## Run Inference
Once the Riva server is up and running with your models, you can send inference requests querying the server. 

In [None]:
! docker run --rm -v $DATA_DOWNLOAD_DIR/healthcare_eval_dataset_v1.0:/data  \
    --net=host $RIVA_API_CONTAINER -- \
     riva_streaming_asr_client \
        --automatic_punctuation=false \
        --interim_results=false \
        --word_time_offsets=false \
        --audio_file /data/general.json \
        --output_filename=/data/base_asr_on_base_output.json

In [None]:
! docker run --rm -v $DATA_DOWNLOAD_DIR/healthcare_eval_dataset_v1.0/:/data  \
    --net=host $RIVA_API_CONTAINER -- \
     riva_streaming_asr_client \
        --automatic_punctuation=false \
        --interim_results=false \
        --word_time_offsets=false \
        --audio_file /data/healthcare.json \
        --output_filename=/data/base_asr_on_domain_output.json

### Calculate word error rate


In [None]:
! pip install jiwer
from jiwer import wer
import json

def calculate_wer(ground_truth_manifest, asr_transcript):
    data ={}
    ground_truths = []
    predictions = []
    with open(ground_truth_manifest) as file:
        for line in file:
            dt = json.loads(line)
            data[dt['audio_filepath']] = dt['text']
    with open(asr_transcript) as file:
        for line in file:
            dt = json.loads(line)
            if dt['audio_filepath'] in data:
                ground_truths.append(data[dt['audio_filepath']])
                predictions.append(dt['text'])
    return round(100*wer(ground_truths, predictions), 2)

In [None]:
print( "WER of base model on generic domain data", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/general.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/base_asr_on_base_output.json"))

In [None]:
print("WER of base model on Healthcare domain data", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/healthcare.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/base_asr_on_domain_output.json"))

Above results show the model performace is well on general data but not on healthcare specific domain data. We can finetune the Language model on healthcare domain data to boost ASR performance

---
<a id='isc-Finetuning'></a>
### Finetuning/Interpolation


The fine-tuning process will continue training using a previously trained model by training a second model on new domain data and interpolating it with the original model. Finetuning requires the original model have intermediate enabled during training. A finetuned model cannot be used for finetuning again. <br>


### Downloading and procesing domain data (healthcare) for LM finetuning
We can make use of reddit data for this purpose. It is a collection of Corpuses of Reddit data built from Pushshift.io Reddit Corpus. Each Corpus contains posts and comments from an individual subreddit from its inception until Oct 2018.

In [None]:
!pip install convokit

In [None]:
from convokit import Corpus, download
corpus = Corpus(download('subreddit-healthcare'))

In [None]:
# Perform basic text cleaning and generate domain data
import string,re
def clean_text(text):
    text = re.sub(r"[^a-z' ]+", "", text.lower().strip())
    text = ' '.join(text.split())
    if len(text.split())> 5:
        return text.strip()
    
with open(f'{DATA_DOWNLOAD_DIR}/domain_data_all.txt', 'w') as file:
    for utt in corpus.iter_utterances():
        text = clean_text(utt.text)
        if text:
            file.write(text+'\n')
            
# Picking top 10000 lines from dataset
! head -10000 $DATA_DOWNLOAD_DIR/domain_data_all.txt > $DATA_DOWNLOAD_DIR/domain_data.txt

The fine-tuning process will continue training using a previously trained model by training a second model on new domain data and interpolating it with the original model. Finetuning requires the original model have intermediate enabled during training. A finetuned model cannot be used for finetuning again. <br>


For Finetuning a N-gram language model in TAO Toolkit, we use the `tao n_gram finetune` command with the following args:
- `-e`: Path to the spec file
- `-k`: User specified encryption key to use while saving/loading the model
- `-r`: Path to a folder where the outputs should be written. Make sure this is mapped in tlt_mounts.json
- Any overrides to the spec file eg. `model.order`, `weight` etc
<br>


More details about these arguments are present in the [TAO Toolkit Getting Started Guide](https://docs.nvidia.com/tao/tao-toolkit/text/overview.html) <br>
`Note:` All file paths correspond to the destination mounted directory that is visible in the TAO Toolkit docker container used in backend.<br>

In [None]:
# Interpolate the domain LM with base LM
!tao n_gram finetune \
            -e /specs/finetune.yaml \
            -r /results \
            restore_from=/results/base/checkpoints/train_n_gram.kenlm_intermediate \
            tuning_ds.data_file=/data/domain_data.txt \
            model.order=4 \
            weight=0.6      # weight of domain specific model \
            -k $KEY

In [None]:
# Export interpolated LM to Riva compatible format
!tao n_gram export \
            -e /specs/export.yaml \
            -r /results/interpolated \
            export_format=RIVA \
            export_to=exported-model.riva \
            restore_from=/results/checkpoints/finetune_n_gram.arpa \
            binary_type=trie \
            binary_q_bits=8 \
            binary_b_bits=7 \
            binary_a_bits=256

Interpolated LM is not generated at /results/interpolated/exported-model.binary. <br>
We can now use this LM along with new vocabulary file to generate model repo for Domain specific ASR

In [None]:
# Add domain specific words to vocabulary file
! cat $DATA_DOWNLOAD_DIR/domain_data.txt | sed "s/ /\n/g" | sort -u > $RESULT_DIR_LOCAL/interpolated/dict_vocab_domain.txt
! cat $RESULT_DIR_LOCAL/base/dict_vocab.txt $RESULT_DIR_LOCAL/interpolated/dict_vocab_domain.txt | sort -u > $RESULT_DIR_LOCAL/interpolated/dict_vocab.txt

In [None]:
# Generate new model repo with interpolated LM. Set absolute path to create MODEL_LOC_DOMAIN
MODEL_LOC_DOMAIN = "<YOUR_PATH_TO_DOMAIN_MODEL_REPO>"
! mkdir $MODEL_LOC_DOMAIN
! cp $MODEL_LOC/Citrinet.riva $MODEL_LOC_DOMAIN/
! docker run -it --rm --gpus 0 -v $MODEL_LOC_DOMAIN:/data \
            -v $RESULT_DIR_LOCAL/interpolated:/lm \
            --name riva-service-maker-lm \
            $RIVA_SM_CONTAINER  -- \
            riva-build speech_recognition /data/interpolated_asr.rmir:$KEY \
            /data/Citrinet.riva:$KEY \
            --ms_per_timestep=80 \
            --chunk_size=0.16 \
            --left_padding_size=1.92 \
            --right_padding_size=1.92 \
            --decoder_type=flashlight \
            --decoding_language_model_binary=/lm/exported-model.binary \
            --decoding_vocab=/lm/dict_vocab.txt \
            --flashlight_decoder.lm_weight=0.2 \
            --flashlight_decoder.word_insertion_score=0.2 \
            --flashlight_decoder.beam_threshold=20. \
            --force --featurizer.dither=0.0
! docker run --rm --gpus 0 -v $MODEL_LOC_DOMAIN:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f  /data/interpolated_asr.rmir:$KEY /data/models/

In [None]:
# Update riva_model_loc in riva_quickstart config file to MODEL_LOC_DOMAIN
! cd riva_quickstart_v$RIVA_VERSION && ./riva_stop.sh && ./riva_start.sh config.sh

In [None]:
# Get model transcripts on base data
! docker run --rm -v $DATA_DOWNLOAD_DIR/healthcare_eval_dataset_v1.0:/data  \
    --net=host $RIVA_API_CONTAINER -- \
     riva_streaming_asr_client \
        --automatic_punctuation=false \
        --interim_results=false \
        --word_time_offsets=false \
        --audio_file /data/general.json \
        --output_filename=/data/interpolated_asr_on_base_output.json

In [None]:
# Get model transcripts on Healthcare domain data
! docker run --rm -v $DATA_DOWNLOAD_DIR/healthcare_eval_dataset_v1.0:/data  \
    --net=host $RIVA_API_CONTAINER -- \
     riva_streaming_asr_client \
        --automatic_punctuation=false \
        --interim_results=false \
        --word_time_offsets=false \
        --audio_file /data/healthcare.json \
        --output_filename=/data/interpolated_asr_on_domain_output.json

In [None]:
# Check WER on base data
print("WER of base model on generic data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/general.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/base_asr_on_base_output.json"))
print("WER of Domain model on generic data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/general.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/interpolated_asr_on_base_output.json"))

In [None]:
# Check WER on Healtcare domain data
print("WER of base model on Healtcare domain data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/healthcare.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/base_asr_on_domain_output.json"))
print("WER of Domain model on Healtcare domain data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/healthcare.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/interpolated_asr_on_domain_output.json"))


With help of interpolation we were able to improve the performance of our ASR model on Healthcare domain as well as generic domain

### Pruning

LM generated by simply passing the text corpus to TAO toolkit contains some n-grams which are less frequent(in corpus) and thus have very low probabilities. Such n-grams can be removed by `pruning`.<br>
Pruning requires some thresholds which can be defined in the `train.yaml` file as follows (for 4-gram):
```
pruning:
    - 0
    - 1
    - 7
    - 9
```
or can be passed as command line argument as follows:<br>
`model.pruning=[0,1,7,9]`

All the n-gram with frequncy less than or equal to specified threshold will get eliminated.<br>
Here, 2-grams with freq. <= 1, 3-gram with freq.<=7 & 4-gram with freq.<=9 will get eliminated.<br>
There's a tradeoff between degree of pruning and accuracy. High pruning parameters will reduce the size of language model but at the cost of model accuracy!  

#### *Note:
Pruning of 1-gram is not supported, threshold for 1-gram should always be 0

In [None]:
!tao n_gram train \
    -e /specs/train.yaml \
    -r /results/pruned \
    training_ds.data_file=/data/preprocessed.txt \
    model.order=4 \
    model.pruning=[0,1,7,9]

In [None]:
# Lets check the size of original LM and Pruned LM
!echo "Size of unpruned ARPA: $(du -h $RESULT_DIR_LOCAL/base/checkpoints/train_n_gram.arpa | cut -f 1)"
!echo "Size of pruned ARPA: $(du -h $RESULT_DIR_LOCAL/pruned/checkpoints/train_n_gram.arpa | cut -f 1)"

In [None]:
#Lets deploy ASR server with Pruned LM, Set absolute path to create MODEL_LOC_PRUNED
MODEL_LOC_PRUNED = "<YOUR_PATH_TO_PRUNING_MODEL_REPO>"

#export to Riva format 
!tao n_gram export \
            -e /specs/export.yaml \
            -r /results/base \
            export_format=RIVA \
            export_to=pruned-base.riva \
            restore_from=/results/pruned/checkpoints/train_n_gram.arpa \
            binary_type=trie \
            binary_q_bits=8 \
            binary_b_bits=7 \
            binary_a_bits=256

# Generate RMIR
! mkdir $MODEL_LOC_PRUNED
! cp $MODEL_LOC/Citrinet.riva $MODEL_LOC_PRUNED/
! docker run -it --rm --gpus 0 -v $MODEL_LOC_PRUNED:/data \
            -v $RESULT_DIR_LOCAL/base:/lm \
            --name riva-service-maker-lm \
            $RIVA_SM_CONTAINER  -- \
            riva-build speech_recognition /data/pruned_asr.rmir:$KEY \
            /data/Citrinet.riva:$KEY \
            --ms_per_timestep=80 \
            --chunk_size=0.16 \
            --left_padding_size=1.92 \
            --right_padding_size=1.92 \
            --decoder_type=flashlight \
            --decoding_language_model_binary=/lm/pruned-base.binary \
            --decoding_vocab=/lm/dict_vocab.txt \
            --flashlight_decoder.lm_weight=0.2 \
            --flashlight_decoder.word_insertion_score=0.2 \
            --flashlight_decoder.beam_threshold=20. \
            --force --featurizer.dither=0.0
                
# Deploy RMIR with Pruned LM
! docker run --rm --gpus 0 -v $MODEL_LOC_PRUNED:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f  /data/pruned_asr.rmir:$KEY /data/models/

In [None]:
# Update riva_model_loc in riva_quickstart config file to MODEL_LOC_PRUNED and then start server
! cd riva_quickstart_v$RIVA_VERSION && ./riva_stop.sh && ./riva_start.sh config.sh

In [None]:
# Evaluate model and calculate WERs
! docker run --rm -v $DATA_DOWNLOAD_DIR/healthcare_eval_dataset_v1.0/:/data  \
    --net=host $RIVA_API_CONTAINER -- \
     riva_streaming_asr_client \
        --automatic_punctuation=false \
        --interim_results=false \
        --word_time_offsets=false \
        --audio_file /data/general.json \
        --output_filename=/data/pruned_asr_on_base_output.json

! docker run --rm -v $DATA_DOWNLOAD_DIR/healthcare_eval_dataset_v1.0/:/data  \
    --net=host $RIVA_API_CONTAINER -- \
     riva_streaming_asr_client \
        --automatic_punctuation=false \
        --interim_results=false \
        --word_time_offsets=false \
        --audio_file /data/healthcare.json \
        --output_filename=/data/pruned_asr_on_domain_output.json

In [None]:
# Check WER on base data
print("WER of base model on generic data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/general.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/base_asr_on_base_output.json"))
print("WER of Pruned base model on generic data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/general.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/pruned_asr_on_base_output.json"))

# Check WER on Healtcare domain data
print("WER of base model on Healtcare domain data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/healthcare.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/base_asr_on_domain_output.json"))
print("WER of Pruned base model on Healtcare domain data: ", calculate_wer(f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/healthcare.json", f"{DATA_DOWNLOAD_DIR}/healthcare_eval_dataset_v1.0/pruned_asr_on_domain_output.json"))


Pruning drops some of low probabiliy N-grams from Lnaguage model. This can affect models in both ways.
For our case, we were able to improve model performance by reducing the perplexity of Language model. 