<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-tao-ngram-pretrain/nvidia_logo.png" style="width: 90px; float: right;">

# How to pretrain a Riva ASR Language Modeling (n-gram) with TAO Toolkit
This notebook is a walkthrough of  pretraining the Riva ASR language model (n-gram) with [NVIDIA Train Adapt Optimize (TAO)](https://developer.nvidia.com/tao) Toolkit.

## TAO Toolkit
Train Adapt Optimize (TAO) Toolkit is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data. Developers, researchers, and software partners building intelligent vision AI applications and services can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

![Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/sites/default/files/akamai/embedded-transfer-learning-toolkit-software-stack-1200x670px.png)

Transfer learning extracts learned features from an existing neural network into a new one. Transfer learning is often used when creating a large training dataset is not feasible. The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientists to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Automatic Speech Recognition Language Modeling!

---
<a id='isc-task-description'></a>
## Language Modeling

### Task Description

Language modeling returns a probability distribution over a sequence of words. Besides assigning a probability to a sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) that follows a sequence of words. <br>

> The sentence:  **all of a sudden I notice three guys standing on the sidewalk**
> would be scored higher than 
> the sentence: **on guys all I of notice sidewalk three a sudden standing the** by the language model. <br>

A language model trained on large corpus can significantly improve the accuracy of an ASR system as suggested in recent research.

There are primarily two types of language models:

- **N-gram language models**: These models use frequency of n-grams to learn the probability distribution over words. Two benefits of N-gram language models are simplicity and scalability – with a larger `n`, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.
- **Neural language models**: These models use different kinds of neural networks to model the probability distribution over words, and have surpassed the N-gram language models in the ability to model language, but are generally slower to evaluate.

In this tutorial, we will show how to train, evaluate, and optionally fine-tune an [**N-gram language model**](https://web.stanford.edu/~jurafsky/slp3/3.pdf) leveraging TAO Toolkit.

---
## Let's Dig in: Riva Language Modeling using TAO

### Installing and setting up TAO
Install TAO Toolkit inside the Python virtual environment.

It's a simple `pip` install.

In [None]:
!pip install nvidia-pyindex
!pip install nvidia-tao

To view the Docker image versions and the tasks that TAO can perform, use the `tao info` command.

In [None]:
!tao info --verbose

---
<a id='isc-prepare-data'></a>
### Preparing the dataset
#### Librispeech LM Normalized dataset
For this tutorial, we use the **normalized version of the LibriSpeech LM dataset** to **train** our N-gram language model. The normalized version of the LibriSpeech LM dataset is available [here](https://www.openslr.org/11/).

#### LibriSpeech dev-clean dataset
For this tutorial, we also use the **clean version of the LibriSpeech development set** to **evaluate** our N-gram language model. The clean version of the LibriSpeech development set is available [here](https://www.openslr.org/12/).

### Downloading the dataset
#### LibriSpeech LM Normalized dataset
The training data is publicly available [here](https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz) and can be downloaded directly.

In [None]:
# Imports
import os

# Create a local directory to save artifacts in this tutorial
LM_ARTIFACTS = os.path.join(os.getcwd(), "lm-pretraining-artifacts")
!mkdir -p $LM_ARTIFACTS

In [None]:
# Set the path to a folder where you want your data and results to be saved.
DATA_DOWNLOAD_DIR = os.path.join(LM_ARTIFACTS, "data")

!mkdir -p $DATA_DOWNLOAD_DIR

assert os.path.exists(DATA_DOWNLOAD_DIR), "Provided DATA_DOWNLOAD_DIR does not exist."

In [None]:
# NOTE: Ensure that wget and unzip utilities are available. If not, install them.
if os.path.exists(os.path.join(DATA_DOWNLOAD_DIR, "librispeech-lm-norm.txt")):
    print("Dataset exists, skipping download")
else:
    print("Downloading and Extracting the Data")
    !wget 'https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz' -P $DATA_DOWNLOAD_DIR
    !gzip -dk $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt.gz;

#### LibriSpeech dev-clean dataset
The evaluation data is publicly available [here](https://www.openslr.org/resources/12/dev-clean.tar.gz) and can be downloaded directly. We provide a Python script below to download and preprocess the dataset for you.

In [None]:
"""
Scripts to download and preprocess LibriSpeech dev-clean
"""
from multiprocessing import Pool

import numpy

LOG_STR = " To regenerate this file, please, remove it."

def find_transcript_files(dir):
    files = []
    for dirpath, _, filenames in os.walk(dir):
        for filename in filenames:
            if filename.endswith(".trans.txt"):
                files.append(os.path.join(dirpath, filename))
    return files

def transcript_to_list(file):
    audio_path = os.path.dirname(file)
    ret = []
    with open(file, "r") as f:
        for line in f:
            file_id, trans = line.strip().split(" ", 1)
            audio_file = os.path.abspath(os.path.join(audio_path, file_id + ".flac"))
            duration = 0  # We are not using the audio
            ret.append([file_id, audio_file, str(duration), trans.lower()])

    return ret


if __name__ == "__main__":
    
    name = "dev-clean"
    data_path = os.path.join(DATA_DOWNLOAD_DIR, "eval_data")
    text_path = os.path.join(DATA_DOWNLOAD_DIR, "text")
    lists_path = os.path.join(DATA_DOWNLOAD_DIR, "lists")
    os.makedirs(data_path, exist_ok=True)
    os.makedirs(text_path, exist_ok=True)
    os.makedirs(lists_path, exist_ok=True)
    data_http = "http://www.openslr.org/resources/12/"

    # Download the audio data
    print("Downloading the evaluation data.", flush=True)
    if not os.path.exists(os.path.join(data_path, "LibriSpeech", name)):
        print("Downloading and unpacking {}...".format(name))
        cmd = """wget -c {http}{name}.tar.gz -P {path};
                 yes n 2>/dev/null | gunzip {path}/{name}.tar.gz;
                 tar -C {path} -xf {path}/{name}.tar"""
        os.system(cmd.format(path=data_path, http=data_http, name=name))
    else:
        log_str = "{} part of data exists, skip its downloading and unpacking"
        print(log_str.format(name) + LOG_STR, flush=True)

    # Prepare the audio data
    print("Converting data into necessary format.", flush=True)
    word_dict = {}
    word_dict[name] = set()
    src = os.path.join(data_path, "LibriSpeech", name)
    assert os.path.exists(src), "Unable to find the directory - '{src}'".format(
        src=src
    )

    dst_list = os.path.join(lists_path, name + ".lst")
    if os.path.exists(dst_list):
        print(
            "Path {} exists, skip its generation.".format(dst_list) + LOG_STR,
            flush=True,
        )
        

    print("Analyzing {src}...".format(src=src), flush=True)
    transcript_files = find_transcript_files(src)
    transcript_files.sort()

    print("Writing to {dst}...".format(dst=dst_list), flush=True)
    with Pool(processes=8) as p:
        samples = list(p.imap(transcript_to_list, transcript_files))

    with open(dst_list, "w") as fout:
        for sp in samples:
            for s in sp:
                word_dict[name].update(s[-1].split(" "))
                s[0] = name + "-" + s[0]
                fout.write(" ".join(s) + "\n")

    current_path = os.path.join(text_path, name + ".txt")
    if not os.path.exists(current_path):
        with open(os.path.join(lists_path, name + ".lst"), "r") as flist, open(
            os.path.join(text_path, name + ".txt"), "w"
        ) as fout:
            for line in flist:
                fout.write(" ".join(line.strip().split(" ")[3:]) + "\n")
    else:
        print(
            "Path {} exists, skip its generation.".format(current_path) + LOG_STR,
            flush=True,
        )

print("Done!", flush=True)


For the sake of reducing the time this tutorial takes, we reduce the number of lines of the training dataset. Feel free to modify the number of used lines.

In [None]:
# Use a random 10,000 lines for training
!shuf -n 10000 $DATA_DOWNLOAD_DIR/librispeech-lm-norm.txt  > $DATA_DOWNLOAD_DIR/reduced_training.txt

In [None]:
!head -n 5 $DATA_DOWNLOAD_DIR/reduced_training.txt

---
## TAO Toolkit workflow
The rest of the tutorial demonstrates what a sample TAO Toolkit workflow looks like.

### Setting TAO Toolkit Mounts

Now that our dataset has been downloaded, an important step in using TAO Toolkit is to setup the directory mounts. The TAO Toolkit launcher uses Docker containers under the hood, and **for our data and results directory to be visible to Docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO Toolkit launcher. <br>

`IMPORTANT NOTE:` The following code creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case.  These directories are correctly visible to the Docker container. **Ensure that the source directories exist on your machine.**

In [None]:
! ls -ltr {LM_ARTIFACTS}/data

In [None]:
# Define these paths on your local host machine
%env HOST_DATA_DIR={LM_ARTIFACTS}/data
%env HOST_SPECS_DIR={LM_ARTIFACTS}/specs
%env HOST_RESULTS_DIR={LM_ARTIFACTS}/results
%env HOST_CACHE_DIR={LM_ARTIFACTS}/cache

In [None]:
# Create these directories if they don't already exist
!mkdir -p {LM_ARTIFACTS}/specs
!mkdir -p {LM_ARTIFACTS}/results
!mkdir -p {LM_ARTIFACTS}/cache

In [None]:
# Mapping the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

Users with basic knowledge of deep learning can get started building their own custom models using a simple specification file. It's essentially just one command each to run data preprocessing, training, fine-tuning, evaluation, inference, and export. All configurations happen through `.yaml` spec files. <br>

---
### Configuration/Specification Files

The essence of all commands in TAO Toolkit lies within `.yaml` spec files. There are sample spec files already available for you to use directly or as a reference to create your own.  Through these spec files, you can tune many knobs like the model, dataset, hyperparameters, etc. Each command (like train, fine-tune, evaluate, etc.) should have a dedicated spec file with configurations pertinent to it. <br>

Here is an example of the training spec file:

---
```
model:
  intermediate: True
  order: 2
  pruning:
    - 0
training_ds:
  is_tarred: false
  is_file: true
  data_file: ???

vocab_file: ""
encryption_key: "tlt_encode"
...
```


---
### Set Relevant Paths
Please set these paths according to your environment.

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Toolkit Docker. 

# The data is saved here
DATA_DIR='/data'

# The configuration files are stored here
SPECS_DIR='/specs/n_gram'

# The results are saved at this path
RESULTS_DIR='/results/n_gram'

# Set your encryption key, and use the same key for all commands
KEY='tlt_encode'

---
### Downloading Specs
Let's download the spec files. You may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` argument points to an empty folder.**

In [None]:
if not os.path.exists(os.path.join(LM_ARTIFACTS, "specs/n_gram")):
    print("Downloading Specs")
    !tao n_gram download_specs \
        -r $RESULTS_DIR \
        -o $SPECS_DIR
else:
    print("n_gram .yaml specs already exist. If you want to re-download, please remove the contents of this directory.")

---
### Data Convert


In preparation for training/fine-tuning, we need to preprocess the dataset. The `tao n_gram dataset_convert` command can be used in conjunction with the appropriate configuration in the spec file. Here is the sample `dataset_convert.yaml` spec file we use:
```
# Dataset. Available options: [assistant]
dataset_name: assistant

# Extension of the files containing in dataset
extension: ???

# Path to the folder containing the dataset source files.
source_data_dir: ???

# Path to the output folder.
target_data_file: ???

```
 Take a look at the `.yaml` spec files we provide.
As we show below, you can override the `source_data_dir` and `target_data_dir` options with the appropriate paths.

In [None]:
# Preprocess training data (LibriSpeech LM Normalized)
!tao n_gram dataset_convert \
            -e $SPECS_DIR/dataset_convert.yaml \
            -r $RESULTS_DIR/dataset_convert \
            extension=*.txt \
            source_data_dir=$DATA_DIR/reduced_training.txt \
            target_data_file=$DATA_DIR/preprocessed.txt

# Preprocess evaluation data (LibriSpeech dev-clean)
!tao n_gram dataset_convert \
            -e $SPECS_DIR/dataset_convert.yaml \
            -r $RESULTS_DIR/dataset_convert \
            extension=*.txt \
            source_data_dir=$DATA_DIR/text/dev-clean.txt \
            target_data_file=$DATA_DIR/preprocessed_dev_clean.txt

The command preprocess the training and evaluation dataset using basic text preprocessings including converting lowercase, normalization, removing punctuation, and write the results into files named `preprocessed.txt` and `preprocessed_dev_clean.txt` for training and evaluation correspondingly. In both `preprocessed.txt` and `preprocessed_dev_clean.txt` files, each preprocessed sentence corresponds to a new line.

In [None]:
# Peek into the preprocessed dataset
!head -n 5 $DATA_DOWNLOAD_DIR/preprocessed.txt

---
<a id='isc-training'></a>
### Training / Fine-tuning


Training a model using TAO Toolkit is as simple as configuring your spec file and running the train command. The following code uses the `train.yaml` spec file available to you as reference. The spec file configurations can easily be overridden using the `tao-launcher` CLI. For example, below we override `model.order`, `model.pruning` and `training_ds.data_file` configurations to suit our needs. <br>

In [None]:
TRAIN_YAML = os.path.join(LM_ARTIFACTS, "specs", "n_gram", "train.yaml")
!cat $TRAIN_YAML

Here are some parameters you can modify/add specific to n_gram training - <br>

| Parameter                 | Data Type   | Default | Description |
| -----------               | ----------- |-------- |-----------  |
| training_ds.data_file     | string      | -       |Path to dataset file. |
| model.order               | int         | -       | Order of N-Gram model (maximum number of grams) |
| vocab_file                | string      | -       | Optional path to vocab file to limit vocabulary learned by model. |
| model.intermediate        | boolean     | true    | Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate |
| model.pruning             | list[int]   | [0]     | Prune grams with counts less than or equal to threhold provided for each gram. Non-decreasing. Starts with 0 |
| export_to                 | string      | -        | The path to the trained .tlt model |



For training an N-gram language model in TAO Toolkit, we use the `tao n_gram train` command with the following general TAO arguments:
- `-e`: Path to the spec file
- `-k`: User specified encryption key to use while saving/loading the model
- `-r`: Path to a folder where the outputs should be written. Ensure this is mapped in the `tlt_mounts.json` file.
- Any overrides to the spec file. For example, `model.order`.
<br>


For more information about these arguments, refer to the [TAO Toolkit Getting Started Guide](https://docs.nvidia.com/tao/tao-toolkit/text/overview.html). <br>
`Note:` All file paths correspond to the destination mounted directory that is visible in the TAO Toolkit docker container used in backend.<br>

In [None]:
# Here, we traing a 3-gram model
!tao n_gram train \
            -e $SPECS_DIR/train.yaml \
            -r $RESULTS_DIR/train \
            training_ds.data_file=$DATA_DIR/preprocessed.txt \
            model.order=3 \
            model.pruning=[0,0,1]

The train command produces results saved at `$RESULTS_DIR/n_gram/train/checkpoints`, including three files called `train_n_gram.arpa`, `train_n_gram.vocab` and `train_n_gram.kenlm_intermediate` 

In [None]:
# Check the generated artifacts at Local path
!ls -ltr $LM_ARTIFACTS/results/n_gram/train/checkpoints

---
<a id='evaluation'></a>
### Evaluation
The evaluation spec `.yaml` is as simple as:

```
# Name of the `.arpa` or `.binary` file where the trained model will be restored from.
restore_from: ???

test_ds:
  data_file: ???
  
```

In [None]:
!tao n_gram evaluate \
     -e $SPECS_DIR/evaluate.yaml \
     -r $RESULTS_DIR/evaluate \
     restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa \
     test_ds.data_file=$DATA_DIR/preprocessed_dev_clean.txt

The output of the evaluation gives us the **perplexity** of the N-gram language model on the evaluation (LibriSpeech dev-clean) dataset.

#### A note on perplexity

Language models are typically evaluated not using raw probabilities, but with the [perplexity](https://en.wikipedia.org/wiki/Perplexity) metric. The perplexity (PP) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. Such normalization is important because different datasets can have different number of sentences and words in each each sentences. Without normalization, a larger test set will have lower probabilities. Perplexity is independent of the size of the test set.

For a given test set $W = w_1 w_2 .. w_N$ , <br> 
    Perplexity is defined as <br>
$PP(w) = P(w_1 w_2 .. w_N)^{-1/N}$

Because of the inverse probability, the higher the probability of a sentence, the lower the perplexity. Therefore, with perplexity, lower is better. 

---
<a id='isc-inference'></a>
### Inference
Now, we execute inference using a trained `.arpa` or `.binary` model uses the `tao n_gram infer` command.  <br>
The `infer.yaml` is also very simple, and we can directly give inputs for the model to run inference.
```
# "Simulate" user input:
input_batch:
  - 'set alarm for seven thirty am'
  - 'lower volume by fifty percent'
  - 'what is my schedule for tomorrow'

restore_from: ???

```

In [None]:
!tao n_gram infer \
            -e $SPECS_DIR/infer.yaml \
            -r $RESULTS_DIR/infer \
            restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa

This command returns the **log likelihood**, **perplexity**, and all n-grams for each of the input sequences that users provided.

---
<a id='isc-export-riva'></a>
### Export to Riva

With TAO Toolkit, you can also export your model in a format that can deployed using [NVIDIA Riva](https://developer.nvidia.com/riva). The export command will convert the trained language model from `.arpa` to `.binary` with the option of quantizing the model binary. We will set `export_format` in the spec file to `RIVA` to create a `.riva` file which will contain the language model binary and its corresponding vocabulary.

`NOTE:` More information about the different arguments can be found in the [TAO documentation](https://docs.nvidia.com/tao/tao-toolkit/text/lm/n_gram.html?highlight=binary_q#model-export)

In [None]:
!tao n_gram export \
            -e $SPECS_DIR/export.yaml \
            -r $RESULTS_DIR/export \
            export_format=RIVA \
            export_to=exported-model.riva \
            restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa \
            binary_type=trie \
            binary_q_bits=8 \
            binary_b_bits=7 \
            binary_a_bits=256

The model is exported as `exported-model.binary` which is in a format suited for deployment in Riva. 

---
### What's Next?

Deploying the exported n-gram language model in the speech recognition pipeline is similar to the steps as specified in the 1_deploy-speech-recognition-pipeline.ipynb. Deploying it again is not part of this tutorial. <br>

For reference, you can point the `--decoding_language_model_binary` arg in `riva-build` to your freshly exported language model.

After `riva-build` and `riva-deploy`, you can follow the rest of the tutorial `1_deploy-speech-recognition-pipeline.ipynb` to deploy your newly generated language model along with the downloaded acoustic model.