&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../START_HERE_RIVA_HACKATHON.ipynb)

&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;
[1]
[2](punctuation-and-capitalization-deployment.ipynb)
[3](punctuation-and-capitalization-Exercise.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](punctuation-and-capitalization-deployment.ipynb)

# Punctuation And Capitalization using Transfer Learning Toolkit

The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientist to have considerably more train-test iterations in the same time frame.

In this notebook, you will learn how to leverage the simplicity and convenience of TLT to:
- Take a BERT model and __Train/Finetune__ it on a dataset for Punctuation and Capitalization task
- Run __Inference__
- __Export__ the model to the ONNX format, or export in the format that is suitable for deployment in Riva

The earlier section in this notebook gives a brief introduction to the Punctuation and Capitalization task and the dataset being used.

## Prerequisites

Please follow the steps shown in the setup notebook to install the TLT package

In [None]:
##check for tlt install and availabiility
!tlt info --verbose

Check if the GPU device(s) are available.

In [None]:
!nvidia-smi

## Punctuation And Capitalization using TLT 
### Task Description

Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. This tutorial explains how to implement a model that will predict punctuation and capitalization for each word in a sentence to make ASR output more readable and to boost performance of the named entity recognition, machine translation or text-to-speech models. We'll show how to train a model for this task using a pre-trained BERT model. For every word in our training dataset we’re going to predict:

- punctuation mark that should follow the word 
- whether the word should be capitalized

---
### TLT Workflow
### Setting TLT Mounts

Once the TLT gets installed, the next step is to set up the directory mounts that is the ``source`` and ``destination``. The ``source_mount`` will store all the data, pre-trained models and scripts pertaining to Punctuation and Capitalization task. Please select an empty directory for ``source_mount``. 
The ``destination_mount`` is the directory to which ``source_mount`` will be mapped to, inside the container.

The TLT launcher uses docker container under the hood, and for our data and results directory to be visible to the docker, they must be mapped.

The launcher can be configured using the config file ``.tlt_mounts.json``. Apart from the mounts you can also configure additional options like the Environmental Variables and amount of Shared Memory available to the TLT launcher.

``Important Note:`` The code below creates a sample ``.tlt_mounts.json`` file. Here, we can map directories in which we save the data, specs, results and cache.  You must configure it for your specific case to make sure both your data and results are correctly mapped to the docker. **Please also ensure that the source directories exist on your machine!**

In [None]:
%%bash
tee ~/.tlt_mounts.json <<'EOF' 
{
   "Mounts":[
       {
           "source": "<YOUR_PATH_TO_DATA_DIR>",
           "destination": "/data"
       },
       {
           "source": "<YOUR_PATH_TO_SPECS_DIR>",
           "destination": "/specs"
       },
       {
           "source": "<YOUR_PATH_TO_RESULTS_DIR>",
           "destination": "/results"
       },
       {
           "source": "<YOUR_PATH_TO_CACHE_DIR>",
           "destination": "/root/.cache"
       }
   ]
}
EOF

In [None]:
# Make sure the source directories exist, if not, create them
! mkdir <YOUR_PATH_TO_SPECS_DIR>
! mkdir <YOUR_PATH_TO_RESULTS_DIR>
! mkdir <YOUR_PATH_TO_CACHE_DIR>

The rest of the notebook exemplifies the simplicity of the TLT workflow. 
Users with any level of Deep Learning knowledge can get started building their own custom models using a simple specification file. It's essentially just one command each to run data download and preprocessing, training, fine-tuning, evaluation, inference, and export. 
All configurations happen through YAML specification files

---
### Configuration/Specification Files

All commands in TLT lies in the YAML specification files. There are sample specification files already available for you to use directly or as reference to create your own YAML specification files.  

Through these specification files, you can tune many a lot of things like the model, dataset, hyperparameters, optimizer etc.

Each command (like download_and_convert, train, finetune, evaluate etc.) should have a dedicated specification file with configurations pertinent to it.

Here is an example of the training spec file:

```python
save_to: trained-model.tlt

trainer:
  max_epochs: 5
  
model:
  punct_label_ids:
    O: 0
    ',': 1
    '.': 2
    '?': 3

  capit_label_ids:
    O: 0
    U: 1 

  tokenizer:
      tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece =
      vocab_file: null # path to vocab file 
      tokenizer_model: null # only used if tokenizer is sentencepiece
      special_tokens: null

  language_model:
    pretrained_model_name: bert-base-uncased
    lm_checkpoint: null
    config_file: null # json file, precedence over config
    config: null 

  punct_head:
    punct_num_fc_layers: 1
    fc_dropout: 0.1
    activation: 'relu'
    use_transformer_init: true

  capit_head:
    capit_num_fc_layers: 1
    fc_dropout: 0.1
    activation: 'relu'
    use_transformer_init: true

# Data dir containing dataset.
data_dir: ???

training_ds:
  text_file: text_train.txt
  labels_file: labels_train.txt
  shuffle: true
  num_samples: -1 # number of samples to be considered, -1 means the whole the dataset
  batch_size: 64

validation_ds:
  text_file: text_dev.txt
  labels_file: labels_dev.txt
  shuffle: false
  num_samples: -1 # number of samples to be considered, -1 means the whole the dataset
  batch_size: 64

optim:
  name: adam
  lr: 1e-5
  weight_decay: 0.00

  sched:
    name: WarmupAnnealing
    # Scheduler params
    warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1

    # pytorch lightning args
    monitor: val_loss
    reduce_on_plateau: false 
```

---
### Set Relevant Paths
Please set these paths according to your environment.

In [None]:
# NOTE: The following paths are set from the perspective of the TLT Docker. 

# The data is saved here
DATA_DIR='/data'

# The configuration files are stored here
SPECS_DIR='/specs/punctuation_and_capitalization'

# The results are saved at this path
RESULTS_DIR='/results/punctuation_and_capitalization'

# Set your encryption key, and use the same key for all commands
KEY='tlt_encode'

---
### Downloading Specs
We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. **Make sure the -o points to an empty folder!**

In [None]:
!tlt punctuation_and_capitalization download_specs \
    -r $RESULTS_DIR/ \
    -o $SPECS_DIR/

---
### Dataset 

This model can work with any dataset as long as it follows the format specified below. The training and evaluation data is divided into _2 files: text.txt and labels.txt_. Each line of the __text.txt__ file contains text sequences, where words are separated with spaces: [WORD] [SPACE] [WORD] [SPACE] [WORD], for example:

when is the next flight to new york<br>
the next flight is ...<br>
...<br>

The __labels.txt__ file contains corresponding labels for each word in text.txt, the labels are separated with spaces. Each label in labels.txt file consists of 2 symbols:

- the first symbol of the label indicates what punctuation mark should follow the word (where O means no punctuation needed);
- the second symbol determines if a word needs to be capitalized or not (where U indicates that the word should be upper cased, and O - no capitalization needed.)
In this tutorial, we are considering only commas, periods, and question marks the rest punctuation marks were removed. To use more punctuation marks, update the dataset to include desired labels, no changes to the model needed.

Each line of the __labels.txt__ should follow the format: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt). For example, labels for the above text.txt file should be:

OU OO OO OO OO OO OU ?U<br>
OU OO OO OO ...<br>
...

The complete list of all possible labels for this task used in this tutorial is: OO, ,O, .O, ?O, OU, ,U, .U, ?U.

### Download and preprocess the data 

In this notebook we are going to use a subset of English examples from the [Tatoeba collection of sentences](https://tatoeba.org/eng). 

Downloading and preprocessing the data using TLT is as simple as configuring YAML specification file and running the ``download_and_convert_tatoeba`` command. The code cell below uses the default `download_and_convert_tatoeba.yaml` available for the users as a reference. 

The configurations in the specification file can be easily overridden using the tlt-launcher CLI as shown below. For instance, we override the ``source_data_dir`` and ``target_data_dir`` configurations.

We encourage you to take a look at the YAML files we have provided.

After executing the cell below, your data folder will contain the following 4 files:
- labels_dev.txt
- labels_train.txt
- text_dev.txt
- text_train.txt

In [None]:
### To download and convert the dataset
!tlt punctuation_and_capitalization download_and_convert_tatoeba \
    -e $SPECS_DIR/download_and_convert_tatoeba.yaml \
    -r $RESULTS_DIR/download_and_convert_tatoeba \
    source_data_dir=$DATA_DIR \
    target_data_dir=$DATA_DIR

---
### Training 

In the Punctuation and Capitalization Model, we are jointly training two token-level classifiers on top of the pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model:

- one classifier to predict punctuation and
- the other one - capitalization.

Training a model using TLT is as simple as configuring your spec file and running the train command. The code cell below uses the default train.yaml available for users as reference. It is configured by default to use the ``bert-base-uncased`` pretrained model. Additionally, these configurations could easily be overridden using the tlt-launcher CLI as shown below. For instance, below we override the trainer.max_epochs, training_ds.num_samples and validation_ds.num_samples configurations to suit our needs. We encourage you to take a look at the .yaml spec files we provide!

The command for training is very similar to the of ``download_and_convert_tatoeba``. Instead of ``tlt punctuation_and_capitalization download_and_convert_tatoeba``, we use ``tlt punctuation_and_capitalization train`` instead. The  ``tlt punctuation_and_capitalization train`` command has the following arguments:

- ``-e`` : Path to the spec file
- ``-g`` : Number of GPUs to use
- ``-k`` : User specified encryption key to use while saving/loading the model
- ``-r`` : Path to the folder where the outputs should be written. Make sure this is mapped in the tlt_mounts.json
- Any overrides to the spec file eg. trainer.max_epochs

More details about these arguments are present in the  [TLT Getting Started Guide](https://docs.nvidia.com/tlt/tlt-user-guide/index.html).

``NOTE:`` All file paths corresponds to the destination mounted directory that is visible in the TLT docker container used in backend.

In [None]:
### To train the dataset with BERT-base-uncased model
!tlt punctuation_and_capitalization train \
    -e $SPECS_DIR/train.yaml \
    -g 4 \
    -r $RESULTS_DIR/train \
    data_dir=$DATA_DIR \
    trainer.max_epochs=2 \
    training_ds.num_samples=-1  \
    validation_ds.num_samples=-1 \
    -k $KEY

The train command produces a .tlt file called ``trained-model.tlt`` saved at ``$RESULTS_DIR/checkpoints/trained-model.tlt``

### Other tips and tricks:

- To accelerate the training without loss of quality, it is possible to train with these parameters: ``trainer.amp_level="O1"`` and ``trainer.precision=16`` for reduced precision.
- The batch size ``training_ds.batch_size`` may influence the validation accuracy. Larger batch sizes are faster to train with, however, you may get slightly better results with smaller batches.
- You can also change the optimizer parameter ``optim.name`` and can see its effect on the punctuation and capitalization task by the change in accuracy.
- You can specify the number of layers in the head of the model ``model.punct_head.punct_num_fc_layers`` and ``model.capit_head.capit_num_fc_layers``.

---
### Finetuning

As stated above the command for all the tasks are very similar but have different YAML specification files that can be tweaked.

Note: If you wish to proceed with a trained dataset for better inference results, you can find a .nemo model [here](
https://ngc.nvidia.com/catalog/collections/nvidia:nemotrainingframework).

Simply re-name the .nemo file to .tlt and pass it through the finetune pipeline.

In [None]:
### To finetune on the dataset
!tlt punctuation_and_capitalization finetune \
    -e $SPECS_DIR/finetune.yaml \
    -g 4 \
    -m $RESULTS_DIR/train/checkpoints/trained-model.tlt \
    -r $RESULTS_DIR/finetune \
    data_dir=$DATA_DIR \
    trainer.max_epochs=3 \
    -k $KEY

The train command produces a .tlt file called ``finetuned-model.tlt`` saved in the results directory.

---
### Evaluation

To evaluate our TLT model we will run the command below. It is always advisable to look at the YAML file for evaluate to understand the command in a better way.

In [None]:
### For evaluation
!tlt punctuation_and_capitalization evaluate \
    -e $SPECS_DIR/evaluate.yaml \
    -g 4 \
    -m $RESULTS_DIR/finetune/checkpoints/finetuned-model.tlt \
    data_dir=$DATA_DIR \
    -r $RESULTS_DIR/evaluate \
    -k $KEY

On evaluating the model you will get some results and based on that we can either retrain the model for more epochs or continue with the inference.

---
### Inference

Inference using a TLT trained and fine-tuned model can be done by ``tlt punctuation_and_capitalization infer`` command. It is again advisable to look at the infer.yaml file.

In [None]:
### For inference
!tlt punctuation_and_capitalization infer \
    -e $SPECS_DIR/infer.yaml \
    -g 4 \
    -m $RESULTS_DIR/finetune/checkpoints/finetuned-model.tlt \
    -r $RESULTS_DIR/infer \
    -k $KEY

---
### Export to ONNX
[ONNX](https://onnx.ai/) is a popular open format for machine learning models. It enables interoperability between different frameworks, making the path to production much easier. 

TLT provides commands to export the .tlt model to the ONNX format in an .eonnx archive. The `export_format` configuration can be set to `ONNX` to achieve this.

The tlt export command for ``punctuation_and_capitalization`` is shown in the cell below.

In [None]:
### For export to ONNX
!tlt punctuation_and_capitalization export \
    -e $SPECS_DIR/export.yaml \
    -g 1 \
    -m $RESULTS_DIR/finetune/checkpoints/finetuned-model.tlt \
    -r $RESULTS_DIR/export \
    -k $KEY \
    export_format=ONNX

This command exports the model as ``exported-model.eonnx`` which is essentially an archive containing the .onnx model.

---
### Inference using ONNX

TLT provides the capability to use the exported .eonnx model for inference. The command ``tlt punctuation_and_capitalization infer_onnx`` is very similar to the inference command for .tlt models. Again, the input file used is just for demo purposes, you may choose to try out your own custom input.

In [None]:
### For inference using ONNX
!tlt punctuation_and_capitalization infer_onnx \
    -e $SPECS_DIR/infer_onnx.yaml \
    -g 1 \
    -m $RESULTS_DIR/export/exported-model.eonnx \
    -r $RESULTS_DIR/infer_onnx \
    -k $KEY

---
### Export to Riva

With TLT, you can also export your model in a format that can deployed using [NVIDIA Riva](https://developer.nvidia.com/riva), a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for ``export_format`` in the spec file.

In [None]:
### For export to RIVA
!tlt punctuation_and_capitalization export \
    -e $SPECS_DIR/export.yaml \
    -g 1 \
    -m $RESULTS_DIR/finetune/checkpoints/finetuned-model.tlt \
    -r $RESULTS_DIR/export_riva \
    export_format=JARVIS \
    export_to=punct-capit-model.riva \
    -k $KEY

The model is exported as ``punct-capit-model.riva`` which is in a format suited for deployment in Riva.

---
### What next?

You can use TLT to build custom models for your own NLP applications.