&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../START_HERE_RIVA_BOOTCAMP.ipynb)

&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;
[1]
[2](token-classification-deployment.ipynb)
[3](token-classification-Exercise.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](token-classification-deployment.ipynb)

# TLT - Named Entity Recognition

Transfer Learning Toolkit (TLT) is a python based AI toolkit for taking purpose-built pre-trained AI models and customizing them with your own data. 

Transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used when creating a large training dataset is not feasible. 

Developers, researchers and software partners building intelligent vision AI apps and services, can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

![Transfer Learning Toolkit](https://developer.nvidia.com/sites/default/files/akamai/embedded-transfer-learning-toolkit-software-stack-1200x670px.png)

The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientist to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Named Entity Recognition.

---
## Named Entity Recognition

Named entity recognition (NER), also referred to as entity chunking, identification or extraction, is the task of detecting and classifying key information (entities) in text. 

For example, in a sentence: Mary lives in Santa Clara and works at NVIDIA, we should detect that **Mary** is a person, **Santa Clara** is a location and **NVIDIA** is a company.

## Named Entity Recognition using TLT

---
### Dataset Details

In this tutorial we going to use [GMB(Groningen Meaning Bank)](http://www.let.rug.nl/bjerva/gmb/about.php) corpus for entity recognition.

GMB is a fairly large corpus with a lot of annotations. Note, that GMB is not completely human annotated and it’s not considered 100% correct.

The data is labeled using the [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) (short for inside, outside, beginning). 

The following classes appear in the dataset:

* LOC = Geographical Entity
* ORG = Organization
* PER = Person
* GPE = Geopolitical Entity
* TIME = Time indicator
* ART = Artifact
* EVE = Event
* NAT = Natural Phenomenon

For this tutorial, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.

### Download Data

In [None]:
# IMPORTANT NOTE: Set path to a folder where you want you data to be saved.
DATA_DOWNLOAD_DIR = "<YOUR_PATH_TO_DATA_DIR>"

In [None]:
! wget https://dldata-public.s3.us-east-2.amazonaws.com/gmb_v_2.2.0_clean.zip

In [None]:
! unzip gmb_v_2.2.0_clean.zip -d $DATA_DOWNLOAD_DIR

Now, the data folder should contain 5 files:

- labels_dev.txt
- labels_train.txt
- text_dev.txt
- text_train.txt
- label_ids.csv

In [None]:
! ls -l $DATA_DOWNLOAD_DIR/gmb_v_2.2.0_clean

In [None]:
# let's take a look at the data 
print('Train text:')
! head -n 5 {DATA_DOWNLOAD_DIR}/gmb_v_2.2.0_clean/text_train.txt

print('\nTrain label:')
! head -n 5 {DATA_DOWNLOAD_DIR}/gmb_v_2.2.0_clean/labels_train.txt

---
### Set Relevant Paths

Once tlt is installed, the next step is to setup the mounts for TLT. <br>

The file `~/.tlt_mounts.json` takes care of the mounts inside docker container and also for additional arguments to be passed to docker run command. This file is stored in the users home directory.

In [None]:
!ls -la ~/.tlt_mounts.json

Let's overwrite `~/.tlt_mounts.json` with the mounts needed. Please change the paths to "source" key below.
After installing tlt, the next step is to setup the mounts for TLT. The TLT launcher uses docker containers under the hood, and **for our data and results directory to be visible to the docker, they need to be mapped**. The launcher can be configured using the config file `~/.tlt_mounts.json`. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TLT launcher. <br>

`IMPORTANT NOTE:` The code below creates a sample `~/.tlt_mounts.json`  file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case such your these directories are correctly visible to the docker container. **Please also ensure that the source directories exist on your machine!**


In [None]:
%%writefile ~/.tlt_mounts.json
{
   "Mounts":[
       {
           "source": "<add path to DATA_DIR>",
           "destination": "/data"
       },
       {
           "source": "<add path to SPECS_DIR>",
           "destination": "/specs"
       },
       {
           "source": "<add path to RESULTS_DIR>",
           "destination": "/results"
       },
       {
           "source": "<add path to CACHE_DIR eg. /home/user/.cache>",
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}

The "source" and "destination" are mounts for the source and destination folders to access the pre-processed and processed dataset.

In [None]:
# Make sure the source directories exist, if not, create them
! mkdir <add path to SPECS_DIR>
! mkdir <add path to RESULTS_DIR>
! mkdir <add path to CACHE_DIR>

In [None]:
# NOTE: The following paths are set from the perspective of the TLT Docker.

# The data is saved here
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key, and use the same key for all commands
KEY = 'tlt_encode'

You can check the docker image versions and the tasks that tlt can perform. You can also check this out with `tlt info --verbose`

In [None]:
! tlt info --verbose

Now that everything is setup, we would like to take a bit of time to explain the tlt interface for ease of use. The command structure can be broken down as follows: `tlt <task name> <subcommand>` <br> 

Let's see this in further detail.

---
### Downloading Specs
TLT's Conversational AI Toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. **Make sure the -o points to an empty folder!**

In [None]:
! tlt token_classification download_specs \
    -r $RESULTS_DIR/token_classification \
    -o $SPECS_DIR/token_classification

---
### Data pre-processing

[TokenClassification Model in NeMo](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/token_classification/token_classification_model.py) supports NER and other token level classification tasks, as long as the data follows the format specified below. 

Token Classification Model requires the data to be split into 2 files: 
* text.txt and 
* labels.txt. 

Each line of the **text.txt** file contains text sequences, where words are separated with spaces, i.e.: 
[WORD] [SPACE] [WORD] [SPACE] [WORD].

The **labels.txt** file contains corresponding labels for each word in text.txt, the labels are separated with spaces, i.e.:
[LABEL] [SPACE] [LABEL] [SPACE] [LABEL].

Example of a text.txt file:
```
Jennifer is from New York City .
She likes ...
...
```
Corresponding labels.txt file:
```
B-PER O O B-LOC I-LOC I-LOC O
O O ...
...
```

To convert an IOB format data to the format required for training, we can use the `dataset_convert` command in TLT on your train and dev files

For this tutorial, we are using the preprocessed GMB dataset, thus we won't be required to convert the dataset.

---
### Train the NER model from scratch using pre-trained BERT base uncased model

Our Named Entity Recognition model is comprised of the pretrained BERT model followed by a Token Classification layer.

The model is defined in a config file which are already available for you to use directly or as reference to create your own. 

Through these spec files, the user can tune many knobs like the model, dataset, hyperparameters, optimizer etc. Each command (like train, finetune, evaluate etc.) should have a dedicated spec file. These sample spec files are available at `$SPECS_DIR/token_classification`

The important sections to look out for are:

- model: All arguments that are related to the model - language model, token classifier, optimizer and schedulers, datasets and any other related information
- trainer: Any argument to be passed to PyTorch Lightning

The model config file for token classification is available at [Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/conf/token_classification_config.yaml). A similar file (at the above location inside TLT docker image) is used for TLT.

Among other things, the config file contains dictionaries called **dataset**, **train_ds** and **validation_ds**. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.

We assume that both training and evaluation files are located in the same directory, and use the default names mentioned during the data download step. So, to start model training, we simply need to specify `model.dataset.data_dir`, like we are going to do below.

Also notice that some config lines, including `model.dataset.data_dir` have ??? in place of paths, which means that the values for these fields are required to be specified by the user.

Thus, we must provide the following parameters to TLT:
1. **data_dir** - path to the processed data
2. **model.label_ids** - mapping of labels with their numerical ids

We can pass these parameters while using the TLT train command. In addition to these, we can also change other parameters like max_epochs.

A similar version of model config file we saw above is also passed to the TLT train command.

In [None]:
!tlt token_classification train \
                        -e $SPECS_DIR/token_classification/train.yaml \
                        -g 1  \
                        -k $KEY \
                        -r $RESULTS_DIR/token_classification/train \
                        data_dir=$DATA_DIR/gmb_v_2.2.0_clean \
                        training_ds.num_samples=-1 \
                        validation_ds.num_samples=-1 \
                        model.label_ids=$DATA_DIR/gmb_v_2.2.0_clean/label_ids.csv \
                        trainer.max_epochs=1 \
                        model.language_model.pretrained_model_name=bert-base-uncased

---
### Finetune NER model

When we were training from scratch, the datasets were prepared for training during the model initialization. When we are using a pretrained NER model, before training, we need to setup training and evaluation data, and also provide path to the pre-trained model.tlt

For simplicity of this tutorial, we will be using the `trained-model.tlt` that was saved during the last section (Train the NER model from scratch using pre-trained BERT Base uncased model) and finetune it again on the GMB dataset.

Note: If you wish to proceed with a trained dataset for better inference results, you can find a .nemo model [here](
https://ngc.nvidia.com/catalog/collections/nvidia:nemotrainingframework).

Simply re-name the .nemo file to .tlt and pass it through the finetune pipeline.

In [None]:
!tlt token_classification finetune \
                        -e $SPECS_DIR/token_classification/finetune.yaml \
                        -g 1 \
                        -m $RESULTS_DIR/token_classification/train/checkpoints/trained-model.tlt \
                        -k $KEY \
                        -r $RESULTS_DIR/token_classification/finetune-bert-base \
                        data_dir=$DATA_DIR/gmb_v_2.2.0_clean \
                        trainer.max_epochs=1

This command will generate a fine-tuned model `finetuned-model.tlt` at $RESULTS_DIR/finetune-bert-base/checkpoints

---
### Evaluate NER on the Validation Set

To see how the model performs, we can generate prediction the same way we did it earlier or we can use our model to generate predictions for a dataset from a file, for example, to perform final evaluation or to do error analysis. Below, we are using a subset of dev set, but it could be any text file as long as it follows the data format described above. 

`labels_file` is optional here, and if provided will be used to get metrics.

Using the below command, TLT will evaluate on `text_dev.txt` present in the data_dir, using `trained-model.tlt` model

In [None]:
!tlt token_classification evaluate \
                        -e $SPECS_DIR/token_classification/evaluate.yaml \
                        -g 1 \
                        -m $RESULTS_DIR/token_classification/finetune-bert-base/checkpoints/finetuned-model.tlt \
                        -k $KEY \
                        -r $RESULTS_DIR/token_classification/evaluate \
                        data_dir=$DATA_DIR/gmb_v_2.2.0_clean

---
### Export model to ONNX format for deployment

Once the model is trained up-to satisfactory metrics, we can export it in ONNX format to be deployed with any inferencing solution.

In [None]:
!tlt token_classification export \
                        -e $SPECS_DIR/token_classification/export.yaml \
                        -g 1 \
                        -m $RESULTS_DIR/token_classification/finetune-bert-base/checkpoints/finetuned-model.tlt \
                        -k $KEY \
                        -r $RESULTS_DIR/token_classification/export \
                        export_format=ONNX

This command exports the model as `exported-model.eonnx` which is essentially an archive containing the .onnx model.

If you're specifically interested to deploy this model using NVIDIA Riva, please set `export_format=JARVIS`.

In [None]:
!tlt token_classification export \
                        -e $SPECS_DIR/token_classification/export.yaml \
                        -g 1 \
                        -m $RESULTS_DIR/token_classification/finetune-bert-base/checkpoints/finetuned-model.tlt \
                        -k $KEY \
                        -r $RESULTS_DIR/token_classification/riva \
                        export_format=JARVIS

The model is exported as `exported-model.riva` which is in a format suited for deployment in Riva.

---
### Infer on a custom input sentence

To get the model's predictions on a given sentence of your choice, you can save those sentences in infer.yaml and use TLT infer command.

The infer.yaml contains:

```
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT Spec file for inference using the previously pretrained Token Classification model.

# "Simulate" user input: batch with two samples.
input_batch:
  - 'We bought four shirts from the Nvidia gear store in Santa Clara.'
  - 'Nvidia is a company.'

```
Please edit `input_batch` with your own sentences to measure the output of this NER model.

In [None]:
!tlt token_classification infer \
                        -e $SPECS_DIR/token_classification/infer.yaml \
                        -g 1 \
                        -m $RESULTS_DIR/token_classification/finetune-bert-base/checkpoints/finetuned-model.tlt \
                        -k $KEY \
                        -r $RESULTS_DIR/token_classification/infer

---
### What's Next?

You could use TLT to build custom models for your own applications, or you could deploy the custom model to Nvidia Riva!