# Manufacturing Use Case using TAO MaskRCNN

Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data. In this example we will be training an instance segmentation model for tracking defects along a manufacturing line. 

<!-- <img align="center" src="https://developer.nvidia.com/sites/default/files/akamai/embedded-transfer-learning-toolkit-software-stack-1200x670px.png" width="1080">  -->

## Hands-on Lab Objectives
In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Prepare a dataset which was labeled in Azure Machine Learning for use with the TAO Toolkit
* Take a pretrained resnet50 model and train a MaskRCNN model on COCO dataset
* Evaluate the trained model
* Optimize the trained model by pruning and retraining
* Export the trained model to a .etlt file for deployment to DeepStream
* Integrate the exported model into a DeepStream Pipeline

## Prerequisites
> **NOTE:** In order to complete this hands on lab, we assume you have already completed the steps mentioned in the [README.md](README.md) document of this repository. Please complete the steps mentioned there prior to proceeding further. Additionally this notebook should be opened from the conda environment you configured for use with the tao toolkit.


### Table of Contents
This notebook shows an example use case for instance segmentation using the Train Adapt Optimize (TAO) Toolkit.

1. [Configure Key Directories](#head-1)
2. [Instal Python Dependencies](#head-2)
3. [Install Nvidia NGC CLI Tool](#head-3)
4. [Download and Prepare Training Data](#head-4)
5. [Download pre-trained model](#head-5)
6. [Provide the TAO training specification](#head-6)
7. [Train a MaskRCNN model](#head-7)
8. [Evaluate trained model](#head-8)
9. [Prune the model](#head-9)
10. [Retrain pruned models](#head-10)
11. [Evaluate retrained model](#head-11)
12. [Export the model for use with DeepStream](#head-12)
13. [Run a deepstream pipeline with the model](#head-13)


## 1. Configure Key Directories <a class="anchor" id="head-1"></a>
---
When using the purpose-built pretrained models from NGC, please make sure to set the `$KEY` environment variable to the key as mentioned in the model overview. Failing to do so, can lead to errors when trying to load them as pretrained models.

The following notebook requires the user to set an env variable called the `$PROJECT_DIR` as the path to the users workspace. Please note that the dataset to run this notebook is expected to reside in the `$PROJECT_DIR/data`, while the TAO experiment generated collaterals will be output to `$PROJECT_DIR/maskrcnn`. More information on how to set up the dataset and the supported steps in the TAO workflow are provided in the subsequent cells.

The cell below configures the key project directories required for running this notebook. A table summarizing the directories and expected values are listed below.

|Directory Name|Description|
|--------------|-----------|
|PROJECT_DIR|General project directory. This should point to the root of your workspace|
|DATA_DIR|Sub-directory of your project where the images and annotations are stored|
|AML_DATA_DIR|Sub-directory of DATA_DIR where the raw image data should be downloaded|

*Note: Please make sure to remove any stray artifacts/files from the `$EXPERIMENT_DIR` or `$AML_DATA_DIR` paths as mentioned below, that may have been generated from previous experiments. Having checkpoint files etc may interfere with creating a training graph for a new experiment.*

*Note: This notebook currently is by default set up to run training using 1 GPU. To use more GPU's please update the env variable `$NUM_GPUS` accordingly*

In [None]:
import os
os.environ['PROJECT_DIR'] = os.environ['PWD'] + "/workspace"
os.environ['DATA_DIR'] = os.environ['PROJECT_DIR'] + "/data"
os.environ['AML_DATA_DIR'] = os.environ['PROJECT_DIR'] + "/data/raw-images"
os.environ['EXPERIMENT_DIR'] = os.environ['PROJECT_DIR'] + "/models/maskrcnn"

### Configure TAO mount directories for container bindings

The TAO toolkit requires a file which defines the directories to be mounted to the container for the various tasks. This file is located in `~/.tao_mounts.json`. The cell below uses the previously specified directories to generate this file for you.

In [None]:
import os
import json

# map the workspace directory from this repo to the location in the container

mounts_file = os.path.expanduser("~/.tao_mounts.json")

# Define the dictionary with the mapped drives
drive_map = {
    "Mounts": [
        # Mapping the data directory
        {
            "source": os.environ["PROJECT_DIR"],
            "destination": "/workspace"
        }
    ]
}

# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

## 2. Install Python dependencies <a class="anchor" id="head-2"></a>
---

This lab requires several key packages in order to run the experience. The below cell will install the following packages and any dependencies into your current conda environment.

### Installed Packages
- nvidia-tao
- azure-storage-blob
- pycocotools
- absl-py
- tensorflow-object-detection-api

In [None]:
# install packages from requirements file
!pip3 install -r requirements.txt

### Check that the Tao Launcher is available

Run the below command to validate that the tao launcher tool is available. The output of the command should appear similar to the below:


#### Sample Output
```text
$ tao info

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022
```

In [None]:
!tao info

## 3. Install Nvidia NGC CLI tool <a class="anchor" id="head-3"></a>

During the prerequisites for this lab, you should have configured your NGC account and retrieved your API key. You also should have logged into the nvcr.io docker account with the `docker login nvcr.io` command. The steps in the cell below further builds on this by downloading the NGC CLI tool for retrieving the docker images used for working with the TAO toolkit. 

This cell will perform the following actions:
* Remove any existing NGC CLI instances
* Download the NGC CLI zip file
* Unpack the NGC CLI tool and cleanup files
* Add the new instance to your PATH

In [None]:
# Installing NGC CLI on the local machine.
## Download and install
%env CLI=ngccli_cat_linux.zip
!mkdir -p $PROJECT_DIR/ngccli

# Remove any previously existing CLI installations
!rm -rf $PROJECT_DIR/ngccli/*
!wget "https://ngc.nvidia.com/downloads/$CLI" -P $PROJECT_DIR/ngccli
!unzip -u "$PROJECT_DIR/ngccli/$CLI" -d $PROJECT_DIR/ngccli
!rm $PROJECT_DIR/ngccli/*.zip 
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("PROJECT_DIR", ""), os.getenv("PATH", ""))

In [None]:
# validate the NGC CLI successfully installed
!ngc --version

## 4. Install Protocol Buffer Compiler (Protoc) on your system.

In [None]:
# Install Protocol buffer compiler (protoc). This binary will be used in the next cell to compile all the .proto files
# under tf-models/research/object_detection/protos/ folder. 
# https://google.github.io/proto-lens/installing-protoc.html
# This installation is for Ubuntu 20.04 OS.

%env PROTOC_ZIP=protoc-3.14.0-linux-x86_64.zip
!curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.14.0/$PROTOC_ZIP
!unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
!unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
!rm -f $PROTOC_ZIP

The above cell should install Protobuf compiler on your system which are essential to compile the .proto files. These .proto files would get downloaded in the following cells while we prepare the dataset for training purposes. If the above commands error out due to permission issues, then conncet to your VM via SSH, goto workspace folder and then execute these commands manually.

## 5. Download and Prepare Training Data <a class="anchor" id="head-4"></a>

At this point, we have configured our environment for use with the TAO toolkit. Now the fun part begins. We start by downloading the labeled data from Azure storage. For this step you will need a SAS token for downloading the images from blob storage. It is expected that you already have an annotations.json in COCO format which has been exported from the Azure Machine Learning labeling project. We will be using an annotations.json file stored in this repository found [here](workspace/data/annotations.json)

### Download Data from Azure Storage

In [None]:
# Run the python script with the following arguments to download the images
!python3 $PROJECT_DIR/models/maskrcnn/scripts/download_aml_blobs.py --output_dir $AML_DATA_DIR

### Split the dataset into train, test, and validation sets

Next, we run the below Python script to split our images into train, test, and validation sets. We randomly select the images to be put into each of these categories. The main variable to control here is the `--train_pct` argument which determines what percentage of the data is used for training vs testing and validation. 

![docs/images/train_val_test.jpg](docs/images/train_val_test.jpg)

Below is a table summarizing the inputs for the script

|Input Parameter|Description|
|:--------------|:----------|
|--annotations_file|This is an absolute path reference to the `annotations.json` file which was exported from the Azure ML labeling project|
|--input_dir|This is an absolute path reference to the directory where the raw image dataset was downloaded to in the previous step|
|--output_dir|This is an absolute path reference to the path where the split dataset will reside after it is split|
|--train_pct|This determines the percentage of the data to be used for training vs testing and validation|


In [None]:
# Split the data in preparation for training
!python3 $PROJECT_DIR/models/maskrcnn/scripts/split_aml_exported_coco.py \
    --annotations_file $DATA_DIR/annotations.json \
    --input_dir $AML_DATA_DIR \
    --output_dir $DATA_DIR/split-images \
    --train_pct 0.8


### Convert COCO labels to TFRecords

The last step in the data preparation process is to convert the images and annotations into TFRecord format. This is done via a helper script which has been modified for this lab. Behind the scenes, this script downloads some additional Tensorflow utilities required for the conversion process and then converts all the data to TFRecord format.

In [None]:
# Create TFRecords from coco
!bash $PROJECT_DIR/models/maskrcnn/scripts/process-labels-aml.sh

Note: Ignore the following, if the above step was successful. But if the above step errors out due to permission issues while running the protocol buffer compiler (protoc), then connect to your VM via SSH, goto workspace folder and then execute the below command manually:
```bash
cd tf-models/research && protoc object_detection/protos/*.proto --python_out=.
```
And then re-run the above step (process-labels-aml.sh) to successfully convert your data into TFRecords.

### Validate Data Directory Structure 

At this point you have successfully prepared your dataset for use with the TAO toolkit for training an optimized model ready for DeepStream integration. To validate everything went smoothly, your data directory should look like the following:
```
workspace
│   README.md
│   file001.txt    
│
└───data
    │   annotations.json
    │
    └───raw-images
    │   │   image1.jpg
    │   │   image2.jpg
    │   │   ...
    │
    └───split-images
        │   
        └───train
        │   │   annotations.json
        │   └───images
        │       │   image1.jpg
        │       │   image2.jpg
        │       │   ...
        │       
        └───test
        │   │   annotations.json
        │   └───images
        │       │   image1.jpg
        │       │   image2.jpg
        │       │   ...│
        │
        └───val
            │   annotations.json
            └───images
                │   image1.jpg
                │   image2.jpg
                │   ...        
```

 We will use NGC CLI to get the pre-trained models. For more details, go to ngc.nvidia.com and click the SETUP on the navigation bar.

## 6. Download pre-trained model <a class="anchor" id="head-5"></a>

In an earlier step, we downloaded the Nvidai NGC CLI tool. Now we are ready to use this tool for downloading the pretrained model we will use for transfer learning. The cells below create a new directory for the pretrained model then use the NGC CLI to download the model for use in the TAO toolkit.

### Prepare directory for model download

In [None]:
# Remove existing instances and create a new empty directory
!rm -rf $EXPERIMENT_DIR/pretrained_resnet50/
!mkdir -p $EXPERIMENT_DIR/pretrained_resnet50/

### List available instance segmentation models
The below command lists the available instance segmentation models for use with the TAO toolkit. From the output list we are interested in the `resnet50` model which we will download in the next command.

In [None]:
# List the avaialble instance segmentation models
!ngc registry model list nvidia/tao/pretrained_instance_segmentation:*

### Download the resnet50 model to the output directory created above

In [None]:
# Pull pretrained model from NGC
!ngc registry model download-version \
    nvidia/tao/pretrained_instance_segmentation:resnet50 \
    --dest $EXPERIMENT_DIR/pretrained_resnet50

### Validate model exists in target directory

In [None]:
print("Check that model is downloaded into dir.")
!ls -l $EXPERIMENT_DIR/pretrained_resnet50/pretrained_instance_segmentation_vresnet50

## 7. Provide the TAO training specification <a class="anchor" id="head-6"></a>

* Tfrecords for the train datasets
    * In order to use the newly generated tfrecords, we have updated the dataset_config parameter in the spec file at `$EXPERIMENT_DIR/configs/maskrcnn_train_resnet50.txt`
    
Note that the learning rate in the spec file is set for 1 GPU training. 

We have configured this experiment to work with the sample dataset, but you may need to alter these settings if using another dataset. Please refer to the [Nvidia Documentation](https://docs.nvidia.com/tao/tao-toolkit/text/instance_segmentation/mask_rcnn.html) for further information on how to configure the training specification for your specific use case.

In [None]:
# Print out the training spec
!cat $EXPERIMENT_DIR/configs/maskrcnn_train_resnet50.txt

## 8. Train a MaskRCNN model <a class="anchor" id="head-7"></a>

Now we are ready to train our model. As final preparation, we will set a couple environment variables which will be used throughout the tao commands used for the remainder of the lab. A description of these variables is provided below.

|ENV Variable|Description|
|:-----------|:----------|
|KEY|The value provided here is used to encrypt the model|
|CONTAINER_EXPERIMENT_DIR|This is the absolute path in the target directory specified in the `~/.tao_mounts.json` file created earlier in this tutorial|


### Set ENV Variables

In [None]:
# Configure the key variables for the TAO training steps
os.environ['KEY'] = "nvidia_tlt"
os.environ['CONTAINER_EXPERIMENT_DIR'] = "/workspace/models/maskrcnn"

### Prepare output directory

In [None]:
# Remove and recreate output directory
!sudo rm -rf $EXPERIMENT_DIR/experiment_dir_unpruned
!mkdir -p $EXPERIMENT_DIR/experiment_dir_unpruned

### Train the model

Now we use the command below to train the model using the training spec and output the files to the output directory created above. 

Notes about training:
* The command requires the sample spec file and the output directory location for models
* Evaluation uses COCO metrics. For more info, please refer to: https://cocodataset.org/#detection-eval

> **WARNING:** The training process will take some time (i.e. several hours to days) to complete depending on how many iterations are in your training spec and the machine you are using for training

In [None]:
# Train the MaskRCNN model using resnet50 as starting point
!tao mask_rcnn train -e $CONTAINER_EXPERIMENT_DIR/configs/maskrcnn_train_resnet50.txt \
                     -d $CONTAINER_EXPERIMENT_DIR/experiment_dir_unpruned\
                     -k $KEY \
                     --gpus 1

### Inspect Training output

The command below lists each of the checkpoint files saved during the training process. Each checkpoint is of the pattern 

`model.step-<checkpoint-number>.tlt`

In [None]:
print('Model for each epoch:')
print('---------------------')
!ls -ltrh $EXPERIMENT_DIR/experiment_dir_unpruned/

## 9. Evaluate trained model <a class="anchor" id="head-8"></a>

After training the model, we can evaluate the models performance. To do so, we enter the step number we want to run the evaluation for. In this case, we ran our model for 25000 steps so we will use the .tlt file for the last iteration.

In [None]:
%env NUM_STEP=1000

In [None]:
!sudo rm -rf $EXPERIMENT_DIR/evaluate/
!mkdir -p $EXPERIMENT_DIR/evaluate/

!tao mask_rcnn evaluate -e $CONTAINER_EXPERIMENT_DIR/configs/maskrcnn_train_resnet50.txt \
                        -m $CONTAINER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-$NUM_STEP.tlt \
                        -k $KEY

## 10. Prune the model <a class="anchor" id="head-9"></a>

Now that we have trained a model, we will want to prune it to reduce it's size and to improve integration with DeepStream. To run the pruning step we need to specify the below information:

#### Pruning inputs
- Specify pre-trained model
- Output directory to store the pruned model
- Threshold for pruning.
- A key to save and load the model

Usually, you just need to adjust -pth (threshold) for accuracy and model size trade off. Higher pth gives you smaller model (and thus higher inference speed) but worse accuracy. The threshold value depends on the dataset and the model. 0.5 in the block below is just a start point. If the retrain accuracy is good, you can increase this value to get smaller models. Otherwise, lower this value to get better accuracy.


### Prepare output directory

In [None]:
# Create an output directory to save the pruned model. 
# Remove the directory first if it already exists
!sudo rm -rf $EXPERIMENT_DIR/experiment_dir_pruned
!mkdir -p $EXPERIMENT_DIR/experiment_dir_pruned

### Perform model pruning

In [None]:
# Prune the model and 
!tao mask_rcnn prune -m $CONTAINER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-$NUM_STEP.tlt \
                     -o $CONTAINER_EXPERIMENT_DIR/experiment_dir_pruned \
                     -pth 0.5 \
                     -k $KEY

### Validate pruning output exists

In [None]:
!ls -l $EXPERIMENT_DIR/experiment_dir_pruned

**Note** that you should retrain the pruned model first, as it cannot be directly used for evaluation or inference. 

## 11. Retrain pruned models <a class="anchor" id="head-10"></a>

After pruning, the model will have lost some accuracy. For that reason, it is necessary to retrain again using the TAO toolkit. The toolkit allows for this by using the pruned model as a starting checkpoint to apply a final trained model for integration with DeepStream. This is achieved through a separate training spec file located in `$EXPERIMENT_DIR/configs/maskrcnn_retrain_resnet50.txt`

The inputs for the retraining step are listed below:
- Path to the retraining model specification
- Output directory to store retrained model
- Key used with TAO tookit

> **WARNING:** As with the initial training step, this training will take several hours to one day to complete depending on the training spec and the machine you are using for training

In [None]:
# output the training spec
!cat $EXPERIMENT_DIR/configs/maskrcnn_retrain_resnet50.txt

# prepare the output directories
!sudo rm -rf $EXPERIMENT_DIR/experiment_dir_retrain
!mkdir -p $EXPERIMENT_DIR/experiment_dir_retrain

# run the retraining step
!tao mask_rcnn train -e $CONTAINER_EXPERIMENT_DIR/configs/maskrcnn_retrain_resnet50.txt \
                     -d $CONTAINER_EXPERIMENT_DIR/experiment_dir_retrain\
                     -k $KEY \
                     --gpus 1

## 12. Evaluate retrained model <a class="anchor" id="head-11"></a>

We will once again evaluate the retrained model to make sure the performance is satisfactory. We do so by running the `tao mask_rcnn evaluate` command on the final training checkpoint.

In [None]:
%env NUM_STEP=25000

In [None]:
!tao mask_rcnn evaluate -e $CONTAINER_EXPERIMENT_DIR/configs/maskrcnn_retrain_resnet50.txt \
                        -m $CONTAINER_EXPERIMENT_DIR/experiment_dir_retrain/model.step-$NUM_STEP.tlt \
                        -k $KEY

## 13. Export the model for use with DeepStream <a class="anchor" id="head-12"></a>

This is the final step in the TAO toolkit for preparing your model to be used with DeepStream. The output of the cell below is a .etlt file which can be used with DeepStream applications.

In [None]:
# Export the model
!tao mask_rcnn export -m $CONTAINER_EXPERIMENT_DIR/experiment_dir_retrain/model.step-$NUM_STEP.tlt \
                      -k $KEY \
                      -e $CONTAINER_EXPERIMENT_DIR/configs/maskrcnn_retrain_resnet50.txt

### Validate exported model is in output directory

We are looking for the `.etlt` file to be present in the output command listed below

In [None]:
# Check if etlt model is correctly saved.
!ls -l $EXPERIMENT_DIR/experiment_dir_retrain/

## 14. Run a deepstream pipeline with the model <a class="anchor" id="head-13"></a>

The final step in the process is to run a deepstream pipeline using the model. In this instance we are using the `deepstream-app` that is provided as part of the DeepStream sample apps when installing the DeepStream SDK.

We have already configured the configuration files required for running this application so it works for this lab exercise, but typically you would need to configure your `pgie_config.txt` and `tracker_config.txt` files according to where your model assets reside and your particular use case.

This sample application will use a sample input video file in the `media` folder of this repository and run the video through the DeepStream pipeline. The pipeline will save an output with masks and metadata to the root of this repository.

In [None]:
# Run the deepstream sample app with the config file
!deepstream-app -c /home/edwin/repos/manufacturing-demo/deepstream/deepstream_config.txt

### Validate output file exists

In [None]:
# notice the out.mp4 file that was generated by the DeepStream pipeline
!ls | grep out.mp4

### Play the output video file

<video controls src="out.mp4" width=640 height=480/>