### Notebook to demonstrate Image Classification workflow

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

### Sample prediction for an Image Classification model
<img align="center" src="../example_images/sample_image_classification.jpg">

### The workflow in a nutshell

- Creating a dataset
- Upload dataset to the service
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Prune, retrain
    - Export
    - TAO-Deploy
    - Inference on TAO

### Table of contents

1. [Create datasets ](#head-1)
1. [List the created datasets](#head-2)
1. [Create an experiment](#head-4)
1. [List experiments](#head-5)
1. [Assign train, eval datasets](#head-6)
1. [Assign PTM](#head-7)
1. [View hyperparameters that are enabled by default](#head-8)
1. [Set AutoML related configurations](#head-9)
1. [Actions](#head-10)
1. [Train](#head-11)
1. [Evaluate](#head-12)
1. [Optimize: Apply specs for prune](#head-14)
1. [Optimize: Apply specs for retrain](#head-15)
1. [Optimize: Run actions](#head-16)
1. [Export](#head-17)
1. [TRT Engine generation using TAO-Deploy](#head-19)
1. [TAO inference](#head-20)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [286]:
import json
import os
import requests
import uuid
import time
from IPython.display import clear_output
import subprocess
import glob

### FIXME

1. Assign a model_name in FIXME 1
1. Assign a workdir in FIXME 2
1. Assign the ip_address and port_number in FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_api_key variable in FIXME 4
1. (Optional) Enable AutoML if needed in FIXME 5
1. (Optional) Choose between bayesian and hyperband automl_algorithm in FIXME 6 (If automl was enabled in FIXME5)
1. Choose to download jobs or not in FIXME 7
1. Choose between default and custom dataset in FIXME 8
1. Assign path of DATA_DIR in FIXME 9

In [287]:
# Define model_name workspaces and other variables
# Available models (#FIXME 1):
# 1. classification_pyt - https://docs.nvidia.com/tao/tao-toolkit/text/image_classification.html
# 2. classification_tf1 - https://docs.nvidia.com/tao/tao-toolkit/text/image_classification.html
# 3. classification_tf2 - https://docs.nvidia.com/tao/tao-toolkit/text/image_classification_tf2.html
# 4. multitask_classification - https://docs.nvidia.com/tao/tao-toolkit/text/multitask_image_classification.html
# classification is the same as multi-class classification

model_name = "classification_tf2" # FIXME1 (Add the model name from the above mentioned list)

In [288]:
workdir = "/home/zyw/tao-env/siemens_product_cla" # FIXME2
host_url = "http://192.168.1.85:31951" # FIXME3 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
ngc_api_key = "YTVmYWs3aDgxZ2Q2aG5oY3Yyc2RwZG9na2Q6MDFjMmUxMjMtZDNhZC00MWJlLWFmZGMtMGU3ZTc1OThjMGY3" # FIXME4 example: (Add NGC API key)

In [289]:
automl_enabled = True # FIXME5 set to True if you want to run automl for the model chosen in the previous cell
automl_algorithm = "bayesian" # FIXME6 example: bayesian/hyperband
# FIXME7 Defaulted to False as downloading jobs from service to your machine takes time
# Set to True if you want to download jobs where examples have been provided like for train, export, inference.
download_jobs = True

In [292]:
# Exchange NGC_API_KEY for JWT
data = json.dumps({"ngc_api_key": ngc_api_key})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "user_id" in response.json().keys()
user_id = response.json()["user_id"]
print("User ID",user_id)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/users/{user_id}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

User ID a1c02cba-b62b-52f9-9e49-e3de0e5b66ab
JWT eyJraWQiOiJFUkNPOklCWFY6TjY2SDpOUEgyOjNMRlQ6SENVVToyRkFTOkJJTkw6WkxKRDpNWk9ZOkRVN0o6TVlVWSIsImFsZyI6IlJTMjU2In0.eyJzdWIiOiJhNWZhazdoODFnZDZobmhjdjJzZHBkb2drZCIsImF1ZCI6Im5nYyIsImFjY2VzcyI6W10sImlzcyI6ImF1dGhuLm52aWRpYS5jb20iLCJvcHRpb25zIjpbXSwiZXhwIjoxNzI0MjE4OTQzLCJpYXQiOjE3MjQyMTgzNDMsImp0aSI6IjlhZmExZjQzLTk1NzYtNDAxNS05YWQ5LWU1MDJmOTQ4NmI5MCJ9.1lIV9LlaYYQzOceG_fsxpUTkuvHMvDfYwkjkgwkRVPKj0dXWyty3OG2UzsyHJf7SGz4UQLrzPWz8nL2SpPt34uOy4InczgAHchERY-PLY7gLcB-U3G36aDnALwwMwS_iVuGSztfoomKQFM-Qitn0bqEt1I4bo5DftA9Yzq8CsX-XaAMdZQe2veLwV8Wz0M7T3TCO4p7SZoysz03BW7M9ymMZMUf3d5cMJiv71FywXLbpMqdLQvdj43A50qGpjlkTEWszXA5rrAeWp-N8wbqhAmS-h9LMZ6zmXhnlRCqU5OpX1V1fpK2qja94p8NToTJ4DuebfXx25PZ-Ucraw8JXzfuX4sr1KsbjdICbrh5KOrFL2gXNbipzk13SbuivYtvoDUyG4_bZlF4J3DlQThXI3BziPPel38vJAOSPzfNJWoNZaAmKdqMYlws06_X3LVzELDHH2Z30UgkDx6LSVrZpvQ6ythXHiUgv7lCWJUTnvK8YavCO4k1TvKuJJsUcyz1LI3B6XVVjnXRZ0LdhtjXgJgwE-uSfJkS3JEyeBANSwVq030tyrWZtnpMToWe0xkczHwKf4ALE1iKUHeOa0T-F0adSIv

In [296]:
# Creating workdir
if not os.path.isdir(workdir):
    os.makedirs(workdir)

### Function to split tar files <a class="anchor" id="head-1.1"></a>

In [297]:
import os
import tarfile

def split_tar_file(input_tar_path, output_dir, max_split_size=0.2*1024*1024*1024):
	os.makedirs(output_dir, exist_ok=True)
	
	with tarfile.open(input_tar_path, 'r') as original_tar:
		members = original_tar.getmembers()
		current_split_size = 0
		current_split_number = 0
		current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
		
		with tarfile.open(current_split_name, 'w') as split_tar:
			for member in members:
				if current_split_size + member.size <= max_split_size:
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size
				else:
					split_tar.close()
					current_split_number += 1
					current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
					current_split_size = 0
					split_tar = tarfile.open(current_split_name, 'w')  # Open a new split tar archive
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size

### Set dataset type, format <a class="anchor" id="head-1.1"></a>

**For multi-class classification:**

We will be using the pascal `VOC dataset` for the tutorial. To find more details please visit [here](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html#devkit). Please download the [dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) to the environment variable `$DATA_DIR`.

**If using custom dataset; it should follow this dataset structure, and skip running** "**Split dataset into train and val sets**"
```
DATA_DIR
├── classes.txt
├── images_test
│   ├── class_name_1
│   │   ├── image_name_1.jpg
│   │   ├── image_name_2.jpg
│   │   ├── ...
|   |   ... 
│   └── class_name_n
│       ├── image_name_3.jpg
│       ├── image_name_4.jpg
│       ├── ...
├── images_train
│   ├── class_name_1
│   │   ├── image_name_5.jpg
│   │   ├── image_name_6.jpg
|   |   ...
│   └── class_name_n
│       ├── image_name_7.jpg
│       ├── image_name_8.jpg
│       ├── ...
|
└── images_val
    ├── class_name_1
    │   ├── image_name_9.jpg
    │   ├── image_name_10.jpg
    │   ├── ...
    |   ...
    └── class_name_n
        ├── image_name_11.jpg
        ├── image_name_12.jpg
        ├── ...
```
- Each class name folder should contain the images corresponding to that class
- Same class name folders should be present across images_test, images_train and images_val
- classes.txt is a file which contains the names of all classes (each name in a separate line)

**For multi-task classification:**

We will be using the `Fashion Product Images (Small)` for the tutorial. This dataset is available on Kaggle.In this tutorial, our trained classification network will perform three tasks: article category classification, base color classification and target season classification.

To download the dataset, you will need a Kaggle account. After login, you can download the dataset zip file [here](https://www.kaggle.com/paramaggarwal/fashion-product-images-small). The downloaded file is archive.zip with a subfolder called myntradataset. Unzip contents in this subfolder to your workdir created in the cell above and you should have a folder called images and a CSV file called styles.csv

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   |   ├── ...
├── styles.csv
```

In [298]:
# Create train dataset
ds_type = "image_classification"
if model_name == "classification_pyt":
    ds_format = model_name
elif "classification_" in model_name:
    ds_format = "default"
elif model_name == "multitask_classification":
    ds_format = "custom"
print(ds_format)

default


In [299]:
model_name

'classification_tf2'

In [300]:
dataset_to_be_used = "custom" #FIXME8 #default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset
DATA_DIR = os.path.join(workdir, model_name, "source_data") # FIXME9
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR
job_map = {}

In [301]:
os.path.abspath(DATA_DIR)

'/home/zyw/tao-env/siemens_product_cla/classification_tf2/source_data'

### Dataset download and pre-processing <a class="anchor" id="head-1"></a>

In [302]:
print(DATA_DIR)

/home/zyw/tao-env/siemens_product_cla/classification_tf2/source_data


In [303]:
# if dataset_to_be_used == "default":
#     if "classification_" in model_name:
#         assert os.path.exists(os.path.join(DATA_DIR,"VOCtrainval_11-May-2012.tar"))
#         !tar -xf $DATA_DIR/VOCtrainval_11-May-2012.tar -C $DATA_DIR
#         assert (os.path.exists(f"{DATA_DIR}/VOCdevkit/"))
#         !rm -rf $DATA_DIR/split
#     elif model_name == "multitask_classification":
#         assert os.path.exists(os.path.join(DATA_DIR,"archive.zip"))
#         !unzip -uq $DATA_DIR/archive.zip -d $DATA_DIR/
#         assert (os.path.exists(f"{DATA_DIR}/images"))
#         assert (os.path.exists(f"{DATA_DIR}/styles.csv"))
#         # Create subdirectories and remove existing files in them
#         !mkdir -p $DATA_DIR/images_train && rm -rf $DATA_DIR/images_train/*
#         !mkdir -p $DATA_DIR/images_val && rm -rf $DATA_DIR/images_val/*
#         !mkdir -p $DATA_DIR/images_test && rm -rf $DATA_DIR/images_test/*

#### Split dataset into train and val sets

In [304]:
model_name

'classification_tf2'

In [305]:
dataset_to_be_used

'custom'

In [306]:
# # Split dataset into train and val sets
# !python3 -m pip install numpy pandas==1.5.1 tqdm
# if "classification_" in model_name and dataset_to_be_used == "default":
#     !python3 ../dataset_prepare/classification/dataset_split.py
#     assert (os.path.exists(f"{DATA_DIR}/split/images_train/"))
#     assert (os.path.exists(f"{DATA_DIR}/split/images_val/"))
#     assert (os.path.exists(f"{DATA_DIR}/split/images_test/"))
# elif model_name == "multitask_classification" and dataset_to_be_used == "default":
#     !python3 ../dataset_prepare/multitask_classification/dataset_split.py --max_images 10000
#     assert (os.path.exists(f"{DATA_DIR}/images_train/"))
#     assert (os.path.exists(f"{DATA_DIR}/images_val/"))
#     assert (os.path.exists(f"{DATA_DIR}/images_test/"))
#     assert (os.path.exists(f"{DATA_DIR}/train.csv"))
#     assert (os.path.exists(f"{DATA_DIR}/val.csv"))

### Split my custom Siemens dataset into train, val and test datasets

In [307]:
import os
import glob
import shutil
from tqdm import tqdm
import random

DATA_DIR
# Set up directories
DATA_DIR = os.environ.get('DATA_DIR')
SOURCE_DIR = DATA_DIR
IMAGES_TRAIN_DIR = os.path.join(os.path.dirname(DATA_DIR), 'split', 'images_train')
IMAGES_VAL_DIR = os.path.join(os.path.dirname(DATA_DIR), 'split', 'images_val')
IMAGES_TEST_DIR = os.path.join(os.path.dirname(DATA_DIR), 'split', 'images_test')

# Get class names from the directory structure
class_names = [d for d in os.listdir(SOURCE_DIR) if os.path.isdir(os.path.join(SOURCE_DIR, d))]
# Create directories if they don't exist
for dir_path in [IMAGES_TRAIN_DIR, IMAGES_VAL_DIR, IMAGES_TEST_DIR]:
    print("Path:", dir_path)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
        
# Create classes.txt file
with open(os.path.join(os.path.dirname(IMAGES_TRAIN_DIR), 'classes.txt'), 'w') as f:
    for class_name in class_names:
        f.write(f"{class_name}\n")

print("Class names:", class_names)


Path: /home/zyw/tao-env/siemens_product_cla/classification_tf2/split/images_train
Path: /home/zyw/tao-env/siemens_product_cla/classification_tf2/split/images_val
Path: /home/zyw/tao-env/siemens_product_cla/classification_tf2/split/images_test
Class names: ['ET200ecoPN', 'ET200sp', 'S7_1200', 'ET200AL', 'S7_1500']


In [308]:
from collections import defaultdict

In [309]:
def get_source_distribution(src_dir):
    class_distribution = {}
    total_images = 0
    for class_name in class_names:
        assert os.path.exists(os.path.join(src_dir, class_name))
        images = glob.glob(os.path.join(src_dir, class_name, '*.*'))
        class_distribution[class_name] = len(images)
        total_images += len(images)
    return class_distribution, total_images

# Get and print source distribution
source_distribution, source_total = get_source_distribution(SOURCE_DIR)
print("\nOriginal Source Dataset Distribution:")
print(f"{'Class':<15} {'Count':<10} {'Percentage':<10}")
print("-" * 35)
for class_name, count in source_distribution.items():
    percentage = count / source_total * 100
    print(f"{class_name:<15} {count:<10} {percentage:.2f}%")
print(f"{'Total':<15} {source_total:<10} 100.00%")


Original Source Dataset Distribution:
Class           Count      Percentage
-----------------------------------
ET200ecoPN      226        22.07%
ET200sp         193        18.85%
S7_1200         189        18.46%
ET200AL         253        24.71%
S7_1500         163        15.92%
Total           1024       100.00%


In [310]:
# Function to split and copy images
def split_and_copy(src_dir, train_dir, val_dir, test_dir, train_ratio=0.7, val_ratio=0.15):
    class_distribution = defaultdict(lambda: defaultdict(int))
    total_images = 0

    for class_name in class_names:
        images = glob.glob(os.path.join(src_dir, class_name, '*.*'))
        total_images += len(images)
        for dir_path in [train_dir, val_dir, test_dir]:
            os.makedirs(os.path.join(dir_path, class_name), exist_ok=True)

    for class_name in tqdm(class_names, desc="Splitting dataset"):
        images = glob.glob(os.path.join(src_dir, class_name, '*.*'))
        random.shuffle(images)
        
        for img in images:
            rand_val = random.random()
            if rand_val < train_ratio:
                dest_dir = train_dir
                split_name = 'train'
            elif rand_val < train_ratio + val_ratio:
                dest_dir = val_dir
                split_name = 'val'
            else:
                dest_dir = test_dir
                split_name = 'test'
            
            shutil.copy2(img, os.path.join(dest_dir, class_name))
            class_distribution[class_name][split_name] += 1

    return class_distribution, total_images

# Split the dataset
class_distribution, total_images = split_and_copy(SOURCE_DIR, IMAGES_TRAIN_DIR, IMAGES_VAL_DIR, IMAGES_TEST_DIR)

print("\nClass Distribution:")
print(f"{'Class':<15} {'Train':<20} {'Val':<20} {'Test':<20} {'Total':<10}")
print("-" * 85)

empty_classes = []

for class_name in class_names:
    train_count = class_distribution[class_name]['train']
    val_count = class_distribution[class_name]['val']
    test_count = class_distribution[class_name]['test']
    total_count = train_count + val_count + test_count
    
    if total_count == 0:
        empty_classes.append(class_name)
        print(f"{class_name:<15} No images found")
    else:
        print(f"{class_name:<15} "
              f"{train_count:<10} ({train_count/total_count:.1%}) "
              f"{val_count:<10} ({val_count/total_count:.1%}) "
              f"{test_count:<10} ({test_count/total_count:.1%}) "
              f"{total_count:<10}")

# Count images in each split
train_count = sum(class_distribution[c]['train'] for c in class_names)
val_count = sum(class_distribution[c]['val'] for c in class_names)
test_count = sum(class_distribution[c]['test'] for c in class_names)

print(f"\nTotal images: Train: {train_count} ({train_count/total_images:.1%}), "
      f"Validation: {val_count} ({val_count/total_images:.1%}), "
      f"Test: {test_count} ({test_count/total_images:.1%})")

if empty_classes:
    print("\nWarning: The following classes have no images:")
    for class_name in empty_classes:
        print(f"- {class_name}")

print('\nDataset preparation completed.')

Splitting dataset: 100%|███████████████████████████████████████████████| 5/5 [00:00<00:00, 21.55it/s]


Class Distribution:
Class           Train                Val                  Test                 Total     
-------------------------------------------------------------------------------------
ET200ecoPN      171        (75.7%) 30         (13.3%) 25         (11.1%) 226       
ET200sp         134        (69.4%) 26         (13.5%) 33         (17.1%) 193       
S7_1200         130        (68.8%) 29         (15.3%) 30         (15.9%) 189       
ET200AL         173        (68.4%) 47         (18.6%) 33         (13.0%) 253       
S7_1500         109        (66.9%) 31         (19.0%) 23         (14.1%) 163       

Total images: Train: 717 (70.0%), Validation: 163 (15.9%), Test: 144 (14.1%)

Dataset preparation completed.





### Create Tar files to upload

In [311]:
DATA_DIR = os.path.join(os.path.dirname(DATA_DIR))

In [312]:
DATA_DIR

'/home/zyw/tao-env/siemens_product_cla/classification_tf2'

In [313]:
# the zipped file should be in a folder called split_tar in DATA_DIR, if not existing, create it
split_tar_dir = os.path.join(DATA_DIR, 'split_tar')
os.makedirs(split_tar_dir, exist_ok=True)

if "classification_" in model_name:
    !tar -C $DATA_DIR/split/ -czf $split_tar_dir/classification_train.tar.gz images_train classes.txt
    !tar -C $DATA_DIR/split/ -czf $split_tar_dir/classification_val.tar.gz images_val classes.txt
    !tar -C $DATA_DIR/split/ -czf $split_tar_dir/classification_test.tar.gz images_test classes.txt
elif model_name == "multitask_classification":
    !tar -C $DATA_DIR/ -czf $split_tar_dir/mt_classification_train.tar.gz images_train train.csv val.csv
    !tar -C $DATA_DIR/ -czf $split_tar_dir/mt_classification_val.tar.gz images_val val.csv
    !tar -C $DATA_DIR/ -czf $split_tar_dir/mt_classification_test.tar.gz images_test

In [314]:

if "classification_" in model_name:
    train_dataset_path =  os.path.join(split_tar_dir, "classification_train.tar.gz")
    eval_dataset_path = os.path.join(split_tar_dir, "classification_val.tar.gz")
    test_dataset_path = os.path.join(split_tar_dir, "classification_test.tar.gz")
elif model_name == "multitask_classification":
    train_dataset_path =  os.path.join(split_tar_dir, "mt_classification_train.tar.gz")
    eval_dataset_path = os.path.join(split_tar_dir, "mt_classification_val.tar.gz")
    test_dataset_path = os.path.join(split_tar_dir, "mt_classification_test.tar.gz")

In [315]:
os.path.abspath(train_dataset_path)

'/home/zyw/tao-env/siemens_product_cla/classification_tf2/split_tar/classification_train.tar.gz'

### Create and upload train dataset <a class="anchor" id="head-1.2"></a>

In [316]:
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)


print(response)
print(response.json())
assert "id" in response.json().keys()
train_dataset_id = response.json()["id"]

<Response [201]>
{'actions': [], 'client_url': None, 'created_on': '2024-08-21T05:34:56.135928', 'description': 'My TAO Dataset', 'docker_env_vars': {}, 'format': 'default', 'id': '1c63db8b-5281-4aea-95dd-8f7b029039dc', 'jobs': [], 'last_modified': '2024-08-21T05:34:56.135944', 'logo': 'https://www.nvidia.com', 'name': 'My Dataset', 'pull': None, 'status': 'not_present', 'type': 'image_classification', 'version': '1.0.0'}


In [317]:
# Update
dataset_information = {"name":"Siemens Train Dataset",
                       "description":"My train dataset"}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)


print(response)
print(response.json())

<Response [200]>
{'actions': [], 'client_url': None, 'created_on': '2024-08-21T05:34:56.135928', 'description': 'My train dataset', 'docker_env_vars': {}, 'format': 'default', 'id': '1c63db8b-5281-4aea-95dd-8f7b029039dc', 'jobs': [], 'last_modified': '2024-08-21T05:35:28.731986', 'logo': 'https://www.nvidia.com', 'name': 'Siemens Train Dataset', 'pull': None, 'status': 'not_present', 'type': 'image_classification', 'version': '1.0.0'}


In [318]:
# Upload
output_dir = os.path.join(os.path.dirname(os.path.abspath(train_dataset_path)), model_name, "train")
split_tar_file(train_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    files = [("file",open(os.path.join(output_dir, tar_dataset_path),"rb"))]

    endpoint = f"{base_url}/datasets/{train_dataset_id}:upload"

    response = requests.post(endpoint, files=files, headers=headers)
    assert response.status_code in (200, 201)
    assert "message" in response.json().keys() and response.json()["message"] == "Server recieved file and upload process started"

    print(response)
    print(response.json())

Uploading 1/1 tar split
<Response [201]>
{'message': 'Server recieved file and upload process started'}


In [319]:
print(output_dir)

/home/zyw/tao-env/siemens_product_cla/classification_tf2/split_tar/classification_tf2/train


### Create and upload val dataset <a class="anchor" id="head-1.3"></a>

In [320]:
# Create eval dataset
ds_type = "image_classification"
if model_name == "classification_pyt":
    ds_format = model_name
else:
    ds_format = "default"
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())
assert "id" in response.json().keys()
eval_dataset_id = response.json()["id"]

<Response [201]>
{'actions': [], 'client_url': None, 'created_on': '2024-08-21T05:35:46.334652', 'description': 'My TAO Dataset', 'docker_env_vars': {}, 'format': 'default', 'id': 'd79beccd-32d3-4c7b-a936-4574f52ec94b', 'jobs': [], 'last_modified': '2024-08-21T05:35:46.334666', 'logo': 'https://www.nvidia.com', 'name': 'My Dataset', 'pull': None, 'status': 'not_present', 'type': 'image_classification', 'version': '1.0.0'}


In [321]:
# Update
dataset_information = {"name":"Siemens Eval dataset",
                       "description":"S eval dataset"}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

<Response [200]>
{'actions': [], 'client_url': None, 'created_on': '2024-08-21T05:35:46.334652', 'description': 'S eval dataset', 'docker_env_vars': {}, 'format': 'default', 'id': 'd79beccd-32d3-4c7b-a936-4574f52ec94b', 'jobs': [], 'last_modified': '2024-08-21T05:35:48.708111', 'logo': 'https://www.nvidia.com', 'name': 'Siemens Eval dataset', 'pull': None, 'status': 'not_present', 'type': 'image_classification', 'version': '1.0.0'}


In [322]:
# Upload
output_dir = os.path.join(os.path.dirname(os.path.abspath(eval_dataset_path)), model_name, "eval")
split_tar_file(eval_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    files = [("file",open(os.path.join(output_dir, tar_dataset_path),"rb"))]

    endpoint = f"{base_url}/datasets/{eval_dataset_id}:upload"

    response = requests.post(endpoint, files=files, headers=headers)
    assert response.status_code in (200, 201)
    assert "message" in response.json().keys() and response.json()["message"] == "Server recieved file and upload process started"

    print(response)
    print(response.json())

Uploading 1/1 tar split
<Response [201]>
{'message': 'Server recieved file and upload process started'}


### Create and upload test dataset <a class="anchor" id="head-1.4"></a>

In [343]:
# Assuming the JSON object is stored in the variable 'response_json'
pretty_json = json.dumps(response_json, indent=4)

print(pretty_json)

[
    {
        "actions": [
            "train",
            "evaluate",
            "prune",
            "retrain",
            "export",
            "gen_trt_engine",
            "inference"
        ],
        "additional_id_info": null,
        "automl_add_hyperparameters": "[]",
        "automl_algorithm": null,
        "automl_enabled": false,
        "automl_remove_hyperparameters": "[]",
        "base_experiment": [],
        "calibration_dataset": "1c63db8b-5281-4aea-95dd-8f7b029039dc",
        "checkpoint_choose_method": "best_model",
        "checkpoint_epoch_number": {},
        "created_on": "2024-08-21T05:42:01.537074",
        "dataset_type": "image_classification",
        "description": "My Experiments",
        "docker_env_vars": {},
        "encryption_key": "nvidia_tlt",
        "eval_dataset": "d79beccd-32d3-4c7b-a936-4574f52ec94b",
        "id": "d8ac076a-6bef-48cb-9dee-21e24d7cd615",
        "inference_dataset": "aa022042-11c4-44c7-968d-0a5301059500",
        "is

In [326]:
 # Create testing dataset for inference
ds_type = "image_classification"
ds_format = "default"
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())
assert "id" in response.json().keys()
test_dataset_id = response.json()["id"]

<Response [201]>
{'actions': [], 'client_url': None, 'created_on': '2024-08-21T05:39:43.333864', 'description': 'My TAO Dataset', 'docker_env_vars': {}, 'format': 'default', 'id': 'aa022042-11c4-44c7-968d-0a5301059500', 'jobs': [], 'last_modified': '2024-08-21T05:39:43.333874', 'logo': 'https://www.nvidia.com', 'name': 'My Dataset', 'pull': None, 'status': 'not_present', 'type': 'image_classification', 'version': '1.0.0'}


In [327]:
# Upload
output_dir = os.path.join(os.path.dirname(os.path.abspath(test_dataset_path)), model_name, "test")
split_tar_file(test_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    files = [("file",open(os.path.join(output_dir, tar_dataset_path),"rb"))]

    endpoint = f"{base_url}/datasets/{test_dataset_id}:upload"

    response = requests.post(endpoint, files=files, headers=headers)
    assert response.status_code in (200, 201)
    assert "message" in response.json().keys() and response.json()["message"] == "Server recieved file and upload process started"

    print(response)
    print(response.json())

Uploading 1/1 tar split
<Response [201]>
{'message': 'Server recieved file and upload process started'}


### List the created datasets <a class="anchor" id="head-2"></a>

In [328]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
# print(response.json()) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in response.json():
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

<Response [200]>
id					 type			 format		 name
aa022042-11c4-44c7-968d-0a5301059500 	 image_classification 	 default 		 My Dataset
b43ccf4a-90cc-4a7e-a75d-a7baf303a040 	 image_classification 	 raw 		 My Dataset
d79beccd-32d3-4c7b-a936-4574f52ec94b 	 image_classification 	 default 		 Siemens Eval dataset
1c63db8b-5281-4aea-95dd-8f7b029039dc 	 image_classification 	 default 		 Siemens Train Dataset
75d2d437-b225-410e-ab66-9eb7e769cc3a 	 image_classification 	 raw 		 My Dataset
1e9c8565-7e5c-42e7-a137-e811f1e05ddb 	 image_classification 	 default 		 Siemens Eval dataset
3a87ca28-3129-4f80-9a5a-2800434acace 	 image_classification 	 default 		 Siemens Train Dataset
2e94065b-c972-4002-98b9-480c09d8f98a 	 image_classification 	 raw 		 My Dataset
112b79fa-e93e-4487-89a1-43b5954fdf71 	 image_classification 	 default 		 Siemens Eval dataset
7998bcf9-1e40-4e3c-9b6c-675b60d24e6d 	 image_classification 	 default 		 Siemens Train Dataset
7162202a-aa48-43c6-a2dd-4a5a1c0e5976 	 image_classification 	 

### Create an experiment <a class="anchor" id="head-4"></a>

In [362]:
if "classification" in model_name:
    encode_key = "nvidia_tlt"
else:
    encode_key = "tlt_encode"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,"encryption_key":encode_key,"checkpoint_choose_method":checkpoint_choose_method})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())
assert "id" in response.json().keys()
experiment_id = response.json()["id"]

<Response [201]>
{'actions': ['train', 'evaluate', 'prune', 'retrain', 'export', 'gen_trt_engine', 'inference'], 'additional_id_info': None, 'automl_add_hyperparameters': '[]', 'automl_algorithm': None, 'automl_enabled': False, 'automl_remove_hyperparameters': '[]', 'base_experiment': [], 'calibration_dataset': None, 'checkpoint_choose_method': 'best_model', 'checkpoint_epoch_number': {}, 'created_on': '2024-08-21T06:43:41.391967', 'dataset_type': 'image_classification', 'description': 'My Experiments', 'docker_env_vars': {}, 'encryption_key': 'nvidia_tlt', 'eval_dataset': None, 'id': 'd862a1b9-1305-4ac9-9d1f-a9e94e2b9360', 'inference_dataset': None, 'is_ptm_backbone': True, 'jobs': [], 'last_modified': '2024-08-21T06:43:41.391984', 'logo': 'https://www.nvidia.com', 'metric': None, 'model_params': {}, 'name': 'My Experiment', 'network_arch': 'classification_tf2', 'ngc_path': '', 'public': False, 'read_only': False, 'realtime_infer': False, 'realtime_infer_request_timeout': 60, 'realtim

In [364]:
response.json()

{'actions': ['train',
  'evaluate',
  'prune',
  'retrain',
  'export',
  'gen_trt_engine',
  'inference'],
 'additional_id_info': None,
 'automl_add_hyperparameters': '[]',
 'automl_algorithm': None,
 'automl_enabled': False,
 'automl_remove_hyperparameters': '[]',
 'base_experiment': [],
 'calibration_dataset': None,
 'checkpoint_choose_method': 'best_model',
 'checkpoint_epoch_number': {},
 'created_on': '2024-08-21T06:43:41.391967',
 'dataset_type': 'image_classification',
 'description': 'My Experiments',
 'docker_env_vars': {},
 'encryption_key': 'nvidia_tlt',
 'eval_dataset': None,
 'id': 'd862a1b9-1305-4ac9-9d1f-a9e94e2b9360',
 'inference_dataset': None,
 'is_ptm_backbone': True,
 'jobs': [],
 'last_modified': '2024-08-21T06:43:41.391984',
 'logo': 'https://www.nvidia.com',
 'metric': None,
 'model_params': {},
 'name': 'My Experiment',
 'network_arch': 'classification_tf2',
 'ngc_path': '',
 'public': False,
 'read_only': False,
 'realtime_infer': False,
 'realtime_infer_reque

### List experiments <a class="anchor" id="head-5"></a>

In [365]:
endpoint = f"{base_url}/experiments"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.status_code in (200, 201)

print(response)
# print(response.json()) ## Uncomment for verbose list output
print("model id\t\t\t     network architecture")
for rsp in response.json():
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys and "network_arch" in rsp_keys
    print(rsp["name"], rsp["id"],rsp["network_arch"])

<Response [200]>
model id			     network architecture
My Experiment d862a1b9-1305-4ac9-9d1f-a9e94e2b9360 classification_tf2
My Experiment 9b2e365c-8726-4ef6-8b15-f0aa9a047476 classification_tf2
My Experiment d8ac076a-6bef-48cb-9dee-21e24d7cd615 classification_tf2
My Experiment 7cfe7c67-27a9-4fb7-8bd1-983c10be635e classification_tf2
My Experiment a92eff4a-2c58-48d2-8cde-6499d3c0c939 classification_tf2
My Experiment e7d878d7-ab21-4890-8d9f-76ade7be1c2a classification_tf2
TAO Pretrained NVImageNet Classification backbone 82b75315-2ae3-514b-82b3-00e781a11517 classification_tf2
TAO Pretrained NVImageNet Classification backbone 83f99c5a-cf58-53e5-9cc0-1185f30007c0 classification_tf2
TAO Pretrained NVImageNet Classification backbone 67e24d03-2ca2-59cd-b654-ec203303a339 classification_tf2
TAO Pretrained NVImageNet Classification backbone 4b9bc192-924e-5d49-895a-e55db53a6b4f classification_tf2
TAO Pretrained NVImageNet Classification backbone 6f64702e-624a-57e6-bba4-9809eddb6ab0 classification_

### Assign train, eval datasets <a class="anchor" id="head-6"></a>

In [366]:
dataset_information = {"train_datasets":[train_dataset_id],
                       "eval_dataset":eval_dataset_id,
                       "inference_dataset":test_dataset_id,
                       "calibration_dataset":train_dataset_id}
print(dataset_information)
data = json.dumps(dataset_information)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

{'train_datasets': ['1c63db8b-5281-4aea-95dd-8f7b029039dc'], 'eval_dataset': 'd79beccd-32d3-4c7b-a936-4574f52ec94b', 'inference_dataset': 'aa022042-11c4-44c7-968d-0a5301059500', 'calibration_dataset': '1c63db8b-5281-4aea-95dd-8f7b029039dc'}
<Response [200]>
{'actions': ['train', 'evaluate', 'prune', 'retrain', 'export', 'gen_trt_engine', 'inference'], 'additional_id_info': None, 'automl_add_hyperparameters': '[]', 'automl_algorithm': None, 'automl_enabled': False, 'automl_remove_hyperparameters': '[]', 'base_experiment': [], 'calibration_dataset': '1c63db8b-5281-4aea-95dd-8f7b029039dc', 'checkpoint_choose_method': 'best_model', 'checkpoint_epoch_number': {}, 'created_on': '2024-08-21T06:43:41.391967', 'dataset_type': 'image_classification', 'description': 'My Experiments', 'docker_env_vars': {}, 'encryption_key': 'nvidia_tlt', 'eval_dataset': 'd79beccd-32d3-4c7b-a936-4574f52ec94b', 'id': 'd862a1b9-1305-4ac9-9d1f-a9e94e2b9360', 'inference_dataset': 'aa022042-11c4-44c7-968d-0a5301059500'

In [263]:
job_map

{}

### Assign PTM <a class="anchor" id="head-7"></a>

Search for the PTM on NGC for the Classification model chosen

In [333]:
params

{'network_arch': 'classification_tf2'}

In [346]:
# List all pretrained models for the chosen network architecture
endpoint = f"{base_url}/experiments"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.status_code in (200, 201)

response_json = response.json()

for rsp in response_json:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys and "additional_id_info" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}; Additional info: {rsp["additional_id_info"]}')

PTM Name: TAO Pretrained NVImageNet Classification backbone; PTM version: efficientnet-b4; NGC PATH: nvidia/tao/pretrained_efficientdet_tf2_nvimagenet:efficientnet-b4; Additional info: None
PTM Name: TAO Pretrained NVImageNet Classification backbone; PTM version: efficientnet-b5; NGC PATH: nvidia/tao/pretrained_efficientdet_tf2_nvimagenet:efficientnet-b5; Additional info: None
PTM Name: TAO Pretrained NVImageNet Classification backbone; PTM version: efficientnet-b3; NGC PATH: nvidia/tao/pretrained_efficientdet_tf2_nvimagenet:efficientnet-b3; Additional info: None
PTM Name: TAO Pretrained NVImageNet Classification backbone; PTM version: efficientnet-b2; NGC PATH: nvidia/tao/pretrained_efficientdet_tf2_nvimagenet:efficientnet-b2; Additional info: None
PTM Name: TAO Pretrained NVImageNet Classification backbone; PTM version: efficientnet-b1; NGC PATH: nvidia/tao/pretrained_efficientdet_tf2_nvimagenet:efficientnet-b1; Additional info: None
PTM Name: TAO Pretrained NVImageNet Classification

In [347]:
# Assigning pretrained models to different classification models
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"classification_tf1" : "pretrained_classification:resnet18",
                  "classification_tf2" : "pretrained_classification_tf2:efficientnet_b0",
                  "classification_pyt" : "pretrained_fan_classification_imagenet:fan_hybrid_tiny",
                  "multitask_classification" : "pretrained_classification:resnet10"}
no_ptm_models = set([])

In [367]:
# Get pretrained model for classification
if model_name not in no_ptm_models:
    endpoint = f"{base_url}/experiments"
    params = {"network_arch": model_name}
    response = requests.get(endpoint, params=params, headers=headers)
    assert response.status_code in (200, 201)

    response_json = response.json()

    # Search for ptm with given ngc path
    ptm = []
    for rsp in response_json:
        assert "ngc_path" in rsp_keys
        if rsp["ngc_path"].endswith(pretrained_map[model_name]):
            assert "id" in rsp_keys
            ptm_id = rsp["id"]
            ptm = [ptm_id]
            print("Metadata for model with requested NGC Path")
            print(rsp)
            break

Metadata for model with requested NGC Path
{'actions': ['train', 'evaluate', 'prune', 'retrain', 'export', 'gen_trt_engine', 'inference'], 'additional_id_info': None, 'base_experiment': [], 'base_experiment_pull_complete': 'present', 'calibration_dataset': None, 'checkpoint_choose_method': 'best_model', 'checkpoint_epoch_number': {'id': 0}, 'created_on': '2022-12-08T23:34:10.884Z', 'dataset_type': 'image_classification', 'description': 'Pretrained backbones for TAO Toolkit TF2 image classification', 'eval_dataset': None, 'id': 'cd5a8d37-132f-5811-b84c-c16ffe61b9e3', 'inference_dataset': None, 'is_ptm_backbone': True, 'last_modified': '2022-12-08T23:40:30.534Z', 'logo': 'https://www.nvidia.com', 'name': 'TAO Pretrained Classification', 'network_arch': 'classification_tf2', 'ngc_path': 'nvidia/tao/pretrained_classification_tf2:efficientnet_b0', 'public': True, 'read_only': True, 'realtime_infer_support': False, 'sha256_digest': {}, 'train_datasets': [], 'type': 'vision', 'version': 'effi

In [349]:
model_name

'classification_tf2'

In [350]:
data

'{"network_arch": "classification_tf2", "encryption_key": "nvidia_tlt", "checkpoint_choose_method": "best_model"}'

In [368]:
if model_name not in no_ptm_models:
    ptm_information = {"base_experiment":ptm}
    data = json.dumps(ptm_information)

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(response.json())

<Response [200]>
{'actions': ['train', 'evaluate', 'prune', 'retrain', 'export', 'gen_trt_engine', 'inference'], 'additional_id_info': None, 'automl_add_hyperparameters': '[]', 'automl_algorithm': None, 'automl_enabled': False, 'automl_remove_hyperparameters': '[]', 'base_experiment': ['cd5a8d37-132f-5811-b84c-c16ffe61b9e3'], 'calibration_dataset': '1c63db8b-5281-4aea-95dd-8f7b029039dc', 'checkpoint_choose_method': 'best_model', 'checkpoint_epoch_number': {}, 'created_on': '2024-08-21T06:43:41.391967', 'dataset_type': 'image_classification', 'description': 'My Experiments', 'docker_env_vars': {}, 'encryption_key': 'nvidia_tlt', 'eval_dataset': 'd79beccd-32d3-4c7b-a936-4574f52ec94b', 'id': 'd862a1b9-1305-4ac9-9d1f-a9e94e2b9360', 'inference_dataset': 'aa022042-11c4-44c7-968d-0a5301059500', 'is_ptm_backbone': True, 'jobs': [], 'last_modified': '2024-08-21T06:43:41.391984', 'logo': 'https://www.nvidia.com', 'metric': None, 'model_params': {}, 'name': 'My Experiment', 'network_arch': 'class

In [353]:
automl_specs

['train.reg_config.type',
 'train.optim_config.lr',
 'train.optim_config.beta_1',
 'train.optim_config.nesterov']

### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-8"></a>

In [369]:
if automl_enabled:
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    assert "automl_default_parameters" in response.json().keys()
    automl_specs = response.json()["automl_default_parameters"]
    print(json.dumps(automl_specs, sort_keys=True, indent=4))

[
    "train.reg_config.type",
    "train.optim_config.lr",
    "train.optim_config.beta_1",
    "train.optim_config.nesterov"
]


### Actions <a class="anchor" id="head-10"></a>

For all actions:
1. Get default spec schema and derive the default values
2. Modify defaults if needed
3. Post spec dictionary to the service
4. Run model action
5. Monitor job using retrieve
6. Download results using job download endpoint (if needed)

### Train <a class="anchor" id="head-11"></a>

#### Set AutoML related configurations <a class="anchor" id="head-9"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters:

[Classification TF1](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/classification_tf1/classification_tf1%20-%20train.csv), 
[Classification TF2](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/classification_tf2/classification_tf2%20-%20train.csv), 
[Classification Pytorch](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/classification_pyt/classification_pyt%20-%20train.csv), 
[Multitask classification](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/multitask_classification/multitask_classification%20-%20train.csv)

In [370]:
if automl_enabled:
    # Choose any metric that is present in the kpi dictionary present in the model's status.json. 
    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html
    if model_name == "classification_pyt":
        metric = "loss"
    else:
        metric = "kpi" 

    additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
    remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

    automl_information = {"automl_enabled":automl_enabled,
                          "automl_algorithm":automl_algorithm,
                          "metric":metric,
                          "automl_max_recommendations": 20, # Only for bayesian
                          "automl_R": 15, # Only for hyperband
                          "automl_nu": 4, # Only for hyperband
                          "epoch_multiplier": 0.3, # Only for hyperband
                          # Enable this if you want to add parameters to automl_add_hyperparameters below that are disabled by TAO in the automl_enabled column of the spec csv.
                          # Warning: The parameters that are disabled are not tested by TAO, so there might be unexpected behaviour in overriding this
                          "override_automl_disabled_params": False,
                          "automl_add_hyperparameters":str(additional_automl_parameters),
                          "automl_remove_hyperparameters":str(remove_default_automl_parameters)
                         }
    data = json.dumps(automl_information)

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)


    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))

<Response [200]>
{
    "actions": [
        "train",
        "evaluate",
        "prune",
        "retrain",
        "export",
        "gen_trt_engine",
        "inference"
    ],
    "additional_id_info": null,
    "automl_R": 15,
    "automl_add_hyperparameters": "[]",
    "automl_algorithm": "bayesian",
    "automl_enabled": true,
    "automl_max_recommendations": 20,
    "automl_nu": 4,
    "automl_remove_hyperparameters": "[]",
    "base_experiment": [
        "cd5a8d37-132f-5811-b84c-c16ffe61b9e3"
    ],
    "calibration_dataset": "1c63db8b-5281-4aea-95dd-8f7b029039dc",
    "checkpoint_choose_method": "best_model",
    "checkpoint_epoch_number": {},
    "created_on": "2024-08-21T06:43:41.391967",
    "dataset_type": "image_classification",
    "description": "My Experiments",
    "docker_env_vars": {},
    "encryption_key": "nvidia_tlt",
    "eval_dataset": "d79beccd-32d3-4c7b-a936-4574f52ec94b",
    "id": "d862a1b9-1305-4ac9-9d1f-a9e94e2b9360",
    "inference_dataset": "aa022042

In [371]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)


print(response)
#print(response.json()) ## Uncomment for verbose schema
train_specs = response.json()["default"]
print(json.dumps(train_specs, sort_keys=True, indent=4))

<Response [200]>
{
    "data_format": "channels_first",
    "dataset": {
        "augmentation": {
            "disable_horizontal_flip": false,
            "enable_center_crop": true,
            "enable_color_augmentation": false,
            "enable_random_crop": true,
            "mixup_alpha": 0.0
        },
        "image_mean": [
            103.939,
            116.779,
            123.68
        ],
        "num_classes": 20,
        "preprocess_mode": "caffe"
    },
    "gpus": 1,
    "model": {
        "activation_type": "None",
        "all_projections": false,
        "backbone": "efficientnet-b0",
        "dropout": 0.0,
        "freeze_bn": false,
        "input_channels": 3,
        "input_height": 224,
        "input_image_depth": 8,
        "input_width": 224,
        "resize_interpolation_method": "bilinear",
        "retain_head": false,
        "use_batch_norm": true,
        "use_bias": false,
        "use_pooling": true
    },
    "train": {
        "batch_size_pe

In [356]:
len(class_names)

5

In [372]:
# Override any of the parameters listed in the previous cell as required
# Example for multitask-classification (for each network the parameter key might be different)
if model_name == "multitask_classification":
    train_specs["training_config"]["num_epochs"] = 5
    train_specs["gpus"] = 1
# Example for classification_pyt
elif model_name == "classification_pyt":
    train_specs["train"]["train_config"]["runner"]["max_epochs"] = 40
    train_specs["train"]["num_gpus"] = 1
    train_specs["gpus"] = 1
# Example for classification_tf1
elif model_name == "classification_tf1":
    train_specs["train_config"]["n_epochs"] = 80
    train_specs["gpus"] = 1
# Example for classification_tf2
elif model_name == "classification_tf2":
    train_specs["train"]["num_epochs"] = 80
    train_specs["gpus"] = 1
    train_specs["dataset"]["num_classes"] = len(class_names)

print(json.dumps(train_specs, sort_keys=True, indent=4))
print("Number of classes:", train_specs["dataset"]["num_classes"])

{
    "data_format": "channels_first",
    "dataset": {
        "augmentation": {
            "disable_horizontal_flip": false,
            "enable_center_crop": true,
            "enable_color_augmentation": false,
            "enable_random_crop": true,
            "mixup_alpha": 0.0
        },
        "image_mean": [
            103.939,
            116.779,
            123.68
        ],
        "num_classes": 5,
        "preprocess_mode": "caffe"
    },
    "gpus": 1,
    "model": {
        "activation_type": "None",
        "all_projections": false,
        "backbone": "efficientnet-b0",
        "dropout": 0.0,
        "freeze_bn": false,
        "input_channels": 3,
        "input_height": 224,
        "input_image_depth": 8,
        "input_width": 224,
        "resize_interpolation_method": "bilinear",
        "retain_head": false,
        "use_batch_norm": true,
        "use_bias": false,
        "use_pooling": true
    },
    "train": {
        "batch_size_per_gpu": 64,
      

In [373]:
# Run action
parent = None
action = "train"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":train_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
# assert response.status_code in (200, 201)
# assert response.json()

print(response)
print(response.json())

job_map["train_" + model_name] = response.json()
print(job_map)

<Response [201]>
6c79c4d5-5d44-4c6b-8b76-d74c9b43610a
{'train_classification_tf2': '6c79c4d5-5d44-4c6b-8b76-d74c9b43610a'}


In [375]:
# Monitor job status by repeatedly running this cell
# For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

job_id = job_map["train_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)

    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    # assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(5)

<Response [200]>
{
    "action": "train",
    "created_on": "2024-08-21T06:49:50.252479",
    "description": null,
    "experiment_id": "d862a1b9-1305-4ac9-9d1f-a9e94e2b9360",
    "id": "6c79c4d5-5d44-4c6b-8b76-d74c9b43610a",
    "last_modified": "2024-08-21T08:30:11.913861",
    "name": null,
    "parent_id": null,
    "result": {
        "automl_result": [
            {
                "metric": "best_val_accuracy",
                "value": 0.662576775415039
            }
        ],
        "stats": [
            {
                "metric": "Estimated time for automl completion",
                "value": "323.89 minutes remaining approximately"
            },
            {
                "metric": "Current experiment number",
                "value": "6"
            },
            {
                "metric": "Number of epochs yet to start",
                "value": "1166"
            },
            {
                "metric": "Time per epoch in seconds",
                "value": "16

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status by repeatedly running this cell' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [361]:
if automl_enabled:
    job_id = job_map["train_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

    response = requests.post(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(response.json())

<Response [200]>
{}


In [57]:
print(job_id)
print(model_name)

d2d15d6b-01ea-402b-b4c6-f0d307173654
multitask_classification


In [58]:
print(experiment_id)

e8132f13-d5da-42fd-a6fb-758d401cd1a0


In [59]:
print(user_id)

a1c02cba-b62b-52f9-9e49-e3de0e5b66ab


In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status by repeatedly running this cell' cell above (4th cell above from this cell)
if automl_enabled:
    job_id = job_map["train_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:resume"

    data = json.dumps({"parent_job_id":parent,"specs":train_specs})
    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(response.json())

### Download train job artifacts <a class="anchor" id="head-12"></a>

In [65]:
# Example to list the files of the executed train job
job_id = job_map["train_" + model_name]
endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:list_files'

response = requests.get(endpoint, headers=headers)
print(json.dumps(response.json(), sort_keys=True, indent=4))

[
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/automl_metadata_lock.lock",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/controller_lock.lock",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/recommendation_3_lock.lock",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/recommendation_0.json",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/recommendation_0_lock.lock",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/recommendation_2.json",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2/events",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2/status.json",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2/multitask_cls_training_log_resnet10.csv",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2/class_mapping.json",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2/weights",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_2/log.txt",
    "1e778a2d-1597-4b62-ac68-5f5ee28ddacd/experiment_0",
    "1e778a2d-1597-4b62-ac68-5f5ee28dda

In [66]:
# ## Patch the model with proper metric before training to run this cell; By default loss is used, but some models dont log the parameter under the name 'loss'

# # Download selective job contents once the above job shows "Done" status
# # Example to download selective files of train job (Note: will take time)
# endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download_selective_files'

# file_lists = [] # Choose file names from the previous cell where all the files for this job were listed
# best_model = False # Enable this to download the checkpoint of the best performing model w.r.t to the metric chosen before starting training
# latest_model = True # Enable this to download the latest checkpoint of the training job; Disable best_model to use latest_model

# params = {"file_lists": file_lists, "best_model": best_model, "latest_model": latest_model}

# # Save
# temptar = f'{job_id}.tar.gz'
# with requests.get(endpoint, headers=headers, params=params, stream=True) as r:
#     r.raise_for_status()
#     with open(temptar, 'wb') as f:
#         for chunk in r.iter_content(chunk_size=8192):
#             f.write(chunk)

# print("Untarring")
# # Untar to destination
# tar_command = f'tar -xvf {temptar} -C {workdir}/'
# os.system(tar_command)
# os.remove(temptar)
# print(f"Results at {workdir}/{job_id}")
# model_downloaded_path = f"{workdir}/{job_id}"

In [67]:
# Downloading train job takes a longer time, uncomment this cell if you want to still proceed
if download_jobs:
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    expected_file_size = response.json().get("job_tar_stats", {}).get("file_size")
    print("expected_file_size: ", expected_file_size)

    !python3 -m pip install tqdm
    from tqdm import tqdm

    endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download'
    temptar = f'{job_id}.tar.gz'

    with tqdm(total=expected_file_size, unit='B', unit_scale=True) as progress_bar:
        while True:
            # Check if the file already exists
            headers_download_job = dict(headers)
            if os.path.exists(temptar):
                # Get the current file size
                file_size = os.path.getsize(temptar)
                print(f"File size of dowloaded content until now is {file_size}")

                # If the file size matches the expected size, break out of the loop
                if file_size >= (expected_file_size-1):
                    print("Download completed successfully.")
                    print("Untarring")
                    # Untar to destination
                    tar_command = f'tar -xf {temptar} -C {workdir}/'
                    os.system(tar_command)
                    os.remove(temptar)
                    print(f"Results at {workdir}/{job_id}")
                    model_downloaded_path = f"{workdir}/{job_id}"
                    break

                # Set the headers to resume the download from where it left off
                headers_download_job['Range'] = f'bytes={file_size}-'
            # Open the file for writing in binary mode
            with open(temptar, 'ab') as f:
                try:
                    response = requests.get(endpoint, headers=headers_download_job, stream=True)
                    print(response)
                    # Check if the request was successful
                    if response.status_code in [200, 206]:
                        # Iterate over the content in chunks
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                # Write the chunk to the file
                                f.write(chunk)
                                # Flush and sync the file to disk
                                f.flush()
                                os.fsync(f.fileno())
                            progress_bar.update(len(chunk))
                    else:
                        print(f"Failed to download file. Status code: {response.status_code}")
                except requests.exceptions.RequestException as e:
                    print("Connection interrupted during download, resuming download from breaking point")
                    time.sleep(5)  # Sleep for a while before retrying the request
                    continue  # Continue the loop to retry the request

expected_file_size:  26710944


  0%|                                      | 1.02k/26.7M [00:01<9:28:40, 783B/s]

<Response [200]>


100%|██████████████████████████████████████▉| 26.7M/26.7M [01:30<00:00, 289kB/s]

File size of dowloaded content until now is 26710944
Download completed successfully.
Untarring


100%|███████████████████████████████████████| 26.7M/26.7M [01:31<00:00, 293kB/s]

Results at /home/zyw/tao-env/1e778a2d-1597-4b62-ac68-5f5ee28ddacd





In [68]:
# View the checkpoints generated for the training job and for automl jobs, in addition view: best performing model's config and the results of all automl experiments

if download_jobs:
    if automl_enabled:
        # !python3 -m pip install pandas==1.5.1
        import pandas as pd
        model_downloaded_path = f"{model_downloaded_path}/best_model"
        assert glob.glob(f"{model_downloaded_path}/*.protobuf") or glob.glob(f"{model_downloaded_path}/*.yaml")

    assert os.path.exists(model_downloaded_path)
    assert (glob.glob(model_downloaded_path + "/**/*.tlt", recursive=True) + glob.glob(model_downloaded_path + "/**/*.hdf5", recursive=True) + glob.glob(model_downloaded_path + "/**/*.pth", recursive=True))

    if os.path.exists(model_downloaded_path):        
        #List the binary model file
        print("\nCheckpoints for the training experiment")
        if os.path.exists(model_downloaded_path+"/train/weights") and len(os.listdir(model_downloaded_path+"/train/weights")) > 0:
            print(f"Folder: {model_downloaded_path}/train/weights")
            print("Files:", os.listdir(model_downloaded_path+"/train/weights"))
        elif os.path.exists(model_downloaded_path+"/weights") and len(os.listdir(model_downloaded_path+"/weights")) > 0:
            print(f"Folder: {model_downloaded_path}/weights")
            print("Files:", os.listdir(model_downloaded_path+"/weights"))
        else:
            print(f"Folder: {model_downloaded_path}")
            print("Files:", os.listdir(model_downloaded_path))

        if automl_enabled:
            assert glob.glob(f"{model_downloaded_path}/*.protobuf") or glob.glob(f"{model_downloaded_path}/*.yaml")
            experiment_artifacts = json.load(open(f"{model_downloaded_path}/controller.json","r"))
            data_frame = pd.DataFrame(experiment_artifacts)
            # Print experiment id/number and the corresponding result
            print("\nResults of all experiments")
            with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
                print(data_frame[["id","result"]])


Checkpoints for the training experiment
Folder: /home/zyw/tao-env/1e778a2d-1597-4b62-ac68-5f5ee28ddacd/best_model/weights
Files: ['multitask_cls_resnet10_epoch_012.hdf5']

Results of all experiments
   id     result
0   0  21.493010
1   1   2.616403
2   2  40.618683
3   3  28.198430


### Evaluate <a class="anchor" id="head-12"></a>

In [214]:
# Get model handler parameters
endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

model_parameters = response.json()
update_checkpoint_choosing = {}
update_checkpoint_choosing["checkpoint_choose_method"] = model_parameters["checkpoint_choose_method"]
update_checkpoint_choosing["checkpoint_epoch_number"] = model_parameters["checkpoint_epoch_number"]
print(update_checkpoint_choosing)

{'checkpoint_choose_method': 'best_model', 'checkpoint_epoch_number': {'best_model_9e9cc34b-a96e-4eab-a2b2-288e5bd941cc': 3, 'latest_model_9e9cc34b-a96e-4eab-a2b2-288e5bd941cc': 3}}


In [215]:
# Change the method by which checkpoint from the parent action is chosen, when parent action is a train/retrain action.
# Example for evaluate action below, can be applied in the same way for other actions too
update_checkpoint_choosing["checkpoint_choose_method"] = "latest_model" # Choose between best_model/latest_model/from_epoch_number
# If from_epoch_number is chosen then assign the epoch number to the dictionary key in the format 'from_epoch_number{train_job_id}'
# update_checkpoint_choosing["checkpoint_epoch_number"]["from_epoch_number_28a2754e-50ef-43a8-9733-98913776dd90"] = 3
data = json.dumps(update_checkpoint_choosing)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

<Response [200]>
{
    "actions": [
        "train",
        "evaluate",
        "prune",
        "retrain",
        "export",
        "gen_trt_engine",
        "inference"
    ],
    "additional_id_info": null,
    "automl_R": 15,
    "automl_add_hyperparameters": "[]",
    "automl_algorithm": "hyperband",
    "automl_enabled": true,
    "automl_max_recommendations": 20,
    "automl_nu": 4,
    "automl_remove_hyperparameters": "[]",
    "base_experiment": [
        "cd5a8d37-132f-5811-b84c-c16ffe61b9e3"
    ],
    "calibration_dataset": "7998bcf9-1e40-4e3c-9b6c-675b60d24e6d",
    "checkpoint_choose_method": "latest_model",
    "checkpoint_epoch_number": {
        "best_model_9e9cc34b-a96e-4eab-a2b2-288e5bd941cc": 3,
        "latest_model_9e9cc34b-a96e-4eab-a2b2-288e5bd941cc": 3
    },
    "created_on": "2024-08-20T08:52:14.162214",
    "dataset_type": "image_classification",
    "description": "My Experiments",
    "docker_env_vars": {},
    "encryption_key": "nvidia_tlt",
    "eval_d

In [279]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
eval_specs = response.json()["default"]
print(json.dumps(eval_specs, sort_keys=True, indent=4))

<Response [200]>
{
    "data_format": "channels_first",
    "dataset": {
        "augmentation": {
            "disable_horizontal_flip": false,
            "enable_center_crop": true,
            "enable_color_augmentation": false,
            "enable_random_crop": true,
            "mixup_alpha": 0.0
        },
        "image_mean": [
            103.939,
            116.779,
            123.68
        ],
        "num_classes": 20,
        "preprocess_mode": "caffe"
    },
    "evaluate": {
        "batch_size": 1,
        "n_workers": 1,
        "top_k": 3
    },
    "model": {
        "activation_type": "None",
        "all_projections": false,
        "backbone": "efficientnet-b0",
        "dropout": 0.0,
        "freeze_bn": false,
        "input_channels": 3,
        "input_height": 224,
        "input_image_depth": 8,
        "input_width": 224,
        "resize_interpolation_method": "bilinear",
        "retain_head": false,
        "use_batch_norm": true,
        "use_bias": f

In [282]:
eval_specs

{'data_format': 'channels_first',
 'dataset': {'augmentation': {'disable_horizontal_flip': False,
   'enable_center_crop': True,
   'enable_color_augmentation': False,
   'enable_random_crop': True,
   'mixup_alpha': 0.0},
  'image_mean': [103.939, 116.779, 123.68],
  'num_classes': 20,
  'preprocess_mode': 'caffe'},
 'evaluate': {'batch_size': 1, 'n_workers': 1, 'top_k': 3},
 'model': {'activation_type': 'None',
  'all_projections': False,
  'backbone': 'efficientnet-b0',
  'dropout': 0.0,
  'freeze_bn': False,
  'input_channels': 3,
  'input_height': 224,
  'input_image_depth': 8,
  'input_width': 224,
  'resize_interpolation_method': 'bilinear',
  'retain_head': False,
  'use_batch_norm': True,
  'use_bias': False,
  'use_pooling': True}}

In [283]:
eval_specs["dataset"]["num_classes"] = len(class_names)

In [284]:
eval_specs["dataset"]["num_classes"]

5

In [285]:
# Modify specs dictionary to change any config parameters
print(json.dumps(eval_specs, sort_keys=True, indent=4))

{
    "data_format": "channels_first",
    "dataset": {
        "augmentation": {
            "disable_horizontal_flip": false,
            "enable_center_crop": true,
            "enable_color_augmentation": false,
            "enable_random_crop": true,
            "mixup_alpha": 0.0
        },
        "image_mean": [
            103.939,
            116.779,
            123.68
        ],
        "num_classes": 5,
        "preprocess_mode": "caffe"
    },
    "evaluate": {
        "batch_size": 1,
        "n_workers": 1,
        "top_k": 3
    },
    "model": {
        "activation_type": "None",
        "all_projections": false,
        "backbone": "efficientnet-b0",
        "dropout": 0.0,
        "freeze_bn": false,
        "input_channels": 3,
        "input_height": 224,
        "input_image_depth": 8,
        "input_width": 224,
        "resize_interpolation_method": "bilinear",
        "retain_head": false,
        "use_batch_norm": true,
        "use_bias": false,
        "use

In [74]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["evaluate_" + model_name] = response.json()
print(job_map)

<Response [201]>
a5122cd1-cd6e-45b7-a8d6-581dd7490384
{'train_multitask_classification': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'evaluate_multitask_classification': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384'}


In [76]:
# Monitor job status by repeatedly running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(2)

<Response [200]>
{'action': 'evaluate', 'created_on': '2024-08-20T06:12:02.815184', 'description': '', 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0', 'id': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384', 'job_tar_stats': {'file_size': 2671, 'sha256_digest': '414c9aa1fde72dae77fd05a89b5690bbc1f072a51f6a0c927f8358f4ccc4ed7f'}, 'last_modified': '2024-08-20T06:14:36.952850', 'name': '', 'parent_id': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'result': {'categorical': [], 'cur_iter': None, 'detailed_status': {'date': '8/20/2024', 'message': 'Evalation finished successfully.', 'status': 'SUCCESS', 'time': '6:14:12'}, 'epoch': None, 'eta': None, 'graphical': [], 'key_metric': 0.0, 'kpi': [{'metric': 'base_color', 'values': {'0': 0.6744260204081632}}, {'metric': 'category', 'values': {'0': 0.8995535714285714}}, {'metric': 'season', 'values': {'0': 0.6552933673469388}}, {'metric': 'mean accuracy', 'values': {'0': 0.743090986394558}}], 'max_epoch': None, 'time_per_epoch': None, 'time_per_iter': N

In [79]:
response.json()

{'action': 'evaluate',
 'created_on': '2024-08-20T06:12:02.815184',
 'description': '',
 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0',
 'id': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384',
 'job_tar_stats': {'file_size': 2671,
  'sha256_digest': '414c9aa1fde72dae77fd05a89b5690bbc1f072a51f6a0c927f8358f4ccc4ed7f'},
 'last_modified': '2024-08-20T06:14:36.952850',
 'name': '',
 'parent_id': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd',
 'result': {'categorical': [],
  'cur_iter': None,
  'detailed_status': {'date': '8/20/2024',
   'message': 'Evalation finished successfully.',
   'status': 'SUCCESS',
   'time': '6:14:12'},
  'epoch': None,
  'eta': None,
  'graphical': [],
  'key_metric': 0.0,
  'kpi': [{'metric': 'base_color', 'values': {'0': 0.6744260204081632}},
   {'metric': 'category', 'values': {'0': 0.8995535714285714}},
   {'metric': 'season', 'values': {'0': 0.6552933673469388}},
   {'metric': 'mean accuracy', 'values': {'0': 0.743090986394558}}],
  'max_epoch': None,
  'time_per_ep

### Prune, Retrain and Evaluation <a class="anchor" id="head-13"></a>

- We optimize the trained model by pruning and retraining in the following cells

#### Prune <a class="anchor" id="head-14"></a>

In [83]:
if model_name != "classification_pyt":
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/prune/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    assert "default" in response.json().keys()
    prune_specs = response.json()["default"]
    print(json.dumps(prune_specs, sort_keys=True, indent=4))

<Response [200]>
{
    "equalization_criterion": "union",
    "min_num_filters": 16,
    "normalizer": "max",
    "pruning_granularity": 8,
    "pruning_threshold": 0.1
}


In [84]:
if model_name != "classification_pyt":
    # Apply changes to specs dictionary if required here
    if model_name == "classification_tf2":
        prune_specs["prune"]["byom_model_path"] = ""
    print(json.dumps(prune_specs, sort_keys=True, indent=4))

{
    "equalization_criterion": "union",
    "min_num_filters": 16,
    "normalizer": "max",
    "pruning_granularity": 8,
    "pruning_threshold": 0.1
}


In [85]:
if model_name != "classification_pyt":
    # Run actions
    parent = job_map["train_" + model_name]
    action = "prune"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":prune_specs})

    endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    job_map["prune_" + model_name] = response.json()
    print(job_map)

<Response [201]>
4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14
{'train_multitask_classification': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'evaluate_multitask_classification': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384', 'prune_multitask_classification': '4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14'}


In [86]:
if model_name != "classification_pyt":
    # Monitor job status by repeatedly running this cell (prune)
    job_id = job_map["prune_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

<Response [200]>
{'action': 'prune', 'created_on': '2024-08-20T06:45:05.731442', 'description': '', 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0', 'id': '4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14', 'job_tar_stats': {'file_size': 25665906, 'sha256_digest': 'cbff0bb261137d5cf722e814971795704d6ed364b343e2c0bce373db714d552a'}, 'last_modified': '2024-08-20T06:45:57.551625', 'name': '', 'parent_id': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'result': {'categorical': [], 'cur_iter': None, 'detailed_status': {'date': '8/20/2024', 'message': 'Pruning finished successfully.', 'status': 'SUCCESS', 'time': '6:45:32'}, 'epoch': None, 'eta': None, 'graphical': [], 'key_metric': 0.0, 'kpi': [{'metric': 'pruning_ratio', 'values': {'0': 1.0}}, {'metric': 'size', 'values': {'0': 26.545745849609375}}, {'metric': 'param_count', 'values': {'0': 6.910489}}], 'max_epoch': None, 'time_per_epoch': None, 'time_per_iter': None}, 'specs': {'equalization_criterion': 'union', 'min_num_filters': 16, 'normalizer':

#### Retrain <a class="anchor" id="head-15"></a>

In [87]:
if model_name != "classification_pyt":
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/retrain/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    retrain_specs = response.json()["default"]
    print(json.dumps(retrain_specs, sort_keys=True, indent=4))

<Response [200]>
{
    "gpus": 1,
    "model_config": {
        "all_projections": true,
        "arch": "resnet",
        "input_image_size": "3,80,60",
        "n_layers": 10,
        "use_batch_norm": true
    },
    "random_seed": 42,
    "training_config": {
        "batch_size_per_gpu": 32,
        "checkpoint_interval": 1,
        "learning_rate": {
            "soft_start_annealing_schedule": {
                "annealing": 0.7,
                "max_learning_rate": 0.01,
                "min_learning_rate": 1e-06,
                "soft_start": 0.1
            }
        },
        "num_epochs": 10,
        "optimizer": {
            "sgd": {
                "momentum": 0.9,
                "nesterov": false
            }
        },
        "regularizer": {
            "type": "__L1__",
            "weight": 9e-05
        }
    },
    "use_amp": false
}


In [89]:
if model_name != "classification_pyt":
    # Override any of the parameters listed in the previous cell as required
    # Example for multitask-classification (for each network the parameter key might be different)
    if model_name == "multitask_classification":
        retrain_specs["training_config"]["num_epochs"] = 5
        retrain_specs["gpus"] = 1
    # Example for classification_tf1
    elif model_name == "classification_tf1":
        retrain_specs["train_config"]["n_epochs"] = 80
        retrain_specs["gpus"] = 1
    # Example for classification_tf2
    elif model_name == "classification_tf2":
        retrain_specs["train"]["num_epochs"] = 80
        retrain_specs["gpus"] = 1
    print(json.dumps(retrain_specs, sort_keys=True, indent=4))

{
    "gpus": 1,
    "model_config": {
        "all_projections": true,
        "arch": "resnet",
        "input_image_size": "3,80,60",
        "n_layers": 10,
        "use_batch_norm": true
    },
    "random_seed": 42,
    "training_config": {
        "batch_size_per_gpu": 32,
        "checkpoint_interval": 1,
        "learning_rate": {
            "soft_start_annealing_schedule": {
                "annealing": 0.7,
                "max_learning_rate": 0.01,
                "min_learning_rate": 1e-06,
                "soft_start": 0.1
            }
        },
        "num_epochs": 5,
        "optimizer": {
            "sgd": {
                "momentum": 0.9,
                "nesterov": false
            }
        },
        "regularizer": {
            "type": "__L1__",
            "weight": 9e-05
        }
    },
    "use_amp": false
}


In [90]:
if model_name != "classification_pyt":
    # Run actions
    parent = job_map["prune_" + model_name]
    action = "retrain"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":retrain_specs})

    endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    job_map["retrain_" + model_name] = response.json()
    print(job_map)

<Response [201]>
1f94b7bd-35fb-481e-87b2-f3fce170dd08
{'train_multitask_classification': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'evaluate_multitask_classification': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384', 'prune_multitask_classification': '4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14', 'retrain_multitask_classification': '1f94b7bd-35fb-481e-87b2-f3fce170dd08'}


In [91]:
if model_name != "classification_pyt":
    # Monitor job status by repeatedly running this cell (retrain)
    job_id = job_map["retrain_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

<Response [200]>
{'action': 'retrain', 'created_on': '2024-08-20T06:46:26.541774', 'description': '', 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0', 'id': '1f94b7bd-35fb-481e-87b2-f3fce170dd08', 'job_tar_stats': {'file_size': 261889330, 'sha256_digest': '917c6df3610bf8e49fc3d51ee5a3a1750266f8727004d084edf298f28c8a44c9'}, 'last_modified': '2024-08-20T06:51:01.277568', 'name': '', 'parent_id': '4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14', 'result': {'categorical': [], 'cur_iter': None, 'detailed_status': {'date': '8/20/2024', 'message': 'Multi-Task classification finished successfully', 'status': 'SUCCESS', 'time': '6:50:28'}, 'epoch': 5, 'eta': '0:00:00', 'graphical': [{'metric': 'validation_loss', 'units': None, 'values': {'1': 4.347505, '2': 3.6965528, '3': 3.4004142, '4': 2.847764, '5': 2.819906}, 'x_max': 5, 'x_min': 0, 'y_max': 4.347505, 'y_min': 0.0}, {'metric': 'val_base_color_loss', 'units': None, 'values': {'1': 1.067409, '2': 1.0198623, '3': 0.95671356, '4': 0.795387, '5':

In [92]:
# if model_name != "classification_pyt":
#    # Optional cancel job - for jobs that are pending/running (retrain)

#     job_id = job_map["retrain_" + model_name]
#     endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

#     response = requests.post(endpoint, headers=headers)
#     assert response.status_code in (200, 201)

#     print(response)
#     print(response.json())

In [None]:
# if model_name != "classification_pyt":
    # # Optional delete job - for jobs that are error/done (retrain)

    # job_id = job_map["retrain_" + model_name]
    # endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    # response = requests.delete(endpoint, headers=headers)
    # assert response.status_code in (200, 201)

    # print(response)
    # print(response.json())

#### Evaluate after retrain <a class="anchor" id="head-15"></a>

In [93]:
# Get default spec schema
if model_name != "classification_pyt":
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    assert "default" in response.json().keys()
    eval_retrain_specs = response.json()["default"]
    print(json.dumps(eval_retrain_specs, sort_keys=True, indent=4))

<Response [200]>
{
    "batch_size": 1,
    "model_config": {
        "all_projections": true,
        "arch": "resnet",
        "input_image_size": "3,80,60",
        "n_layers": 10,
        "use_batch_norm": true
    },
    "random_seed": 42,
    "training_config": {
        "batch_size_per_gpu": 32,
        "checkpoint_interval": 1,
        "learning_rate": {
            "soft_start_annealing_schedule": {
                "annealing": 0.7,
                "max_learning_rate": 0.01,
                "min_learning_rate": 1e-06,
                "soft_start": 0.1
            }
        },
        "num_epochs": 10,
        "optimizer": {
            "sgd": {
                "momentum": 0.9,
                "nesterov": false
            }
        },
        "regularizer": {
            "type": "__L1__",
            "weight": 9e-05
        }
    }
}


In [94]:
# Modify specs dictionary to change any config parameters
if model_name != "classification_pyt":
    print(json.dumps(eval_retrain_specs, sort_keys=True, indent=4))

{
    "batch_size": 1,
    "model_config": {
        "all_projections": true,
        "arch": "resnet",
        "input_image_size": "3,80,60",
        "n_layers": 10,
        "use_batch_norm": true
    },
    "random_seed": 42,
    "training_config": {
        "batch_size_per_gpu": 32,
        "checkpoint_interval": 1,
        "learning_rate": {
            "soft_start_annealing_schedule": {
                "annealing": 0.7,
                "max_learning_rate": 0.01,
                "min_learning_rate": 1e-06,
                "soft_start": 0.1
            }
        },
        "num_epochs": 10,
        "optimizer": {
            "sgd": {
                "momentum": 0.9,
                "nesterov": false
            }
        },
        "regularizer": {
            "type": "__L1__",
            "weight": 9e-05
        }
    }
}


In [95]:
if model_name != "classification_pyt":
    # Run actions
    parent = job_map["retrain_" + model_name]
    action = "evaluate"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_retrain_specs})

    endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    job_map["eval_retrain_" + model_name] = response.json()
    print(job_map)

<Response [201]>
03c5383e-1b9d-4511-866a-d403eec83c5d
{'train_multitask_classification': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'evaluate_multitask_classification': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384', 'prune_multitask_classification': '4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14', 'retrain_multitask_classification': '1f94b7bd-35fb-481e-87b2-f3fce170dd08', 'eval_retrain_multitask_classification': '03c5383e-1b9d-4511-866a-d403eec83c5d'}


In [96]:
if model_name != "classification_pyt":
    # Monitor job status by repeatedly running this cell (evaluate)
    job_id = job_map["eval_retrain_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

<Response [200]>
{'action': 'evaluate', 'created_on': '2024-08-20T06:57:28.233371', 'description': '', 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0', 'id': '03c5383e-1b9d-4511-866a-d403eec83c5d', 'job_tar_stats': {'file_size': 2705, 'sha256_digest': '01169a9d9030b99a2bd23feeca95a875a94161e7fca44e45b9f17ca30615e2e0'}, 'last_modified': '2024-08-20T06:58:11.026204', 'name': '', 'parent_id': '1f94b7bd-35fb-481e-87b2-f3fce170dd08', 'result': {'categorical': [], 'cur_iter': None, 'detailed_status': {'date': '8/20/2024', 'message': 'Evalation finished successfully.', 'status': 'SUCCESS', 'time': '6:57:46'}, 'epoch': None, 'eta': None, 'graphical': [], 'key_metric': 0.0, 'kpi': [{'metric': 'base_color', 'values': {'0': 0.765625}}, {'metric': 'category', 'values': {'0': 0.9811862244897959}}, {'metric': 'season', 'values': {'0': 0.7146045918367347}}, {'metric': 'mean accuracy', 'values': {'0': 0.8204719387755102}}], 'max_epoch': None, 'time_per_epoch': None, 'time_per_iter': None}, 'sp

In [97]:
response.json()

{'action': 'evaluate',
 'created_on': '2024-08-20T06:57:28.233371',
 'description': '',
 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0',
 'id': '03c5383e-1b9d-4511-866a-d403eec83c5d',
 'job_tar_stats': {'file_size': 2705,
  'sha256_digest': '01169a9d9030b99a2bd23feeca95a875a94161e7fca44e45b9f17ca30615e2e0'},
 'last_modified': '2024-08-20T06:58:11.026204',
 'name': '',
 'parent_id': '1f94b7bd-35fb-481e-87b2-f3fce170dd08',
 'result': {'categorical': [],
  'cur_iter': None,
  'detailed_status': {'date': '8/20/2024',
   'message': 'Evalation finished successfully.',
   'status': 'SUCCESS',
   'time': '6:57:46'},
  'epoch': None,
  'eta': None,
  'graphical': [],
  'key_metric': 0.0,
  'kpi': [{'metric': 'base_color', 'values': {'0': 0.765625}},
   {'metric': 'category', 'values': {'0': 0.9811862244897959}},
   {'metric': 'season', 'values': {'0': 0.7146045918367347}},
   {'metric': 'mean accuracy', 'values': {'0': 0.8204719387755102}}],
  'max_epoch': None,
  'time_per_epoch': Non

### Export <a class="anchor" id="head-17"></a>

In [98]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/export/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
export_specs = response.json()["default"]
print(json.dumps(export_specs, sort_keys=True, indent=4))

<Response [200]>
{
    "backend": "onnx",
    "batch_size": 4,
    "batches": 10,
    "data_type": "fp32",
    "force_ptq": false,
    "gen_ds_config": true,
    "strict_type_constraints": false,
    "version": "1"
}


In [99]:
# Apply changes to spec dictionary if required
print(json.dumps(export_specs, sort_keys=True, indent=4))

{
    "backend": "onnx",
    "batch_size": 4,
    "batches": 10,
    "data_type": "fp32",
    "force_ptq": false,
    "gen_ds_config": true,
    "strict_type_constraints": false,
    "version": "1"
}


In [100]:
# Run action
parent = job_map["train_" + model_name]
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":export_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["export_" + model_name] = response.json()
print(job_map)

<Response [201]>
1bc50b81-05c5-4000-b372-5e3a387d0f28
{'train_multitask_classification': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'evaluate_multitask_classification': 'a5122cd1-cd6e-45b7-a8d6-581dd7490384', 'prune_multitask_classification': '4e99ea1e-d8ce-4ff6-b8f9-d834a5c82c14', 'retrain_multitask_classification': '1f94b7bd-35fb-481e-87b2-f3fce170dd08', 'eval_retrain_multitask_classification': '03c5383e-1b9d-4511-866a-d403eec83c5d', 'export_multitask_classification': '1bc50b81-05c5-4000-b372-5e3a387d0f28'}


In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["export_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

<Response [200]>
{'action': 'export', 'created_on': '2024-08-20T07:05:37.785520', 'description': '', 'experiment_id': 'e8132f13-d5da-42fd-a6fb-758d401cd1a0', 'id': '1bc50b81-05c5-4000-b372-5e3a387d0f28', 'last_modified': '2024-08-20T07:05:47.585734', 'name': '', 'parent_id': '1e778a2d-1597-4b62-ac68-5f5ee28ddacd', 'result': {'categorical': [], 'cur_iter': None, 'detailed_status': {'date': '', 'message': '', 'status': '', 'time': ''}, 'epoch': None, 'eta': None, 'graphical': [], 'key_metric': 0.0, 'kpi': [], 'max_epoch': None, 'time_per_epoch': None, 'time_per_iter': None}, 'specs': {'backend': 'onnx', 'batch_size': 4, 'batches': 10, 'data_type': 'fp32', 'force_ptq': False, 'gen_ds_config': True, 'strict_type_constraints': False, 'version': '1'}, 'status': 'Running'}


### TRT Engine generation using TAO-Deploy <a class="anchor" id="head-19"></a>

- Here, we use the exported model to generate trt engine

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
tao_deploy_specs = response.json()["default"]
print(json.dumps(tao_deploy_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes
if model_name == "classification_tf2":
    tao_deploy_specs["gen_trt_engine"]["tensorrt"]["data_type"] = "int8"
elif model_name == "classification_pyt":
    tao_deploy_specs["gen_trt_engine"]["tensorrt"]["data_type"] = "fp16"
else:
    tao_deploy_specs["data_type"] = "int8"
print(json.dumps(tao_deploy_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["export_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_deploy_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["model_gen_trt_engine_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### TAO inference <a class="anchor" id="head-20"></a>

- Run inference on a set of images using the .tlt model created at train step

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
tao_inference_specs = response.json()["default"]
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs you want to modify
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_inference_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["inference_tlt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_tlt_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# Download job contents once the above job shows "Done" status
if download_jobs:
    job_id = job_map["inference_tlt_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    expected_file_size = response.json().get("job_tar_stats", {}).get("file_size")
    print("expected_file_size: ", expected_file_size)

    !python3 -m pip install tqdm
    from tqdm import tqdm

    endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download'
    temptar = f'{job_id}.tar.gz'

    with tqdm(total=expected_file_size, unit='B', unit_scale=True) as progress_bar:
        while True:
            # Check if the file already exists
            headers_download_job = dict(headers)
            if os.path.exists(temptar):
                # Get the current file size
                file_size = os.path.getsize(temptar)
                print(f"File size of dowloaded content until now is {file_size}")

                # If the file size matches the expected size, break out of the loop
                if file_size >= (expected_file_size-1):
                    print("Download completed successfully.")
                    print("Untarring")
                    # Untar to destination
                    tar_command = f'tar -xf {temptar} -C {workdir}/'
                    os.system(tar_command)
                    os.remove(temptar)
                    print(f"Results at {workdir}/{job_id}")
                    inference_out_path = f"{workdir}/{job_id}"
                    break

                # Set the headers to resume the download from where it left off
                headers_download_job['Range'] = f'bytes={file_size}-'
            # Open the file for writing in binary mode
            with open(temptar, 'ab') as f:
                try:
                    response = requests.get(endpoint, headers=headers_download_job, stream=True)
                    print(response)
                    # Check if the request was successful
                    if response.status_code in [200, 206]:
                        # Iterate over the content in chunks
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                # Write the chunk to the file
                                f.write(chunk)
                                # Flush and sync the file to disk
                                f.flush()
                                os.fsync(f.fileno())
                            progress_bar.update(len(chunk))
                    else:
                        print(f"Failed to download file. Status code: {response.status_code}")
                except requests.exceptions.RequestException as e:
                    print("Connection interrupted during download, resuming download from breaking point")
                    time.sleep(5)  # Sleep for a while before retrying the request
                    continue  # Continue the loop to retry the request

In [None]:
# Print Classification results
if download_jobs:
    if model_name == "classification_tf1":
        assert os.path.exists(f'{inference_out_path}/result.csv')
        !cat {inference_out_path}/result.csv
    elif "classification_" in model_name:
        assert os.path.exists(f'{inference_out_path}/inference/result.csv')
        !cat {inference_out_path}/inference/result.csv
    elif model_name == "multitask_classification":
        assert os.path.exists(f'{inference_out_path}/result.txt')
        !cat {inference_out_path}/result.txt

### TRT inference <a class="anchor" id="head-21"></a>

- no need to change the specs since we already uploaded it at the tlt inference step

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
trt_inference_specs = response.json()["default"]
print(json.dumps(trt_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs you want to modify
print(json.dumps(trt_inference_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["model_gen_trt_engine_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":trt_inference_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["inference_trt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_trt_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# Download job contents once the above job shows "Done" status
if download_jobs:
    job_id = job_map["inference_trt_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    expected_file_size = response.json().get("job_tar_stats", {}).get("file_size")
    print("expected_file_size: ", expected_file_size)

    !python3 -m pip install tqdm
    from tqdm import tqdm

    endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download'
    temptar = f'{job_id}.tar.gz'

    with tqdm(total=expected_file_size, unit='B', unit_scale=True) as progress_bar:
        while True:
            # Check if the file already exists
            headers_download_job = dict(headers)
            if os.path.exists(temptar):
                # Get the current file size
                file_size = os.path.getsize(temptar)
                print(f"File size of dowloaded content until now is {file_size}")

                # If the file size matches the expected size, break out of the loop
                if file_size >= (expected_file_size-1):
                    print("Download completed successfully.")
                    print("Untarring")
                    # Untar to destination
                    tar_command = f'tar -xf {temptar} -C {workdir}/'
                    os.system(tar_command)
                    os.remove(temptar)
                    print(f"Results at {workdir}/{job_id}")
                    inference_out_path = f"{workdir}/{job_id}"
                    break

                # Set the headers to resume the download from where it left off
                headers_download_job['Range'] = f'bytes={file_size}-'
            # Open the file for writing in binary mode
            with open(temptar, 'ab') as f:
                try:
                    response = requests.get(endpoint, headers=headers_download_job, stream=True)
                    print(response)
                    # Check if the request was successful
                    if response.status_code in [200, 206]:
                        # Iterate over the content in chunks
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                # Write the chunk to the file
                                f.write(chunk)
                                # Flush and sync the file to disk
                                f.flush()
                                os.fsync(f.fileno())
                            progress_bar.update(len(chunk))
                    else:
                        print(f"Failed to download file. Status code: {response.status_code}")
                except requests.exceptions.RequestException as e:
                    print("Connection interrupted during download, resuming download from breaking point")
                    time.sleep(5)  # Sleep for a while before retrying the request
                    continue  # Continue the loop to retry the request

In [None]:
# Print Classification results
if download_jobs:
    if model_name in ("classification_tf1", "multitask_classification"):
        !cat {inference_out_path}/result.csv
    elif "classification_" in model_name:
        !cat {inference_out_path}/inference/result.csv

### Delete model <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

### Delete dataset <a class="anchor" id="head-21"></a>

#### Delete train dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

#### Delete val dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

#### Delete test dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{test_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())