# TensorRT Introduction

TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.


## Installing Required Libraries

To set up the environment for TensorRT, `torch2trt`, ONNX, and PyCUDA, follow these steps:


In [None]:
!git clone -b TensorRT https://github.com/yundogyeong/bootcamp.git
!mv ./bootcamp/*.py ./
!git clone https://github.com/NVIDIA-AI-IOT/torch2trt
%cd torch2trt
!pip install tensorrt
!python3 setup.py install
%cd /content/

fatal: destination path 'bootcamp' already exists and is not an empty directory.
mv: cannot stat './bootcamp/*.py': No such file or directory
fatal: destination path 'torch2trt' already exists and is not an empty directory.
/content/torch2trt
running install
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/is

In [None]:
!pip3 install onnx
!pip3 install pycuda



## Importing Libraries

Import the necessary libraries and modules for working with TensorRT and PyTorch.


In [None]:
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset

import onnx
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

from model import cifar10_resnet20
from utils import module_wrapper, accuracy

## Exporting the PyTorch Model to ONNX

Since we cannot convert a PyTorch model directly to TensorRT, we first need to convert our PyTorch model to the ONNX format. This ONNX model can then be further converted to a TensorRT engine for optimized inference.


In [None]:
model = cifar10_resnet20(pretrained=True)
model.eval()

dummy_input = torch.randn(1, 3, 32, 32)
onnx_model_path = "resnet20.onnx"

# Export the model to ONNX
torch.onnx.export(model, dummy_input, onnx_model_path,
                  input_names=["input"], output_names=["output"],
                  opset_version=11)

## Converting ONNX Model to TensorRT Engine

This script converts the ONNX model to a TensorRT engine, enabling optimized inference. Below are the steps taken in the script to achieve this conversion:


In [None]:
# Create TensorRT builder, network, and parser
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

# Parse the ONNX model
with open(onnx_model_path, 'rb') as model:
    if not parser.parse(model.read()):
        for error in range(parser.num_errors):
            print(parser.get_error(error))

config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)  # Set 16-bit floating point mode

# Create optimization profile
profile = builder.create_optimization_profile()
config.add_optimization_profile(profile)

# Build TensorRT engine
serialized_engine = builder.build_serialized_network(network, config)

# Save the engine
engine_path = "resnet20_fp16.trt"
with open(engine_path, "wb") as f:
    f.write(serialized_engine)

print(f"TensorRT engine has been saved to {engine_path}.")

TensorRT engine has been saved to resnet20_fp16.trt.


## TensorRT Inference and Evaluation

This section includes a class and functions to handle loading a TensorRT engine, performing inference, and evaluating the model on the CIFAR-10 dataset.


In [None]:
# Function to load TensorRT engine
def load_engine(engine_file_path):
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    runtime = trt.Runtime(TRT_LOGGER)
    with open(engine_file_path, "rb") as f:
        engine_data = f.read()
    engine = runtime.deserialize_cuda_engine(engine_data)
    return engine

In [None]:
class Inference:
    def __init__(self, engine_path):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.engine = load_engine(engine_path)
        self.engine_params = self.prepare_trt_engine()  # Prepare TensorRT engine
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
        ])

        self.testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=self.transform)
        self.indices = np.random.choice(len(self.testset), 500, replace=False)
        self.subset = Subset(self.testset, self.indices)
        self.testloader = DataLoader(self.subset, batch_size=1, shuffle=True, num_workers=2)

    # Function to prepare TensorRT engine
    def prepare_trt_engine(self):
        host_inputs = []
        cuda_inputs = []
        host_outputs = []
        cuda_outputs = []
        bindings = []

        for binding in self.engine:
            size = trt.volume(self.engine.get_tensor_shape(binding))
            dtype = trt.nptype(self.engine.get_tensor_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            cuda_mem = cuda.mem_alloc(host_mem.nbytes)

            bindings.append(int(cuda_mem))  # Add to bindings
            if self.engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:  # If input tensor
                host_inputs.append(host_mem)
                cuda_inputs.append(cuda_mem)
            else:  # If output tensor
                host_outputs.append(host_mem)
                cuda_outputs.append(cuda_mem)

        return host_inputs, cuda_inputs, host_outputs, cuda_outputs, bindings, self.engine.create_execution_context()

    # Inference function
    def infer(self, input_data):
        host_inputs, cuda_inputs, host_outputs, cuda_outputs, bindings, context = self.engine_params

        np.copyto(host_inputs[0], input_data.ravel())  # Copy input data to host input buffer
        cuda.memcpy_htod(cuda_inputs[0], host_inputs[0])  # Copy host input buffer to CUDA input buffer
        context.execute_v2(bindings)  # Execute inference
        cuda.memcpy_dtoh(host_outputs[0], cuda_outputs[0])  # Copy CUDA output buffer to host output buffer

        return host_outputs[0]

    # Model evaluation function
    def evaluate(self):
        top1_acc = 0
        top5_acc = 0

        for images, targets in self.testloader:
            images = images.numpy()
            outputs = self.infer(images)
            outputs = torch.tensor(outputs)  # Convert outputs to tensor
            outputs = torch.reshape(outputs, (-1, 10))  # Reshape outputs

            acc1, acc5 = accuracy(outputs.data, targets.data, topk=(1, 5))
            top1_acc += acc1.item()
            top5_acc += acc5.item()

        avg_top1_acc = top1_acc / len(self.testloader)
        avg_top5_acc = top5_acc / len(self.testloader)

        return avg_top1_acc, avg_top5_acc

## Running Inference and Evaluating the Model(FP16)

With the FP16 TensorRT engine ready, we can now perform inference and evaluate the model's performance on the CIFAR-10 dataset.

In [None]:
engine_path = '/content/resnet20_fp16.trt'

inference = Inference(engine_path)
avg_top1_acc, avg_top5_acc = inference.evaluate()
print(avg_top1_acc, avg_top5_acc)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:03<00:00, 44007194.62it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
92.4 100.0


In [None]:
# @title preparing calibration data (extract tar file to png)
!python3 data.py

Unpacking Train File 1/5
Unpacking Train File 2/5
Unpacking Train File 3/5
Unpacking Train File 4/5
Unpacking Train File 5/5
Unpacking Test File
Unpacking Finish


## INT8 Calibration and Building TensorRT Engine

To build a TensorRT engine with INT8 precision and calibration, use the following command. This command takes various arguments to specify the ONNX model, output engine, precision, calibration inputs, and other parameters.


<strong>NOTICE:</strong> If you need to create a new calibration engine, make sure to change the cache name.



### Command Line Arguments for builder.py

Below are the command line arguments for the `builder.py` script, which are used to configure the process of building a TensorRT engine.

- **`-o, --onnx`**: The input ONNX model file to load.

    ```markdown
    -o, --onnx
    The input ONNX model file to load
    ```

- **`-e, --engine`**: The output path for the TensorRT engine.

    ```markdown
    -e, --engine
    The output path for the TRT engine
    ```

- **`-p, --precision`**: The precision mode to build in, either 'int8', 'fp16' or 'mix'. The default is 'int8'.

    ```markdown
    -p, --precision
    The precision mode to build in, either 'int8', 'fp16' or 'mix', default: 'int8'
    ```

- **`--calib_input`**: The directory holding images to use for calibration.

    ```markdown
    --calib_input
    The directory holding images to use for calibration
    ```

- **`--calib_cache`**: The file path for INT8 calibration cache to use. The default is `./calibration.cache`.

    ```markdown
    --calib_cache
    The file path for INT8 calibration cache to use, default: ./calibration.cache
    ```

- **`--calib_num_images`**: The maximum number of images to use for calibration. The default is 5000 images.

    ```markdown
    --calib_num_images
    The maximum number of images to use for calibration, default: 5000
    ```

- **`--calib_batch_size`**: The batch size for the calibration process. The default is 8.

    ```markdown
    --calib_batch_size
    The batch size for the calibration process, default: 8
    ```


TODO : Change 'calib_input', 'calib_num_images', and 'calib_batch_size' parameters, then observe the effects on calibration accuracy

(For one more step, try modifying the image augmentation in image_batcher.py in the preprocess_image function.)

In [None]:
# If you need to create a new calibration engine, make sure to change the cache name.
!python3 builder.py --onnx=resnet20.onnx --engine=resnet20_int8.trt --precision=int8 --calib_input=/content/data/cifar-10-batches-py/train --calib_num_images=1000 --calib_cache=calib_test.cache --calib_batch_size=4

[08/04/2024-06:20:35] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 25, GPU 103 (MiB)
[08/04/2024-06:20:37] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +903, GPU +180, now: CPU 1081, GPU 283 (MiB)
INFO:EngineBuilder:Network Description
INFO:EngineBuilder:Input 'input' with shape (1, 3, 32, 32) and dtype DataType.FLOAT
INFO:EngineBuilder:Output 'output' with shape (1, 10) and dtype DataType.FLOAT
INFO:EngineBuilder:Building int8 Engine in /content/resnet20_int8.trt
  self.config.int8_calibrator = EngineCalibrator(calib_cache)
  self.config.int8_calibrator.set_image_batcher(ImageBatcher(calib_input,
[08/04/2024-06:20:37] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/04/2024-06:20:37] [TRT] [W] Heuristics has been ignored in this builder run. This feature is only supported on Ampere and beyond.
[08/04/2024-06:20:37] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.


## Running Inference and Evaluating the Model(INT8)

With the Calibration INT8 TensorRT engine ready, we can now perform inference and evaluate the model's performance on the CIFAR-10 dataset.

In [None]:
engine_path = '/content/resnet20_int8.trt'

inference = Inference(engine_path)
avg_top1_acc, avg_top5_acc = inference.evaluate()
print(avg_top1_acc, avg_top5_acc)

Files already downloaded and verified
92.8 99.8


## Measuring Inference Time for Different Precision Models

This script measures and prints the inference time for models with different precision (INT8 and FP16) using a TensorRT engine. The function `measure_inference_time` takes the engine path and the input data, performs inference, and prints the elapsed time.


In [None]:
def measure_inference_time(engine_path, input_data):
    inference = Inference(engine_path)

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)

    start_event.record()
    inference.infer(input_data)
    end_event.record()
    torch.cuda.synchronize()

    elapsed_time = start_event.elapsed_time(end_event)

    print(f"GPU inference time for engine {engine_path}: {elapsed_time / 1000:.6f} seconds")

# Define input data
input_data = torch.randn(3, 32, 32).numpy()

# Measure inference time for INT8 model
engine_path_int8 = '/content/resnet20_int8.trt'
measure_inference_time(engine_path_int8, input_data)

# Measure inference time for FP16 model
engine_path_fp16 = '/content/resnet20_fp16.trt'
measure_inference_time(engine_path_fp16, input_data)

Files already downloaded and verified
GPU inference time: 0.000620 seconds


# Torch2TRT Introduction

The `torch2trt` library simplifies the conversion of PyTorch models to TensorRT models. Unlike other methods, `torch2trt` can directly convert PyTorch models to TensorRT without the intermediate ONNX conversion step. This library also allows you to use your existing inference code without any modifications, providing a seamless transition from PyTorch to TensorRT.

## Advantages of torch2trt

- **Direct Conversion**: Converts PyTorch models to TensorRT directly without the need for ONNX as an intermediate format.
- **Ease of Use**: The conversion process is straightforward and integrated, requiring minimal changes to your workflow.
- **Unchanged Inference Code**: Allows you to continue using your existing PyTorch inference code, enabling an easy and quick adoption of TensorRT for optimized inference.

By leveraging `torch2trt`, you can achieve significant performance improvements for your deep learning models with minimal effort and changes to your codebase.


## INT8 Calibration using torch2trt

The following script demonstrates how to use the `torch2trt` library to prepare an image batcher for INT8 calibration. This class, `ImageBatcher`, is designed to handle batching of images specifically for the calibration process required when converting models to TensorRT with INT8 precision.


In [None]:
from torch2trt import torch2trt
from torch2trt import TRTModule
from model import cifar10_resnet20
import os
from PIL import Image
from torchvision.transforms import ToTensor, Compose, Normalize, Resize
import random

class ImageBatcher():

    def __init__(self, root="/content/data/cifar-10-batches-py/train", max_samples=1000, batch_size=4):
        self.input_root = root
        self.batch_size = batch_size
        self.image_paths = []
        self.num_samples = 0

        extensions = [".jpg", ".jpeg", ".png", ".bmp"]

        def is_image(path):
            return os.path.isfile(path) and os.path.splitext(path)[1].lower() in extensions

        if os.path.isdir(self.input_root):
            for root, _, files in os.walk(self.input_root):
                for file in files:
                    file_path = os.path.join(root, file)
                    if is_image(file_path):
                        self.image_paths.append(file_path)
                        self.num_samples += 1

        random.shuffle(self.image_paths)
        self.image_paths = self.image_paths[:max_samples]
        self.num_samples = len(self.image_paths)

        self.transform = Compose([
            Resize((32, 32)),
            ToTensor()
        ])

        self.num_samples = min(max_samples, self.num_samples)
        self.num_samples = self.batch_size * (self.num_samples // self.batch_size)
        self.image_paths = self.image_paths[:self.num_samples]

    def __len__(self):
        return (self.num_samples + self.batch_size - 1) // self.batch_size

    def __getitem__(self, idx):
        start_idx = idx * self.batch_size
        end_idx = min(start_idx + self.batch_size, self.num_samples)

        batch_images = self.image_paths[start_idx:end_idx]
        batch_data = torch.zeros(self.batch_size, 3, 32, 32).to(dtype=torch.float32)

        for i, image_path in enumerate(batch_images):
            image = Image.open(image_path)
            batch_data[i] = self.transform(image)

        return batch_data.cuda()

In [None]:
def convert_to_trt(model, dummy_input, precision, calib_input, calib_num_images, calib_batch_size):
    if precision == "int8":
        dataset = ImageBatcher(calib_input, calib_num_images, calib_batch_size)
        return torch2trt(model, [dummy_input], int8_mode=True,
                         int8_calib_dataset=dataset, int8_calib_algorithm=trt.CalibrationAlgoType.ENTROPY_CALIBRATION_2)
    elif precision == "fp16":
        return torch2trt(model, [dummy_input], fp16_mode=True)

def save_trt_model(model_trt, save_path):
    torch.save(model_trt.state_dict(), save_path)

## Building TensorRT Engine using torch2trt

The following script demonstrates how to use the `torch2trt` library to convert a PyTorch model to a TensorRT engine, specifically for INT8 precision, and then save the engine. This process includes setting up the model, preparing the input data, and performing the conversion and calibration.


In [None]:
model = cifar10_resnet20(pretrained=True)
model.eval().cuda()

dummy_input = torch.ones((1, 3, 32, 32)).cuda().to(torch.float32)

RT_model = convert_to_trt(model, dummy_input, "int8",
                                  "/content/data/cifar-10-batches-py/train", calib_num_images=1000, calib_batch_size=4)

save_trt_model(RT_model, "torch2trt_int8.trt")

print("TensorRT engine is created and saved.")

TensorRT engine is created and saved.


## Running Inference with TensorRT Engine using torch2trt

The following script demonstrates how to run inference using a TensorRT engine created with the `torch2trt` library. This process includes loading the TensorRT model, preparing the data loader, and evaluating the model's performance.


In [None]:
def evaluate(model, data_loader):
    model.eval()
    top1_acc = 0
    top5_acc = 0

    with torch.no_grad():
        for images, targets in data_loader:
            images, targets = images.cuda(), targets.cuda()

            # Forward pass
            outputs = model(images)

            # Compute accuracy
            acc1, acc5 = accuracy(outputs.data, targets.data, topk=(1, 5))
            top1_acc += acc1.item()
            top5_acc += acc5.item()

    avg_top1_acc = top1_acc / len(data_loader)
    avg_top5_acc = top5_acc / len(data_loader)

    return avg_top1_acc, avg_top5_acc

In [None]:
model_trt = TRTModule()
model_trt.load_state_dict(torch.load('torch2trt_int8.trt'))

_, test_loader = get_loader()

top1_acc, top5_acc = evaluate(model_trt, test_loader)

print(f'TensorRT INT8 Model - Top-1 Accuracy: {top1_acc:.2f}%, Top-5 Accuracy: {top5_acc:.2f}%')

Files already downloaded and verified
Files already downloaded and verified
TensorRT INT8 Model - Top-1 Accuracy: 81.56%, Top-5 Accuracy: 98.66%
