<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>

# Deploying a Model for Inference at Production Scale

## 02 - Simple PyTorch Model
-------

**Table of Contents**

* [Introduction](#introduction)
* [Create Model Directory Structure](#structure)
* [Define a Simple PyTorch Model](#model)
* [Trace Model with TorchScript](#torchscript)
* [Create Configuration File](#configuration)
* [Load Model in Triton Inference Server](#load)
* [Send Inference Request to Server](#infer)
* [Exercise](#exercise)
* [Conclusion](#conclusion)


<a id="introduction"></a>
### Introduction

In this notebook, we will create a PyTorch ResNet50 model, write it out as a native PyTorch model and in its ONNX representation, and deploy it using Triton Inference Server. We'll see how to create model directory structures and configuration files within Triton Inference Server, how to work with TorchScript and ONNX, and how to send inference requests to the models deployed within Triton Inference Server.

<a id="structure"></a>
### Create Model Directory Structure

Triton Inference Server serves models within a model repository. When you first run Triton Inference Server, you'll specify the model repository where the models reside:

```
tritonserver --model-repository=/models
```

Each model resides in its own model subdirectory within the model repository - i.e. each directory within `/models` represents a unique model. For example, in this notebook we'll be deploying two models: a `simple-onnx-model` and a `simple-pytorch-model`. 

All models typically follow a similar directory structure. Within each of these directories, we'll create a configuration file `config.pbtxt` that details information about the model - e.g. batch size, input shapes, deployment backend (PyTorch, ONNX, TensorFlow, TensorRT, etc.) and more. We'll explore the configuration file later in this notebook.

Additionally, we can create one or more versions of our model. Each version lives under a subdirectory name with the respective version number, starting with `1`. It is within this subdirectory where our model files reside (e.g. `model.onnx`, `model.pt`).

```
root@server:/models$ tree
.
├── simple-onnx-model
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
├── simple-pytorch-model
│   ├── 1
│   │   └── model.pt
│   └── config.pbtxt

```

We can also add a file representing the names of the outputs. We have omitted this step in this notebook for the sake of brevity. For more details on how to work with model repositories and model directory structures in Triton Inference Server, please see the documentation here: https://github.com/triton-inference-server/server/blob/r20.12/docs/model_repository.md

Below, we'll create the model directory structure for each of our PyTorch and ONNX models.

In [None]:
!mkdir -p models/simple-pytorch-model
!mkdir -p models/simple-pytorch-model/1
!mkdir -p models/simple-onnx-model
!mkdir -p models/simple-onnx-model/1

<a id="model"></a>
### Define a Simple PyTorch Model

In this next section, we'll define a simple PyTorch ResNet50 model. We'll specify that we will use a pretrained model, which will instantiate the ResNet50 model with the weights learned from training on ImageNet. After defining our `Model` class, we will instantiate this model, set the model to evaluation mode with the `.eval()` method, and allocate the model on the GPU with the `.cuda()` method. Learn more about how to train PyTorch models on GPUs with CUDA with [this article](https://medium.com/ai%C2%B3-theory-practice-business/use-gpu-in-your-pytorch-code-676a67faed09).

In [None]:
import torch
from torch import nn
from torchvision import models


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.model = models.resnet50(pretrained=True)
        
    def forward(self, x):
        return self.model(x)

model = Model().eval().cuda()

Next, we'll load the ImageNet labels.

In [None]:
import json

with open('./imagenet-simple-labels.json') as file:
    labels = json.load(file)

print(labels[:5])

Before working with Triton Inference Server, let's confirm our ResNet50 model pre-trained on ImageNet works on a sample image. We'll use an image of goldfish - feel free to try this with your own images!

In [None]:
import numpy as np
from PIL import Image


image = Image.open('./assets/goldfish.jpg')
image

Below, we'll create a transformation pipeline to take an image, resize it to `(256, 256)`, take a center crop resulting in an image of size `(224, 224)`, convert it to a PyTorch Tensor, and then normalize the image using the means and standard deviations of the ImageNet dataset.

In [None]:
from torchvision import transforms


imagenet_mean = [0.485, 0.456, 0.406]
imagenet_std = [0.485, 0.456, 0.406]

resize = transforms.Resize((256, 256))
center_crop = transforms.CenterCrop(224)
to_tensor = transforms.ToTensor()
normalize = transforms.Normalize(mean=imagenet_mean,
                                 std=imagenet_std)

transform = transforms.Compose([resize, center_crop, to_tensor, normalize])

Lastly, we'll apply our transformation pipeline to our image, add a dimension for our batch sizes with the `.unsqueeze(0)` method, and allocate our image on the GPU with the `.cuda()` method. We'll pass our image through our model to get the `logits`.

After moving the `logits` to CPU, we'll then use the `torch.topk` function to access the values and indices of the top 3 `logits`. We see our top result is indeed a goldfish. Awesome!

In [None]:
image_tensor = transform(image).unsqueeze(0).cuda()
logits = model(image_tensor)

K = 3
values, indices = torch.topk(logits, K)

values = values.detach().tolist()[0]
indices = indices.detach().tolist()[0]

for i in range(K):
    print(values[i], indices[i], labels[indices[i]])

<a id="torchscript"></a>
### Trace Model with TorchScript


We have defined our model and confirmed it works as expected. Before writing out our model as a `model.pt` file, we will trace our model using TorchScript. TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. This is how we'll load our PyTorch model into Triton Inference Server (which uses the `libtorch` backend).

There are two ways to generate a model with TorchScript - using either the `torch.jit.script` function or the `torch.jit.trace` function. 

Using `torch.jit.script` on a function or `nn.Module` will inspect the source code, compile it as TorchScript code using the TorchScript compiler, and return a `ScriptModule` or `ScriptFunction`.

Using `torch.jit.trace` on a function will return an executable or `ScriptFunction` that will be optimized using just-in-time compilation.

It may not be immediately clear whether to use `torch.jit.script` or `torch.jit.trace`. Typically, `torch.jit.script` is more flexible and allows you work with different batch sizes while `torch.jit.trace` requires you to pass in an example dummy input with a fixed batch size. In general, I recommend starting with `torch.jit.script`.

For more details on TorchScript, please see:

* The TorchScript documentation: https://pytorch.org/docs/stable/jit.html
* This very insightful blogpost: https://paulbridger.com/posts/mastering-torchscript/

Below, we'll define a wrapper around our model, set our model wrapper to evaluation mode, and allocate our model on the GPU. Next, we'll generate our TorchScript code with the `torch.jit.script` function and write out our model as `model.pt` in the version `1` subdirectory of our `simple-pytorch-model` model directory.

In [None]:
class PyTorch_to_TorchScript(nn.Module):
    def __init__(self, my_model):
        super(PyTorch_to_TorchScript, self).__init__()
        self.model = my_model.model
    
    def forward(self, x):
        return self.model(x)

torchscript_model = PyTorch_to_TorchScript(model).eval().cuda()
traced_script_module = torch.jit.script(torchscript_model)
traced_script_module.save('models/simple-pytorch-model/1/model.pt')

We'll also convert our model to an ONNX representation. Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. Currently we focus on the capabilities needed for inferencing (scoring).

Below, we'll create a Torch Tensor of random data in the shape of our input images and allocate it on the GPU. We'll also specify the input and output names of the model. We'll see in the next section where these values are used in our configuration model.

Lastly, we'll export our model in an ONNX representation as a `model.onnx` file in the version `1` subdirectory of our `simple-onnx-model` model directory, specifying the dummy input and the appropriate input and output names. We'll also pass in a dictionary mapping the input and outname names to which dimensions should be the batch size. This allows us to work with variable batch sizes - without using the `dynamic_axes` parameter, our ONNX model would be hard coded to use whichever batch size we chose for our dummy input, which is in this case, is batch size 1.

In [None]:
dummy_input = torch.randn(1, 3, 224, 224).cuda()

input_names = ['actual_input_1'] + ['learned_%d' % i for i in range(16)]
output_names = ['output1']

torch.onnx.export(model, dummy_input, 
                  'models/simple-onnx-model/1/model.onnx', verbose=False, 
                  input_names=input_names, output_names=output_names, 
                  dynamic_axes={'actual_input_1': {0: 'batch_size'}, 'output1': {0: 'batch_size'}})

<a id="configuration"></a>
### Create Configuration File

With our models defined and written out in TorchScript and ONNX representations, we now turn our attention to creating configuration files for our models.

A minimal model configuration must specify the name of the model, the platform and/or backend properties, the max_batch_size property, and the input and output tensors of the model (name, data type, and shape).


For more details on how to create model configuration files within Triton Inference Server, please see the documentation: 
https://github.com/triton-inference-server/server/blob/r20.12/docs/model_configuration.md

In [None]:
configuration = """
name: "simple-pytorch-model"
platform: "pytorch_libtorch"
max_batch_size: 32
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
"""

with open('models/simple-pytorch-model/config.pbtxt', 'w') as file:
    file.write(configuration)

We'll also create a configuration file for the ONNX model. Note that the name attribute of our input and output tensors are different, since we specified the input and output names when we exported the ONNX model. Please note that the `platform` has been updated to `onnxruntime_onnx`.

In [None]:
configuration = """
name: "simple-onnx-model"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
 {
    name: "actual_input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output {
    name: "output1"
    data_type: TYPE_FP32
    dims: [ 1000]
  }
"""

with open('models/simple-onnx-model/config.pbtxt', 'w') as file:
    file.write(configuration)

<a id="load"></a>
### Load Model in Triton Inference Server


With our model directory structures created, models defined and exported, and configuration files created, we will now wait for Triton Inference Server to load our models. We have set up this lab to use Triton Inference Server in **polling** mode. This means that Triton Inference Server will continuously poll for modifications to our models or for newly created models - once every 30 seconds. Please run the cell below to allow time for Triton Inference Server to poll for new models/modifications before proceeding. Due to the asynchronous nature of this step, we have added 15 seconds to be safe.

In [None]:
!sleep 45

At this point, our models should be deployed and ready to use! To confirm Triton Inference Server is up and running, we can see a `curl` request to the below URL.

In [None]:
!curl -v triton:8000/v2/health/ready

The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.

We can also send a `curl` request to our model endpoints to confirm our models are deployed and ready to use. This `curl` request returns status 200 if the model is ready and non-200 if it is not ready. 

Additionally, we will also see information about our models such:

* The name of our model,
* The versions available for our model,
* The backend platform (e.g. pytorch_libtorch, onnxruntime_onnx), 
* The inputs and outputs, with their respective names, data types, and shapes.


In [None]:
!curl -v triton:8000/v2/models/simple-pytorch-model

In [None]:
!curl -v triton:8000/v2/models/simple-onnx-model

<a id="infer"></a>
### Send Inference Request to Server

With our models deployed, it is now time to send inference requests to our models. 

First, we'll load the `tritonclient.http` module and a utility function for working with NumPy data.

In [None]:
import tritonclient.http as tritonhttpclient

Next, we'll define the input and output names of our model, the name of our model, the URL where our models are deployed with Triton Inference Server (in this case the host `triton:8000`), and our model version.

In [None]:
VERBOSE = False
input_name = 'input__0'
input_shape = (1, 3, 224, 224)
input_dtype = 'FP32'
output_name = 'output__0'
model_name = 'simple-pytorch-model'
url = 'triton:8000'
model_version = '1'

We'll instantiate our client `triton_client` using the `tritonhttpclient.InferenceServerClient` class access the model metadata with the `.get_model_metadata()` method as well as get our model configuration with the `get_model_config()` method.

In [None]:
triton_client = tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

Next, we'll convert our previouly defined image of our goldfish (currently as a Torch Tensor) to a NumPy array on the CPU.

In [None]:
image_numpy = image_tensor.cpu().numpy()
print(image_numpy.shape)

We'll instantiate a placeholder for our input data using the input name, shape, and data type expected. We'll set the data of the input to be the NumPy array representation of our goldfish image. We'll also instantiate a placeholder for our output data using just the output name.

Lastly, we'll submit our input to the Triton Inference Server using the `triton_client.infer()` method, specifying our model name, model version, inputs, and outputs and convert our result to a NumPy array.

In [None]:
input0 = tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
input0.set_data_from_numpy(image_numpy, binary_data=False)

output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
response = triton_client.infer(model_name, model_version=model_version, 
                               inputs=[input0], outputs=[output])
logits = response.as_numpy(output_name)
logits = np.asarray(logits, dtype=np.float32)
print(logits.shape)

And that's all there is to it! We can identify the largest logit value and confirm that our model correctly inferred that our image is, indeed, a goldfish.

In [None]:
print(labels[np.argmax(logits)])

<a id="exercise"></a>
### Exercise #1 - Submit an Inference Request to the ONNX model

We leave it as an exercise for the participant to submit an inference to the deployed ONNX model. If you get stuck (or want to confirm your answer), click the `...` to reveal the answer.

Hint: Just copying the inference code from above won't work - pay attention to model name and the input and output names in the configuration file we defined for ONNX.

#### Step 1: Define names and shapes

**Hint**: Try looking at the ONNX `configuration` defined above.

In [None]:
VERBOSE = FIXME
input_name = FIXME
input_shape = FIXME
input_dtype = FIXME
output_name = FIXME
model_name = FIXME
url = FIXME
model_version = FIXME

In [None]:
VERBOSE = False
input_name = 'actual_input_1'
input_shape = (1, 3, 224, 224)
input_dtype = 'FP32'
output_name = 'output1'
model_name = 'simple-onnx-model'
url = 'triton:8000'
model_version = '1'

#### Step 2: Get model information from Triton

In [None]:
triton_client = tritonhttpclient.FIXME(url=url, verbose=VERBOSE)
model_metadata = triton_client.FIXME(model_name=model_name, model_version=model_version)
model_config = triton_client.FIXME(model_name=model_name, model_version=model_version)

In [None]:
triton_client = tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

#### Step 3: Test an image

No `FIXME`s here, view image shape.

In [None]:
image_numpy = image_tensor.cpu().numpy()
print(image_numpy.shape)

#### Step 4: Define inputs and outputs to get an inference response from Triton

In [None]:
input0 = FIXME

output = FIXME
response = FIXME

logits = response.as_numpy(output_name)
logits = np.asarray(logits, dtype=np.float32)

In [None]:
input0 = tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
input0.set_data_from_numpy(image_numpy, binary_data=False)

output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
response = triton_client.infer(model_name, model_version=model_version, 
                               inputs=[input0], outputs=[output])
logits = response.as_numpy(output_name)
logits = np.asarray(logits, dtype=np.float32)

#### Step 5: Verify the response

In [None]:
print(labels[np.argmax(logits)])

<a id="conclusion"></a>
### Conclusion

In this notebook, we showed how to create a PyTorch ResNet50 model, write it out as a native PyTorch model and in its ONNX representation, and deploy it using Triton Inference Server. We saw how to create model directory structures and configuration files within Triton Inference Server, how to work with TorchScript and ONNX, and how to send inference requests to the models deployed within Triton Inference Server.

We kindly ask that you do some clean up and run the cell below. This will free up GPU memory for other section of the lab.

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>