<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>

# Deploying a Model for Inference at Production Scale

## 03 - HuggingFace Model
-------

**Table of Contents**

* [Introduction](#introduction)
* [Define a HuggingFace Pre-Trained Model](#model)
* [Trace Model with TorchScript](#torchscript)
* [Create Model Directory Structure](#structure)
* [Create Configuration File](#configuration)
* [Load Model in Triton Inference Server](#load)
* [Send Inference Request to Server](#infer)
* [Exercise](#exercise)
* [Conclusion](#conclusion)


<a id="introduction"></a>
### Introduction

In this notebook, we will create a PyTorch `XLMRobertaForSequenceClassification` model from HuggingFace, write it out as a native PyTorch model using TorchScript generated code, and deploy it using Triton Inference Server. RoBERTa is an evolution of the BERT model architecture. More information about it can be found [here](https://huggingface.co/docs/transformers/model_doc/roberta). Our goal is to see how Triton can handle a more complicated model. 


<a id="structure"></a>
### Create Model Directory Structure

Below, we'll create our model directory structure. For more details on how to create model directory structures within PyTorch, see the previous notebook [02_Simple_PyTorch_Model.ipynb](02_Simple_PyTorch_Model.ipynb).

```
root@server:/models$ tree
.
├── huggingface-model
│   ├── 1
│   │   └── model.pt
│   └── config.pbtxt
```

In [None]:
!mkdir -p models/huggingface-model
!mkdir -p models/huggingface-model/1

<a id="model"></a>
### Define a HuggingFace Pre-Trained Model

In this section, we'll import a text tokenization function to create tokens of our inputs for the `XLMRobertaForSequenceClassification` model. We'll wrap our model, set it to evaluation mode, and allocate it on the GPU. Lastly, we'll generate the TorchScript code using the `torch.jit.trace` function and pass in our dummy input and save the model as `model.pt` file.

In [None]:
import torch
from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer


R_tokenizer = XLMRobertaTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
premise = "Jupiter's Biggest Moons Started as Tiny Grains of Hail"
hypothesis = 'This text is about space & cosmos'

input_ids = R_tokenizer.encode(premise, hypothesis, return_tensors='pt', 
                               max_length=256, padding='max_length')

mask = input_ids != 1
mask = mask.long()


class PyTorch_to_TorchScript(torch.nn.Module):
    def __init__(self):
        super(PyTorch_to_TorchScript, self).__init__()
        self.model = XLMRobertaForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli', return_dict=False)
    def forward(self, data, attention_mask=None):
        return self.model(data.cuda(), attention_mask.cuda())

pt_model = PyTorch_to_TorchScript().eval().cuda()
traced_script_module = torch.jit.trace(pt_model, (input_ids, mask))
traced_script_module.save('models/huggingface-model/1/model.pt')


<a id="configuration"></a>
### Create Configuration File

Next, we'll create our configuration file. For more details on how to create model directory structures within PyTorch, see the previous notebook [02_Simple_PyTorch_Model.ipynb](02_Simple_PyTorch_Model.ipynb).

In [None]:
configuration = """
name: "huggingface-model"
platform: "pytorch_libtorch"
max_batch_size: 1024
input [
 {
    name: "input__0"
    data_type: TYPE_INT32
    dims: [ 256 ]
  } ,
{
    name: "input__1"
    data_type: TYPE_INT32
    dims: [ 256 ]
  }
]
output {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 3 ]
  }
"""

with open('models/huggingface-model/config.pbtxt', 'w') as file:
    file.write(configuration)

<a id="load"></a>
### Load Model in Triton Inference Server


With our model directory structures created, models defined and exported, and configuration files created, we will now wait for Triton Inference Server to load our models. We have set up this lab to use Triton Inference Server in **polling** mode. This means that Triton Inference Server will continuously poll for modifications to our models or for newly created models - once every 30 seconds. Please run the cell below to allow time for Triton Inference Server to poll for new models/modifications before proceeding.

In [None]:
!sleep 45

At this point, our models should be deployed and ready to use! To confirm Triton Inference Server is up and running, we can see a `curl` request to the below URL.

In [None]:
!curl -v triton:8000/v2/health/ready

The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.

We can also send a `curl` request to our model endpoints to confirm our models are deployed and ready to use. This `curl` request returns status 200 if the model is ready and non-200 if it is not ready. 

Additionally, we will also see information about our models such:

* The name of our model,
* The versions available for our model,
* The backend platform (e.g. pytorch_libtorch 
* The inputs and outputs, with their respective names, data types, and shapes.


In [None]:
!curl -v triton:8000/v2/models/huggingface-model

<a id="infer"></a>
### Send Inference Request to Server

With our HuggingFace model deployed, it is now time to send inference requests to our models. 

First, we'll do some housekeeping and restart our Jupyter notebook kernel. This will free up some of the GPU memory.

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

Next, we'll load the `tritonclient.http` module and a utility function for working with NumPy data.

In [None]:
import tritonclient.http as tritonhttpclient

Next, we'll define the input and output names of our model, the name of our model, the URL where our models are deployed with Triton Inference Server (in this case local host of `triton:8000`), and our model version.

In [None]:
VERBOSE = False
input_name = ['input__0', 'input__1']
input_dtype = 'INT32'
output_name = 'output__0'
model_name = 'huggingface-model'
url = 'triton:8000'
model_version = '1'

We'll instantiate our client `triton_client` using the `tritonhttpclient.InferenceServerClient` class access the model metadata with the `.get_model_metadata()` method as well as get our model configuration with the `get_model_config()` method.

In [None]:
triton_client = tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

Below, we'll create our tokenizer, apply our tokenizer to our premise and topic, and manipulate the resulting data to be passed to Triton Inference Server.

In [None]:
import numpy as np
from transformers import XLMRobertaTokenizer


# instantiate our tokenizer
R_tokenizer = XLMRobertaTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')

# create our premise and topic to be passed into the model
premise = 'Jupiter’s Biggest Moons Started as Tiny Grains of Hail'
topic = 'This text is about space & cosmos'

# encode our inputs, convert to numpy arrays, create our mask, and do some reshaping
input_ids = R_tokenizer.encode(premise, topic, max_length=256, truncation=True, padding='max_length')
input_ids = np.array(input_ids, dtype=np.int32)
mask = input_ids != 1
mask = np.array(mask, dtype=np.int32)
mask = mask.reshape(1, 256) 
input_ids = input_ids.reshape(1, 256)

We'll instantiate a placeholder for our input data using the input name, shape, and data type expected. We'll set the data of the input to be the NumPy array representation of our text. We'll also instantiate a placeholder for our output data using just the output name.

Lastly, we'll submit our input to the Triton Inference Server using the `triton_client.infer()` method, specifying our model name, model version, inputs, and outputs and convert our result to a NumPy array.

In [None]:
input0 = tritonhttpclient.InferInput(input_name[0], (1, 256), input_dtype)
input0.set_data_from_numpy(input_ids, binary_data=False)
input1 = tritonhttpclient.InferInput(input_name[1], (1, 256), input_dtype)
input1.set_data_from_numpy(mask, binary_data=False)
output = tritonhttpclient.InferRequestedOutput(output_name,  binary_data=False)
response = triton_client.infer(model_name, model_version=model_version, inputs=[input0, input1], outputs=[output])
logits = response.as_numpy(output_name)
logits = np.asarray(logits, dtype=np.float32)

Lastly, we post-process our data to ignore the "neutral" (dimension 1) result of our logits and take the probability of "entailment" (2) as the probability of the label being true. And that's all there is to it! Our model identifies that our premise is in fact about space and cosmos!

In [None]:
from scipy.special import softmax


entail_contradiction_logits = logits[:,[0,2]]
probs = softmax(entail_contradiction_logits)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')

<a id="conclusion"></a>
### Conclusion

In this notebook, we showed how to create a PyTorch XLMRobertaForSequenceClassification model from HuggingFace, write it out as a native PyTorch model using TorchScript generated code, and deploy it using Triton Inference Server.

We kindly ask that you do some clean up and run the cell below. This will free up GPU memory for other section of the lab.

In [None]:
!rm -rf models/huggingface-model

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>