# Deploying NVIDIA Triton Inference Server in AI Platform Prediction Custom Container (Google Cloud SDK)

In this notebook, we will walk through the process of deploying NVIDIA's Triton Inference Server into AI Platform Prediction Custom Container service in the Direct Model Server mode:

![](img/caip_triton_container_diagram_direct.jpg)


In [None]:
# Enter the same Project ID you used from README.md instructions 
PROJECT_ID='[enter project name]'

# Create a model bucket where model artifacts will be stored
MODEL_BUCKET='gs://[enter GCS bucket name]'

# This is the AI Platform Service endpoint
ENDPOINT='https://ml.googleapis.com/v1'

# Repository that was created from README.md instructions 
REGION='us-central1'
REPOSITORY='caipcustom'

### Prepare model Artifacts

Clone the NVIDIA Triton Inference Server repo.

In [None]:
!git clone https://github.com/NVIDIA/triton-inference-server.git

Create the GCS bucket where the model artifacts will be copied to.

In [None]:
!gsutil mb $MODEL_BUCKET

Stage model artifacts and copy to bucket.

In [None]:
!mkdir model_repository

In [None]:
!cp -R triton-inference-server/docs/examples/model_repository/* model_repository/

In [None]:
!./triton-inference-server/docs/examples/fetch_models.sh

In [None]:
!gsutil -m cp -R model_repository/ $MODEL_BUCKET

In [None]:
!gsutil ls $MODEL_BUCKET/model_repository

### Prepare request payload

To prepare the payload format, we have included a utility get_request_body_simple.py.  To use this utility, install the following library:

In [None]:
!pip3 install geventhttpclient

#### Prepare non-binary request payload

The first model will illustrate a non-binary payload.  The following command will create a KF Serving v2 format non-binary payload to be used with the "simple" model:

In [None]:
!python3 get_request_body_simple.py -m simple

#### Prepare binary request payload

Triton's implementation of KF Serving v2 protocol for binary data appends the binary data after the json body.  Triton requires an additional header for offset:

`Inference-Header-Content-Length: [offset]`

We have provided a script that will automatically resize the image to the proper size for ResNet-50 [224, 224, 3] and calculate the proper offset.  The following command takes an image file and outputs the necessary data structure to be use with the "resnet50_netdef" model.  Please note down this offset as it will be used later.

In [None]:
!python3 get_request_body_simple.py -m image -f triton-inference-server/qa/images/mug.jpg

## Create and deploy Model and Model Version

In this section, we will deploy two models:
1. Simple model with non-binary data.  KF Serving v2 protocol specifies a json format with non-binary data in the json body itself.
2. Binary data model with ResNet-50.  Triton's implementation of binary data for KF Server v2 protocol.


### Simple model (non-binary data)

#### Create Model

AI Platform Prediction uses a Model/Model Version Hierarchy, where the Model is a logical grouping of Model Versions.  We will first create the Model.

Because the MODEL_NAME variable will be used later to specify the predict route, and Triton will use that route to run prediction on a specific model, we must set the value of this variable to a valid name of a model.  For this section, will use the "simple" model.

In [None]:
MODEL_NAME='simple'

In [None]:
!gcloud beta ai-platform models create $MODEL_NAME --regions us-central1 --enable-console-logging

In [None]:
!gcloud ai-platform models list

#### Create Model Version

After the Model is created, we can now create a Model Version under this Model.  Each Model Version will need a name that is unique within the Model.  In AI Platform Prediction Custom Container, a {Project}/{Model}/{ModelVersion} uniquely identifies the specific container and model artifact used for inference.

In [None]:
VERSION_NAME='v01'

The following config file will be used in the Model Version creation command.

In [None]:
import yaml

config_simple={'deploymentUri': MODEL_BUCKET+'/model_repository', \
               'container': {'image': REGION+'-docker.pkg.dev/'+PROJECT_ID+'/'+REPOSITORY+'/tritonserver:20.06-py3', \
                             'args': ['tritonserver', '--model-repository=$(AIP_STORAGE_URI)'], \
                             'env': [], \
                             'ports': {'containerPort': 8000}}, \
               'routes': {'predict': '/v2/models/'+MODEL_NAME+'/infer', \
                          'health': '/v2/models/'+MODEL_NAME}, \
               'machineType': 'n1-standard-4', 'autoScaling': {'minNodes': 1}}

with open(r'config_simple.yaml', 'w') as file:
    config = yaml.dump(config_simple, file)

In [None]:
!gcloud beta ai-platform versions create $VERSION_NAME \
--model $MODEL_NAME \
--accelerator count=1,type=nvidia-tesla-t4 \
--config config_simple.yaml

#### To see details of the Model Version just created

In [None]:
!gcloud ai-platform versions describe $VERSION_NAME --model=$MODEL_NAME

#### To list all Model Versions and their states in this Model

In [None]:
!gcloud ai-platform versions list --model=$MODEL_NAME

#### Run Prediction

The "simple" model takes two tensors with shape [1,16] and does a couple of basic arithmetic operation.

In [None]:
!curl -X POST $ENDPOINT/projects/$PROJECT_ID/models/$MODEL_NAME/versions/$VERSION_NAME:predict \
    -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    -d @simple.json

### ResNet-50 model (binary data)

#### Create Model

In [None]:
BINARY_MODEL_NAME='resnet50_netdef'

In [None]:
!gcloud beta ai-platform models create $BINARY_MODEL_NAME --regions us-central1 --enable-console-logging

#### Create Model Version

In [None]:
BINARY_VERSION_NAME='v01'

In [None]:
import yaml

config_binary={'deploymentUri': MODEL_BUCKET+'/model_repository', \
               'container': {'image': 'gcr.io/'+PROJECT_ID+'/tritonserver:20.06-py3', \
                             'args': ['tritonserver', '--model-repository=$(AIP_STORAGE_URI)'], \
                             'env': [], \
                             'ports': {'containerPort': 8000}}, \
               'routes': {'predict': '/v2/models/'+BINARY_MODEL_NAME+'/infer', \
                          'health': '/v2/models/'+BINARY_MODEL_NAME}, \
               'machineType': 'n1-standard-4', 'autoScaling': {'minNodes': 1}}

with open(r'config_binary.yaml', 'w') as file:
    config_binary = yaml.dump(config_binary, file)

In [None]:
!gcloud beta ai-platform versions create $BINARY_VERSION_NAME \
--model $BINARY_MODEL_NAME \
--accelerator count=1,type=nvidia-tesla-t4 \
--config config_binary.yaml

#### To see details of the Model Version just created

In [None]:
!gcloud ai-platform versions describe $BINARY_VERSION_NAME --model=$BINARY_MODEL_NAME

#### To list all Model Versions and their states in this Model

In [None]:
!gcloud ai-platform versions list --model=$BINARY_MODEL_NAME

#### Run Prediction

Recall the offset value calcuated above.  The binary case has an additional header:

`Inference-Header-Content-Length: [offset]`

In [None]:
!curl --request POST $ENDPOINT/projects/$PROJECT_ID/models/$BINARY_MODEL_NAME/versions/$BINARY_VERSION_NAME:predict \
    -k -H "Content-Type: application/octet-stream" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    -H "Inference-Header-Content-Length: 138" \
    --data-binary @payload.dat

## Clean up

In [None]:
!gcloud ai-platform versions delete $VERSION_NAME --model=$MODEL_NAME --quiet

In [None]:
!gcloud ai-platform models delete $MODEL_NAME --quiet

In [None]:
!gcloud ai-platform versions delete $BINARY_VERSION_NAME --model=$BINARY_MODEL_NAME --quiet

In [None]:
!gcloud ai-platform models delete $BINARY_MODEL_NAME --quiet

In [None]:
!gsutil -m rm -r -f $MODEL_BUCKET

In [None]:
!rm -rf model_repository triton-inference-server *.yaml *.dat *.json