# Deploying NVIDIA Triton Inference Server in AI Platform Prediction Custom Container

In this notebook, we will walk through the process of deploying NVIDIA's Triton Inference Server into AI Platform Prediction Custom Container service in the Direct Model Server mode:

![](img/caip_triton_container_diagram_direct.jpg)


In [None]:
%env PROJECT_ID="[enter project name]"
%env MODEL_BUCKET="gs://[enter GCS bucket name]"
%env ENDPOINT="https://alpha-ml.googleapis.com/v1"


In [1]:
%env PROJECT_ID=tsaikevin-triton-2
%env MODEL_BUCKET=gs://tsaikevin-triton-2-models
%env ENDPOINT=https://alpha-ml.googleapis.com/v1


env: PROJECT_ID=tsaikevin-triton-2
env: MODEL_BUCKET=gs://tsaikevin-triton-2-models
env: ENDPOINT=https://alpha-ml.googleapis.com/v1


### Prepare model Artifacts

Clone the NVIDIA Triton Inference Server repo.

In [2]:
!git clone https://github.com/NVIDIA/triton-inference-server.git

Cloning into 'triton-inference-server'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 22796 (delta 16), reused 22 (delta 7), pack-reused 22750[K
Receiving objects: 100% (22796/22796), 13.11 MiB | 15.87 MiB/s, done.
Resolving deltas: 100% (16833/16833), done.


Create the GCS bucket where the model artifacts will be copied to.

In [None]:
!gsutil mb $MODEL_BUCKET

Stage model artifacts and copy to bucket.

In [5]:
!mkdir model_repository

In [6]:
!cp -R triton-inference-server/docs/examples/model_repository/* model_repository/

In [7]:
!./triton-inference-server/docs/examples/fetch_models.sh

+ mkdir -p model_repository/resnet50_netdef/1
+ wget -O model_repository/resnet50_netdef/1/model.netdef http://download.caffe2.ai.s3.amazonaws.com/models/resnet50/predict_net.pb
--2020-07-31 07:53:51--  http://download.caffe2.ai.s3.amazonaws.com/models/resnet50/predict_net.pb
Resolving download.caffe2.ai.s3.amazonaws.com (download.caffe2.ai.s3.amazonaws.com)... 52.216.28.44
Connecting to download.caffe2.ai.s3.amazonaws.com (download.caffe2.ai.s3.amazonaws.com)|52.216.28.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31649 (31K) [binary/octet-stream]
Saving to: ‘model_repository/resnet50_netdef/1/model.netdef’


2020-07-31 07:53:51 (1.03 MB/s) - ‘model_repository/resnet50_netdef/1/model.netdef’ saved [31649/31649]

+ wget -O model_repository/resnet50_netdef/1/init_model.netdef http://download.caffe2.ai.s3.amazonaws.com/models/resnet50/init_net.pb
--2020-07-31 07:53:51--  http://download.caffe2.ai.s3.amazonaws.com/models/resnet50/init_net.pb
Resolving downloa

In [None]:
!gsutil -m cp -R model_repository/ $MODEL_BUCKET

In [295]:
!gsutil ls $MODEL_BUCKET/model_repository

gs://tsaikevin-triton-2-models/model_repository/densenet_onnx/
gs://tsaikevin-triton-2-models/model_repository/inception_graphdef/
gs://tsaikevin-triton-2-models/model_repository/resnet50_netdef/
gs://tsaikevin-triton-2-models/model_repository/simple/
gs://tsaikevin-triton-2-models/model_repository/simple_string/


## Create and deploy Model and Model Version

In this section, we will deploy two models:
1. Simple model with non-binary data.  KF Serving v2 protocol specifies a json format with non-binary data in the json body itself.
2. Binary data model with ResNet-50.  Triton's implementation of binary data for KF Server v2 protocol.

### Simple Model with non-binary data

#### Create Model

AI Platform Prediction uses a Model/Model Version Hierarchy, where the Model is a logical grouping of Model Versions.  We will first create the Model.

Because the MODEL_NAME variable will be used later to specify the predict route, and Triton will use that route to run prediction on a specific model, we must set the value of this variable to a valid name of a model.  For this section, will use the "simple" model.

In [8]:
%env MODEL_NAME=simple

env: MODEL_NAME=simple


In [306]:
!curl --request \
    -X POST -v -k -H "Content-Type: application/json" \
    -d "{'name': '"$MODEL_NAME"'}" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    "${ENDPOINT}/projects/${PROJECT_ID}/models/"

Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 172.217.212.95:443...
* Connected to alpha-ml.googleapis.com (172.217.212.95) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /opt/conda/ssl/cacert.pem
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=Mountain View; O=Google LLC; CN=upload.video.google.com
*  start date: Jul  7 08:08:59 2020 GMT
*  expire date: Sep 29 08:08:59 2020 GMT
*  i

#### Create Model Version

After the Model is created, we can now create a Model Version under this Model.  Each Model Version will need a name that is unique within the Model.  In AI Platform Prediction Custom Container, a {Project}/{Model}/{ModelVersion} uniquely identifies the specific container and model artifact used for inference.

In [10]:
%env VERSION_NAME=v20

env: VERSION_NAME=v20


The following specifications tell AI Platform how to create the Model Version.

In [11]:
import json
import os

triton_simple_version = {
  "name": os.getenv("VERSION_NAME"),
  "deployment_uri": os.getenv("MODEL_BUCKET")+"/model_repository",
  "container": {
    "image": "gcr.io/"+os.getenv("PROJECT_ID")+"/tritonserver:20.06-py3",
    "args": ["tritonserver",
             "--model-repository=$(AIP_STORAGE_URI)"
    ],
    "env": [
    ], 
    "ports": [
      { "containerPort": 8000 }
    ]
  },
  "routes": {
    "predict": "/v2/models/"+os.getenv("MODEL_NAME")+"/infer",
    "health": "/v2/models/"+os.getenv("MODEL_NAME")
  },
  "machine_type": "n1-standard-4",
  "acceleratorConfig": {
    "count":1,
    "type":"nvidia-tesla-t4"
  }
}

with open("triton_simple_version.json", "w") as f: 
  json.dump(triton_simple_version, f)

In [12]:
triton_simple_version

{'name': 'v20',
 'deployment_uri': 'gs://tsaikevin-triton-2-models/model_repository',
 'container': {'image': 'gcr.io/tsaikevin-triton-2/tritonserver:20.06-py3',
  'args': ['tritonserver', '--model-repository=$(AIP_STORAGE_URI)'],
  'env': [],
  'ports': [{'containerPort': 8000}]},
 'routes': {'predict': '/v2/models/simple/infer',
  'health': '/v2/models/simple'},
 'machine_type': 'n1-standard-4',
 'acceleratorConfig': {'count': 1, 'type': 'nvidia-tesla-t4'}}

In [13]:
!curl --request \
    POST -v -k -H "Content-Type: application/json" \
    -d @triton_simple_version.json \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    "${ENDPOINT}/projects/${PROJECT_ID}/models/${MODEL_NAME}/versions"

Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 172.217.214.95:443...
* Connected to alpha-ml.googleapis.com (172.217.214.95) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /opt/conda/ssl/cacert.pem
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=Mountain View; O=Google LLC; CN=upload.video.google.com
*  start date: Jul  7 08:08:59 2020 GMT
*  expire date: Sep 29 08:08:59 2020 GMT
*  i

#### Check the status of Model Version creation

Creating a Model Version may take several minutes.  You can check on the status of this specfic Model Version with the following, and a successful deployment will show:

`"state": "READY"`

In [14]:
!curl --request GET -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    "${ENDPOINT}/projects/${PROJECT_ID}/models/${MODEL_NAME}/versions/${VERSION_NAME}" 

{
  "name": "projects/tsaikevin-triton-2/models/simple/versions/v20",
  "deploymentUri": "gs://tsaikevin-triton-2-models/model_repository",
  "createTime": "2020-07-31T07:54:29Z",
  "state": "CREATING",
  "etag": "Be1xEHxSY7M=",
  "machineType": "n1-standard-4",
  "acceleratorConfig": {
    "count": "1",
    "type": "NVIDIA_TESLA_T4"
  },
  "container": {
    "image": "gcr.io/tsaikevin-triton-2/tritonserver:20.06-py3",
    "args": [
      "tritonserver",
      "--model-repository=$(AIP_STORAGE_URI)"
    ],
    "ports": [
      {
        "containerPort": 8000
      }
    ]
  },
  "routes": {
    "predict": "/v2/models/simple/infer",
    "health": "/v2/models/simple"
  }
}


#### To list all Model Versions and their states in this Model:

In [323]:
!curl --request GET -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    "${ENDPOINT}/projects/${PROJECT_ID}/models/${MODEL_NAME}/versions/" 

{
  "versions": [
    {
      "name": "projects/tsaikevin-triton-2/models/simple/versions/v15",
      "deploymentUri": "gs://tsaikevin-triton-2-models",
      "createTime": "2020-07-30T10:25:03Z",
      "state": "FAILED",
      "errorMessage": "model server container terminated: exit_code: 1\nreason: \"Error\"\nstarted_at {\n  seconds: 1596106476\n}\nfinished_at {\n  seconds: 1596106485\n}\n",
      "etag": "7ExBP8Ff1Y8=",
      "machineType": "n1-standard-4",
      "acceleratorConfig": {
        "count": "1",
        "type": "NVIDIA_TESLA_T4"
      },
      "container": {
        "image": "gcr.io/tsaikevin-triton-2/tritonserver:20.06-py3",
        "args": [
          "tritonserver",
          "--model-repository=$(AIP_STORAGE_URI)"
        ],
        "ports": [
          {
            "containerPort": 8000
          }
        ]
      },
      "routes": {
        "predict": "/v2/models/simple/infer",
        "health": "/v2/models/simple"
      }
    },
    {
      "name": "projects/tsa

#### Run Prediction

The "simple" model takes two tensors with shape [1,16] and does a couple of basic arithmetic operation.

In [244]:
!curl -X POST ${ENDPOINT}/projects/${PROJECT_ID}/models/${MODEL_NAME}/versions/${VERSION_NAME}:predict \
    -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    -d '{ \
            "id": "0", \
            "inputs": [ \
                { \
                    "name": "INPUT0", \
                    "shape": [1, 16], \
                    "datatype": "INT32", \
                    "parameters": {}, \
                    "data": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] \
                }, \
                { \
                    "name": "INPUT1", \
                    "shape": [1, 16], \
                    "datatype": "INT32", \
                    "parameters": {}, \
                    "data": [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] \
                } \
            ] \
        }'

{"id":"0","model_name":"simple","model_version":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":[1,16],"data":[-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]},{"name":"OUTPUT1","datatype":"INT32","shape":[1,16],"data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}]}

### ResNet-50 with binary data

#### Create Model

In [15]:
%env BINARY_MODEL_NAME=resnet50_netdef

env: BINARY_MODEL_NAME=resnet50_netdef


In [297]:
!curl -X POST -v -k -H "Content-Type: application/json" \
  -d "{'name': '"$BINARY_MODEL_NAME"'}" \
  -H "Authorization: Bearer `gcloud auth print-access-token`" \
  "${ENDPOINT}/projects/${PROJECT_ID}/models/"

Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 74.125.69.95:443...
* Connected to alpha-ml.googleapis.com (74.125.69.95) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /opt/conda/ssl/cacert.pem
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=Mountain View; O=Google LLC; CN=upload.video.google.com
*  start date: Jul  7 08:08:59 2020 GMT
*  expire date: Sep 29 08:08:59 2020 GMT
*  issue

#### Create Model Version

In [16]:
%env BINARY_VERSION_NAME=v2

env: BINARY_VERSION_NAME=v2


In [17]:
triton_binary_version = {
  "name": os.getenv("BINARY_VERSION_NAME"),
  "deployment_uri": os.getenv("MODEL_BUCKET")+"/model_repository",
  "container": {
    "image": "gcr.io/"+os.getenv("PROJECT_ID")+"/tritonserver:20.06-py3",
    "args": ["tritonserver",
             "--model-repository=$(AIP_STORAGE_URI)"
    ],
    "env": [
    ], 
    "ports": [
      { "containerPort": 8000 }
    ]
  },
  "routes": {
    "predict": "/v2/models/"+os.getenv("BINARY_MODEL_NAME")+"/infer",
    "health": "/v2/models/"+os.getenv("BINARY_MODEL_NAME")
  },
  "machine_type": "n1-standard-4",
  "acceleratorConfig": {
    "count":1,
    "type":"nvidia-tesla-t4"
  }
}

with open("triton_binary_version.json", "w") as f: 
  json.dump(triton_binary_version, f)

In [18]:
!curl -X POST -v -k -H "Content-Type: application/json" \
  -d @triton_binary_version.json \
  -H "Authorization: Bearer `gcloud auth print-access-token`" \
  ${ENDPOINT}/projects/${PROJECT_ID}/models/${BINARY_MODEL_NAME}/versions

Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 172.217.214.95:443...
* Connected to alpha-ml.googleapis.com (172.217.214.95) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /opt/conda/ssl/cacert.pem
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=Mountain View; O=Google LLC; CN=upload.video.google.com
*  start date: Jul  7 08:08:59 2020 GMT
*  expire date: Sep 29 08:08:59 2020 GMT
*  i

#### Check Model Version status

In [19]:
!curl --request GET -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    "${ENDPOINT}/projects/${PROJECT_ID}/models/${BINARY_MODEL_NAME}/versions/${BINARY_VERSION_NAME}" 

{
  "name": "projects/tsaikevin-triton-2/models/resnet50_netdef/versions/v2",
  "deploymentUri": "gs://tsaikevin-triton-2-models/model_repository",
  "createTime": "2020-07-31T07:54:57Z",
  "state": "CREATING",
  "etag": "3PLPMvxvNd0=",
  "machineType": "n1-standard-4",
  "acceleratorConfig": {
    "count": "1",
    "type": "NVIDIA_TESLA_T4"
  },
  "container": {
    "image": "gcr.io/tsaikevin-triton-2/tritonserver:20.06-py3",
    "args": [
      "tritonserver",
      "--model-repository=$(AIP_STORAGE_URI)"
    ],
    "ports": [
      {
        "containerPort": 8000
      }
    ]
  },
  "routes": {
    "predict": "/v2/models/resnet50_netdef/infer",
    "health": "/v2/models/resnet50_netdef"
  }
}


#### Prepare Binary Request Payload

Triton's implementation of KF Serving v2 protocol for binary data appends the binary data after the json body.  Triton requires an additional header for offset:

`Inference-Header-Content-Length: [offset]`

We have provided a script that will automatically resize the image to the proper size for ResNet-50 [224, 224, 3] and calculate the proper offset.

In [247]:
!pip3 install geventhttpclient

Collecting geventhttpclient
  Downloading geventhttpclient-1.4.4-cp37-cp37m-manylinux2010_x86_64.whl (77 kB)
[K     |████████████████████████████████| 77 kB 2.3 MB/s eta 0:00:011
Installing collected packages: geventhttpclient
Successfully installed geventhttpclient-1.4.4


The following command takes an image file and outputs the necessary data structure for Triton.

In [3]:
!python3 get_request_body_simple.py -m image -f triton-inference-server/qa/images/mug.jpg

(3, 224, 224)
Add Header: Inference-Header-Content-Length: 138


#### Run Prediction

In [21]:
!curl --request POST ${ENDPOINT}/projects/${PROJECT_ID}/models/${BINARY_MODEL_NAME}/versions/${BINARY_VERSION_NAME}:predict \
    -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    -H "Inference-Header-Content-Length: 138" \
    -d @payload.dat

{
  "error": {
    "code": 400,
    "message": "{\"error\":\"unexpected size for input 'gpu_0/data', expecting 599327 bytes for model 'resnet50_netdef'\"}",
    "status": "INVALID_ARGUMENT"
  }
}


In [339]:
!curl -X POST ${ENDPOINT}/projects/${PROJECT_ID}/models/${BINARY_MODEL_NAME}/versions/${BINARY_VERSION_NAME}:predict \
    -k -H "Content-Type: application/json" \
    -H "Authorization: Bearer `gcloud auth print-access-token`" \
    -d @payload.dat

{
  "error": {
    "code": 400,
    "message": "{\"error\":\"failed to parse the request JSON buffer: The document root must not be followed by other values. at 138\"}",
    "status": "INVALID_ARGUMENT"
  }
}


In [334]:
!ls -lR triton-inference-server/qa/images

triton-inference-server/qa/images:
total 1092
-rw-r--r-- 1 jupyter jupyter    7999 Jul 30 09:42 car.jpg
-rw-r--r-- 1 jupyter jupyter 1005970 Jul 30 09:42 mug.jpg
-rw-r--r-- 1 jupyter jupyter   99689 Jul 30 09:42 vulture.jpeg


In [340]:
!git checkout

D	v2/simple_setup/car.jpg
M	v2/simple_setup/get_request_body_simple.py
D	v2/simple_setup/mug.jpg
D	v2/simple_setup/simple.dat
D	v2/simple_setup/simple.json
D	v2/simple_setup/vulture.jpeg
Your branch is up-to-date with 'origin/master'.


In [341]:
!git add .

fatal: Unable to create '/home/jupyter/caip-triton/.git/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
