# Using Vertex AI for online serving with NVIDIA Triton

- This notebooks demonstrates serving of ensemble models - NVTabular preprocessing + HugeCTR recommender on Triton server 

The notebook compiles prescriptive guidance for the following tasks:

- Building a custom container derived from NVIDIA NGC Merlin inference image and the model artifacts
- Creating Vertex model using the custome container
- Creating a Vertex endpoint and deploying the model to that endpoint
- Getting the inference on a sample dataset using hte endpoint

## Model serving

[Triton Inference Server](https://github.com/triton-inference-server/server) provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.
Triton can load models from local storage or cloud platforms. As models are retrained with new data, developers can easily make updates without restarting the inference server or disrupting the application.

Triton runs multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.

It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensemble for use cases that require multiple models to perform end-to-end inference, such as conversational AI.

%Users can also use shared memory. The Inputs and outputs that pass to and from Triton are stored in shared memory, reducing HTTP/gRPC overhead and increasing performance.

The following figure shows the Triton Inference Server high-level architecture. The model repository is a file-system based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via either HTTP/REST or GRPC or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model's scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then returned.

<img src="./images/triton-architecture.png" alt="Triton Architecture" />

Triton supports a backend C API that allows Triton to be extended with new functionality such as custom pre- and post-processing operations or even a new deep-learning framework.

The models being served by Triton can be queried and controlled by a dedicated model management API that is available by HTTP/REST or GRPC protocol, or by the C API.

Readiness and liveness health endpoints and utilization, throughput and latency metrics ease the integration of Triton into deployment framework such as Kubernetes.

Here we use Triton to serve an ensemble model that contains data processing operations using NVTabular and HugeCTR model trained on Criteo data. The model is deployed into Google's Vertex AI and served via a Vertex Endpoint. 

## Notebook flow

This notebook assumes that the emsemble model containg the Hugectr trained model asn the NVTabular preprocessed wrokflow is created using ... notebook.

As you walk through the notebook you will execute the following steps:
- Configure notebook environment settings like GCP project and compute region.
- Build a custom Vertex container based on NVIDIA NGC Merlin Inference container
- Configure and submit the model based on the custom container 
- Create the endoint
- Configure the deployment of the model and submit the deployment job

In [1]:
import json
import os
import random
import sys
import pandas as pd
import numpy as np

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.cloud.aiplatform import hyperparameter_tuning as hpt
from google.protobuf.json_format import MessageToDict

## Configure notebook settings

In [2]:
PROJECT_ID = 'merlin-on-gcp'
REGION = "us-central1"
BUCKET_NAME = "gs://workshop-datasets/merlin/"

## Create the ensemble model

### Copying the nvtabular workflow created in step 1 artifact to a temporary local path

In [3]:
!gsutil cp -r gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/ ./src/tmp/

Copying gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/categories/unique.C1.parquet...
Copying gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/categories/unique.C10.parquet...
Copying gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/categories/unique.C11.parquet...
Copying gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/categories/unique.C12.parquet...
| [4 files][131.4 MiB/131.4 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/categories/unique.C13.parquet...
Copying gs://workshop-datasets/merlin/criteo_processed_parquet_0.6/workflow/categories/unique.C14.parquet...
Copying gs://w

### Copying the hugectr model artifact created in step 2 to a temporary local path

In [4]:
!gsutil cp -r gs://workshop-datasets/merlin/model_21.09/ ./src/tmp/

Copying gs://workshop-datasets/merlin/model_21.09/deepfm.json...
Copying gs://workshop-datasets/merlin/model_21.09/deepfm0_opt_sparse_0.model... 
Copying gs://workshop-datasets/merlin/model_21.09/deepfm0_sparse_0.model/emb_vector...
==> NOTE: You are downloading one or more large file(s), which would            
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

Copying gs://workshop-datasets/merlin/model_21.09/deepfm0_sparse_0.model/key... 
\ [4 files][  3.7 GiB/  3.7 GiB]   18.1 MiB/s                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://workshop-datasets/merlin/model_21.09/deepfm0_sparse_0.model/slot_id...
Copying gs:/

### Exporting the ensemble model

In [5]:
# from src.serving import export

# workflow_path = "./src/tmp/criteo_processed_parquet_0.6/workflow/"
# saved_model_path = "./src/tmp/model_21.09"
# output_path = "./src/tmp/models"
# label_columns=["label"],
# categotical_columns=["C" + str(x) for x in range(1, 27)],
# continuous_columns=["I" + str(x) for x in range(1, 14)]


# export.export_ensemble( workflow_path, saved_model_path, output_path, categotical_columns, continuous_columns, label_columns)

### Copying the ensemble model to the ensemble model location on gcs

In [4]:
#model_ensemble_location = f"{BUCKET_NAME}/models/"
#!gsutil cp -r ${output_path} ${model_ensemble_location}

## Submit a Vertex custom training job

### Initialize Vertex AI SDK

In [5]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

### Build a custom prediction container

In [6]:
IMAGE_NAME = 'triton_deploy-hugectr'
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}"
DOCKERFILE = 'src/Dockerfile.triton'

In [7]:
!docker build -t {IMAGE_URI} -f {DOCKERFILE} src
!docker push {IMAGE_URI}

Sending build context to Docker daemon  10.88GB
Step 1/9 : FROM gcr.io/merlin-on-gcp/dongm-merlin-inference-hugectr:v0.6.1
 ---> fb6f7db2d7fd
Step 2/9 : EXPOSE 8000
 ---> Using cache
 ---> 6483e4a811d5
Step 3/9 : EXPOSE 8001
 ---> Using cache
 ---> 36f81f5b7f47
Step 4/9 : EXPOSE 8002
 ---> Using cache
 ---> 541852b52454
Step 5/9 : RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y
 ---> Using cache
 ---> 0935960c0e99
Step 6/9 : WORKDIR /src
 ---> Using cache
 ---> f2a6be6fafb9
Step 7/9 : COPY serving/entrypoint.sh ./
 ---> Using cache
 ---> 0733b8a1e25d
Step 8/9 : RUN chmod +x entrypoint.sh
 ---> Using cache
 ---> 6b36dbe0f14e
Step 9/9 : ENTRYPOINT ["./entrypoint.sh"]
 ---> Using cache

### Configure a custom prediction job

In [11]:
VERSION = 3
model_display_name = f"{IMAGE_NAME}-deepfm-v{VERSION}"
model_description = "Serving with Triton inference server using a custom container"

health_route = "/v2/health/ready"
predict_route = f"/v2/models/deepfm_ens/infer"
serving_container_ports = [8000]
in_container_model_repository = '/models' # this should match the paths in ps.json and config.pbtxt in the ensemble
serving_container_args = [in_container_model_repository]

model_ensemble_location = f"{BUCKET_NAME}/models/"

### Create the model

In [12]:
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
    artifact_uri=model_ensemble_location,
    serving_container_args=serving_container_args,
)

model.wait()

print(model.display_name)
print(model.resource_name)

INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/659831510405/locations/us-central1/models/7127513758313742336/operations/7054833428377632768
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/659831510405/locations/us-central1/models/7127513758313742336
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/659831510405/locations/us-central1/models/7127513758313742336')
triton_deploy-hugectr-deepfm-v3
projects/659831510405/locations/us-central1/models/7127513758313742336


### Create the endpoint

In [13]:
endpoint_display_name = f"{IMAGE_NAME}-endpoint-v{VERSION}"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/659831510405/locations/us-central1/endpoints/5534369788177940480/operations/8858525079139516416
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/659831510405/locations/us-central1/endpoints/5534369788177940480
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/659831510405/locations/us-central1/endpoints/5534369788177940480')


### Set deployment configuration

In [16]:
traffic_percentage = 100
machine_type = "n1-standard-8"
accelerator_type="NVIDIA_TESLA_T4"
accelerator_count = 1

deployed_model_display_name = model_display_name
min_replica_count = 1
max_replica_count = 3
sync = True

### Deploying the model

In [None]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    traffic_percentage=traffic_percentage,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    sync=sync,
)

INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/659831510405/locations/us-central1/endpoints/5534369788177940480
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/659831510405/locations/us-central1/endpoints/5534369788177940480/operations/1685522325761425408


### Getting inference

### Reading CSV data file, getting the input data, and generating the inference request

In [11]:
from src.serving import inference

data = pd.read_csv('gs://diman-criteo/data/criteo.csv', encoding='utf-8', index_col=[0])

# Defining whether to set the request_data in json or binary format
binary_data = False

data = data[[x for x in data.columns if x != "label"]].fillna(0)

# Converting the data into triton's InferInput object format 
# The format matches KF Serving V2 protocol
inputs = inference.get_inference_input(data, binary_data)

# Greating the request_body to be sent to the inference request  
if (binary_data):
    request_body, json_size = inference.get_inference_request(inputs, '1')
    with open('criteo.dat', 'wb') as output_file:
        output_file.write(request_body)
else:
    infer_request, request_body, json_size = inference.get_inference_request(inputs, '1')
    json_obj = json.loads(request_body)
    with open('criteo.json', 'w') as output_file:
        json.dump(json_obj, output_file)
         
output_file.close()
    

     I1   I2   I3    I4   I5   I6  I7   I8  I9  I10  ...  C17         C18  \
0   0.0  164  0.0  58.0  2.0  0.0   0   -1   0  0.0  ...  0.0 -1206449222   
1   1.0   13  0.0   7.0  1.0  0.0   0   24   0  0.0  ...  0.0 -1206449222   
2  15.0   31  2.0   0.0  0.0  0.0   0  131   4  0.0  ...  0.0 -1206449222   

          C19         C20        C21        C22  C23         C24         C25  \
0 -1793932789 -1947780979 -839651461   -8271857  0.0   282542550   -10139646   
1  1678260921   615989787  574842146 -268177989  0.0  -174708188 -1726799382   
2 -1793932789 -1400958208  -84150787  553047527  0.0 -1290703736   809724924   

          C26  
0 -1218975401  
1  -317696227  
2  -317696227  

[3 rows x 39 columns]
b'{"id": "1", "inputs": [{"name": "I1", "shape": [3, 1], "datatype": "INT32", "parameters": {"binary_data_size": 12}}, {"name": "I2", "shape": [3, 1], "datatype": "INT32", "parameters": {"binary_data_size": 12}}, {"name": "I3", "shape": [3, 1], "datatype": "INT32", "parameters": {"b

### Getting inference for a json input using curl command

In [42]:

!curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json"  \
   https://us-central1-aiplatform.googleapis.com/v1/projects/merlin-on-gcp/locations/us-central1/endpoints/5806274615680434176:rawPredict \
  -d @criteo.json

{"id":"1","model_name":"deepfm_ens","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"OUTPUT0","datatype":"FP32","shape":[3],"data":[0.0293459240347147,0.03906842693686485,0.032647937536239627]}]}

In [None]:
%%bash -s  $PROJECT_ID $REGION $endpoint_display_name

PROJECT_ID=$1
REGION=$2
endpoint_display_name=$3

# get endpoint id
echo "REGION = ${REGION}"
echo "ENDPOINT DISPLAY NAME = ${endpoint_display_name}"
endpoint_id=$(gcloud beta ai endpoints list --region ${REGION} --filter "display_name=${endpoint_display_name}" --format "value(ENDPOINT_ID)")
echo "ENDPOINT_ID = ${endpoint_id}"

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json"  \
  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/${endpoint_id}:rawPredict \
  -d @criteo.json

### Getting inference for a binary input using curl command

In [None]:
# Currently it is not working

#Infer-Header-Content-Length = json_size

# !curl \
# -X POST https://us-central1-aiplatform.googleapis.com/v1/projects/merlin-on-gcp/locations/us-central1/endpoints/5806274615680434176:rawPredict \
# -k -H "Content-Type: application/octet-stream" \
# -H "Authorization: Bearer $(gcloud auth print-access-token)" \
# -H "Infer-Header-Content-Length: 3812" \
# --data-binary "@criteo.dat"

{"error":"must specify valid 'Infer-Header-Content-Length' in request header and 'binary_data_size' when passing inputs in binary data format"}