# Using Vertex AI for online serving with NVIDIA Triton

- This notebooks demonstrates serving of ensemble models - NVTabular preprocessing + HugeCTR recommender on Triton server 

The notebook compiles prescriptive guidance for the following tasks:

- Building a custom container derived from NVIDIA NGC Merlin inference image and the model artifacts
- Creating Vertex model using the custome container
- Creating a Vertex endpoint and deploying the model to that endpoint
- Getting the inference on a sample dataset using hte endpoint

## Model serving

[Triton Inference Server](https://github.com/triton-inference-server/server) provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.
Triton can load models from local storage or cloud platforms. As models are retrained with new data, developers can easily make updates without restarting the inference server or disrupting the application.

Triton runs multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.

It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensemble for use cases that require multiple models to perform end-to-end inference, such as conversational AI.

%Users can also use shared memory. The Inputs and outputs that pass to and from Triton are stored in shared memory, reducing HTTP/gRPC overhead and increasing performance.

The following figure shows the Triton Inference Server high-level architecture. The model repository is a file-system based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via either HTTP/REST or GRPC or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model's scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then returned.

<img src="./images/triton-architecture.png" alt="Triton Architecture" />

Triton supports a backend C API that allows Triton to be extended with new functionality such as custom pre- and post-processing operations or even a new deep-learning framework.

The models being served by Triton can be queried and controlled by a dedicated model management API that is available by HTTP/REST or GRPC protocol, or by the C API.

Readiness and liveness health endpoints and utilization, throughput and latency metrics ease the integration of Triton into deployment framework such as Kubernetes.

Here we use Triton to serve an ensemble model that contains data processing operations using NVTabular and HugeCTR model trained on Criteo data. The model is deployed into Google's Vertex AI and served via a Vertex Endpoint. 

## Notebook flow

This notebook assumes that the emsemble model containg the Hugectr trained model asn the NVTabular preprocessed wrokflow is created using ... notebook.

As you walk through the notebook you will execute the following steps:
- Configure notebook environment settings like GCP project and compute region.
- Build a custom Vertex container based on NVIDIA NGC Merlin Inference container
- Configure and submit the model based on the custom container 
- Create the endoint
- Configure the deployment of the model and submit the deployment job

In [1]:
import base64
import json
import os
import random
import sys
import pandas as pd
import numpy as np

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.cloud.aiplatform import hyperparameter_tuning as hpt
from google.protobuf.json_format import MessageToDict

## Configure notebook settings

In [3]:
PROJECT_ID = 'merlin-on-gcp'
REGION = "us-central1"
BUCKET_NAME = "gs://cloud-ai-platform-61647b5e-05eb-4c08-b632-92067b616f37"

## Submit a Vertex custom training job

### Initialize Vertex AI SDK

In [4]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

### Build a custom prediction container

In [5]:
IMAGE_NAME = 'triton_deploy-hugectr'
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}"
DOCKERFILE = 'src/Dockerfile.triton-hugectr'

In [27]:
!gsutil cp -r gs://diman-criteo/models ./src/

Copying gs://diman-criteo/models/deepfm/1/.ipynb_checkpoints/deepfm-checkpoint.json...
Copying gs://diman-criteo/models/deepfm/1/deepfm.json...                        
Copying gs://diman-criteo/models/deepfm/1/deepfm0_opt_sparse_0.model...         
Copying gs://diman-criteo/models/deepfm/1/deepfm0_sparse_0.model/emb_vector...  
==> NOTE: You are downloading one or more large file(s), which would            
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

| [4 files][569.7 MiB/569.7 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://diman-criteo/models/deepfm/1/deepfm0_sparse_0.model/key...
Copying 

In [39]:
!mkdir ./src/models/deepfm_ens/1

In [28]:
!docker build -t {IMAGE_URI} -f {DOCKERFILE} src
!docker push {IMAGE_URI}

Sending build context to Docker daemon  1.133GB
Step 1/8 : FROM gcr.io/merlin-on-gcp/dongm-merlin-inference-hugectr:v0.6.1
 ---> fb6f7db2d7fd
Step 2/8 : EXPOSE 8000
 ---> Using cache
 ---> 6483e4a811d5
Step 3/8 : EXPOSE 8001
 ---> Using cache
 ---> 36f81f5b7f47
Step 4/8 : EXPOSE 8002
 ---> Using cache
 ---> 541852b52454
Step 5/8 : WORKDIR /
 ---> Using cache
 ---> b625e263b707
Step 6/8 : RUN mkdir /model
 ---> Using cache
 ---> 959204525114
Step 7/8 : COPY /models/ /model/models/
 ---> Using cache
 ---> b7fa7ce1fce7
Step 8/8 : CMD ["tritonserver", "--model-repository=/model/models/", "--backend-config=hugectr,ps=/model/models/ps.json"]
 ---> Using cache
 ---> 8af0879748fc
Successfully built 8af0879748fc
Successfully tagged gcr.io/merlin-on-gcp/triton_deploy-hugectr:latest
Using default tag: latest
The push refers to repository [gcr.io/merlin-on-gcp/triton_deploy-hugectr]

[1B62eee2b8: Preparing 
[1B6660d413: Preparing 
[1Be58e8598: Preparing 
[1Bc9824aed: Preparing 
[1B23fe2ec9: P

### Configure a custom prediction job

In [29]:
VERSION = 1
model_display_name = f"{IMAGE_NAME}-deepfm-v{VERSION}"
model_description = "Serving with Triton inference server using a custom container"

health_route = "/v2/health/ready"
predict_route = f"/v2/models/deepfm_ens/infer"
serving_container_ports = [8000]

### Create the model

In [30]:
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/659831510405/locations/us-central1/models/8979605910930325504/operations/8507813305471991808
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/659831510405/locations/us-central1/models/8979605910930325504
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/659831510405/locations/us-central1/models/8979605910930325504')
triton_deploy-hugectr-deepfm-v1
projects/659831510405/locations/us-central1/models/8979605910930325504


### Create the endpoint

In [17]:
endpoint_display_name = f"{IMAGE_NAME}-endpoint"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/659831510405/locations/us-central1/endpoints/5806274615680434176/operations/2118700770047033344
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/659831510405/locations/us-central1/endpoints/5806274615680434176
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/659831510405/locations/us-central1/endpoints/5806274615680434176')


### Set deployment configuration

In [18]:
traffic_percentage = 100
machine_type = "n1-standard-4"
accelerator_type="NVIDIA_TESLA_T4"
accelerator_count = 1

deployed_model_display_name = model_display_name
min_replica_count = 1
max_replica_count = 3
sync = True

### Deploying the model

In [31]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    traffic_percentage=traffic_percentage,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    sync=sync,
)

INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/659831510405/locations/us-central1/endpoints/5806274615680434176
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/659831510405/locations/us-central1/endpoints/5806274615680434176/operations/3871920439047487488
INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/659831510405/locations/us-central1/endpoints/5806274615680434176


<google.cloud.aiplatform.models.Endpoint object at 0x7f98656d7550> 
resource name: projects/659831510405/locations/us-central1/endpoints/5806274615680434176

### Getting inference

### Reading CSV data file, getting the input data, and generating the inference request

In [4]:
from src.inference import infer_input

data = pd.read_csv('gs://diman-criteo/data/criteo.csv', encoding='utf-8')
#data = pd.read_csv('/data/criteo.csv', encoding='utf-8', index_col=[0])

binary_data = True

data = data[[x for x in data.columns if x != "label"]].fillna(0)

inputs = infer_input.get_inference_input(data, binary_data)

if (binary_data):
    request_body, json_size = infer_input.get_inference_request(inputs, '1')
else:
    infer_request, request_body, json_size = infer_input.get_inference_request(inputs, '1')
    

In [1]:
!curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json"  \
  https://us-central1-aiplatform.googleapis.com/v1/projects/merlin-on-gcp/locations/us-central1/endpoints/5806274615680434176:rawPredict \
  -d @criteo.json

{"id":"1","model_name":"deepfm_ens","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"OUTPUT0","datatype":"FP32","shape":[3],"data":[0.06609038263559342,0.07316402345895767,0.08091689646244049]}]}