# Using Vertex AI for online serving with NVIDIA Triton

- This notebooks demonstrates serving of ensemble models - NVTabular preprocessing + HugeCTR recommender on Triton server 

The notebook compiles prescriptive guidance for the following tasks:

- Building a custom container derived from NVIDIA NGC Merlin inference image and the model artifacts
- Creating Vertex model using the custome container
- Creating a Vertex endpoint and deploying the model to that endpoint
- Getting the inference on a sample dataset using hte endpoint

## Model serving

[Triton Inference Server](https://github.com/triton-inference-server/server) provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.
Triton can load models from local storage or cloud platforms. As models are retrained with new data, developers can easily make updates without restarting the inference server or disrupting the application.

Triton runs multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.

It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensemble for use cases that require multiple models to perform end-to-end inference, such as conversational AI.

Users can also use shared memory. The Inputs and outputs that pass to and from Triton are stored in shared memory, reducing HTTP/gRPC overhead and increasing performance.

<img src="./images/triton-architecture.png" alt="Triton Architecture" />

## Notebook flow

This notebook assumes that the emsemble model containg the Hugectr trained model asn the NVTabular preprocessed wrokflow is created using ... notebook.

As you walk through the notebook you will execute the following steps:
- Configure notebook environment settings like GCP project and compute region.
- Build a custom Vertex container based on NVIDIA NGC Merlin Inference container
- Configure and submit the model based on the custom container 
- Create the endoint
- Configure the deployment of the model and submit the deployment job

In [2]:
import base64
import json
import os
import random
import sys

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.cloud.aiplatform import hyperparameter_tuning as hpt
from google.protobuf.json_format import MessageToDict

## Configure notebook settings

In [3]:
PROJECT_ID = 'merlin-on-gcp'
REGION = "us-central1"

BUCKET_NAME = "gs://cloud-ai-platform-61647b5e-05eb-4c08-b632-92067b616f37"

## Submit a Vertex custom training job

### Initialize Vertex AI SDK

In [4]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

### Build a custom prediction container

In [5]:
IMAGE_NAME = 'triton_deploy-hugectr'
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}"
DOCKERFILE = 'src/Dockerfile.triton-hugectr'

In [6]:
!docker build -t {IMAGE_URI} -f {DOCKERFILE} src
!docker push {IMAGE_URI}

Sending build context to Docker daemon  420.9kB
Step 1/4 : FROM nvcr.io/nvidia/merlin/merlin-training:21.09
21.09: Pulling from nvidia/merlin/merlin-training

[1Bccf8d472: Already exists 
[1Ba3ad5c35: Already exists 
[1Bbc8dc1bd: Already exists 
[1B082db0a6: Already exists 
[1Bdaaa33a4: Already exists 
[1Baf991e8d: Already exists 
[1B1fb0fdb6: Already exists 
[1B571dbd7c: Already exists 
[1Bd18379eb: Already exists 
[1B8153258e: Already exists 
[1B19f57f60: Already exists 
[1B76a15bd8: Already exists 
[1Ba8b4dfd1: Already exists 
[1Be2b38283: Already exists 
[1B03da1dad: Already exists 
[1B161feb9f: Already exists 
[1B56dbc5fb: Already exists 
[1Bf2c05242: Already exists 
[1B06e859d2: Already exists 
[1B32c1dafe: Already exists 
[1B31c8c71b: Already exists 
[1B2263d14b: Already exists 
[1B874d11d2: Already exists 
[1Bfd25df9e: Already exists 
[1B01889029: Already exists 
[1Ba23c7e38: Already exists 
[1B2be66c3a: Already exists 
[1Ba81b8838: Already exists 


### Configure a custom prediction job

In [None]:
VERSION = 1
model_display_name = f"{APP_NAME}-movielens-v{VERSION}"
model_description = "Serving with Triton inference server using a custom container"

health_route = "/v2/health/ready"
predict_route = f"/v2/models/deepfm/infer"
serving_container_ports = [8000]

### Create the model

In [None]:
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

### Create the endpoint

In [None]:
endpoint_display_name = f"{APP_NAME}-endpoint"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

### Set deployment configuration

In [None]:
traffic_percentage = 100
machine_type = "n1-standard-4"
accelerator_type="NVIDIA_TESLA_T4"
accelerator_count = 1

deployed_model_display_name = model_display_name
min_replica_count = 1
max_replica_count = 3
sync = True

### Deploying the model

In [None]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    traffic_percentage=traffic_percentage,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    sync=sync,
)

### Getting inference

In [None]:
!curl \
-X POST  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/5851653659582005248:rawPredict \
-k -H "Content-Type: application/octet-stream" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Infer-Header-Content-Length: 3710" \
--data-binary "@criteo.dat"