# Serving Recommender Models for Online Prediction using NVIDIA Triton and Vertex AI

This notebooks demonstrates serving a HugeCTR model using Triton server on Vertex AI prediction.
The notebook compiles prescriptive guidance for the following tasks:

1. Exporting the Triton ensemble model consisting of NVTabular preprocessing workflow HugeCTR model.
2. Uploading the model and its metadata to Vertex Models.
3. Building a custom container derived from NVIDIA NGC Merlin inference image.
4. Deploy the model to Vertex AI Prediction.
5. Getting the inference on a sample data points using the endpoint.

## Triton Inference Server Overview

[Triton Inference Server](https://github.com/triton-inference-server/server) provides an inferencing solution optimized for both CPUs and GPUs. Triton can run multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensemble for use cases that require multiple models to perform end-to-end inference.

At a high-level, the Triton Inference Server high-level architecture works as follows:
- The model repository is a file-system based repository of the models that Triton will make available for inferencing. 
- Inference requests arrive at the server via either HTTP/REST or gRPC or then routed to the appropriate per-model scheduler. 
- Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis.
- The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs.

Triton server provides readiness and liveness health endpoints, as well as utilization, throughput, and latency metrics, which enables the integration of Triton into deployment environment, such as Vertex AI Prediction.

In this example, we use Triton to serve an ensemble model that contains data processing workflow and HugeCTR model trained on Criteo data. The model is deployed into Vertex AI Prediction. This is shown in the following figure:

<img src="./images/triton-vertex.png" alt="Triton Architecture" style="width:70%"/>

## Setup

In this section of the notebook you configure your environment settings, including a GCP project, a GCS compute region, and a GCP Bucket. 
You also set the locations of the saved NVTaubular workflow, created in [01-dataset-preprocessing.ipynb](01-dataset-preprocessing.ipynb) and the trained HugeCTR model, created in [02-model-training-hugectr.ipynb](02-model-training-hugectr.ipynb) notebook.

Make sure to update the below cells with the values reflecting your environment.

In [1]:
import json
import os
from pathlib import Path
import time

from src.serving import export
from src import feature_utils
from src.input_config import InputOutputConfig

from google.cloud import aiplatform as vertex_ai

In [2]:
PROJECT_ID = 'merlin-on-gcp' # Change to your project.
REGION = 'us-central1'  # Change to your region.
STAGING_BUCKET = 'jk-merlin-staging' # Change to your bucket.
MODEL_REPOSITORY_BUCKET = 'merlin_model_repository'
LOCAL_WORKSPACE = '/tmp'

MODEL_NAME = 'deepfm'
MODEL_VERSION = 'v01'
MODEL_DISPLAY_NAME = f'{MODEL_NAME}-{MODEL_VERSION}'
EXPORTED_MODELS_DIR = f'gs://{MODEL_REPOSITORY_BUCKET}/hugectr_models'

IMAGE_NAME = 'triton-deploy-hugectr'
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}"
DOCKERFILE = 'src/Dockerfile.triton'

WORKFLOW_MODEL_DIR = "gs://criteo-datasets/criteo_processed_parquet_0.6/workflow" # Change to GCS path of the nvt workflow.
HUGECTR_MODEL_DIR = "gs://merlin-models/hugectr_deepfm_21.09" # Change to GCS path of the hugectr trained model.

### Initialize Vertex AI SDK

In [3]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=STAGING_BUCKET
)

## 1. Exporting the Triton ensemble model

The Triton ensemble model consists of NVTabular preprocessing workflow HugeCTR model

In [None]:
!gsutil cp -r {WORKFLOW_MODEL_DIR} {LOCAL_WORKSPACE}

In [None]:
!gsutil cp -r {HUGECTR_MODEL_DIR} {LOCAL_WORKSPACE}

### Exporting the ensemble model

In [4]:
model_config = InputOutputConfig()
local_workflow_path = Path(WORKFLOW_MODEL_DIR).parts[-1]
local_saved_model_path = Path(HUGECTR_MODEL_DIR).parts[-1]
local_exported_ensemble_path = f'triton-ensemble-{time.strftime("%Y%m%d%H%M%S")}'

export.export_ensemble(
    workflow_path=local_workflow_path,
    saved_model_path=local_saved_model_path,
    output_path=local_exported_ensemble_path,
    categorical_columns=model_config.categorical_columns,
    continuous_columns=model_config.continuous_columns,
    label_columns=model_config.label_columns)

FileNotFoundError: [Errno 2] No such file or directory: 'workflow/metadata.json'

## 2. Uploading the model and its metadata to Vertex Models.

In [None]:
!gsutil cp -r {local_exported_ensemble_path} {EXPORTED_MODELS_DIR}

In [None]:
health_route = "/v2/health/ready"
predict_route = f"/v2/models/{MODEL_PREFIX}_ens/infer"
serving_container_ports = [8000]
in_container_model_repository = '/models'
serving_container_args = [in_container_model_repository]

model_ensemble_location = os.path.join(EXPORTED_MODELS_DIR, local_exported_ensemble_path)

In [None]:
model = vertex_ai.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
    artifact_uri=model_ensemble_location,
    serving_container_args=serving_container_args,
    sync=True
)

model.resource_name

## 3. Building a custom container derived from NVIDIA NGC Merlin inference image.

In [None]:
! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} {DOCKERFILE} --machine-type=e2-highcpu-8

## 4. Deploy the model to Vertex AI Prediction.

### Create the Vertex Endpoint

In [None]:
vertex_ai.Endpoint.create(
    display_name=ENDPOINT_DISPLAY_NAME
)

### Deploy the model to Vertex Endpoint

In [None]:
traffic_percentage = 100
machine_type = "n1-standard-8"
accelerator_type="NVIDIA_TESLA_T4"
accelerator_count = 1
min_replica_count = 1
max_replica_count = 3

In [None]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_DISPLAY_NAME,
    machine_type=machine_type,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    traffic_percentage=traffic_percentage,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    sync=True,
)

## 5. Getting the inference on a sample data points using hte endpoint.

In [None]:
from src.serving import inference
from src import feature_utils

is_binary = False

data = {
    'I1': [5, 32, 0], 
    'I2': [110, 3, 233], 
    'I3': [0, 5, 1], 
    'I4': [16, 0, 146], 
    'I5': [0, 1, 1], 
    'I6': [1, 0, 0], 
    'I7': [0, 0, 0], 
    'I8': [14, 61, 99], 
    'I9': [7, 5, 7], 
    'I10': [1, 0, 0], 
    'I11': [0, 1, 1], 
    'I12': [306, 3157, 3101], 
    'I13': [0, 5, 1], 
    'C1': [1651969401, -436994675, 1651969401], 
    'C2': [-501260968, -1599406170, -1382530557], 
    'C3': [-1343601617, 1873417685, 1656669709], 
    'C4': [-1805877297, -628476895, 946620910], 
    'C5': [951068488, 1020698403, -413858227], 
    'C6': [1875733963, 1875733963, 1875733963], 
    'C7': [897624609, -1424560767, -1242174622], 
    'C8': [679512323, 1128426537, -772617077], 
    'C9': [1189011366, 502653268, 776897055], 
    'C10': [771915201, 2112471209, 771915201], 
    'C11': [209470001, 1716706404, 209470001], 
    'C12': [-1785193185, -1712632281, 309420420], 
    'C13': [12976055, 12976055, 12976055], 
    'C14': [-1102125769, -1102125769, -1102125769], 
    'C15': [-1978960692, -205783399, -150008565], 
    'C16': [1289502458, 1289502458, 1289502458], 
    'C17': [-771205462, -771205462, -771205462], 
    'C18': [-1206449222, -1578429167, 1653545869], 
    'C19': [-1793932789, -1793932789, -1793932789], 
    'C20': [-1014091992, -20981661, -1014091992], 
    'C21': [351689309, -1556988767, 351689309], 
    'C22': [632402057, -924717482, 632402057], 
    'C23': [-675152885, 391309800, -675152885], 
    'C24': [2091868316, 1966410890, 883538181], 
    'C25': [809724924, -1726799382, -10139646], 
    'C26': [-317696227, -1218975401, -317696227]
}

inputs = inference.get_inference_input(data, is_binary)

# Greating the request_body to be sent to the inference request  
if is_binary:
    request_body, json_size = inference.get_inference_request(inputs, '1')
    with open('criteo.dat', 'wb') as output_file:
        output_file.write(request_body)
else:
    infer_request, request_body, json_size = inference.get_inference_request(inputs, '1')
    json_obj = json.loads(request_body)
    with open('criteo.json', 'w') as output_file:
        json.dump(json_obj, output_file)
         
output_file.close()

### Getting inference for a json input using curl command

In [None]:
%%bash -s  $PROJECT_ID $REGION $ENDPOINT_DISPLAY_NAME

PROJECT_ID=$1
REGION=$2
endpoint_display_name=$3

# get endpoint id
echo "REGION = ${REGION}"
echo "ENDPOINT DISPLAY NAME = ${ENDPOINT_DISPLAY_NAME}"
ENDPOINT_ID=$(gcloud beta ai endpoints list --region ${REGION} --filter "display_name=${ENDPOINT_DISPLAY_NAME}" --format "value(ENDPOINT_ID)")
echo "ENDPOINT_ID = ${ENDPOINT_ID}"

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json"  \
  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
  -d @criteo.json