# Serving Recommender Models for Online Prediction using NVIDIA Triton and Vertex AI

This notebooks demonstrates serving a HugeCTR model using Triton server on Vertex AI prediction.
The notebook compiles prescriptive guidance for the following tasks:

1. Exporting the Triton ensemble model consisting of NVTabular preprocessing workflow HugeCTR model.
2. Uploading the model and its metadata to Vertex Models.
3. Building a custom container derived from NVIDIA NGC Merlin inference image.
4. Deploy the model to Vertex AI Prediction.
5. Getting the inference on a sample data points using the endpoint.

## Triton Inference Server Overview

[Triton Inference Server](https://github.com/triton-inference-server/server) provides an inferencing solution optimized for both CPUs and GPUs. Triton can run multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensemble for use cases that require multiple models to perform end-to-end inference.

At a high-level, the Triton Inference Server high-level architecture works as follows:
- The model repository is a file-system based repository of the models that Triton will make available for inferencing. 
- Inference requests arrive at the server via either HTTP/REST or gRPC or then routed to the appropriate per-model scheduler. 
- Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis.
- The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs.

Triton server provides readiness and liveness health endpoints, as well as utilization, throughput, and latency metrics, which enables the integration of Triton into deployment environment, such as Vertex AI Prediction.

In this example, we use Triton to serve an ensemble model that contains data processing workflow and HugeCTR model trained on Criteo data. The model is deployed into Vertex AI Prediction. This is shown in the following figure:

<img src="./images/triton-vertex.png" alt="Triton Architecture" style="width:70%"/>

## Setup

In this section of the notebook you configure your environment settings, including a GCP project, a GCS compute region, and a GCP Bucket. 
You also set the locations of the saved NVTaubular workflow, created in [01-dataset-preprocessing.ipynb](01-dataset-preprocessing.ipynb) and the trained HugeCTR model, created in [02-model-training-hugectr.ipynb](02-model-training-hugectr.ipynb) notebook.

Make sure to update the below cells with the values reflecting your environment.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json
import os
import shutil
import time

from pathlib import Path
from src.serving import export
from src.configs import EnsembleConfig

from google.cloud import aiplatform as vertex_ai

In [3]:
PROJECT_ID = 'jk-mlops-dev' # Change to your project.
REGION = 'us-central1'  # Change to your region.
STAGING_BUCKET = 'jk-merlin-dev' # Change to your bucket.
MODEL_REPOSITORY_BUCKET = 'jk-vertex-staging'
LOCAL_WORKSPACE = '/home/jupyter/staging'

MODEL_NAME = 'deepfm'
MODEL_VERSION = 'v01'
MODEL_DISPLAY_NAME = f'hugectr-{MODEL_NAME}-{MODEL_VERSION}'
MODEL_DESCRIPTION = 'HugeCTR DeepFM model'
ENDPOINT_DISPLAY_NAME = f'hugectr-{MODEL_NAME}-{MODEL_VERSION}'

EXPORTED_MODELS_DIR = f'gs://{MODEL_REPOSITORY_BUCKET}/hugectr_models'

IMAGE_NAME = 'triton-deploy-hugectr'
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}"
DOCKERFILE = 'src/Dockerfile.triton'

WORKFLOW_MODEL_DIR = "gs://criteo-datasets/criteo_processed_parquet/workflow" # Change to GCS path of the nvt workflow.
HUGECTR_MODEL_DIR = "gs://merlin-models/hugectr_deepfm_21.09" # Change to GCS path of the hugectr trained model.

### Initialize Vertex AI SDK

In [4]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=STAGING_BUCKET
)

## 1. Exporting Triton ensemble model

The Triton ensemble model consists of NVTabular preprocessing workflow HugeCTR model

### Copy a hugectr saved model and a fitted NVTabular workflow to a local staging folder

In [5]:
if os.path.isdir(LOCAL_WORKSPACE):
    shutil.rmtree(LOCAL_WORKSPACE)
os.makedirs(LOCAL_WORKSPACE)

!gsutil -m cp -r {WORKFLOW_MODEL_DIR} {LOCAL_WORKSPACE}
!gsutil -m cp -r {HUGECTR_MODEL_DIR} {LOCAL_WORKSPACE}

Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C1.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C10.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C11.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C12.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C13.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C14.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C15.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C16.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C17.parquet...
Copying gs://criteo-datasets/criteo_processed_parquet/workflow/categories/unique.C18.parquet...
Copying gs://criteo-datasets/criteo_proce

### Export the ensemble model

In [6]:
ensemble_config = EnsembleConfig()
local_workflow_path = Path(LOCAL_WORKSPACE) / Path(WORKFLOW_MODEL_DIR).parts[-1]
local_saved_model_path = Path(LOCAL_WORKSPACE) / Path(HUGECTR_MODEL_DIR).parts[-1]
local_ensemble_path = Path(LOCAL_WORKSPACE) / f'triton-ensemble-{time.strftime("%Y%m%d%H%M%S")}'
container_model_registry_path = '/models'

export.export_ensemble(
    model_name=MODEL_NAME,
    workflow_path=local_workflow_path,
    saved_model_path=local_saved_model_path,
    output_path=local_ensemble_path,
    categorical_columns=ensemble_config.categorical_columns,
    continuous_columns=ensemble_config.continuous_columns,
    label_columns=ensemble_config.label_columns,
    num_slots=ensemble_config.num_slots,
    max_nnz=ensemble_config.max_nnz,
    num_outputs=ensemble_config.num_outputs,
    embedding_vector_size=ensemble_config.embedding_vector_size,
    max_batch_size=ensemble_config.max_batch_size,
    model_registry_path=container_model_registry_path
    )

In [7]:
! ls -la {local_ensemble_path}

total 24
drwxr-xr-x 5 jupyter jupyter 4096 Oct 30 01:55 .
drwxr-xr-x 5 jupyter jupyter 4096 Oct 30 01:55 ..
drwxr-xr-x 3 jupyter jupyter 4096 Oct 30 01:55 deepfm
drwxr-xr-x 3 jupyter jupyter 4096 Oct 30 01:55 deepfm_ens
drwxr-xr-x 3 jupyter jupyter 4096 Oct 30 01:55 deepfm_nvt
-rw-r--r-- 1 jupyter jupyter  222 Oct 30 01:55 ps.json


### Upload the ensemble to GCS

In [8]:
gcs_ensemble_path = '{}/{}'.format(EXPORTED_MODELS_DIR, Path(local_ensemble_path).parts[-1])

!gsutil -m cp -r {local_ensemble_path}/* {gcs_ensemble_path}/

Copying file:///home/jupyter/staging/triton-ensemble-20211030015518/deepfm/config.pbtxt [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/staging/triton-ensemble-20211030015518/deepfm/1/deepfm0_opt_sparse_0.model [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/staging/triton-ensemble-20211030015518/deepfm/1/deepfm.json [Content-Type=application/json]...
Copying file:///home/jupyter/staging/triton-ensemble-20211030015518/deepfm/1/deepfm_opt_dense_0.model [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/staging/triton-ensemble-20211030015518/deepfm/1/deepfm_dense_0.model [Content-Type=application/octet-stream]...
Copying file:///home/jupyter/staging/triton-ensemble-20211030015518/deepfm/1/deepfm0_sparse_0.model/emb_vector [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature c

## 2. Building a custom container derived from NVIDIA NGC Merlin inference image.

In [None]:
! cp src/Dockerfile.triton src/Dockerfile
! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} src --machine-type=e2-highcpu-8

## 3. Uploading the model and its metadata to Vertex Models.

In [9]:
health_route = "/v2/health/ready"
predict_route = f"/v2/models/{MODEL_NAME}_ens/infer"
serving_container_ports = [8000]
in_container_model_repository = '/models'
serving_container_args = [in_container_model_repository]


model = vertex_ai.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    description=MODEL_DESCRIPTION,
    serving_container_image_uri=IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
    artifact_uri=gcs_ensemble_path,
    serving_container_args=serving_container_args,
    sync=True
)

model.resource_name

INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/895222332033/locations/us-central1/models/8950064232515239936/operations/1915373857158463488
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/895222332033/locations/us-central1/models/8950064232515239936
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/895222332033/locations/us-central1/models/8950064232515239936')


'projects/895222332033/locations/us-central1/models/8950064232515239936'

## 4. Deploying the model to Vertex AI Prediction.

### Create the Vertex Endpoint

In [10]:
endpoint = vertex_ai.Endpoint.create(
    display_name=ENDPOINT_DISPLAY_NAME
)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/895222332033/locations/us-central1/endpoints/6144326062709932032/operations/1805598616241307648
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/895222332033/locations/us-central1/endpoints/6144326062709932032
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/895222332033/locations/us-central1/endpoints/6144326062709932032')


### Deploy the model to Vertex Endpoint

In [11]:
traffic_percentage = 100
machine_type = "n1-standard-8"
accelerator_type="NVIDIA_TESLA_T4"
accelerator_count = 1
min_replica_count = 1
max_replica_count = 3

In [12]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_DISPLAY_NAME,
    machine_type=machine_type,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    traffic_percentage=traffic_percentage,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    sync=True,
)

INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/895222332033/locations/us-central1/endpoints/6144326062709932032
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/895222332033/locations/us-central1/endpoints/6144326062709932032/operations/1566907835990671360
INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/895222332033/locations/us-central1/endpoints/6144326062709932032


<google.cloud.aiplatform.models.Endpoint object at 0x7f5fba5a1730> 
resource name: projects/895222332033/locations/us-central1/endpoints/6144326062709932032

## 5. Invoking the model

In [13]:
payload = {
    'id': '1',
    'inputs': [
        {'name': 'I1','shape': [3, 1], 'datatype': 'INT32', 'data': [5, 32, 0]},
        {'name': 'I2', 'shape': [3, 1], 'datatype': 'INT32', 'data': [110, 3, 233]},
        {'name': 'I3', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 5, 1]},
        {'name': 'I4', 'shape': [3, 1], 'datatype': 'INT32', 'data': [16, 0, 146]},
        {'name': 'I5', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 1, 1]},
        {'name': 'I6', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1, 0, 0]},
        {'name': 'I7', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 0, 0]},
        {'name': 'I8', 'shape': [3, 1], 'datatype': 'INT32', 'data': [14, 61, 99]},
        {'name': 'I9', 'shape': [3, 1], 'datatype': 'INT32', 'data': [7, 5, 7]},
        {'name': 'I10', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1, 0, 0]},
        {'name': 'I11', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 1, 1]},
        {'name': 'I12', 'shape': [3, 1], 'datatype': 'INT32', 'data': [306, 3157, 3101]},
        {'name': 'I13', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 5, 1]},
        {'name': 'C1', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1651969401, -436994675, 1651969401]},
        {'name': 'C2', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-501260968, -1599406170, -1382530557]},
        {'name': 'C3', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1343601617, 1873417685, 1656669709]},
        {'name': 'C4', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1805877297, -628476895, 946620910]},
        {'name': 'C5', 'shape': [3, 1], 'datatype': 'INT32', 'data': [951068488, 1020698403, -413858227]},
        {'name': 'C6', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1875733963, 1875733963, 1875733963]},
        {'name': 'C7', 'shape': [3, 1], 'datatype': 'INT32', 'data': [897624609, -1424560767, -1242174622]},
        {'name': 'C8', 'shape': [3, 1], 'datatype': 'INT32', 'data': [679512323, 1128426537, -772617077]},
        {'name': 'C9', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1189011366, 502653268, 776897055]},
        {'name': 'C10', 'shape': [3, 1], 'datatype': 'INT32', 'data': [771915201, 2112471209, 771915201]},
        {'name': 'C11', 'shape': [3, 1], 'datatype': 'INT32', 'data': [209470001, 1716706404, 209470001]},
        {'name': 'C12', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1785193185, -1712632281, 309420420]},
        {'name': 'C13', 'shape': [3, 1], 'datatype': 'INT32', 'data': [12976055, 12976055, 12976055]},
        {'name': 'C14', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1102125769, -1102125769, -1102125769]},
        {'name': 'C15', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1978960692, -205783399, -150008565]},
        {'name': 'C16', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1289502458, 1289502458, 1289502458]},
        {'name': 'C17', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-771205462, -771205462, -771205462]},
        {'name': 'C18', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1206449222, -1578429167, 1653545869]},
        {'name': 'C19', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1793932789, -1793932789, -1793932789]},
        {'name': 'C20', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1014091992, -20981661, -1014091992]},
        {'name': 'C21', 'shape': [3, 1], 'datatype': 'INT32', 'data': [351689309, -1556988767, 351689309]},
        {'name': 'C22', 'shape': [3, 1], 'datatype': 'INT32', 'data': [632402057, -924717482, 632402057]},
        {'name': 'C23', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-675152885, 391309800, -675152885]},
        {'name': 'C24', 'shape': [3, 1], 'datatype': 'INT32', 'data': [2091868316, 1966410890, 883538181]},
        {'name': 'C25', 'shape': [3, 1], 'datatype': 'INT32', 'data': [809724924, -1726799382, -10139646]},
        {'name': 'C26', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-317696227, -1218975401, -317696227]}]
}

with open('criteo_payload.json', 'w') as f:
    json.dump(payload, f)

In [15]:
uri = f'https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint.name}:rawPredict'

! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json"  \
{uri} \
-d @criteo_payload.json

{"id":"1","model_name":"deepfm_ens","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"OUTPUT0","datatype":"FP32","shape":[3],"data":[0.07325490564107895,0.04518262296915054,0.04656011983752251]}]}