# Pre-processing + XGBoost model inference pipeline with NVIDIA Triton Inference Server on Amazon SageMaker

With 22.05 release of [NVIDIA Triton](https://github.com/triton-inference-server/server/) container image on SageMaker you can now use Triton's [Forest Inference Library (FIL) backend](https://github.com/triton-inference-server/fil_backend) to easily serve tree based ML models like XGBoost for high-performance CPU and GPU inference in SageMaker. Using Triton's FIL backend allows you to benefit from the performance optimizations like dynamic batching, concurrent execution which help maximize the utilization of GPU and CPU, further lowering the cost of inference. And the multi-framework support provided by NVIDIA Triton allows you seamlessly deploy tree based ML models alongside deep learning models for fast, unified inference pipelines.

Machine Learning applications are complex and can often require data pre-processing. So in this notebook, we will not only deep dive into how to deploy a tree-based ML model like XGBoost using the FIL Backend in Triton on SageMaker endpoint but we will also cover how to implement python-based data pre-processing inference pipeline for your model using the ensemble feature in Triton. This will allow us to send in the raw data from client side and have both data pre-processing and model inference happen in Triton SageMaker endpoint for the best inference performance.

**Note:** This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g4dn`.

## Contents

## Forest Inference Library (FIL)

RAPIDS Forest Inference Library (FIL) is a library to provide high-performance inference for tree-based models. Here are some important FIL features:

* Supports XGBoost, LightGBM, cuML RandomForest, and Scikit Learn Random Forest
* No conversion needed for XGBoost and LightGBM. SKLearn or cuML pickle models need to be converted to Treelite's binary checkpoint format 
* SKLearn Random Forest is supported for single-output regression and multi-class classification
* Both CPU and GPU are supported

Below we show benchmark highlighting FIL's throughput performance against CPU XGBoost.

<img src="./images/fil_benchmark.png" alt="fil-benchmark" width="500" align="left"/>

## Triton FIL Backend
FIL is available as a backend in Triton with all of its features to allow for serving XGBoost, LightGBM and RandomForest models both on CPU and GPU with high performance. Here are some important features of the FIL Backend:

* **Shapley Value Support (GPU)**: GPU Shapley Values are supported for Model Explainability
* **Categorical Feature Support**: Models trained on categorical features fully supported.
* **CPU Optimizations**: Optimized CPU mode offers faster execution than native XGBoost.

To learn more about FIL Backend's features please see the [FAQ Notebook](https://github.com/triton-inference-server/fil_backend/blob/fea-faq_nb/notebooks/faq/FAQs.ipynb) and [Triton FIL Backend GitHub](https://github.com/triton-inference-server/fil_backend/tree/main)

## Triton Model Ensemble Feature
Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Server comes with a convenient solution that simplifies building pre-processing and post-processing pipelines. Triton Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput.

<img src="./images/triton-ensemble.png" alt="triton-ensemble" width="500" align="left"/>

## Set up Environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

In [1]:
!pip install -qU pip awscli boto3 sagemaker 
!pip install nvidia-pyindex
!pip install tritonclient[http]

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.0.1 requires botocore<1.22.9,>=1.22.8, but you have botocore 1.27.28 which is incompatible.[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... [?25ldone
[?25h  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8413 sha256=077d30080ee065966b189126fe527b082995e97f998d94e654e8a27ca5a5ef3c
  Stored in directory: /home/ec2-user/.cache/pip/wheels/e0/c2/fb/5cf4e1cfaf28007238362cb746fb38fc2dd76348331a748d54
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Succ

In [2]:
import boto3
import json
import sagemaker
import time
import os
from sagemaker import get_execution_role
import pandas as pd
import numpy as np

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")

In [3]:
account_id_map = {
    'us-east-1': '785573368785',
    'us-east-2': '007439368137',
    'us-west-1': '710691900526',
    'us-west-2': '301217895009',
    'eu-west-1': '802834080501',
    'eu-west-2': '205493899709',
    'eu-west-3': '254080097072',
    'eu-north-1': '601324751636',
    'eu-south-1': '966458181534',
    'eu-central-1': '746233611703',
    'ap-east-1': '110948597952',
    'ap-south-1': '763008648453',
    'ap-northeast-1': '941853720454',
    'ap-northeast-2': '151534178276',
    'ap-southeast-1': '324986816169',
    'ap-southeast-2': '355873309152',
    'cn-northwest-1': '474822919863',
    'cn-north-1': '472730292857',
    'sa-east-1': '756306329178',
    'ca-central-1': '464438896020',
    'me-south-1': '836785723513',
    'af-south-1': '774647643957'
}

In [4]:
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise("UNSUPPORTED REGION")

In [5]:
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.05-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

## Package models and dependencies and uploading to S3

The following example shows the model repository directory structure, containing a DALI preprocessing model, TensorFlow Inception v3 model, and the model ensemble

### Create Config File for FIL Model

First we create the Triton config file for the XGBoost model being served by the FIL Backend.

In [6]:
USE_GPU = True
FIL_MODEL_DIR = './model_repository/fil'

# Maximum size in bytes for input and output arrays
MAX_MEMORY_BYTES = 60_000_000
NUM_FEATURES = 15
NUM_CLASSES = 2
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

IS_CLASSIFIER = True
model_format = 'xgboost_json'

# Select deployment hardware (GPU or CPU)
if USE_GPU:
    instance_kind = 'KIND_GPU'
else:
    instance_kind = 'KIND_CPU'

# whether the model is doing classification or regression    
if IS_CLASSIFIER:
    classifier_string = 'true'
else:
    classifier_string = 'false'

# whether to predict probabilites or not
predict_proba = False

if predict_proba:
    predict_proba_string = 'true'
else:
    predict_proba_string = 'false'

config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [                                 
 {{  
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {NUM_FEATURES} ]                    
  }} 
]
output [
 {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
  {{
    key: "model_type"
    value: {{ string_value: "{model_format}" }}
  }},
  {{
    key: "predict_proba"
    value: {{ string_value: "{predict_proba_string}" }}
  }},
  {{
    key: "output_class"
    value: {{ string_value: "{classifier_string}" }}
  }},
  {{
    key: "threshold"
    value: {{ string_value: "0.5" }}
  }},
  {{
    key: "storage_type"
    value: {{ string_value: "AUTO" }}
  }}
]

dynamic_batching {{}}"""

config_path = os.path.join(FIL_MODEL_DIR, 'config.pbtxt')
with open(config_path, 'w') as file_:
    file_.write(config_text)

### Create Conda Env for Preprocessing Dependencies

Python backend in Triton requires us to use conda environment for any additional dependencies. In this case we are using the Python backend to do preprocessing of the raw data before feeding it into the XGBoost model being run in FIL Backend. Even though we originally used RAPIDS cuDF and cuML to do the data preprocessing here we use Pandas and Scikit-learn as preprocessing dependencies for inference time. We do this for three reasons. 
* Firstly, to show how to create conda environment for your dependencies and how to package it in [format expected](https://github.com/triton-inference-server/python_backend#2-packaging-the-conda-environment) by Triton's Python backend. 
* Secondly, by showing preprocessing model running in Python backend on the CPU while the XGBoost runs on the GPU we illustrate how each model in Triton's ensemble pipeline can run on different framework backend, and run on different hardware. 
* Thirdly, it highlights how the RAPIDS libraries (cuDF, cuML) are compatible with their CPU counterparts (Pandas, Scikit-learn). For e.g. this way we get to show how LabelEncoders created in cuML can be used in Scikit-learn and vice-versa

We follow the instructions [here](https://github.com/triton-inference-server/python_backend#2-packaging-the-conda-environment) for packaging preprocessing dependencies (here scikit-learn and pandas) to be used in the python backend as conda env file. The [create_prep_env.sh](./create_prep_env.sh) creates the conda environment and then we move it into the python model folder.

In [7]:
!bash create_prep_env.sh
!cp preprocessing_env.tar.gz model_repository/preprocessing/

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.4
  latest version: 4.13.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/preprocessing_env

  added / updated specs:
    - python=3.8


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _openmp_mutex-4.5          |            2_gnu          23 KB  conda-forge
    ca-certificates-2022.6.15  |       ha878542_0         149 KB  conda-forge
    libgcc-ng-12.1.0           |      h8d9b700_16         940 KB  conda-forge
    libgomp-12.1.0             |      h8d9b700_16         459 KB  conda-forge
    libzlib-1.2.12             |       h166bdaf_1          63 KB  conda-forge
    openssl-3.0.5              |       h166bdaf_0         2.9 MB  conda-forge
    pip-22.1.2                 |  

### Set up Label Encoders and XGBoost model in Model Repository

In [8]:
# move label encoders into python preprocessing directory
!cp label_encoders.pkl model_repository/preprocessing/1/

In [9]:
# move trained xgboost model into fil model directory
!mkdir -p model_repository/fil/1
!cp xgboost.json model_repository/fil/1/

In [10]:
# create model version directory for ensemble model
!mkdir -p model_repository/ensemble/1

In [11]:
!tar --exclude='.ipynb_checkpoints' -czvf model.tar.gz -C model_repository .

./
./fil/
./fil/config.pbtxt
./fil/1/
./fil/1/xgboost.json
./preprocessing/
./preprocessing/preprocessing_env.tar.gz
./preprocessing/config.pbtxt
./preprocessing/1/
./preprocessing/1/model.py
./preprocessing/1/label_encoders.pkl
./ensemble/
./ensemble/config.pbtxt
./ensemble/1/


In [12]:
model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix="triton-fil-ensemble")

## Create SageMaker Endpoint

We start off by creating a sagemaker model from the model files we uploaded to s3 in the previous step.

In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. **The value of this key should match the folder name in the model package uploaded to s3.** This variable is optional in case of a single model. In case of ensemble models, this **key has to be specified** for Triton to startup in SageMaker.

Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.

In [13]:
sm_model_name = "triton-fil-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble"},
}

create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Model Arn: arn:aws:sagemaker:us-west-2:354625738399:model/triton-fil-ensemble-2022-07-13-01-48-17


Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.

In [14]:
endpoint_config_name = "triton-fil-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.4xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint-config/triton-fil-ensemble-2022-07-13-01-48-18


Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [15]:
endpoint_name = "triton-fil-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint/triton-fil-ensemble-2022-07-13-01-48-18


In [16]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint/triton-fil-ensemble-2022-07-13-01-48-18
Status: InService


## Run Inference

Once we have the endpoint running we can use some sample raw data to do an inference using json as the payload format. For inference request format, Triton uses the KFServing community standard [inference protocols.](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md)

In [17]:
data_infer = pd.read_csv("data_infer.csv")
data_infer

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?
0,1904,3,2020,2,8,11:08,$16.27,Chip Transaction,-5162038175624867091,Elk Grove,CA,95624.0,5541,
1,1896,1,2008,11,7,15:27,$52.87,Online Transaction,-2088492411650162548,ONLINE,,,4784,
2,572,0,2016,10,8,17:08,$3.89,Chip Transaction,-2444278202958188094,Pearland,TX,77584.0,5912,
3,1325,0,2014,6,7,08:15,$73.97,Swipe Transaction,-5581123930363301609,Bowling Green,KY,42101.0,5311,
4,1946,3,2005,5,8,21:09,$46.95,Swipe Transaction,-3213879500583660539,Fresno,CA,93725.0,7995,Insufficient Balance


In [18]:
STR_COLUMNS = ['Time',
 'Amount',
 'Zip',
 'MCC',
 'Merchant Name',
 'Use Chip',
 'Merchant City',
 'Merchant State',
 'Errors?']

batch_size = len(data_infer)

payload = {}
payload["inputs"] = []
data_dict = {}
for col in data_infer.columns:
    data_dict[col] = {}
    data_dict[col]['name'] = col
    if col in STR_COLUMNS:
        data_dict[col]['data'] = data_infer[col].astype(str).tolist()
        data_dict[col]['datatype'] = 'BYTES'
    else:
        data_dict[col]['data'] = data_infer[col].astype('float32').tolist()
        data_dict[col]['datatype'] = 'FP32'
    data_dict[col]['shape'] = [batch_size, 1]
    payload["inputs"].append(data_dict[col])

In [19]:
payload

{'inputs': [{'name': 'User',
   'data': [1904.0, 1896.0, 572.0, 1325.0, 1946.0],
   'datatype': 'FP32',
   'shape': [5, 1]},
  {'name': 'Card',
   'data': [3.0, 1.0, 0.0, 0.0, 3.0],
   'datatype': 'FP32',
   'shape': [5, 1]},
  {'name': 'Year',
   'data': [2020.0, 2008.0, 2016.0, 2014.0, 2005.0],
   'datatype': 'FP32',
   'shape': [5, 1]},
  {'name': 'Month',
   'data': [2.0, 11.0, 10.0, 6.0, 5.0],
   'datatype': 'FP32',
   'shape': [5, 1]},
  {'name': 'Day',
   'data': [8.0, 7.0, 8.0, 7.0, 8.0],
   'datatype': 'FP32',
   'shape': [5, 1]},
  {'name': 'Time',
   'data': ['11:08', '15:27', '17:08', '08:15', '21:09'],
   'datatype': 'BYTES',
   'shape': [5, 1]},
  {'name': 'Amount',
   'data': ['$16.27', '$52.87', '$3.89', '$73.97', '$46.95'],
   'datatype': 'BYTES',
   'shape': [5, 1]},
  {'name': 'Use Chip',
   'data': ['Chip Transaction',
    'Online Transaction',
    'Chip Transaction',
    'Swipe Transaction',
    'Swipe Transaction'],
   'datatype': 'BYTES',
   'shape': [5, 1]},
  {

In [20]:
response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload)
)

In [21]:
response_body = json.loads(response["Body"].read().decode("utf8"))
predictions = response_body['outputs'][0]['data']

In [22]:
CLASS_LABELS = ['NOT FRAUD', 'FRAUD']
predictions = [CLASS_LABELS[int(idx)] for idx in predictions]

In [23]:
print(predictions)

['NOT FRAUD', 'NOT FRAUD', 'NOT FRAUD', 'NOT FRAUD', 'NOT FRAUD']


## Terminate endpoint and clean up artifacts

In [24]:
sm.delete_endpoint(EndpointName=endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_model(ModelName=sm_model_name)

{'ResponseMetadata': {'RequestId': 'caed8352-bfaf-40d6-a3f2-4dc7fc63babd',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'caed8352-bfaf-40d6-a3f2-4dc7fc63babd',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 13 Jul 2022 01:54:19 GMT'},
  'RetryAttempts': 0}}