# Triton on SageMaker - Ensemble + FIL Backend

## Set up Environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

In [None]:
!pip install -qU pip awscli boto3 sagemaker 
!pip install nvidia-pyindex
!pip install tritonclient[http]

In [2]:
import boto3
import json
import sagemaker
import time
import os
from sagemaker import get_execution_role

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")

In [3]:
account_id_map = {
    'us-east-1': '785573368785',
    'us-east-2': '007439368137',
    'us-west-1': '710691900526',
    'us-west-2': '301217895009',
    'eu-west-1': '802834080501',
    'eu-west-2': '205493899709',
    'eu-west-3': '254080097072',
    'eu-north-1': '601324751636',
    'eu-south-1': '966458181534',
    'eu-central-1': '746233611703',
    'ap-east-1': '110948597952',
    'ap-south-1': '763008648453',
    'ap-northeast-1': '941853720454',
    'ap-northeast-2': '151534178276',
    'ap-southeast-1': '324986816169',
    'ap-southeast-2': '355873309152',
    'cn-northwest-1': '474822919863',
    'cn-north-1': '472730292857',
    'sa-east-1': '756306329178',
    'ca-central-1': '464438896020',
    'me-south-1': '836785723513',
    'af-south-1': '774647643957'
}

In [4]:
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise("UNSUPPORTED REGION")

In [5]:
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.05-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

In [None]:
#triton_image_uri = "354625738399.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.05-py"

In [6]:
triton_image_uri

'301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.05-py3'

## Package models and dependencies and uploading to S3

First we create the Triton config file for the XGBoost model being served by the FIL Backend.

**TODO**: Note the reshape we had to do for output here. This is specific to this example.

In [8]:
USE_GPU = True
MODEL_DIR = './triton-serve/fil'

# Maximum size in bytes for input and output arrays
MAX_MEMORY_BYTES = 60_000_000
NUM_FEATURES = 15
NUM_CLASSES = 2
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

IS_CLASSIFIER = True
model_format = 'xgboost_json'

# Select deployment hardware (GPU or CPU)
if USE_GPU:
    instance_kind = 'KIND_GPU'
else:
    instance_kind = 'KIND_CPU'

# whether the model is doing classification or regression    
if IS_CLASSIFIER:
    classifier_string = 'true'
else:
    classifier_string = 'false'

# whether to predict probabilites or not
predict_proba = False

if predict_proba:
    predict_proba_string = 'true'
else:
    predict_proba_string = 'false'

config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [                                 
 {{  
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {NUM_FEATURES} ]                    
  }} 
]
output [
 {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: {{ shape: [ 1 ] }}
  }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
  {{
    key: "model_type"
    value: {{ string_value: "{model_format}" }}
  }},
  {{
    key: "predict_proba"
    value: {{ string_value: "{predict_proba_string}" }}
  }},
  {{
    key: "output_class"
    value: {{ string_value: "{classifier_string}" }}
  }},
  {{
    key: "threshold"
    value: {{ string_value: "0.5" }}
  }},
  {{
    key: "storage_type"
    value: {{ string_value: "AUTO" }}
  }}
]

dynamic_batching {{}}"""

config_path = os.path.join(MODEL_DIR, 'config.pbtxt')
with open(config_path, 'w') as file_:
    file_.write(config_text)

In [9]:
# Download the RAPIDS 22.04 Conda env to be used in Python preprocessing
!wget -q -P model_repository/preprocessing https://rapidsai-data.s3.us-east-2.amazonaws.com/conda-pack/rapidsai/rapids22.06_cuda11.5_py3.8.tar.gz

In [12]:
!mv model_repository/preprocessing/rapids22.06_cuda11.5_py3.8.tar.gz ../../

In [None]:
# move label encoders into python preprocessing directory
!cp label_encoders.pkl triton-serve/preprocessing/1/

In [10]:
# move trained xgboost model into fil model directory
!mkdir -p triton-serve/fil/1
!cp xgboost.json triton-serve/fil/1/

In [11]:
# create model version directory for ensemble model
!mkdir -p triton-serve/ensemble_preprocess_fil_postprocess/1

In [None]:
!tar -cvzf model.tar.gz -C model_repository .

In [None]:
model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix="triton-serve")

## Create SageMaker Endpoint

We start off by creating a sagemaker model from the model files we uploaded to s3 in the previous step.

In this step we also provide an additional Environment Variable i.e. SAGEMAKER_TRITON_DEFAULT_MODEL_NAME which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to s3. This variable is optional in case of a single model. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker.

Additionally, customers can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.

## Run Inference

In [10]:
import pandas as pd

In [11]:
data_infer = pd.read_csv("data_infer.csv")

In [18]:
type(sample_data)

numpy.ndarray

In [None]:
payload = {
    "inputs": [
        {"name": "INPUT__0", "shape": list(sample_data.shape), "datatype": "STRING", "data": sample_data},
    ]
}