## Deploying Multiple Frameworks Models on GPU 

## Installs <a class="anchor" id="installs-and-set-up"></a>

Install required packages using pip

In [20]:
!pip install -qU pip boto3 sagemaker awscli tritonclient[http] transformers

[0m

#### Imports and variables

In [1]:
# imports
import boto3
import sagemaker
from sagemaker import get_execution_role
import time
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

#variables
prefix = "mme-gpu"
model_name = "xdistilbert"
pytorch_model_file_name = f"{model_name}_pt.tar.gz"
tensorrt_model_file_name = f"{model_name}_trt.tar.gz"
s3_client = boto3.client("s3")
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

# endpoint variables
sm_model_name = f"{prefix}-mdl-{ts}"
endpoint_config_name = f"{prefix}-epc-{ts}"
endpoint_name = f"{prefix}-ep-{ts}"
model_data_url = f"s3://{bucket}/{prefix}/"
instance_type = "ml.g5.xlarge"

## Creating Model Artifacts <a class="anchor" id="pytorch-efficientnet-model"></a>



<div class="alert alert-info"><strong> Note </strong>
We are demonstrating deployment with
</div>

### Prepare PyTorch Model  <a class="anchor" id="create-pytorch-model"></a>

Run the cell below and check out the [pt_exporter.py](./workspace/pt_exporter.py) file for more details

In [2]:
!docker run --gpus=all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
            -v `pwd`/workspace:/workspace -w /workspace nvcr.io/nvidia/pytorch:22.12-py3 \
            /bin/bash generate_model_pytorch.sh

Unable to find image 'nvcr.io/nvidia/pytorch:22.12-py3' locally
22.12-py3: Pulling from nvidia/pytorch

[1B0b181fff: Pulling fs layer 
[1Bf751e984: Pulling fs layer 
[1Bb807c637: Pulling fs layer 
[1B2991e393: Pulling fs layer 
[1B71274096: Pulling fs layer 
[1B91138ef8: Pulling fs layer 
[1Bed3c7117: Pulling fs layer 
[1B46181ee6: Pulling fs layer 
[1Ba7918caa: Pulling fs layer 
[1B2fbe7c33: Pulling fs layer 
[1B8dd49356: Pulling fs layer 
[1B8fc97997: Pulling fs layer 
[1Ba4765a47: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1B6993c2a6: Pulling fs layer 
[1Bdfdccd09: Pulling fs layer 
[1B104c2b1e: Pulling fs layer 
[1Ba5e6a375: Pulling fs layer 
[1Be985500c: Pulling fs layer 
[1Bb7199f2c: Pulling fs layer 
[1Bdc6e60f7: Pulling fs layer 
[1Bfe94dbe9: Pulling fs layer 
[1Bdb54f010: Pulling fs layer 
[1B4e90123e: Pulling fs layer 
[1B49f11018: Pulling fs layer 
[1B91f2bc74: Pulling fs layer 
[1Ba7d4c536: Pulling fs layer 
[1B766c07ef: Pulling fs layer 


In [5]:
!mkdir -p model_repository/xdistilbert_pt/

In [6]:
%%writefile model_repository/xdistilbert_pt/config.pbtxt
backend: "pytorch"
max_batch_size: 224
input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "INPUT__1"
    data_type: TYPE_INT32
    dims: [128]
  },
    {
    name: "INPUT__2"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [6]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}

Writing model_repository/xdistilbert_pt/config.pbtxt


### Prepare TensorRT Model <a class="anchor" id="create-tensorrt-model"></a>

- We load pre-trained xdistilbert from torch
- Convert to onnx representation using torch onnx exporter.
- Use TensorRT trtexec command to create the model plan to be hosted with Triton. 
- The script for exporting this model can be found [here](./workspace/generate_model_trt.sh). 

Execute the below cell and check out the file for more details

<div class="alert alert-info"><strong> Note </strong>
This step takes around 8 minutes to complete. While the step is running, please take a look at the logs in the below cell to understand TensorRT optimizations
</div>

In [7]:
!docker run --gpus=all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
            -v `pwd`/workspace:/workspace -w /workspace nvcr.io/nvidia/pytorch:22.12-py3 \
            /bin/bash generate_model_trt.sh


== PyTorch ==

NVIDIA Release 22.12 (build 49968248)
PyTorch Version 1.14.0a0+410ce96

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This 

In [8]:
!mkdir -p model_repository/xdistilbert_trt/

In [9]:
%%writefile model_repository/xdistilbert_trt/config.pbtxt
name: "xdistilbert_trt"
backend: "tensorrt"
max_batch_size: 224
input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "token_type_ids"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [6]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}

Writing model_repository/xdistilbert_trt/config.pbtxt


### Export model artifacts to S3 <a class="anchor" id="export-to-s3"></a>

SageMaker expects the model artifacts in below format, it should also satisfy Triton container requirements such as model name, version, config.pbtxt files etc. `tar` the folder containing the model file and upload it to s3

In [3]:
!mkdir -p model_repository/xdistilbert_pt/1/
!cp -f workspace/model.pt model_repository/xdistilbert_pt/1/

In [4]:
!tar -C model_repository -czf $pytorch_model_file_name xdistilbert_pt
model_uri_pt = sagemaker_session.upload_data(path=pytorch_model_file_name, key_prefix=prefix)

In [5]:
print(f"PyTorch Model S3 location: {model_uri_pt}")

PyTorch Model S3 location: s3://sagemaker-us-west-2-354625738399/mme-gpu/xdistilbert_pt.tar.gz


In [None]:
!mkdir -p model_repository/xdistilbert_trt/1/
!cp -f workspace/model.plan model_repository/xdistilbert_trt/1/

In [6]:
!tar -C model_repository -czf $tensorrt_model_file_name xdistilbert_trt
model_uri_trt = sagemaker_session.upload_data(path=tensorrt_model_file_name, key_prefix=prefix)

In [7]:
print(f"TensorRT Model S3 location: {model_uri_trt}")

TensorRT Model S3 location: s3://sagemaker-us-west-2-354625738399/mme-gpu/xdistilbert_trt.tar.gz


### Deploy Models with MME <a class="anchor" id="deploy-models-with-mme"></a>

We will now deploy xtreme distilBERT model with different framework backends i.e. PyTorch, TensorRT to SageMaker MME.


<div class="alert alert-info"> <strong> Note </strong>
you can deploy 1000s of models. The models can use same framework. They can also use different frameworks as shown in this note.
</div>

We will use AWS SDK for Python (Boto) APIs [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) to create a mulit-model endpoint.

### Define the serving container  <a class="anchor" id="define-container-def"></a>

 In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load  and serve predictions. Set `Mode` to `MultiModel` to indicates SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU, see MME [container images](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html#multi-model-support) for more details.

### SageMaker Triton Container Image

In [8]:
# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.12-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

In [9]:
container = {"Image": mme_triton_image_uri, "ModelDataUrl": model_data_url, "Mode": "MultiModel"}

### Create a MME object <a class="anchor" id="create-mme-model-obj"></a>

Using the SageMaker boto3 client, create the model using [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) API. We will pass the container definition to the create model API along with ModelName and ExecutionRoleArn.


In [10]:
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Model Arn: arn:aws:sagemaker:us-west-2:354625738399:model/mme-gpu-mdl-2023-01-19-23-25-42


### Define configuration for the MME<a class="anchor" id="config-mme"></a>

Create a multi-model endpoint configuration using [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) boto3 API. Specify an accelerated GPU computing instance in InstanceType (we will use the same instance type that we are using to host our SageMaker Notebook). We recommend configuring your endpoints with at least two instances with real-life use-cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.




In [11]:
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint-config/mme-gpu-epc-2023-01-19-23-25-42


### Create MME  <a class="anchor" id="create-mme"></a>

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful.

In [12]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint/mme-gpu-ep-2023-01-19-23-25-42


### Describe MME <a class="anchor" id="describe-mme"></a>

Now, we check the status of the endpoint using `describe_endpoint`. This step will take about 5 mins to complete and you should see "Status: InService" message before you proceed to next cells.

In [13]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint/mme-gpu-ep-2023-01-19-23-25-42
Status: InService


## Helper functions to prepare Input Payload <a class="anchor" id="helper-functions"></a>

The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server. These will be used by PyTorch and TensorRT efficientnet_b0 computer vision models.

The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.

In [20]:
import tritonclient.http as httpclient
import numpy as np
from transformers import AutoTokenizer

tokenizer_name = "bergum/xtremedistil-emotion"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

def tokenize_text(tokenizer, text):
    MAX_LEN = 128
    tokenized_text = tokenizer(text, padding='max_length', max_length=MAX_LEN, add_special_tokens=True, return_tensors='np')
    return tokenized_text.input_ids, tokenized_text.attention_mask, tokenized_text.token_type_ids

def _get_sample_tokenized_text_binary(text, input_names, output_names):
    inputs = []
    outputs = []
    input_ids, attention_mask, token_type_ids = tokenize_text(tokenizer, text)
    inputs.append(httpclient.InferInput(input_names[0], input_ids.shape, "INT32"))
    inputs.append(httpclient.InferInput(input_names[1], attention_mask.shape, "INT32"))
    inputs.append(httpclient.InferInput(input_names[2], token_type_ids.shape, "INT32"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int32), binary_data=True)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int32), binary_data=True)
    inputs[2].set_data_from_numpy(token_type_ids.astype(np.int32), binary_data=True)
    
    outputs.append(httpclient.InferRequestedOutput(output_names[0], binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length

def get_sample_tokenized_text_binary_pt(text):
    return _get_sample_tokenized_text_binary(text, ["INPUT__0", "INPUT__1", "INPUT__2"], ["OUTPUT__0"])


def get_sample_tokenized_text_binary_trt(text):
    return _get_sample_tokenized_text_binary(text, ["input_ids", "attention_mask", "token_type_ids"], ["logits"])

def get_predictions(logits):
    CLASSES = ["SADNESS", "JOY", "LOVE", "ANGER", "FEAR", "SURPRISE"]
    predictions = []
    for i in range(len(logits)):
        pred_class_idx = np.argmax(logits[i])
        predictions.append(CLASSES[pred_class_idx])
    return predictions

### Invoke target model on Multi Model Endpoint

Once the endpoint is successfully created, we can send inference request to multi-model endpoint using invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Sample invocation for PyTorch model and TensorRT model is shown below

### Invoke PyTorch Model <a class="anchor" id="invoke-pytorch-model"></a>

In [21]:
sample_text = "I really enjoyed deploying thousands of NLP models using Triton on SageMaker Multi-Model Endpoint"
request_body, header_length = get_sample_tokenized_text_binary_pt(sample_text)

In [22]:
response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel='xdistilbert_pt.tar.gz')

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response['ContentType'][len(header_length_prefix):]
output_name = "OUTPUT__0"
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response['Body'].read(), header_length=int(header_length_str))
logits = result.as_numpy(output_name)
predictions = get_predictions(logits)

In [23]:
predictions

['JOY']

### Invoke TensorRT Model <a class="anchor" id="invoke-tensorrt-model"></a>

In [24]:
request_body, header_length = get_sample_tokenized_text_binary_trt(sample_text)

In [25]:
response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel='xdistilbert_trt.tar.gz')

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response['ContentType'][len(header_length_prefix):]
output_name = "logits"
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response['Body'].read(), header_length=int(header_length_str))
logits = result.as_numpy(output_name)
predictions = get_predictions(logits)

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{"error":"load failed for model 'c81c0b68c414c80f8b64d39f33e1f541': version 1 is at UNAVAILABLE state: Internal: unable to load plan file to auto complete config: /opt/ml/models/c81c0b68c414c80f8b64d39f33e1f541/model/xdistilbert_trt/1/model.plan;\n"}". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/mme-gpu-ep-2023-01-19-23-25-42 in account 354625738399 for more information.

In [None]:
predictions

# Deploying Thousand Models to GPUs using MME

Let's say you have thousand customer-specific distilBERT models which are a mixture of frequently and infrequently accessed models coming from different frameworks (PyTorch, TensorFlow, ONNX, TensorRT) and let's assume most of which have stringent latency requirements.

Deploying these 1000 models on GPU instances like `g5.xlarge` (price: `$1/hr`) using Single-Model Endpoints would take ~1000 instances costing you `$1,000`/hour.

By leveraging Triton on SageMaker MME, you can deploy these models behind one MME endpoint which can autoscale the number of GPU instances automatically and end up using ~100x lower instances and thus cost would be reduced by ~100x. 

## Create 1000 models to be loaded to SageMaker MME

This step will take few minutes to complete as we are copying 1000 files to S3

In [None]:
NUM_MODELS = 1000
for i in range(1, NUM_MODELS+1):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    model_copy = f"{model_data_url}{customer_model_name}"
    !aws s3 cp $model_data_url$tensorrt_model_file_name $model_copy

In [None]:
!aws s3 ls $model_data_url

## Set up MME AutoScaling

In [None]:
auto_scaling_client = boto3.client('application-autoscaling')

resource_id='endpoint/' + <ENDPOINT_NAME> + '/variant/' + 'AllTraffic' 
response = auto_scaling_client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity = 1,
    MaxCapacity = 13
)

response = auto_scaling_client.put_scaling_policy(
    PolicyName='GPUUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 85.0, 
        'CustomizedMetricSpecification':
        {
            'MetricName': 'GPUMemoryUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': <ENDPOINT_NAME> },
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average',
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 600,
        'ScaleOutCooldown': 1 
    }
)

## Put Load on MME Endpoint to see it scale

In [None]:

def predict_model(text, model_name):
    print(f"Using model {model_name} to predict")
    
    request_body, header_length = get_sample_tokenized_text_binary_trt(text)
    
    start_time = time.time()
    
    response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel=model_name)
    
    duration = time.time() - start_time
    
    # Parse json header size length from the response
    header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
    header_length_str = response['ContentType'][len(header_length_prefix):]
    output_name = "logits"
    # Read response body
    result = httpclient.InferenceServerClient.parse_response_body(
        response['Body'].read(), header_length=int(header_length_str))
    logits = result.as_numpy(output_name)
    predictions = get_predictions(logits)
    
    print(f"prediction: {predictions}, took {int(duration * 1000)} ms\n")

In [None]:
import random
for _ in range(10):
    i = random.randint(NUM_MODELS)
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(sample_text, customer_model_name)

```
xdistilbert_english
.
.
.
xdistilbert_chinese 
xdistilbert_APAC
.
.
.
xdistilbert_NALA
```

# Clean Up

In [38]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

{'ResponseMetadata': {'RequestId': '70ec55f8-d2d6-4ac4-a3ab-94e073748b0f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '70ec55f8-d2d6-4ac4-a3ab-94e073748b0f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 19 Jan 2023 23:19:17 GMT'},
  'RetryAttempts': 0}}