# Deploying Models from Multiple Frameworks on GPU using MME

## Installs <a class="anchor" id="installs-and-set-up"></a>

Install required packages using pip

In [None]:
!pip install -qU pip boto3 sagemaker awscli tritonclient[http] transformers datasets

#### Imports and variables

In [1]:
# imports
import boto3
import sagemaker
from sagemaker import get_execution_role
import time
import random
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

#variables
prefix = "nlp-mme-gpu"
model_name = "xdistilbert"
pytorch_model_file_name = f"{model_name}_pt.tar.gz"
tensorrt_model_file_name = f"{model_name}_trt.tar.gz"
s3_client = boto3.client("s3")
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

# endpoint variables
sm_model_name = f"{prefix}-mdl-{ts}"
endpoint_config_name = f"{prefix}-epc-{ts}"
endpoint_name = f"{prefix}-ep-{ts}"
model_data_url = f"s3://{bucket}/{prefix}/"
instance_type = "ml.g5.xlarge"

## Creating Model Artifacts <a class="anchor" id="pytorch-efficientnet-model"></a>


### Prepare PyTorch Model  <a class="anchor" id="create-pytorch-model"></a>

Run the cell below and check out the [pt_exporter.py](./workspace/pt_exporter.py) file for more details

In [None]:
!docker run --gpus=all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
            -v `pwd`/workspace:/workspace -w /workspace nvcr.io/nvidia/pytorch:22.12-py3 \
            /bin/bash generate_model_pytorch.sh

In [None]:
!mkdir -p model_repository/xdistilbert_pt/

In [None]:
%%writefile model_repository/xdistilbert_pt/config.pbtxt
backend: "pytorch"
max_batch_size: 224
input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "INPUT__1"
    data_type: TYPE_INT32
    dims: [128]
  },
    {
    name: "INPUT__2"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [6]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}

### Prepare TensorRT Model <a class="anchor" id="create-tensorrt-model"></a>

- We load pre-trained xdistilbert PyTorch model from Huggingface
- Convert to onnx representation using torch onnx exporter.
- Use TensorRT trtexec command to create the model plan to be hosted with Triton. 
- The script for exporting this model can be found [here](./workspace/generate_model_trt.sh). 

Execute the below cell and check out the file for more details

<div class="alert alert-info"><strong> Note </strong>
This step takes around 10 minutes to complete. While the step is running, please take a look at the logs in the below cell to understand TensorRT optimizations
</div>

In [None]:
!docker run --gpus=all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
            -v `pwd`/workspace:/workspace -w /workspace nvcr.io/nvidia/pytorch:22.12-py3 \
            /bin/bash generate_model_trt.sh

In [None]:
!mkdir -p model_repository/xdistilbert_trt/

In [None]:
%%writefile model_repository/xdistilbert_trt/config.pbtxt
name: "xdistilbert_trt"
backend: "tensorrt"
max_batch_size: 224
input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "token_type_ids"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [6]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}

### Export model artifacts to S3 <a class="anchor" id="export-to-s3"></a>

SageMaker expects the model artifacts in below format, it should also satisfy Triton container requirements such as model name, version, config.pbtxt files etc. `tar` the folder containing the model file and upload it to s3

In [None]:
!mkdir -p model_repository/xdistilbert_pt/1/
!cp -f workspace/model.pt model_repository/xdistilbert_pt/1/

In [2]:
!tar -C model_repository -czf $pytorch_model_file_name xdistilbert_pt
model_uri_pt = sagemaker_session.upload_data(path=pytorch_model_file_name, key_prefix=prefix)

In [3]:
print(f"PyTorch Model S3 location: {model_uri_pt}")

PyTorch Model S3 location: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_pt.tar.gz


In [None]:
!mkdir -p model_repository/xdistilbert_trt/1/
!cp -f workspace/model.plan model_repository/xdistilbert_trt/1/

In [4]:
!tar -C model_repository -czf $tensorrt_model_file_name xdistilbert_trt
model_uri_trt = sagemaker_session.upload_data(path=tensorrt_model_file_name, key_prefix=prefix)

In [5]:
print(f"TensorRT Model S3 location: {model_uri_trt}")

TensorRT Model S3 location: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_trt.tar.gz


### Deploy Models with MME <a class="anchor" id="deploy-models-with-mme"></a>

We will now deploy xtreme distilBERT model with different framework backends i.e. PyTorch, TensorRT to SageMaker MME.


<div class="alert alert-info"> <strong> Note </strong>
you can deploy 1000s of models. The models can use same framework. They can also use different frameworks as shown in this note.
</div>

We will use AWS SDK for Python (Boto) APIs [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) to create a mulit-model endpoint.

### Define the serving container  <a class="anchor" id="define-container-def"></a>

 In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load  and serve predictions. Set `Mode` to `MultiModel` to indicates SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU, see MME [container images](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html#multi-model-support) for more details.

### SageMaker Triton Container Image

In [42]:
# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.12-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

In [43]:
container = {"Image": mme_triton_image_uri, "ModelDataUrl": model_data_url, "Mode": "MultiModel"}

### Create a MME object <a class="anchor" id="create-mme-model-obj"></a>

Using the SageMaker boto3 client, create the model using [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) API. We will pass the container definition to the create model API along with ModelName and ExecutionRoleArn.


In [44]:
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Model Arn: arn:aws:sagemaker:us-west-2:354625738399:model/nlp-mme-gpu-mdl-2023-01-27-02-53-04


### Define configuration for the MME<a class="anchor" id="config-mme"></a>

Create a multi-model endpoint configuration using [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) boto3 API. Specify an accelerated GPU computing instance in InstanceType (we will use the same instance type that we are using to host our SageMaker Notebook). We recommend configuring your endpoints with at least two instances with real-life use-cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.




In [45]:
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint-config/nlp-mme-gpu-epc-2023-01-27-02-53-04


### Create MME  <a class="anchor" id="create-mme"></a>

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful.

In [46]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint/nlp-mme-gpu-ep-2023-01-27-02-53-04


### Describe MME <a class="anchor" id="describe-mme"></a>

Now, we check the status of the endpoint using `describe_endpoint`. This step will take about 7 mins to complete and you should see `Status: InService` message before you proceed to next cells.

In [47]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:354625738399:endpoint/nlp-mme-gpu-ep-2023-01-27-02-53-04
Status: InService


## Helper functions <a class="anchor" id="helper-functions"></a>

The following method transforms a sample test we will be using for inference into the payload that can be sent for inference to the Triton server. These will be used by PyTorch and TensorRT distilbert models.

The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.

In [57]:
%%capture
import tritonclient.http as httpclient
import numpy as np
import random
from transformers import AutoTokenizer
from datasets import load_dataset

dataset = load_dataset("emotion")
tokenizer_name = "bergum/xtremedistil-emotion"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

def tokenize_text(tokenizer, text):
    MAX_LEN = 128
    tokenized_text = tokenizer(text, padding='max_length', max_length=MAX_LEN, add_special_tokens=True, return_tensors='np')
    return tokenized_text.input_ids, tokenized_text.attention_mask, tokenized_text.token_type_ids

def _get_tokenized_text_binary(text, input_names, output_names):
    inputs = []
    outputs = []
    input_ids, attention_mask, token_type_ids = tokenize_text(tokenizer, text)
    inputs.append(httpclient.InferInput(input_names[0], input_ids.shape, "INT32"))
    inputs.append(httpclient.InferInput(input_names[1], attention_mask.shape, "INT32"))
    inputs.append(httpclient.InferInput(input_names[2], token_type_ids.shape, "INT32"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int32), binary_data=True)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int32), binary_data=True)
    inputs[2].set_data_from_numpy(token_type_ids.astype(np.int32), binary_data=True)
    
    outputs.append(httpclient.InferRequestedOutput(output_names[0], binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length

def get_random_text():
    rand_i = random.randint(0, 2000)
    text = dataset["test"]["text"][rand_i]
    return text

def get_tokenized_text_binary_pt(text):
    return _get_sample_tokenized_text_binary(text, ["INPUT__0", "INPUT__1", "INPUT__2"], ["OUTPUT__0"])


def get_tokenized_text_binary_trt(text):
    return _get_sample_tokenized_text_binary(text, ["input_ids", "attention_mask", "token_type_ids"], ["logits"])

def logits2predictions(logits):
    CLASSES = ["SADNESS", "JOY", "LOVE", "ANGER", "FEAR", "SURPRISE"]
    predictions = []
    for i in range(len(logits)):
        pred_class_idx = np.argmax(logits[i])
        predictions.append(CLASSES[pred_class_idx])
    return predictions

## Invoke Target Models on Multi-Model Endpoint

Once the endpoint is successfully created, we can send inference request to multi-model endpoint using invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Sample invocation for PyTorch model and TensorRT model is shown below

### Invoke PyTorch Model <a class="anchor" id="invoke-pytorch-model"></a>

In [58]:
sample_text = get_random_text()
request_body, header_length = get_tokenized_text_binary_pt(sample_text)
response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel='xdistilbert_pt.tar.gz')

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response['ContentType'][len(header_length_prefix):]
output_name = "OUTPUT__0"
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response['Body'].read(), header_length=int(header_length_str))
logits = result.as_numpy(output_name)
predictions = logits2predictions(logits)
print(predictions)

['ANGER']


### Invoke TensorRT Model <a class="anchor" id="invoke-tensorrt-model"></a>

In [61]:
sample_text = get_random_text()
request_body, header_length = get_tokenized_text_binary_trt(sample_text)
response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel='xdistilbert_trt.tar.gz')

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response['ContentType'][len(header_length_prefix):]
output_name = "logits"
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response['Body'].read(), header_length=int(header_length_str))
logits = result.as_numpy(output_name)
predictions = logits2predictions(logits)
print(predictions)

['JOY']


# Deploying Numerous Models to GPUs using MME

Let's say we are trying to deploy 1000 customer-specific distilBERT models which are a mixture of frequently and infrequently accessed models coming from different frameworks like PyTorch, TensorFlow, ONNX, TensorRT.

Deploying these 1000 models on GPU instances like `g5.xlarge` using dedicated Single-Model Endpoints would take ~1000 instances.

By leveraging MME on GPU, you can deploy these models behind a single MME endpoint and end up using 100x less instances. 

Thus reducing costs by **100x**. 

## Dynamically adding models to an existing endpoint

It’s easy to deploy a new model to an existing multi-model endpoint. With the endpoint already running, copy a new set of model artifacts to the same S3 location you set up earlier. Client applications are then free to request predictions from that target model, and Amazon SageMaker handles the rest. 

With multi-model endpoints, you don’t need to go through a full endpoint update just to deploy a new model, and you avoid the cost of a separate endpoint for each new model. An S3 copy is all that is needed to deploy.

This step will take around 5 minutes to complete as we are copying 400 files to S3

In [None]:
num_models = 400
for i in range(1, num_models+1):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    model_copy = f"{model_data_url}{customer_model_name}"
    !aws s3 cp $model_data_url$pytorch_model_file_name $model_copy

## Invoking Models

In [74]:
def predict_model(text, model_name, show_latency=False):
    print(f"Using model {model_name} to predict")
    
    request_body, header_length = get_sample_tokenized_text_binary_pt(text)
    
    start_time = time.time()
    
    response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel=model_name)
    
    duration = time.time() - start_time
    
    # Parse json header size length from the response
    header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
    header_length_str = response['ContentType'][len(header_length_prefix):]
    output_name = "OUTPUT__0"
    # Read response body
    result = httpclient.InferenceServerClient.parse_response_body(
        response['Body'].read(), header_length=int(header_length_str))
    logits = result.as_numpy(output_name)
    predictions = get_predictions(logits)
    
    if show_latency:
        print(f"prediction: {predictions}, took {int(duration * 1000)} ms\n")
    else:
        print(f"prediction: {predictions}\n")

In [71]:
for i in range(1, 6):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(get_random_text(), customer_model_name, show_latency=True)

Using model xdistilbert_customer1.tar.gz to predict
prediction: ['JOY'], took 49 ms

Using model xdistilbert_customer2.tar.gz to predict
prediction: ['JOY'], took 10 ms

Using model xdistilbert_customer3.tar.gz to predict
prediction: ['JOY'], took 9 ms

Using model xdistilbert_customer4.tar.gz to predict
prediction: ['JOY'], took 10 ms

Using model xdistilbert_customer5.tar.gz to predict
prediction: ['FEAR'], took 9 ms



[Show ModelCacheHit]

# Dynamic Model Unloading Behavior

In [75]:
for i in range(1, 300):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(text=get_random_text(), model_name=customer_model_name)

Using model xdistilbert_customer1.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer2.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer3.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer4.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer5.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer6.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer7.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer8.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer9.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer10.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer11.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer12.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer13.tar.gz to predict
prediction: ['SADNESS']

Using model x

[Show logs to show unloading]

[Show LoadedModelCount and GPUMemoryUtilization]

# Autoscaling Behavior

## Set up AutoScaling Policy

In [21]:
auto_scaling_client = boto3.client('application-autoscaling')

resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic' 
response = auto_scaling_client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity = 1,
    MaxCapacity = 2
)

response = auto_scaling_client.put_scaling_policy(
    PolicyName='GPUMemUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 80, 
        'CustomizedMetricSpecification':
        {
            'MetricName': 'GPUMemoryUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name},
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average',
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 600,
        'ScaleOutCooldown': 100
    }
)

print("Autoscaling policy for GPU MME endpoint has been set up")

[Show CW alarm being triggered and Endpoint entering Updating state]

### While Autoscaling the endpoint is still active

In [None]:
i = 200
customer_model_name = f"xdistilbert_customer{i}.tar.gz"
predict_model(sample_text, customer_model_name)

[Show Endpoint autoscaling to 2 instances]

## Invoke More Models

In [40]:
for i in range(1, 400):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(sample_text, customer_model_name)

Using model xdistilbert_customer1.tar.gz to predict


ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-west-2.amazonaws.com/endpoints/nlp-mme-gpu-ep-2023-01-27-02-53-04/invocations"

# Clean Up

## Terminate endpoint and clean up artifacts

In [77]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

{'ResponseMetadata': {'RequestId': '8eec899c-6253-440e-bf1a-68cd088616ef',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8eec899c-6253-440e-bf1a-68cd088616ef',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Fri, 27 Jan 2023 05:57:58 GMT'},
  'RetryAttempts': 0}}