# Deploying Models from Multiple Frameworks on GPU using MME

Amazon SageMaker multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now, customers can deploy 1000s of deep learning models on GPUs behind one SageMaker endpoint. MME will run multiple models on a GPU, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve the best price performance.

In this section we show how MME on GPU allows you to deploy ML models from different frameworks like PyTorch, TensorRT, TensorFlow, ONNX, etc. In this example, we show the deployment of PyTorch and TensorRT DistilBERT models on same GPU using SageMaker MME

<div class="alert alert-info"> 💡 <strong> Note </strong>
Set conda_python3 kernel when prompted to set the kernel for this notebook. This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g5.xlarge`.
</div>

### Installs <a class="anchor" id="installs-and-set-up"></a>

Install required packages

In [1]:
!pip install -qU pip boto3 sagemaker awscli tritonclient[http] transformers datasets

### Imports and variables

In [96]:
# imports
import boto3
import sagemaker
from sagemaker import get_execution_role
import time
import random
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from utils import print_safe

# sagemaker variables
prefix = "nlp-mme-gpu"
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

## Creating Model Artifacts <a class="anchor" id="pytorch-efficientnet-model"></a>


This section presents overview of steps to prepare HuggingFace DistilBERT models to be deployed on SageMaker MME using Triton Inference server model configurations. 

### Prepare PyTorch Model  <a class="anchor" id="create-pytorch-model"></a>


* We load pre-trained PyTorch [Huggingface DistilBERT model](https://huggingface.co/bergum/xtremedistil-emotion) that was finetuned on emotion classification task. 
* Export it to Torchscript serialized format

To perform these steps, we will be using our [pt_exporter.py](./workspace/pt_exporter.py) script and running it within the [PyTorch NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container.

In [5]:
%%writefile ./workspace/pt_exporter.py
import torch
from transformers import AutoModelForSequenceClassification

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForSequenceClassification.from_pretrained("bergum/xtremedistil-emotion", torchscript=True)

model = model.eval()
model = model.to(device)

bs = 224
seq_len = 128
dummy_inputs = [
    torch.randint(1000, (bs, seq_len)).to(device),
    torch.ones(bs, seq_len, dtype=torch.int).to(device),
    torch.zeros(bs, seq_len, dtype=torch.int).to(device),
]

traced_model = torch.jit.trace(model, dummy_inputs)
torch.jit.save(traced_model, "model.pt")

Overwriting ./workspace/pt_exporter.py


Run the cell below to finish preparing the PyTorch DistilBERT model.

In [None]:
!docker run --gpus=all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
            -v `pwd`/workspace:/workspace -w /workspace nvcr.io/nvidia/pytorch:22.12-py3 \
            /bin/bash generate_model_pytorch.sh

#### Setup PyTorch Model Repository

Now that model artifact is ready we need to set up a model repository containing the model artifact we want to serve along with a [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) file i.e. `config.pbtxt`. This is the expected structure of the model repository:
```
xdistilbert_pt
├── 1
│   └── model.pt
└── config.pbtxt
```

#### PyTorch Model configuration <a class="anchor" id="create-pytorch-model-config"></a>

Model configuration file `config.pbtxt` contains the following:  
- `name`: xdistilbert_pt
- `backend`: pytorch
- `max_batch_size`: maximum batch size 224 that the model supports
- `input` and `output` tensor shapes with the `data_type` 

Additionally, you can specify [instance_group](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) and [dynamic_batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher) properties to achieve high performance inference. 

In [None]:
!mkdir -p model_repository/xdistilbert_pt/1
!cp workspace/model.pt model_repository/xdistilbert_pt/1/

In [97]:
%%writefile model_repository/xdistilbert_pt/config.pbtxt
backend: "pytorch"
max_batch_size: 224
input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "INPUT__1"
    data_type: TYPE_INT32
    dims: [128]
  },
    {
    name: "INPUT__2"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [6]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}

Overwriting model_repository/xdistilbert_pt/config.pbtxt


### Prepare TensorRT Model <a class="anchor" id="create-tensorrt-model"></a>

- We load pre-trained xdistilbert PyTorch model from Huggingface
- Convert to onnx representation using torch onnx exporter.
- Use TensorRT `trtexec` bash command to create the TensorRT model plan

To perform these steps, we will be running the [generate_model_trt.sh](./workspace/generate_model_trt.sh) script inside the [PyTorch NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container.

In [6]:
%%writefile ./workspace/generate_model_trt.sh

echo "Installing Transformers..."
pip -q install transformers[onnx]

echo "Exporting model to ONNX..."
python -m transformers.onnx --model=bergum/xtremedistil-emotion \
                            --feature=sequence-classification /workspace/onnx/

export CUDA_MODULE_LOADING=LAZY
echo "Converting ONNX Model to TensorRT FP16 Plan..."
trtexec --onnx=/workspace/onnx/model.onnx \
        --saveEngine=/workspace/model.plan \
        --minShapes=input_ids:1x128,attention_mask:1x128,token_type_ids:1x128 \
        --optShapes=input_ids:16x128,attention_mask:16x128,token_type_ids:16x128 \
        --maxShapes=input_ids:224x128,attention_mask:224x128,token_type_ids:224x128 \
        --fp16 \
        --verbose \
        --memPoolSize=workspace:14000 | tee conversion_trt.txt

echo "Finished exporting all models..."

Overwriting ./workspace/generate_model_trt.sh


Execute the cell below to finish preparing the TensorRT DistilBERT model.

**Note**:
This TensorRT optimization step takes around 10 minutes to complete. While the step is running, feel free to watch this [video](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41306/?start=349) or take a look at the logs in the below cell to see the various TensorRT optimizations like Layer & Tensor Fusion and Reduced Mixed Precision in action.

In [None]:
!docker run --gpus=all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
            -v `pwd`/workspace:/workspace -w /workspace nvcr.io/nvidia/pytorch:22.12-py3 \
            /bin/bash generate_model_trt.sh

#### Setup TensorRT Model Repository

Similar to PyTorch model, this is the expected structure of the TensorRT model repository:
```
xdistilbert_trt
├── 1
│   └── model.plan
└── config.pbtxt
```

#### TensorRT Model configuration <a class="anchor" id="create-pytorch-model-config"></a>

Model configuration file `config.pbtxt` contains the following  
- `name`: xdistilbert_trt
- `backend`: tensorrt
- `max_batch_size`: maximum batch size 224 that the model supports
- `input` and `output` tensor shapes with the `data_types`

Additionally, you can specify [instance_group](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) and [dynamic_batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher) properties to achieve high performance inference. 

In [None]:
!mkdir -p model_repository/xdistilbert_trt/1/
!cp workspace/model.plan model_repository/xdistilbert_trt/1/

In [None]:
%%writefile model_repository/xdistilbert_trt/config.pbtxt
name: "xdistilbert_trt"
backend: "tensorrt"
max_batch_size: 224
input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "token_type_ids"
    data_type: TYPE_INT32
    dims: [128]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [6]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}

## Export model artifacts to S3 <a class="anchor" id="export-to-s3"></a>

Next, we will package our models as `*.tar.gz` files for uploading to S3. 

In [98]:
pytorch_model_file_name = "xdistilbert_pt.tar.gz"
!tar -C model_repository -czf $pytorch_model_file_name xdistilbert_pt
model_uri_pt = sagemaker_session.upload_data(path=pytorch_model_file_name, key_prefix=prefix)
print_safe(f"PyTorch Model S3 location: {model_uri_pt}")

In [12]:
tensorrt_model_file_name = "xdistilbert_trt.tar.gz"
!tar -C model_repository -czf $tensorrt_model_file_name xdistilbert_trt
model_uri_trt = sagemaker_session.upload_data(path=tensorrt_model_file_name, key_prefix=prefix)
print_safe(f"TensorRT Model S3 location: {model_uri_trt}")

## Setup Multi-Model Endpoint <a class="anchor" id="deploy-models-with-mme"></a>

We will now setup Multi-Model Endpoint on GPU where we can deploy our DistilBERT PyTorch and TensorRT models.

### SageMaker Triton Container Image

First we define that we will be using SageMaker Triton container image which supports deploying multi-model endpoints with GPU.

In [100]:
# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.12-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

### Define the serving container  <a class="anchor" id="define-container-def"></a>

Next, we define the serving container
* In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load  and serve predictions. 
* Set `Mode` to `MultiModel` to indicate SageMaker should create the endpoint with MME specifications.

In [101]:
model_data_url = f"s3://{bucket}/{prefix}/"
container = {"Image": mme_triton_image_uri, "ModelDataUrl": model_data_url, "Mode": "MultiModel"}

### Create SageMaker model <a class="anchor" id="create-mme-model-obj"></a>

We start off by creating a sagemaker model from the model files we uploaded to s3 in the previous step. We do this using the SageMaker boto3 client and [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) API. We will pass the container definition to the `create model` API along with ModelName and ExecutionRoleArn.


In [102]:
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = f"{prefix}-model-{ts}"
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print_safe("Model Arn: " + create_model_response["ModelArn"])

Model Arn: arn:aws:sagemaker:us-west-2:############:model/nlp-mme-gpu-mdl-2023-02-01-04-05-24


### Define configuration for the MME<a class="anchor" id="config-mme"></a>

Next, we create a multi-model endpoint configuration using [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) boto3 API. Specify an accelerated GPU computing instance in InstanceType **(we will use the same instance type that we are using to host our SageMaker Notebook)**. We recommend configuring your endpoints with at least two instances with real-life use-cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.




In [103]:
endpoint_config_name = f"{prefix}-epc-{ts}"
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": ml.g5.xlarge,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print_safe("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-west-2:############:endpoint-config/nlp-mme-gpu-epc-2023-02-01-04-05-24


### Create MME  <a class="anchor" id="create-mme"></a>

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful.

In [104]:
endpoint_name = f"{prefix}-ep-{ts}"
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print_safe("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-west-2:############:endpoint/nlp-mme-gpu-ep-2023-02-01-04-05-24


### Describe MME <a class="anchor" id="describe-mme"></a>

Now, we check the status of the endpoint using `describe_endpoint`. This step will take about 7 mins to complete and you should see `Status: InService` message before you proceed to next cells.

In [105]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print_safe("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:############:endpoint/nlp-mme-gpu-ep-2023-02-01-04-05-24
Status: InService


## Preparing inference payload <a class="anchor" id="helper-functions"></a>

The following methods help us tokenize text and perform some postprocessing on logits to get final classification prediction. 

In [106]:
import tritonclient.http as httpclient
import numpy as np
import random
from transformers import AutoTokenizer
from datasets import load_dataset

dataset = load_dataset("emotion")
tokenizer_name = "bergum/xtremedistil-emotion"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

def tokenize_text(tokenizer, text):
    MAX_LEN = 128
    tokenized_text = tokenizer(text, padding='max_length', max_length=MAX_LEN, add_special_tokens=True, return_tensors='np')
    return tokenized_text.input_ids, tokenized_text.attention_mask, tokenized_text.token_type_ids

def get_random_text():
    rand_i = random.randint(0, 2000)
    text = dataset["test"]["text"][rand_i]
    return text

def logits2prediction(logits):
    CLASSES = ["SADNESS", "JOY", "LOVE", "ANGER", "FEAR", "SURPRISE"]
    predictions = []
    for i in range(len(logits)):
        pred_class_idx = np.argmax(logits[i])
        predictions.append(CLASSES[pred_class_idx])
    return predictions

No config specified, defaulting to: emotion/split
Found cached dataset emotion (/home/ec2-user/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

While Triton on SageMaker supports json payload format Here we use `binary+json` as the payload format to get better performance for the inference call. The specification of this format is provided [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md).

**Note:** With the `binary+json` format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header `application/vnd.sagemaker-triton.binary+json;json-header-size={}`.

In [107]:
def _get_tokenized_text_binary(text, input_names, output_names):
    inputs = []
    outputs = []
    input_ids, attention_mask, token_type_ids = tokenize_text(tokenizer, text)
    inputs.append(httpclient.InferInput(input_names[0], input_ids.shape, "INT32"))
    inputs.append(httpclient.InferInput(input_names[1], attention_mask.shape, "INT32"))
    inputs.append(httpclient.InferInput(input_names[2], token_type_ids.shape, "INT32"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int32), binary_data=True)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int32), binary_data=True)
    inputs[2].set_data_from_numpy(token_type_ids.astype(np.int32), binary_data=True)
    
    outputs.append(httpclient.InferRequestedOutput(output_names[0], binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length

def get_tokenized_text_binary_pt(text):
    return _get_tokenized_text_binary(text, ["INPUT__0", "INPUT__1", "INPUT__2"], ["OUTPUT__0"])


def get_tokenized_text_binary_trt(text):
    return _get_tokenized_text_binary(text, ["input_ids", "attention_mask", "token_type_ids"], ["logits"])

def read_response(response, output_name):
    # Parse json header size length from the response
    header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
    header_length_str = response['ContentType'][len(header_length_prefix):]
    # Read response body
    result = httpclient.InferenceServerClient.parse_response_body(
        response['Body'].read(), header_length=int(header_length_str))
    logits = result.as_numpy(output_name)
    prediction = logits2prediction(logits)
    return prediction

## Invoking Models on Multi-Model Endpoint

Once the endpoint is successfully created, we can send inference request to multi-model endpoint using `invoke_enpoint` API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Sample invocation for PyTorch model and TensorRT model is shown below

### Invoke TensorRT Model <a class="anchor" id="invoke-tensorrt-model"></a>

Notice the higher latencies on the first invocation of any given model. This is due to the time it takes SageMaker to download the model to the Endpoint instance and then load the model into the inference container. Subsequent invocations of the same model take advantage of the model already being loaded into the inference container and so are fast.

In [84]:
sample_text = get_random_text()
request_body, header_length = get_tokenized_text_binary_trt(sample_text)

start_time = time.time()

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
                                  Body=request_body,
                                  TargetModel='xdistilbert_trt.tar.gz')
duration = time.time() - start_time

output_name = 'logits'
prediction = read_response(response, output_name)
print(f"text: {sample_text}\n")
print(f"prediction: {prediction}, took {int(duration * 1000)} ms\n")

text: i suspect his reasoning may simply be to lull apple into feeling complacent

prediction: ['JOY'], took 2857 ms



In [85]:
sample_text = get_random_text()
request_body, header_length = get_tokenized_text_binary_trt(sample_text)

start_time = time.time()

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
                                  Body=request_body,
                                  TargetModel='xdistilbert_trt.tar.gz')
duration = time.time() - start_time

output_name = 'logits'
prediction = read_response(response, output_name)
print(f"text: {sample_text}\n")
print(f"prediction: {prediction}, took {int(duration * 1000)} ms\n")

text: i sat up to embrace them and realised that two hours spent shaking my thang in an eighties bar celebrating the fact i am one year closer to death had left my ageing body feeling punished and my normally pink feet blackened

prediction: ['SADNESS'], took 12 ms



### Invoke PyTorch Model <a class="anchor" id="invoke-pytorch-model"></a>

In [108]:
sample_text = get_random_text()
request_body, header_length = get_tokenized_text_binary_pt(sample_text)

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
                                  Body=request_body,
                                  TargetModel='xdistilbert_pt.tar.gz')

output_name = 'OUTPUT__0'
prediction = read_response(response, output_name)
print(f"text: {sample_text}\n")
print(f"prediction: {prediction}, took {int(duration * 1000)} ms\n")

text: i feel reassured that the county government in my county takes the murder of an illegal immigrant in a back alley seriously enough to prosecute someone years later

prediction: ['JOY'], took 12 ms



In [109]:
sample_text = get_random_text()
request_body, header_length = get_tokenized_text_binary_pt(sample_text)

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType=f"application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}",
                                  Body=request_body,
                                  TargetModel='xdistilbert_pt.tar.gz')

output_name = 'OUTPUT__0'
prediction = read_response(response, output_name)
print(f"text: {sample_text}\n")
print(f"prediction: {prediction}, took {int(duration * 1000)} ms\n")

text: i feel so blessed to be able to share it with you all

prediction: ['LOVE'], took 12 ms



# Deploying Hundreds of Models to GPUs using MME

Let's say you are trying to deploy 1000 customer-specific distilBERT models which are a mixture of frequently and infrequently accessed models coming from different frameworks like PyTorch, TensorFlow, ONNX, TensorRT.

Deploying these 1000 models on GPU instances like `g5.xlarge` using dedicated Single-Model Endpoints would take ~1000 instances.

By deploying these models behind a Multi-Model endpoint on GPUs you can end up using 100x less instances. Thus reducing costs by **100x**. 

## Dynamically adding models to an existing endpoint

It’s easy to deploy a new model to an existing multi-model endpoint. With the endpoint already running, copy a new set of model artifacts to the same S3 location you set up earlier. Client applications are then free to request predictions from that target model, and Amazon SageMaker handles the rest. 

This step below will take around few minutes to complete as we are copying 300 files to S3

In [86]:
customer_model_name = f"xdistilbert_customer302.tar.gz"
model_copy = f"{model_data_url}{customer_model_name}"
!aws s3 cp $model_data_url$pytorch_model_file_name $model_copy

copy: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_pt.tar.gz to s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer302.tar.gz


In [110]:
%%capture
num_models = 100
for i in range(num_models):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    model_copy = f"{model_data_url}{customer_model_name}"
    !aws s3 cp $model_data_url$pytorch_model_file_name $model_copy

In [73]:
model_name="xdistilbert_pt.tar.gz"
!aws s3 rm $model_data_url$model_name

delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_pt.tar.gz


In [94]:
!aws s3 rm $model_data_url --recursive

delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer0.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer10.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer11.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer1.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer18.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer19.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer14.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer16.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer12.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer13.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer15.tar.gz
delete: s3://sagemaker-us-west-2-3

delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer63.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer65.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer78.tar.gz
delete: s3://sagemaker-us-west-2-354625738399/nlp-mme-gpu/xdistilbert_customer91.tar.gz


In [111]:
!aws s3 ls $model_data_url

2023-02-01 04:16:56   46621941 xdistilbert_customer0.tar.gz
2023-02-01 04:16:57   46621941 xdistilbert_customer1.tar.gz
2023-02-01 04:17:06   46621941 xdistilbert_customer10.tar.gz
2023-02-01 04:17:07   46621941 xdistilbert_customer11.tar.gz
2023-02-01 04:17:08   46621941 xdistilbert_customer12.tar.gz
2023-02-01 04:17:09   46621941 xdistilbert_customer13.tar.gz
2023-02-01 04:17:11   46621941 xdistilbert_customer14.tar.gz
2023-02-01 04:17:12   46621941 xdistilbert_customer15.tar.gz
2023-02-01 04:17:13   46621941 xdistilbert_customer16.tar.gz
2023-02-01 04:17:13   46621941 xdistilbert_customer17.tar.gz
2023-02-01 04:17:15   46621941 xdistilbert_customer18.tar.gz
2023-02-01 04:17:16   46621941 xdistilbert_customer19.tar.gz
2023-02-01 04:16:58   46621941 xdistilbert_customer2.tar.gz
2023-02-01 04:17:17   46621941 xdistilbert_customer20.tar.gz
2023-02-01 04:17:18   46621941 xdistilbert_customer21.tar.gz
2023-02-01 04:17:19   46621941 xdistilbert_customer22.tar.gz
2023-02-01 

# Dynamic Model Unloading Behavior

In [None]:
def predict_model(text, model_name, show_latency=False):
    print(f"Using model {model_name} to predict")
    
    request_body, header_length = get_tokenized_text_binary_pt(text)
    
    start_time = time.time()
    
    response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel=model_name)
    
    duration = time.time() - start_time
    
    prediction = read_response(response, output_name="OUTPUT__0")
    
    if show_latency:
        print(f"prediction: {prediction}, took {int(duration * 1000)} ms\n")
    else:
        print(f"prediction: {prediction}\n")

In [60]:
for i in range(1, 300):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(text=get_random_text(), model_name=customer_model_name)

Using model xdistilbert_customer1.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer2.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer3.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer4.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer5.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer6.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer7.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer8.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer9.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer10.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer11.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer12.tar.gz to predict
prediction: ['SURPRISE']

Using model xdistilbert_customer13.tar.gz to predict
prediction: ['FEAR']

Using model xdis

prediction: ['SADNESS']

Using model xdistilbert_customer110.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer111.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer112.tar.gz to predict
prediction: ['SURPRISE']

Using model xdistilbert_customer113.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer114.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer115.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer116.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer117.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer118.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer119.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer120.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer121.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer122.tar.gz to predi

prediction: ['LOVE']

Using model xdistilbert_customer217.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer218.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer219.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer220.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer221.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer222.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer223.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer224.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer225.tar.gz to predict
prediction: ['SADNESS']

Using model xdistilbert_customer226.tar.gz to predict
prediction: ['JOY']

Using model xdistilbert_customer227.tar.gz to predict
prediction: ['LOVE']

Using model xdistilbert_customer228.tar.gz to predict
prediction: ['ANGER']

Using model xdistilbert_customer229.tar.gz to pred

[Show logs to show unloading]

[Show LoadedModelCount and GPUMemoryUtilization]

# Autoscaling Behavior

## Set up AutoScaling Policy

In [61]:
auto_scaling_client = boto3.client('application-autoscaling')

resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic' 
response = auto_scaling_client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity = 1,
    MaxCapacity = 2
)

response = auto_scaling_client.put_scaling_policy(
    PolicyName='GPUMemUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 75, 
        'CustomizedMetricSpecification':
        {
            'MetricName': 'GPUMemoryUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name},
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average',
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 600,
        'ScaleOutCooldown': 100
    }
)

print("Autoscaling policy for GPU MME endpoint has been set up")

Autoscaling policy for GPU MME endpoint has been set up


[Show CW alarm being triggered and Endpoint entering Updating state]

### While Autoscaling the endpoint is still active

In [67]:
predict_model(text=get_random_text(), model_name="xdistilbert_customer200.tar.gz")

Using model xdistilbert_customer200.tar.gz to predict
prediction: ['SADNESS']



[Show Endpoint autoscaling to 2 instances]

## Invoke Models Again

In [112]:
for i in range(1, 300):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(sample_text, customer_model_name, show_latency=True)

Using model xdistilbert_customer1.tar.gz to predict
prediction: ['LOVE'], took 935 ms

Using model xdistilbert_customer2.tar.gz to predict
prediction: ['LOVE'], took 774 ms

Using model xdistilbert_customer3.tar.gz to predict
prediction: ['LOVE'], took 874 ms

Using model xdistilbert_customer4.tar.gz to predict
prediction: ['LOVE'], took 774 ms

Using model xdistilbert_customer5.tar.gz to predict
prediction: ['LOVE'], took 824 ms

Using model xdistilbert_customer6.tar.gz to predict
prediction: ['LOVE'], took 874 ms

Using model xdistilbert_customer7.tar.gz to predict
prediction: ['LOVE'], took 800 ms

Using model xdistilbert_customer8.tar.gz to predict
prediction: ['LOVE'], took 773 ms

Using model xdistilbert_customer9.tar.gz to predict
prediction: ['LOVE'], took 973 ms

Using model xdistilbert_customer10.tar.gz to predict
prediction: ['LOVE'], took 674 ms

Using model xdistilbert_customer11.tar.gz to predict
prediction: ['LOVE'], took 824 ms

Using model xdistilbert_customer12.tar.gz

prediction: ['LOVE'], took 772 ms

Using model xdistilbert_customer95.tar.gz to predict
prediction: ['LOVE'], took 998 ms

Using model xdistilbert_customer96.tar.gz to predict
prediction: ['LOVE'], took 849 ms

Using model xdistilbert_customer97.tar.gz to predict
prediction: ['LOVE'], took 800 ms

Using model xdistilbert_customer98.tar.gz to predict
prediction: ['LOVE'], took 801 ms

Using model xdistilbert_customer99.tar.gz to predict
prediction: ['LOVE'], took 747 ms

Using model xdistilbert_customer100.tar.gz to predict


ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Failed to download model data(bucket: sagemaker-us-west-2-354625738399, key: nlp-mme-gpu/xdistilbert_customer100.tar.gz). Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the model.


In [114]:
for i in range(1, 100):
    customer_model_name = f"xdistilbert_customer{i}.tar.gz"
    predict_model(sample_text, customer_model_name, show_latency=True)

Using model xdistilbert_customer1.tar.gz to predict
prediction: ['LOVE'], took 11 ms

Using model xdistilbert_customer2.tar.gz to predict
prediction: ['LOVE'], took 8 ms

Using model xdistilbert_customer3.tar.gz to predict
prediction: ['LOVE'], took 8 ms

Using model xdistilbert_customer4.tar.gz to predict
prediction: ['LOVE'], took 8 ms

Using model xdistilbert_customer5.tar.gz to predict
prediction: ['LOVE'], took 9 ms

Using model xdistilbert_customer6.tar.gz to predict
prediction: ['LOVE'], took 8 ms

Using model xdistilbert_customer7.tar.gz to predict
prediction: ['LOVE'], took 9 ms

Using model xdistilbert_customer8.tar.gz to predict
prediction: ['LOVE'], took 9 ms

Using model xdistilbert_customer9.tar.gz to predict
prediction: ['LOVE'], took 9 ms

Using model xdistilbert_customer10.tar.gz to predict
prediction: ['LOVE'], took 8 ms

Using model xdistilbert_customer11.tar.gz to predict
prediction: ['LOVE'], took 8 ms

Using model xdistilbert_customer12.tar.gz to predict
predictio

# Clean Up

## Terminate endpoint and clean up artifacts

In [95]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

{'ResponseMetadata': {'RequestId': 'b3c5d7dc-8ffd-4bc3-9fb3-16795c8476d8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b3c5d7dc-8ffd-4bc3-9fb3-16795c8476d8',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 01 Feb 2023 04:05:12 GMT'},
  'RetryAttempts': 0}}