# Serve Multiple DL models on GPU with Amazon SageMaker Multi-model endpoints (MME)



Amazon SageMaker multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now customers can deploy 1000s of deep learning models behind one SageMaker endpoint. Now, MME will run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve best price performance.



<div class="alert alert-info"> 💡 <strong> Note </strong>
This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g5.xlarge`.
</div>

In this notebook, we will walk you through how to use NVIDIA Triton Inference Server on Amazon SageMaker MME with GPU feature to deploy two different NLP models (**DistilBERT** and **T5**) for two different use-cases (**Classification** and **Summarization**) in two different frameworks (**TensorFlow** and **PyTorch**) on the same GPU. 

## Installs

Installs the dependencies required to package the model and run inferences using Triton server. Update SageMaker, boto3, awscli etc

In [None]:
!pip install -qU pip awscli boto3 sagemaker
!pip install nvidia-pyindex --quiet
!pip install tritonclient[http] --quiet
!pip install transformers[sentencepiece] --quiet

## Imports and variables

In [None]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
s3_client = boto3.client('s3')
bucket = sagemaker.Session().default_bucket()
prefix = "nlp-mme-gpu"

# account mapping for SageMaker MME Triton Image
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.09-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)

## Workflow Overview

This section presents overview of main steps for preparing DistilBERT TensorFlow model (served using TensorFlow backend) and T5 Pytorch (served using Python backend) model to be served using Triton Inference Server.
### 1. Generate Model Artifacts

#### DistilBERT TensorFlow model

First, we use HuggingFace transformers to load pre-trained DistilBERT TensorFlow model that has been fine-tuned for sentiment analysis binary classification task. Then, we save the model as SavedModel serialized format. The `generate_distilbert_tf.sh` bash script performs all these steps inside the NGC TensorFlow container. Run the command below to generate DistilBERT Tensorflow model. It can take a few minutes.

In [None]:
!docker run --gpus=all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it \
            -v `pwd`/workspace:/workspace nvcr.io/nvidia/tensorflow:22.09-tf2-py3 \
            /bin/bash generate_distilbert_tf.sh

#### T5 PyTorch Model

In case of T5-small HuggingFace PyTorch Model, since we are serving it using Triton's [python backend](https://github.com/triton-inference-server/python_backend#usage) we have python script [model.py](./workspace/model.py) which implements all the logic to initialize the T5 model and execute inference for the summarization task.

### 2. Build Model Respository

Using Triton on SageMaker requires us to first set up a [model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md) folder containing the models we want to serve. For each model we need to create a model directory consisting of the model artifact and define config.pbtxt file to specify [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) which Triton uses to load and serve the model. 



#### DistilBERT TensorFlow Model

Model repository structure for DistilBERT TensorFlow Model.

```
distilbert_tf
├── 1
│   └── model.savedmodel
└── config.pbtxt
```

Model configuration must specify the platform and backend properties, max_batch_size property and the input and output tensors of the model. Additionally, you can specify instance_group and dynamic_batching properties for optimal inference performance in terms of latency and concurrency.

Below we set up the DistilBERT TensorFlow Model in the model repository:

In [None]:
!mkdir -p model_repository/distilbert_tf/1
!cp -r workspace/hf_distilbert/saved_model/1 workspace/model.savedmodel
!cp -r workspace/model.savedmodel model_repository/distilbert_tf/1/

Then we define its config file:

In [None]:
%%writefile model_repository/distilbert_tf/config.pbtxt
name: "distilbert_tf"
platform: "tensorflow_savedmodel"
max_batch_size: 8
input: [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT32
        dims: [ -1 ]
    }
]
output: [
    {
        name: "logits"
        data_type: TYPE_FP32
        dims: [ 2 ]
    }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 4
}

#### T5 Python Backend Model

Model repository structure for T5 Model.

```
t5
├── 1
│   └── model.py
└── config.pbtxt
```


Next we set up the T5 PyTorch Python Backend Model in the model repository:

In [None]:
!mkdir -p model_repository/t5_pytorch/1
!cp workspace/model.py model_repository/t5_pytorch/1/

##### Create Conda Environment for Dependencies

For serving the HuggingFace T5 PyTorch Model using Triton's Python backend we have PyTorch and HuggingFace transformers as dependencies.

We follow the instructions from the [Triton documentation for packaging dependencies](https://github.com/triton-inference-server/python_backend#2-packaging-the-conda-environment) to be used in the python backend as conda env tar file. Running the bash script [create_hf_env.sh]('./workspace/create_hf_env.sh') creates the conda environment containing PyTorch and HuggingFace transformers, packages it as tar file and then we move it into the t5-pytorch model directory. This can take a few minutes.

In [None]:
!bash workspace/create_hf_env.sh
!mv hf_env.tar.gz model_repository/t5_pytorch/

After creating the tar file from the conda environment and placing it in model folder, you need to tell Python backend to use that environment for your model. We do this by including the lines below in the model `config.pbtxt` file:

```
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}
```
Here, `$$TRITON_MODEL_DIRECTORY` helps provide environment path relative to the model folder in model repository and is resolved to `$pwd/model_repository/t5_pytorch`. Finally `hf_env.tar.gz` is the name we gave to our conda env file.

Now we are ready to define the config file for t5 pytorch model being served through Triton's Python Backend:

In [None]:
%%writefile model_repository/t5_pytorch/config.pbtxt
name: "t5"
backend: "python"
max_batch_size: 8
input: [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT32
        dims: [ -1 ]
    }
]
output [
  {
    name: "output"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 4
}
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz"}
}

### 3. Package models and upload to S3

Next, we will package our models as `*.tar.gz` files for uploading to S3. 

In [None]:
!tar -C model_repository/ -czf distilbert_tf.tar.gz distilbert_tf
model_uri_distilbert_tf = sagemaker_session.upload_data(path="distilbert_tf.tar.gz", key_prefix=prefix)

In [None]:
!tar -C model_repository/ -czf t5_pytorch.tar.gz t5_pytorch
model_uri_t5_pytorch = sagemaker_session.upload_data(path="t5_pytorch.tar.gz", key_prefix=prefix)

### 4. Create SageMaker Endpoint

Now that we have uploaded the model artifacts to S3, we can create a SageMaker multi-model endpoint.

#### Define the serving container
In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load and serve predictions. Set `Mode` to `MultiModel` to indicate SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU

In [None]:
model_data_url = f"s3://{bucket}/{prefix}/"

container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
}

#### Create a multi-model object

Once the image, data location are set we create the model using `create_model` by specifying the `ModelName` and the Container definition

In [None]:
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = f"{prefix}-mdl-{ts}"

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

#### Define configuration for the multi-model endpoint

Using the model above, we create an [endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where we can specify the type and number of instances we want in the endpoint. Here we are deploying to `g5.xlarge` NVIDIA GPU instance.

In [None]:
endpoint_config_name = f"{prefix}-epc-{ts}"

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

#### Create Multi-Model Endpoint

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful.

In [None]:
endpoint_name = f"{prefix}-ep-{ts}"

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### 5. Run Inference

Once we have the endpoint running we can use some sample raw data to do an inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard [inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md).

#### Add utility methods for preparing JSON request payload



We'll use the following utility methods to convert our inference request for DistilBERT and T5 models into a json payload.

In [None]:
from transformers import AutoTokenizer

def get_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.model_max_length = 256
    return tokenizer

def tokenize_text(model_name, text):
    tokenizer = get_tokenizer(model_name)
    tokenized_text = tokenizer(text, padding="max_length", return_tensors="np")
    return tokenized_text.input_ids, tokenized_text.attention_mask

def get_text_payload(model_name, text):
    input_ids, attention_mask = tokenize_text(model_name, text)
    payload = {}
    payload["inputs"] = []
    payload["inputs"].append({"name": "input_ids", "shape": input_ids.shape, "datatype": "INT32", "data": input_ids.tolist()})
    payload["inputs"].append({"name": "attention_mask", "shape": attention_mask.shape, "datatype": "INT32", "data": attention_mask.tolist()})
    return payload

#### Invoke target model on Multi Model Endpoint

We can send inference request to multi-model endpoint using `invoke_enpoint` API. We specify the `TargetModel` in the invocation call and pass in the payload for each model type.

#### DistilBERT TensorFlow Model

##### Sample DistilBERT Inference using Json Payload

First, we show some sample inference on the DistilBERT TensorFlow Binary Classification Model deployed on Triton's TensorFlow SavedModel Backend behind SageMaker MME GPU endpoint

In [None]:
texts_to_classify = ["Many critics thought the sequel film was unnecessary",
                     "The movie received praise for the visuals and cast",
                     "Spectacular! Great history and amazing architecture!"]
batch_size = len(texts_to_classify)

distilbert_model = "distilbert-base-uncased-finetuned-sst-2-english"
distilbert_payload = get_text_payload(distilbert_model, texts_to_classify)

In [None]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(distilbert_payload),
    TargetModel="distilbert_tf.tar.gz",
)

response_body = json.loads(response["Body"].read().decode("utf8"))
logits = np.array(response_body["outputs"][0]["data"]).reshape(batch_size, -1)
CLASSES = ["NEGATIVE", "POSITIVE"]
predictions = []

for i in range(batch_size):
    pred_class_idx = np.argmax(logits[i])
    predictions.append(CLASSES[pred_class_idx])
print(predictions)

##### Sample DistilBERT Inference using Binary + Json Payload

We can also use `binary+json` as the payload format to get better performance for the inference call. The specification of this format is provided [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md).

**Note:** With the binary+json format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header `Application/vnd.sagemaker-triton.binary+json;json-header-size={}`.

Please note, this is different from using `Inference-Header-Content-Length` header on a stand-alone Triton server since custom headers are not allowed in SageMaker.

The `tritonclient` package in Triton provides utility methods to generate the payload without having to know the details of the specification. We'll use the following method to convert our inference request for DistilBERT and T5 models into a binary format which provides lower latencies for inference.

In [None]:
import tritonclient.http as httpclient
import numpy as np

def get_text_payload_binary(model_name, text):
    inputs = []
    outputs = []
    input_ids, attention_mask = tokenize_text(model_name, text)
    inputs.append(httpclient.InferInput("input_ids", input_ids.shape, "INT32"))
    inputs.append(httpclient.InferInput("attention_mask", attention_mask.shape, "INT32"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int32), binary_data=True)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int32), binary_data=True)
    
    output_name = "output" if model_name == "t5-small" else "logits"
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length

In [None]:
request_body, header_length = get_text_payload_binary(distilbert_model, texts_to_classify)

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                  TargetModel='distilbert_tf.tar.gz')

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response['ContentType'][len(header_length_prefix):]
output_name = "logits"
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response['Body'].read(), header_length=int(header_length_str))
logits = result.as_numpy(output_name)
CLASSES = ["NEGATIVE", "POSITIVE"]
predictions = []

for i in range(batch_size):
    pred_class_idx = np.argmax(logits[i])
    predictions.append(CLASSES[pred_class_idx])
print(predictions)

#### T5 PyTorch Model

Next, we show some sample inference for summarization on the T5 PyTorch Model deployed on Triton's Python Backend behind SageMaker MME GPU endpoint

In [None]:
texts_to_summarize = [
    "summarize: SageMaker enables customers to deploy a model using custom code with NVIDIA Triton Inference Server. This functionality is available through the development of Triton Inference Server Containers. These containers include NVIDIA Triton Inference Server, support for common ML frameworks, and useful environment variables that let you optimize performance on SageMaker. For a list of all available Deep Learning Containers images, see Available Deep Learning Containers Images. Deep Learning Containers images are maintained and regularly updated with security patches",
    "summarize: SageMaker MMEs with GPU work using NVIDIA Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that simplifies the inference serving process and provides high inference performance. Triton supports all major training and inference frameworks, such as TensorFlow, NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, XGBoost, Scikit-learn, RandomForest, OpenVINO, custom C++, and more. It offers dynamic batching, concurrent runs, post-training quantization, and optimal model configuration to achieve high-performance inference."
]
batch_size = len(texts_to_summarize)

##### Sample T5 Inference using Json Payload

In [None]:
t5_payload = get_text_payload("t5-small", texts_to_summarize)

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(t5_payload),
    TargetModel="t5_pytorch.tar.gz",
)

response_body = json.loads(response["Body"].read().decode("utf8"))
output_ids = np.array(response_body["outputs"][0]["data"]).reshape(batch_size, -1)
t5_tokenizer = get_tokenizer(t5)
decoded_outputs = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for text in decoded_outputs:
    print(text, "\n")

##### Sample T5 Inference using Binary + Json Payload

In [None]:
request_body, header_length = get_text_payload_binary("t5-small", texts_to_summarize)

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,
                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),
                                  Body=request_body,
                                 TargetModel='t5_pytorch.tar.gz')

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response['ContentType'][len(header_length_prefix):]
output_name = "output"
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response['Body'].read(), header_length=int(header_length_str))
output_ids = result.as_numpy(output_name)
decoded_output = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for text in decoded_outputs:
    print(text, "\n")

### Terminate endpoint and clean up artifacts

In [None]:
sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)