# Hugging Face Transformers with Amazon SageMaker and Multi-Container Endpoints
### Deploy multiple Transformer models to the same Amazon SageMaker Infrastructure


Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker to deploy multiple transformer models as [Multi-Container Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html). 
Amazon SageMaker Multi-Container Endpoint is an inference option to deploy multiple containers (multiple models) to the same SageMaker real-time endpoint. These models/containers can be accessed individually or in a pipeline. Amazon SageMaer [Multi-Container Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html) can be used to improve endpoint utilization and optimize costs. An example for this is **time zone differences**, the workload for model A (U.S) is mostly at during the day and the workload for model B (Germany) is mostly during the night, you can deploy model A and model B to the same SageMaker endpoint and optimize your costs. 

_**NOTE:** As the time of writing this only `CPU` Instances are supported for Multi-Container Endpoint._


![mce](imgs/mce.png)



## Development Environment and Permissions

_NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances_

In [None]:
%pip install sagemaker --upgrade

In [None]:
import sagemaker

assert sagemaker.__version__ >= "2.75.0"

### Permissions

_If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

## Multi-Container Endpoint creation

As the time of writing this does the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) not support Multi-Container Endpoint deployments. That's why we are going to use `boto3` to create the endpoint.

The first step though is to use the SDK to get our container uris for the Hugging Face Inference DLCs.

In [29]:
from sagemaker import image_uris

hf_inference_dlc = image_uris.retrieve(framework='huggingface', 
                                region=region, 
                                version='4.12.3', 
                                image_scope='inference', 
                                base_framework_version='pytorch1.9.1', 
                                py_version='py38', 
                                container_version='ubuntu20.04', 
                                instance_type='ml.c5.xlarge')
hf_inference_dlc

'763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04'

### Define Hugging Face models

As a next step we need to define the models we want to deploy to our multi-container endpoint. To stick with our example from the introduction we are going to deploy a english sentiment-classification model and a german sentiment-classification model. For the english model we will use [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) and for the german model we will use [oliverguhr/german-sentiment-bert](https://huggingface.co/oliverguhr/german-sentiment-bert). 
Similar to the endpoint creation with the SageMaker SDK do we need to provide the "Hub" configrations for the models as `HF_MODEL_ID` and `HF_TASK`. 

In [30]:
# english model
english_model = {
    'Image': ecr_image,
    'ContainerHostname': 'english_model',
    'Environment': {
	    'HF_MODEL_ID':'distilbert-base-uncased-finetuned-sst-2-english',
	    'HF_TASK':'text-classification'
    }
}

# german model
german_model = {
    'Image': ecr_image,
    'ContainerHostname': 'german_model',
    'Environment': {
	    'HF_MODEL_ID':'oliverguhr/german-sentiment-bert',
	    'HF_TASK':'text-classification'
    }
}

# Set the Mode parameter of the InferenceExecutionConfig field to Direct for direct invocation of each container,
# or Serial to use containers as an inference pipeline. The default mode is Serial.
inferenceExecutionConfig = {"Mode": "Direct"}


## Create Multi-Container Endpoint

After we defined our model configuration we can deploy our endpoint. To create/deploy a real-time endpoint with `boto3` you need to create a "SageMaker Model", a "SageMaker Endpoint Configuration" and a "SageMaker Endpoint". The "SageMaker Model" contains our multi-container configuration including our two models. The "SageMaker Endpoint Configuration" contains the configuration for the endpoint. The "SageMaker Endpoint" is the actual endpoint.

In [33]:
deployment_name = "multi-container-sentiment"
instance_type =  "ml.c5.4xlarge"


# create SageMaker Model
sm_client.create_model(
    ModelName        = f"{deployment_name}-model",
    InferenceExecutionConfig = inferenceExecutionConfig,
    ExecutionRoleArn = role,
    Containers       = [english_model, german_model]
    )

# create SageMaker Endpoint configuration
sm_client.create_endpoint_config(
    EndpointConfigName= f"{deployment_name}-config",
    ProductionVariants=[
        {
            "VariantName": "AllTraffic",
            "ModelName":  f"{deployment_name}-model",
            "InitialInstanceCount": 1,
            "InstanceType": instance_type,
        },
    ],
)

# create SageMaker Endpoint configuration
endpoint = sm_client.create_endpoint(
    EndpointName= f"{deployment_name}-ep", EndpointConfigName=f"{deployment_name}-config"
)

this will take a few minutes to deploy. You can check the console to see if the endpoint is in service

## Invoke Multi-Container Endpoint

To invoke the our multi-container endpoint we can either use `boto3` or any other AWS SDK or the Amazon SageMaker SDK. We are going to test both ways and also do some light load testing to take a look at the performance of our endpoint in cloudwatch.

In [None]:
english_payload={"inputs":"This is a great way for saving money and optimizing my resources."}

german_payload={"inputs":"Diese Methode wird mir in Zukunft helfen Kosten zu sparen und meine Ressourcen zu optimieren."}

### Sending requests with `boto3`

To send requests to our models we will use the `sagemaker-runtime` with the `invoke_endpoint` method. Compared to sending regular requests to a single-container endpoint we are passing `TargetContainerHostname` as additional information to point to the container, which should recieve the request. In our case this is either `english_model` or `german_model`. 

#### `english_model`

In [15]:
import json
import boto3

# create client
invoke_client = boto3.client('sagemaker-runtime')

# send request to first container (bi-encoder)
response = invoke_client.invoke_endpoint(
    EndpointName=f"{deployment_name}-ep",
    ContentType="application/json",
    Accept="application/json",
    TargetContainerHostname="english_model",
    Body=json.dumps(english_payload),
)
result = json.loads(response['Body'].read().decode())
result

#### `german_model`

In [21]:
import json
import boto3

# create client
invoke_client = boto3.client('sagemaker-runtime')

# send request to first container (bi-encoder)
response = invoke_client.invoke_endpoint(
    EndpointName=f"{deployment_name}-ep",
    ContentType="application/json",
    Accept="application/json",
    TargetContainerHostname="german_model",
    Body=json.dumps(german_payload),
)
result = json.loads(response['Body'].read().decode())
result

[{'label': 'POSITIVE', 'score': 0.9996057152748108}]

### Sending requests with `HuggingFacePredictor`

The Python SageMaker SDK can not be used for deploying Multi-Container Endpoints but can be used to invoke/send requests to those. We will use the `HuggingFacePredictor` to send requests to the endpoint, where we also pass the `TargetContainerHostname` as additional information to point to the container, which should recieve the request. In our case this is either `english_model` or `german_model`.

In [25]:
from sagemaker.huggingface import HuggingFacePredictor

# predictor
predictor = HuggingFacePredictor(f"{deployment_name}-ep")

# english request
en_res = predictor.predict(english_payload, initial_args={"TargetContainerHostname":"english_model"})
print(en_res)

# german request
de_res = predictor.predict(german_payload, initial_args={"TargetContainerHostname":"german_model"})
print(de_res)

### Load testing the multi-container endpoint

As mentioned we are doing some light load-testing, meaning sending a few alternating requests to the containers and looking at the latency in cloudwatch. 

In [None]:
for i in range(1000):
  predictor.predict(english_payload, initial_args={"TargetContainerHostname":"english_model"})
  predictor.predict(german_payload, initial_args={"TargetContainerHostname":"german_model"})
  
print(f"")

## Delete the Multi-Container Endpoint

In [28]:
predictor.delete_model()
predictor.delete_endpoint()