<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049; text-align: center; border-radius: 5px 5px; padding: 5px"> Hugging Face Transformers with Amazon SageMaker and Multi-Model Endpoints </h3>

<img src = "img/multi-model.jpg">

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Deploy multiple Transformer models to the same Amazon SageMaker Infrastructure </h2>

We can hosts up to thousands of models with [Amazon SageMaker multi-model endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html).

**Use cases of multi-model endpoints:**

- Multiple models which can be served from a common inference container, can be invoked on-demand and we experience some additional latency for infrequently invoked models.
- In some cases where the variable latency is tolerable, and cost optimization is more important, we may also decide to use MMEs for A/B/n testing, in place of the more typical production variant based strategy discussed here.

**How multi-model endpoints work?**

Amazon SageMaker takes care of the loading and unloading of models for a multi-model endpoint in the container’s memory, as they are invoked. When SageMaker receives an inference request for a particular model, it does the following.

1. Routes the request to an instance assigned to that model.
2. Instead of downloading all of the models from an Amazon S3 bucket to the container, SageMaker downloads only the invoked model from S3 bucket to that instance’s storage volume.
3. Also loads the model to the container’s memory on that instance. If the model is already loaded in the container’s memory then invocation is performed immediately because SageMaker already downloaded that model and loaded it.

Let’s say SageMaker has already loaded 100 models and needs to load another model into memory but instance’s memory utilization is high then what next?

1. SageMaker unloads unused models from that instance’s container memory to ensure that there is enough memory to load the model.
2. Models which are unloaded remain on the instance’s storage volume and can be loaded into the container’s memory later without being downloaded again from the S3 bucket.

What if the instance’s storage volume reaches its capacity?

Then SageMaker deletes the all unused models from the instance storage storage volume. We see SageMaker is much smarter in loading and unloading models to the container’s memory.

If we want to add a new model to multi-model endpoint then we upload it to S3 and invoke it. If we want to delete a existing model from multi-model endpoint then stop sending requests and delete it from S3.

We will use the Hugging Face Inference DLCs and Amazon SageMaker to deploy multiple transformer models as Multi-Model Endpoint. Amazon SageMaker Multi-Model Endpoint can be used to improve endpoint utilization and optimize costs.

This notebook demonstrates how to host 2 pretrained transformers model in one container behind one endpoint.
1. Use BERT model to extract embeddings from text.
2. Use GPT-2 model to generate synthetic text for a given text.

**NOTE**: As the time of writing this only `CPU` Instances are supported for Multi-Model Endpoint.

Please refer to [Medium article](https://medium.com/@vinayakshanawad/multi-model-endpoints-with-hugging-face-transformers-and-amazon-sagemaker-c0e5a3693fac)

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Development Environment and Permissions </h2>

NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances

If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

In [None]:
from sagemaker import get_execution_role
import boto3
import sagemaker

role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'huggingface-multimodel-deploy'
sm_client = boto3.client("sagemaker")


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Retrieve Model Artifacts </h2>

#### `BERT model`

First we will download the model artifacts for the pretrained [BERT](https://arxiv.org/abs/1810.04805) model. BERT is a popular natural language processing (NLP) model that extracts meaning and context from text.

In [2]:
!pip install transformers==4.17.0 --quiet

In [3]:
import os
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

model_path = 'models/bertmodel/model'

if not os.path.exists(model_path):
    os.mkdir(model_path)
    
model.save_pretrained(save_directory=model_path)
tokenizer.save_pretrained(save_directory=model_path)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


('models/bertmodel/model/tokenizer_config.json',
 'models/bertmodel/model/special_tokens_map.json',
 'models/bertmodel/model/vocab.txt',
 'models/bertmodel/model/added_tokens.json')

#### `GPT-2 model`

Second we will download the model artifacts for the pretrained [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. GPT-2 is a popular text generation model that was developed by OpenAI. Given a text prompt it can generate synthetic text that may follow.

In [4]:
import os
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

model_path = 'models/gptmodel/model'

if not os.path.exists(model_path):
    os.mkdir(model_path)
    
model.save_pretrained(save_directory=model_path)
tokenizer.save_vocabulary(save_directory=model_path)

('models/gptmodel/model/vocab.json', 'models/gptmodel/model/merges.txt')

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Write the Inference Script </h2> 

#### `BERT model`

Since we are bringing a model to SageMaker, we must create an inference script. The script will run inside our HuggingFace container. Our script should include a function for model loading, and optionally functions generating predicitions, and input/output processing. The HuggingFace container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional [details here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).

**NOTE**:

1. Single model deployment: To install additional libraries at container startup, we can add a requirements.txt text file that specifies the libraries to be installed using pip. Within the archive, the HuggingFace container expects all inference code and requirements.txt file to be inside the code/ directory.

2. Multi-model deployment: To install additional libraries on the container, libraries which are part of requirements.txt text file needs to be installed using pip in inference script. Within the archive, the HuggingFace container expects all inference code to be inside the code/ directory.

In the next cell we'll see our inference script for BERT model which helps us to extract embeddings from text. 

You will notice that it uses the [transformers library from Hugging Face](https://huggingface.co/docs/transformers/index) and installed using pip command in inference script, likewise we need to install additional libraries if required.

In [6]:
!mkdir models/bertmodel/code

! cp source_dir/model1/inference.py models/bertmodel/code/inference.py

In [7]:
!pygmentize models/bertmodel/code/inference.py

[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
subprocess.call([[33m"[39;49;00m[33mpip[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33minstall[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mtransformers==4.17.0[39;49;00m[33m"[39;49;00m])
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m BertTokenizer, BertModel

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""[39;49;00m
[33m    Load the model for inference[39;49;00m
[33m    """[39;49;00m

    model_path = os.path.join(model_dir, [33m'[39;49;00m[33mmodel/[39;49;00m[33m'[39;49;00m)
    
    [37m# Load BERT tokenizer from disk.[39;49;00m
    tokenizer = BertTokenizer.from_pretrained(model_path)

    [37m# Load BERT model from disk.[39;49;00m
    model = BertModel.from_pretrained(model_path)

    model_dict = {[33m'[39;49;00m[33mmodel[39;49;00m[33m'[39;49;

#### `GPT-2 model`

In the next cell we'll see our inference script for GPT-2 model which helps us to generate synthetic text for a given text.

In [8]:
!mkdir models/gptmodel/code

! cp source_dir/model2/inference.py models/gptmodel/code/inference.py

In [9]:
!pygmentize models/gptmodel/code/inference.py

[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
subprocess.call([[33m"[39;49;00m[33mpip[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33minstall[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mtransformers==4.17.0[39;49;00m[33m"[39;49;00m])
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m GPT2Tokenizer, TextGenerationPipeline, GPT2LMHeadModel

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""[39;49;00m
[33m    Load the model for inference[39;49;00m
[33m    """[39;49;00m

    [37m# Load GPT2 tokenizer from disk.[39;49;00m
    vocab_path = os.path.join(model_dir, [33m'[39;49;00m[33mmodel/vocab.json[39;49;00m[33m'[39;49;00m)
    merges_path = os.path.join(model_dir, [33m'[39;49;00m[33mmodel/merges.txt[39;49;00m[33m'[39;49;00m)
    
    tokenizer = GPT2Tokenizer(vocab_file=vocab_path,
                          

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Package Models </h2> 

For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named "model.tar.gz" (We name it as bertmodel.tar.gz an gptmodel.tar.gz) with gzip compression. Within the archive, the HuggingFace container expects all inference code file to be inside the code/ directory. See the guide here for a thorough explanation of the required directory structure.

In [10]:
!tar -czvf models/bertmodel.tar.gz -C models/bertmodel/ .
!tar -czvf models/gptmodel.tar.gz -C models/gptmodel/ .

./
./code/
./code/inference.py
./.ipynb_checkpoints/
./model/
./model/config.json
./model/tokenizer_config.json
./model/pytorch_model.bin
./model/vocab.txt
./model/special_tokens_map.json
./
./code/
./code/inference.py
./.ipynb_checkpoints/
./model/
./model/config.json
./model/merges.txt
./model/vocab.json
./model/pytorch_model.bin


<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Upload multiple HuggingFace models to S3 </h2> 

In [None]:
from sagemaker.s3 import S3Uploader

models_path = 's3://{0}/{1}/models'.format(bucket,prefix)

S3Uploader.upload('models/bertmodel.tar.gz', models_path)
S3Uploader.upload('models/gptmodel.tar.gz', models_path)

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Create Multi-Model Endpoint </h2> 

After we upload BERT model to S3 we can deploy our endpoint. To create/deploy a real-time endpoint with boto3 you need to create a "SageMaker Model", a "SageMaker Endpoint Configuration" and a "SageMaker Endpoint". The "SageMaker Model" contains our multi-model configuration including our S3 path where we upload/deploy multiple huggingface models. The "SageMaker Endpoint Configuration" contains the configuration for the endpoint. The "SageMaker Endpoint" is the actual endpoint.

Verify `multi-models` LABEL in docker file to indicate any pre-built container is capable of loading and serving multiple models concurrently.

LABEL com.amazonaws.sagemaker.capabilities.multi-models=true

In [None]:
# create SageMaker Model
image_uri = "763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04"
multimodels_path = 's3://sagemaker-us-east-2-(account-id)/huggingface-multimodel-deploy/models/'
deployment_name = "huggingface-multi-model"

primary_container = {
    'Image': image_uri,
    'Mode': 'MultiModel',
    'ModelDataUrl': multimodels_path,
    'Environment': {
        'SAGEMAKER_PROGRAM': 'inference.py',
        'SAGEMAKER_REGION': region,
        'SAGEMAKER_SUBMIT_DIRECTORY': multimodels_path
    }
}

create_model_response = sm_client.create_model(ModelName = deployment_name,
                                              ExecutionRoleArn = get_execution_role(),
                                              PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

# create SageMaker Endpoint configuration
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = f"{deployment_name}-epc",
    ProductionVariants=[
        {
        'InstanceType':'ml.m5.4xlarge',
        'InitialInstanceCount':1,
        'ModelName': deployment_name,
        'VariantName':'AllTraffic',
        'InitialVariantWeight':1
        }
    ])

print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

# create SageMaker Endpoint
endpoint_params = {
    'EndpointName': f"{deployment_name}-ep",
    'EndpointConfigName': f"{deployment_name}-epc",
}
endpoint_response = sm_client.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Get Predictions </h2> 

#### `BERT model`
Now that our API endpoint is deployed, we can send it text to get predictions from our BERT model. You can use the SageMaker SDK or the SageMaker Runtime API to invoke the endpoint.

In [13]:
import boto3

invoke_client = boto3.client('sagemaker-runtime')

prompt = "The best part of Amazon SageMaker is that it makes machine learning easy."

response = invoke_client.invoke_endpoint(EndpointName=f"{deployment_name}-ep",
                              TargetModel='bertmodel.tar.gz',
                              Body=prompt.encode(encoding='UTF-8'),
                              ContentType='text/csv')

response['Body'].read()



b'BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.2462, -0.0988,  0.1747,  ..., -0.4059,  0.0966,  0.6564],\n         [-0.1352, -0.5824, -0.0728,  ..., -0.1726,  0.5765,  0.1273],\n         [-0.1491, -0.4218,  0.2821,  ...,  0.1332,  0.5053, -0.2813],\n         ...,\n         [-0.8054, -0.3126,  0.6776,  ..., -0.0572,  0.0806, -0.0318],\n         [ 0.7608,  0.1367, -0.2650,  ...,  0.1246, -0.5977, -0.2397],\n         [ 0.4660,  0.2762,  0.0636,  ...,  0.1112, -0.5502, -0.2997]]],\n       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-7.0429e-01, -4.2229e-01, -9.7203e-01,  6.3414e-01,  8.6010e-01,\n         -3.5008e-01,  3.7999e-02,  2.2652e-01, -8.5239e-01, -9.9980e-01,\n         -6.4649e-01,  8.4232e-01,  8.9319e-01,  6.2476e-01,  4.8914e-01,\n         -3.7195e-01,  1.0597e-01, -5.0569e-01,  3.3702e-01,  7.3767e-01,\n          6.0322e-01,  1.0000e+00, -3.2281e-01,  4.7648e-01,  4.4296e-01,\n          9.4813e-01, -6.6813e-01,  7.4915e-01,  8.22

#### `GPT-2 model`

Now that our RESTful API endpoint is deployed, we can send it text to get predictions from our GPT-2 model. You can use the SageMaker Python SDK or the SageMaker Runtime API to invoke the endpoint.

In [14]:
import boto3
import json

invoke_client = boto3.client('sagemaker-runtime')

prompt = "Working with SageMaker makes machine learning "

response = invoke_client.invoke_endpoint(EndpointName=f"{deployment_name}-ep",
                              TargetModel='gptmodel.tar.gz', 
                              Body=json.dumps(prompt),
                              ContentType='text/csv')

response['Body'].read().decode('utf-8')

'[{\'generated_text\': \'"Working with SageMaker makes machine learning "a lot easier" than it used to be.\\n\'}]'

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Dynamically deploying models and Updating a model to the endpoint </h2> 

To dynamically deploy a model and update a model, you would follow the same approach as above and add it as a new model. For example, if you have retrained the bertmodel.tar.gz model and wanted to start invoking it, you would upload the updated model artifacts behind the following S3 prefix with a new name such as bertmodel_v2.tar.gz, and then change the TargetModel field to invoke bertmodel_v2.tar.gz instead of bertmodel.tar.gz.

multimodels_path = 's3://sagemaker-us-east-2-(account-id)/huggingface-multimodel-deploy/models/'

You should avoid overwriting model artifacts in Amazon S3, because the old version of the model might still be loaded in the endpoint's running container(s) or on the storage volume of instances on the endpoint: This would lead invocations to still use the old version of the model.

Alternatively, you could stop the endpoint and re-deploy a fresh set of models.

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Delete the Multi-Model Endpoint </h2> 

In [None]:
sm_client.delete_model(ModelName=deployment_name)
sm_client.delete_endpoint_config(EndpointConfigName=f"{deployment_name}-epc")
sm_client.delete_endpoint(EndpointName=f"{deployment_name}-ep")

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Conclusion </h2> 

We successfully deployed two Hugging Face Transformers to Amazon SageMaker for inference using the Multi-Model Endpoint. Multi-Model Endpoints are a great option to optimize compute utilization and costs for your models. Especially when you have independent inference workloads due to use-case differences.

Thanks for reading! If you have any questions, feel free to contact me.