# SageMaker Inference Recommender for HuggingFace BERT Sentiment Analysis



## Contents
[1. Introduction](#1.-Introduction)  
[3. Machine Learning model details](#2.-Machine-Learning-model-details)  
[4. Register Model Version/Package](#3.-Register-Model-Version/Package)  
[5. Create a SageMaker Inference Recommender Default Job](#4:-Create-a-SageMaker-Inference-Recommender-Default-Job)   
[6. Instance Recommendation Results](#5.-Instance-Recommendation-Results)   
[7. Create an Endpoint for lowest latency real-time inference](#6.-Create-an-Endpoint-for-lowest-latency-real-time-inference)  
[8. Clean up](#7.-Clean-up)  
[9. Conclusion](#8.-Conclusion)


## 1. Introduction

SageMaker Inference Recommender is a new capability of SageMaker that reduces the time required to get machine learning (ML) models in production by automating performance benchmarking and load testing models across SageMaker ML instances. You can use Inference Recommender to deploy your model to a real-time inference endpoint that delivers the best performance at the lowest cost. 

Get started with Inference Recommender on SageMaker in minutes while selecting an instance and get an optimized endpoint configuration in hours, eliminating weeks of manual testing and tuning time.


To begin, let's update the required packages i.e. SageMaker Python SDK, `boto3`, `botocore` and `awscli`

In [6]:
import sys

!{sys.executable} -m pip install sagemaker botocore boto3 awscli transformers --upgrade
!pip install --upgrade pip awscli botocore boto3  --quiet

[0m

If you run this notebook in SageMaker Studio, you need to make sure `ipywidgets` is installed and restart the kernel, so please uncomment the code in the next cell, and run it.


In [7]:
%%capture
import IPython

!{sys.executable} -m pip install ipywidgets
IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

## 2. Download a pre-trained Model

In this example, we are using a `Huggingface` pre-trained `sentiment-analysis` model.

You can learn more about it in the 🤗 Transformers library Quick tour: https://huggingface.co/docs/transformers/quicktour

In [1]:
from sagemaker import get_execution_role, Session, image_uris
import sagemaker
import pandas as pd
import boto3
import datetime
import time
import os
import re
import copy
import time
from time import gmtime, strftime
import pprint
import utils


region = boto3.Session().region_name
role = get_execution_role()
sagemaker_session = Session()

payload_archive_name = "hf_payload.tar.gz"
print(region)
%store payload_archive_name



us-east-1
Stored 'payload_archive_name' (str)


In [2]:
%store
%store -r

Stored variables and their in-db values:
deploy_instance_type                  -> 'ml.m5.xlarge'
distilbert_model_name                 -> 'hf-pytorch-model-distilbert-2023-08-25-03-36-14'
model_data_path                       -> 's3://sagemaker-us-east-1-805087355833/sagemaker/h
model_distilbert_uri                  -> 's3://sagemaker-us-east-1-805087355833/sagemaker/h
model_package_group_name              -> 'HuggingFaceModels'
model_roberta_script_uri              -> 's3://sagemaker-us-east-1-805087355833/sagemaker/h
model_roberta_uri                     -> 's3://sagemaker-us-east-1-805087355833/sagemaker/h
payload_archive_name                  -> 'hf_payload.tar.gz'
roberta_model_name                    -> 'hf-pytorch-model-roberta-2023-08-25-03-36-14'
roberta_script_model_name             -> 'hf-pytorch-model-roberta-script-2023-08-25-03-36-


### Tar the payload

In [3]:
!cd ./sample_payload/ && tar czvf ../{payload_archive_name} test_data.csv

test_data.csv


### Upload the model and payload to S3

We now have a model archive and the payload ready. We need to upload it to S3 before we can use it with Inference Recommender, so we will use the SageMaker Python SDK to handle the upload.

We need to create an archive that contains individual files that Inference Recommender can send to your SageMaker Endpoints. Inference Recommender will randomly sample files from this archive so make sure it contains a similar distribution of payloads you'd expect in production. Note that your inference code must be able to read in the file formats from the sample payload.

In [4]:
%%time

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket and prefix
bucket = sagemaker.Session().default_bucket()

prefix = "sagemaker/huggingface-pytorch-inference-recommender"

sample_payload_url = sagemaker.Session().upload_data(
    payload_archive_name, bucket=bucket, key_prefix=prefix + "/inference"
)
model_url = model_distilbert_uri

print(sample_payload_url)
print(model_url)

s3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-inference-recommender/inference/hf_payload.tar.gz
s3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz
CPU times: user 129 ms, sys: 3.84 ms, total: 133 ms
Wall time: 320 ms


## 3. Machine Learning model details

Inference Recommender uses information about your ML model to recommend the best instance types and endpoint configurations for deployment. You can provide as much or as little information as you'd like and Inference Recommender will use that to provide recommendations.

Example ML Domains: `COMPUTER_VISION`, `NATURAL_LANGUAGE_PROCESSING`, `MACHINE_LEARNING`

Example ML Tasks: `CLASSIFICATION`, `REGRESSION`, `OBJECT_DETECTION`, `OTHER`

Note: Select the task that is the closest match to your model. Chose `OTHER` if none apply.

Example Model name: `resnet50`, `yolov4`, `xgboost` etc

Use list_model_metadata API to fetch the list of available models. This will help you to pick the closest model for better recommendation.

In [6]:
client = boto3.client("sagemaker", region)

list_model_metadata_response = client.list_model_metadata()

domains = []
frameworks = []
framework_versions = []
tasks = []
models = []

for model_summary in list_model_metadata_response["ModelMetadataSummaries"]:
    domains.append(model_summary["Domain"])
    tasks.append(model_summary["Task"])
    models.append(model_summary["Model"])
    frameworks.append(model_summary["Framework"])
    framework_versions.append(model_summary["FrameworkVersion"])

data = {
    "Domain": domains,
    "Task": tasks,
    "Framework": frameworks,
    "FrameworkVersion": framework_versions,
    "Model": models,
}

df = pd.DataFrame(data)

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)
pd.set_option("display.colheader_justify", "center")
pd.set_option("display.precision", 3)


display(df.sort_values(by=["Domain", "Task", "Framework", "FrameworkVersion"]))

Unnamed: 0,Domain,Task,Framework,FrameworkVersion,Model
9,COMPUTER_VISION,IMAGE_CLASSIFICATION,MXNET,1.8.0,densenet201-gluon
10,COMPUTER_VISION,IMAGE_CLASSIFICATION,MXNET,1.8.0,resnet18v2-gluon
14,COMPUTER_VISION,IMAGE_CLASSIFICATION,PYTORCH,1.6.0,resnet152
0,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,efficientnetb7
4,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,nasnetlarge
5,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,vgg16
6,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,inception-v3
11,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,xception
12,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,densenet201
17,COMPUTER_VISION,IMAGE_CLASSIFICATION,TENSORFLOW,1.15.5,xceptionV1-keras


In this example, as we are predicting Sentiment analysis with `HuggingFace` `BERT`, we select `NATURAL_LANGUAGE_PROCESSING` as the Domain, `FILL_MASK` as the Task, `PYTORCH` as the Framework, and `bert-base-uncased` as the Model.

In [7]:
ml_domain = "NATURAL_LANGUAGE_PROCESSING"
ml_task = "FILL_MASK"
ml_framework = "PYTORCH"
framework_version = "1.6.0"
model = "bert-base-uncased"

### Container image URL

If you don’t have an inference container image, you can use [Prebuilt SageMaker Docker Images for Deep Learning](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) provided by AWS to serve your ML model.

In [8]:
# ML model details
model_name = "huggingface-pytorch-" + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

inference_image = image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="1.10.2",
    py_version="py38",
    instance_type="ml.m5.xlarge",
    image_scope="inference",
)

print(inference_image)

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.10.2-cpu-py38


## 4. Create SageMaker Model
** Mel to review for description


In [9]:
model_name="HuggingFace-PyTorch-Inference-Recommender-Demo"

In [10]:
response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            'ContainerHostname': "huggingface-pytorch",
            'Image': inference_image,
            'Mode': 'SingleModel',
            'ModelDataUrl': model_url, #should be S3 path to model data
            'Environment': {
                 "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                    "SAGEMAKER_PROGRAM": "inference.py",
                    "SAGEMAKER_REGION": "us-east-1",
                    "SAGEMAKER_SUBMIT_DIRECTORY": model_url #should be S3 path to model data
            },
        },
    ],
    ExecutionRoleArn=role,
    EnableNetworkIsolation=False
)

## 5: Create a SageMaker Inference Recommender Default Job

Now with your model you can kick off a 'Default' job to get instance recommendations.

As `SamplePayloadUrl` and `SupportedContentTypes` parameters are essential for benchmarking the endpoint, we also highly recommend that you specify `Domain`, `Task`, `Framework`, `FrameworkVersion`, `NearestModelName` for better inference recommendation.

The output is a list of instance type recommendations with associated environment variables, cost, throughput and latency metrics.

In [11]:
default_job = "huggingface-pytorch-basic-recommender-job-" + datetime.datetime.now().strftime(
    "%Y-%m-%d-%H-%M-%S"
)
default_response = client.create_inference_recommendations_job(
    JobName=str(default_job),
    JobDescription="HuggingFace PyTorch Inference Basic Recommender Job",
    JobType="Default",
    RoleArn=role,
    InputConfig={"ModelName": model_name, 
                  'ContainerConfig': {
            'Domain': ml_domain,
            'Task': ml_task,
            'Framework': ml_framework,
            'FrameworkVersion': framework_version,
            'PayloadConfig': {
                'SamplePayloadUrl': sample_payload_url,
                'SupportedContentTypes': ["text/csv"]
            },
            'NearestModelName': model,
            'SupportedInstanceTypes': [
                "ml.c5.xlarge",
                "ml.c5.2xlarge"
            ],
            'SupportedResponseMIMETypes': ["text/csv"]
        },
                },
    
  
)

print(default_response)

{'JobArn': 'arn:aws:sagemaker:us-east-1:805087355833:inference-recommendations-job/huggingface-pytorch-basic-recommender-job-2023-09-07-01-08-45', 'ResponseMetadata': {'RequestId': '67f2226c-fb1d-405f-b450-0ab28ebfab23', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '67f2226c-fb1d-405f-b450-0ab28ebfab23', 'content-type': 'application/x-amz-json-1.1', 'content-length': '145', 'date': 'Thu, 07 Sep 2023 01:08:49 GMT'}, 'RetryAttempts': 0}}


### 6. Instance Recommendation Results

The inference recommender job provides multiple endpoint recommendations in its result. The recommendation includes `InstanceType`, `InitialInstanceCount`, `EnvironmentParameters` which includes tuned parameters for better performance. We also include the benchmarking results like `MaxInvocations`, `ModelLatency`, `CostPerHour` and `CostPerInference` for deeper analysis. The information provided will help you narrow down to a specific endpoint configuration that suits your use case.

Example:   

If your motivation is overall price-performance, then you should focus on `CostPerInference` metrics  
If your motivation is latency/throughput, then you should focus on `ModelLatency` / `MaxInvocations` metrics

Running the Inference recommender job will take ~25 minutes.

In [None]:
%%time

ended = False
while not ended:
    inference_recommender_job = client.describe_inference_recommendations_job(
        JobName=str(default_job)
    )
    if inference_recommender_job["Status"] in ["COMPLETED", "STOPPED", "FAILED"]:
        ended = True
    else:
        print("Inference recommender job in progress")
        time.sleep(60)

if inference_recommender_job["Status"] == "FAILED":
    print("Inference recommender job failed ")
    print("Failed Reason: {}".inference_recommender_job["FailedReason"])
else:
    print("Inference recommender job completed")

Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job in progress
Inference recommender job completed
CPU times: user 447 ms, sys: 41 ms, total: 488 ms
Wa

You can check the job status from the Inference Recommender console. In the navigation pane, under **Home**, **Deployments**, select **Inference recommender**.
<div>
    <img src="./images/IR1.png" alt="Image IR1" width="300" style="display:inline-block">
</div>
<br>

This will open a new tab showing all the Inference Recommender (IR) jobs.
<div>
    <img src="./images/IR2.png" alt="Image IR2" width="800" style="display:inline-block">
</div>
<br>

Once the job finishes, you can click into the job and see the summary of model performance using different instances and the job details.
<div>
    <img src="./images/IR3.png" alt="Image IR3" width="800" style="display:inline-block">
</div>
<br>

### Detailing out the result

In [13]:
data = [
    {**x["EndpointConfiguration"], **x["ModelConfiguration"], **x["Metrics"]}
    for x in inference_recommender_job["InferenceRecommendations"]
]
df = pd.DataFrame(data)
dropFilter = df.filter(["VariantName"])
df.drop(dropFilter, inplace=True, axis=1)
pd.set_option("max_colwidth", 400)

Let's sort the result `dataframe` by `MaxInvocations` - The maximum number of requests per minute expected for the endpoint, in descending order.

In [14]:
df.sort_values(by=["MaxInvocations"], ascending=False).head()

Unnamed: 0,EndpointName,ServerlessConfig,EnvironmentParameters,CostPerHour,CostPerInference,MaxInvocations,ModelLatency,MemoryUtilization,InstanceType,InitialInstanceCount,CpuUtilization
3,huggingface-pytorch-basic-recommender-jo-WVxO9dO0uFSNN7JaGLRf,,"[{'Key': 'SAGEMAKER_MODEL_SERVER_WORKERS', 'ValueType': 'String', 'Value': '8'}, {'Key': 'OMP_NUM_THREADS', 'ValueType': 'String', 'Value': '1'}, {'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'Str...",0.408,4.613e-06,1474,416,27.775,ml.c5.2xlarge,1.0,773.462
1,huggingface-pytorch-basic-recommender-jo-RUHE9zc7rHHFOC0aGLre,,"[{'Key': 'SAGEMAKER_MODEL_SERVER_WORKERS', 'ValueType': 'String', 'Value': '4'}, {'Key': 'OMP_NUM_THREADS', 'ValueType': 'String', 'Value': '1'}, {'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'Str...",0.204,4.404e-06,772,605,30.156,ml.c5.xlarge,1.0,384.154
2,huggingface-pytorch-basic-recommender-jo-YAREG1RJh4k1GLaeaE0q,"{'MemorySizeInMB': 6144, 'MaxConcurrency': 1}","[{'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'String', 'Value': '20'}, {'Key': 'SAGEMAKER_REGION', 'ValueType': 'String', 'Value': 'us-east-1'}, {'Key': 'SAGEMAKER_PROGRAM', 'ValueType': 'String...",0.432,1.452e-05,496,95,13.412,,,
0,huggingface-pytorch-basic-recommender-jo-K3RXOyxuTuGifLjbfjAj,"{'MemorySizeInMB': 5120, 'MaxConcurrency': 1}","[{'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'String', 'Value': '20'}, {'Key': 'SAGEMAKER_REGION', 'ValueType': 'String', 'Value': 'us-east-1'}, {'Key': 'SAGEMAKER_PROGRAM', 'ValueType': 'String...",0.36,1.274e-05,471,91,15.187,,,


This time, let's sort the result `dataframe` by `ModelLatencyThresholds` - The interval of time taken by a model to respond as viewed from SageMaker. The interval includes the local communication time taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

In [15]:
df.sort_values(by=["ModelLatency"]).head()

Unnamed: 0,EndpointName,ServerlessConfig,EnvironmentParameters,CostPerHour,CostPerInference,MaxInvocations,ModelLatency,MemoryUtilization,InstanceType,InitialInstanceCount,CpuUtilization
0,huggingface-pytorch-basic-recommender-jo-K3RXOyxuTuGifLjbfjAj,"{'MemorySizeInMB': 5120, 'MaxConcurrency': 1}","[{'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'String', 'Value': '20'}, {'Key': 'SAGEMAKER_REGION', 'ValueType': 'String', 'Value': 'us-east-1'}, {'Key': 'SAGEMAKER_PROGRAM', 'ValueType': 'String...",0.36,1.274e-05,471,91,15.187,,,
2,huggingface-pytorch-basic-recommender-jo-YAREG1RJh4k1GLaeaE0q,"{'MemorySizeInMB': 6144, 'MaxConcurrency': 1}","[{'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'String', 'Value': '20'}, {'Key': 'SAGEMAKER_REGION', 'ValueType': 'String', 'Value': 'us-east-1'}, {'Key': 'SAGEMAKER_PROGRAM', 'ValueType': 'String...",0.432,1.452e-05,496,95,13.412,,,
3,huggingface-pytorch-basic-recommender-jo-WVxO9dO0uFSNN7JaGLRf,,"[{'Key': 'SAGEMAKER_MODEL_SERVER_WORKERS', 'ValueType': 'String', 'Value': '8'}, {'Key': 'OMP_NUM_THREADS', 'ValueType': 'String', 'Value': '1'}, {'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'Str...",0.408,4.613e-06,1474,416,27.775,ml.c5.2xlarge,1.0,773.462
1,huggingface-pytorch-basic-recommender-jo-RUHE9zc7rHHFOC0aGLre,,"[{'Key': 'SAGEMAKER_MODEL_SERVER_WORKERS', 'ValueType': 'String', 'Value': '4'}, {'Key': 'OMP_NUM_THREADS', 'ValueType': 'String', 'Value': '1'}, {'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'Str...",0.204,4.404e-06,772,605,30.156,ml.c5.xlarge,1.0,384.154


In [20]:
# if Serverless is the best option, execute this cell.
ServerlessConfig = df.sort_values(by=["ModelLatency"]).head()['ServerlessConfig'][0]
ServerlessMem= df.sort_values(by=["ModelLatency"]).head()['ServerlessConfig'][0]['MemorySizeInMB']
ServerlessConc = df.sort_values(by=["ModelLatency"]).head()['ServerlessConfig'][0]['MaxConcurrency']
ServerlessConfig

{'MemorySizeInMB': 5120, 'MaxConcurrency': 1}

Let's choose the instance with the lowest `ModelLatency`. This is done by choosing the first record of the result `dataframe`, sorted by ascending order.

In [21]:
instance_type = (
    df.sort_values(by=["ModelLatency"]).head(1)["InstanceType"].to_string(index=False).strip()
)
instance_type

'NaN'

### Optional: ListInferenceRecommendationsJobSteps
To see the list of subtasks for an Inference Recommender job, simply provide the `JobName` to the `ListInferenceRecommendationsJobSteps` API. 

To see more information for the API, please refer to the doc here: [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html)

In [22]:
list_job_steps_response = client.list_inference_recommendations_job_steps(JobName=str(default_job))
print(list_job_steps_response)

{'Steps': [{'StepType': 'BENCHMARK', 'JobName': 'huggingface-pytorch-basic-recommender-job-2023-09-07-01-08-45', 'Status': 'FAILED', 'InferenceBenchmark': {'EndpointConfiguration': {'EndpointName': 'huggingface-pytorch-basic-recommender-jo-K1Zffkn3jUchQQsPU8dj', 'VariantName': 'huggingface-pytorch-basic-recommender-jo-K1Zffkn3jUchQQsPU8dj', 'ServerlessConfig': {'MemorySizeInMB': 4096, 'MaxConcurrency': 1}}, 'ModelConfiguration': {'EnvironmentParameters': [{'Key': 'SAGEMAKER_SUBMIT_DIRECTORY', 'ValueType': 'String', 'Value': 's3://sagemaker-us-east-1-805087355833/sagemaker/huggingface-pytorch-sentiment-analysis/models/model_distilbert.tar.gz'}, {'Key': 'SAGEMAKER_CONTAINER_LOG_LEVEL', 'ValueType': 'String', 'Value': '20'}, {'Key': 'SAGEMAKER_REGION', 'ValueType': 'String', 'Value': 'us-east-1'}, {'Key': 'SAGEMAKER_PROGRAM', 'ValueType': 'String', 'Value': 'inference.py'}]}, 'FailureReason': 'An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error 

## 7. Create an Endpoint for lowest latency real-time inference

Next we will create a SageMaker real-time endpoint using the instance with the lowest latency for the model, detected in the Inference Recommender Default Job that was run previously.

### Create an Endpoint Config from the model

This will create an endpoint configuration that Amazon SageMaker hosting services uses to deploy models. In the configuration, you identify one or more models, created using the `CreateModel` API, to deploy and the resources that you want Amazon SageMaker to provision. Then you call the `CreateEndpoint` API.

More info on `create_endpoint_config` can be found on the [Boto3 SageMaker documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config).

In [23]:
endpoint_config_name = "huggingface-pytorch-endpoint-config-" + datetime.datetime.now().strftime(
    "%Y-%m-%d-%H-%M-%S"
)

# Serverless Deployment
if instance_type == 'NaN':
        endpoint_config_response = client.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ProductionVariants=[
                    {
                        "ModelName": model_name,
                        "VariantName": "AllTraffic",
                        "ServerlessConfig": {
                            "MemorySizeInMB": ServerlessMem,
                            "MaxConcurrency": ServerlessConc,
                        }
                    } 
                ]
        )
# Real time deployment
else:
    endpoint_config_response = client.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ProductionVariants=[
                {
                    "VariantName": "AllTrafficVariant",
                    "ModelName": model_name,
                    "InitialInstanceCount": 1,
                    "InstanceType": instance_type,
                    "InitialVariantWeight": 1,
                },
            ],
        )

endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:805087355833:endpoint-config/huggingface-pytorch-endpoint-config-2023-09-07-03-52-55',
 'ResponseMetadata': {'RequestId': '8c659222-6d0a-4852-b153-b89962cf4aaf',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8c659222-6d0a-4852-b153-b89962cf4aaf',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '136',
   'date': 'Thu, 07 Sep 2023 03:52:55 GMT'},
  'RetryAttempts': 0}}

### Deploy the Endpoint Config to a real-time endpoint

This will create an endpoint using the endpoint configuration specified in the request. Amazon SageMaker uses the endpoint to provision resources and deploy models. Note that you have already created the endpoint configuration with the `CreateEndpointConfig` API in the previous step.

More info on `create_endpoint` can be found on the [Boto3 SageMaker documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint).


In [24]:
endpoint_name = "huggingface-pytorch-endpoint-" + datetime.datetime.now().strftime(
    "%Y-%m-%d-%H-%M-%S"
)

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

create_endpoint_response

{'EndpointArn': 'arn:aws:sagemaker:us-east-1:805087355833:endpoint/huggingface-pytorch-endpoint-2023-09-07-03-53-01',
 'ResponseMetadata': {'RequestId': '9d866d1d-dfce-42d7-bcbe-7a379782932a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9d866d1d-dfce-42d7-bcbe-7a379782932a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '116',
   'date': 'Thu, 07 Sep 2023 03:53:01 GMT'},
  'RetryAttempts': 0}}

### Wait for Endpoint to be ready

In [25]:
%%time

utils.endpoint_creation_wait(endpoint_name=endpoint_name)

Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
InService
CPU times: user 53.4 ms, sys: 11.2 ms, total: 64.6 ms
Wall time: 2min 47s


{'EndpointName': 'huggingface-pytorch-endpoint-2023-09-07-03-53-01',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:805087355833:endpoint/huggingface-pytorch-endpoint-2023-09-07-03-53-01',
 'EndpointConfigName': 'huggingface-pytorch-endpoint-config-2023-09-07-03-52-55',
 'ProductionVariants': [{'VariantName': 'AllTraffic',
   'DeployedImages': [{'SpecifiedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.10.2-cpu-py38',
     'ResolvedImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference@sha256:ad5678a4a59d69692e6dff42ce268436f13d0cbcf4dc96b910a9e44949b3c596',
     'ResolutionTime': datetime.datetime(2023, 9, 7, 3, 53, 2, 504000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 0,
   'CurrentServerlessConfig': {'MemorySizeInMB': 5120, 'MaxConcurrency': 1}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2023, 9, 7, 3, 53, 1, 652000, tzinfo=tzlocal()),
 'LastModifiedTime': dateti

### Invoke Endpoint with `boto3`

After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint.

For an overview of Amazon SageMaker, [see How It Works](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works.html).

Amazon SageMaker strips all POST headers except those supported by the API. Amazon SageMaker might add additional headers. You should not rely on the behavior of headers outside those enumerated in the request syntax.

Calls to `InvokeEndpoint` are authenticated by using AWS Signature Version 4. For information, see Authenticating Requests (AWS Signature Version 4) in the Amazon S3 API Reference.

A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to invocations. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.

More info on `invoke_endpoint` can be found on the [Boto3 `SageMakerRuntime` documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint).

In [None]:
test_data = pd.read_csv("./sample_payload/test_data.csv", header=None)
test_data

In [None]:
runtime = boto3.client("sagemaker-runtime")

In [None]:
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=test_data.to_csv(header=False, index=False),
    ContentType="text/csv",
)

print(response["Body"].read())

## 8. Clean up

Endpoints should be deleted when no longer in use, since (per the [SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/)) they're billed by time deployed.

In [40]:
client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '84d8cc36-ca00-4c72-880e-3db75b3b4420',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '84d8cc36-ca00-4c72-880e-3db75b3b4420',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 30 Aug 2023 00:32:16 GMT'},
  'RetryAttempts': 0}}

## 9. Conclusion

In this notebook you successfully downloaded a `Huggingface` pre-trained `sentiment-analysis` model, you compressed the `model` and the payload and upload it to Amazon S3. 
Then you created a SageMaker model, and triggered a SageMaker Inference Recommender Default Job.

You then browsed the results, sorted by `MaxInvocations` and by `ModelLatency`, and decided to create an Endpoint for the lowest latency real-time inference.
After deploying the model to an endpoint, you invoked the Endpoint with a sample payload of few sentences, using `boto3`, and got the predictions result.

As next steps, you can try running SageMaker Inference Recommender on your own models, to select an instance with the best price performance for your needs.