## Deploying a Hugging Face model in SageMaker

In this notebook, we will practice deploying a Hugging Face pre-trained model from Hugging face hub. 

We will use a pre-trained model for grammar correction. The model that we will use can be found [here](https://huggingface.co/vennify/t5-base-grammar-correction)

In [None]:
! pip install SageMaker -U

In [1]:
## Importing necessary libraries and defining the role

import sagemaker
import boto3
import json
import ast
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

**To deploy a model directly from the Hugging Face Model Hub to Amazon SageMaker, we need to define two environment variables when creating the HuggingFaceModel.**

We need to define:

- HF_MODEL_ID: defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. 
- HF_TASK: defines the task for the used 🤗 Transformers pipeline.

In [2]:
# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'vennify/t5-base-grammar-correction',
  'HF_TASK':'text2text-generation'
}

## Defining the HuggingFace Model and deploying the model

In [3]:
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

-----!

Now, let's test our model with few sample sentences

In [4]:
text = 'I will go not home today'
text = 'Our teacher was the kind person that ever existed.'
#text = 'Once upon a time, there is a king with a big empire'

In [5]:
corrected_sent = predictor.predict({"inputs": "grammar:"+ text})
corrected_sent

[{'generated_text': 'Our teacher was the kindest person that ever existed.'}]

Now, let's test with a paragraph

In [6]:
paragraph = "On cold, wet morning, my class was filled with excitement. Someone have discover that the next day was our teacher's birthday. Our teacher was the kindest person that ever exist. Thus it is no surprise she was the favourite teacher to the pupils. Everyone want to get her a present. I, very much wanted to shown any appreciation too. That afternoon, I spends the whole afternoon shop for a present. After a long search, I finally made on my mind. The next day I gived her a bouquet of beautiful roses and she exclaimed with pleasure"

In [10]:
## Please note that for larger text, you need to add a parameter called max_length so that the model can understand that it needs to do for entire text

payload = {"inputs": "grammar:" + paragraph, 
               "parameters" : {
                            "max_length": int(len(paragraph) + 10)
  }}

In [13]:
corrected_sent = predictor.predict(payload)
print(corrected_sent[0]['generated_text'])

On a cold, wet morning, my class was filled with excitement. Someone discovered that the next day was our teacher's birthday. Our teacher was the kindest person that ever existed. Thus it is no surprise that everyone wanted to give her a present. I, very much wanted to show appreciation too. That afternoon, I spent the whole afternoon shopping for a present. After a long search, I finally made my mind. The next day I gave her a bouquet of beautiful roses and she exclaimed with pleasure.


## Deploying serverless inference in SageMaker

Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling.

With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. During times when there are no requests, Serverless Inference scales your endpoint down to 0, helping you to minimize your costs. 

### Endpoint Configuration Creation

This is where you can adjust the Serverless Configuration for your endpoint. You will define the max concurrent invocations for a single endpoint, known as **MaxConcurrency**, and the **Memory size**. 

Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB). **The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB**. Serverless Inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. 

Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently. **You can set the maximum concurrency for a single endpoint up to 200, and the total number of serverless endpoints you can host in a Region is 50**. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all of the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled.

If your endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a **cold start**. Since serverless endpoints provision compute resources on demand, your endpoint may experience cold starts. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. The cold start time depends on your model size, how long it takes to download your model, and the start-up time of your container.

In [16]:
# Import necessary libraries

from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig
from time import gmtime, strftime

# Define SageMaker client and SageMaker runtime

client = boto3.client("sagemaker")
runtime = boto3.client("sagemaker-runtime")

In [None]:
# Let's get the model name from the SageMaker console. The model name will start with huggingface-pytorch-inference*

model_name='huggingface-pytorch-inference-2022-11-22-09-57-17-455'
rcf_serverless_config = 'hf-serverless-grammarcorrection-epc'+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Serverless Endpoint config creation using the API "create_endpoint_config"

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=rcf_serverless_config,
    ProductionVariants=[
        {
            "VariantName": "HFServerlessVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
            },
        },
    ],
)

print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

### Creating the serverless endpoint

In [None]:
endpoint_name = "HF-serverless-grammar-correction" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=rcf_serverless_config,
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

### Invoking the serverless endpoint using boto3 SageMaker runtime

In [17]:
text = "Our teacher was the kind person that ever existed"
payload = json.dumps({"inputs": "grammar:"+ text})

In [18]:
%%time

## Get the endpoint name from the SageMaker console under "Inference" > "Endpoints"

endpoint_name='HF-serverless-grammar-correction2022-11-23-01-09-25'

## Let's invoke the endpoint to get the response 

response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='application/json',
                                       Body=payload)
text_op = response ['Body'].read().decode()

# convert string output to array and get the corrected sentence 
text_corrected = ast.literal_eval(text_op)
text_corrected[0]['generated_text']

CPU times: user 10.6 ms, sys: 3.59 ms, total: 14.2 ms
Wall time: 23.7 s


'Our teacher was the kindest person that ever existed.'

Now, let's test the same paragraph again using serverless endpoint to make sure that we get the same response

In [19]:
paragraph

"On cold, wet morning, my class was filled with excitement. Someone have discover that the next day was our teacher's birthday. Our teacher was the kindest person that ever exist. Thus it is no surprise she was the favourite teacher to the pupils. Everyone want to get her a present. I, very much wanted to shown any appreciation too. That afternoon, I spends the whole afternoon shop for a present. After a long search, I finally made on my mind. The next day I gived her a bouquet of beautiful roses and she exclaimed with pleasure"

In [20]:
payload = {"inputs": "grammar:" + paragraph, 
               "parameters" : {
                            "max_length": int(len(paragraph) + 10)
  }}

In [21]:
response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                   ContentType='application/json',
                                   Body=json.dumps(payload))

text_op = response['Body'].read().decode()

text_corrected = ast.literal_eval(text_op)
print(text_corrected[0]['generated_text'])

On a cold, wet morning, my class was filled with excitement. Someone discovered that the next day was our teacher's birthday. Our teacher was the kindest person that ever existed. Thus it is no surprise that everyone wanted to give her a present. I, very much wanted to show appreciation too. That afternoon, I spent the whole afternoon shopping for a present. After a long search, I finally made my mind. The next day I gave her a bouquet of beautiful roses and she exclaimed with pleasure.
