## Deploying a Hugging Face model in SageMaker

In [None]:
! pip install SageMaker -U

In [None]:
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

In [4]:
# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'vennify/t5-base-grammar-correction',
  'HF_TASK':'text2text-generation'
}

## Defining the HuggingFace Model and deploying the model

In [5]:
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

-----!

Now, let's test our model with few sample sentences

In [38]:
text = 'I will go not home today'
#text = 'he like to eat noodles'
#text = 'Once upon a time, there is a king with a big empire'

In [39]:
corrected_sent = predictor.predict({"inputs": "grammar:"+ text})
corrected_sent

[{'generated_text': 'I will not go home today.'}]

Now, let's test with a paragraph

In [None]:
paragraph = "On cold, wet morning, my class was filled with excitement. Someone have discover that the next day was our teacher's birthday. Our teacher was the kindest person that ever exist. Thus it is no surprise she was the favourite teacher to the pupils. Everyone want to get her a present. I, very much wanted to shown any appreciation too. That afternoon, I spends the whole afternoon shop for a present. After a long search, I finally made on my mind. The next day I gived her a bouquet of beautiful roses and she exclaimed with pleasure"

In [47]:
for lines in paragraph.split("."):
    print(lines)

On cold, wet morning, my class was filled with excitement
 Someone have discover that the next day was our teacher's birthday
 Our teacher was the kindest person that ever exist
 Thus it is no surprise she was the favourite teacher to the pupils
 Everyone want to get her a present
 I, very much wanted to shown any appreciation too
 That afternoon, I spends the whole afternoon shop for a present
 After a long search, I finally made on my mind
 The next day I gived her a bouquet of beautiful roses and she exclaimed with pleasure


In [46]:
for lines in paragraph.split("."):
    corrected_sent = predictor.predict({"inputs": "grammar:"+ lines})
    print(corrected_sent)

[{'generated_text': 'On a cold, wet morning, my class was filled with excitement.'}]
[{'generated_text': "Someone has discovered that the next day was our teacher's birthday."}]
[{'generated_text': 'Our teacher was the kindest person that ever existed.'}]
[{'generated_text': 'Thus it is no surprise that she was the favourite teacher to the pupils.'}]
[{'generated_text': 'Everyone wanted to get her a present.'}]
[{'generated_text': 'I, very much wanted to show any appreciation too.'}]
[{'generated_text': 'That afternoon, I spend the whole afternoon shopping for a present.'}]
[{'generated_text': 'After a long search, I finally made up my mind.'}]
[{'generated_text': 'The next day I gave her a bouquet of beautiful roses and she exclaimed with pleasure'}]


## Deploying serverless inference in SageMaker

Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling.

With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. During times when there are no requests, Serverless Inference scales your endpoint down to 0, helping you to minimize your costs. 

### Endpoint Configuration Creation

This is where you can adjust the Serverless Configuration for your endpoint. You will define the max concurrent invocations for a single endpoint, known as **MaxConcurrency**, and the **Memory size**. 

Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB). **The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB**. Serverless Inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. 

Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently. **You can set the maximum concurrency for a single endpoint up to 200, and the total number of serverless endpoints you can host in a Region is 50**. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all of the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled.

If your endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a **cold start**. Since serverless endpoints provision compute resources on demand, your endpoint may experience cold starts. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. The cold start time depends on your model size, how long it takes to download your model, and the start-up time of your container.