## Deploying a Hugging Face model in SageMaker for Semantic Similarity [Long Documents]

In [None]:
! pip install SageMaker -U --quiet
! pip install transformers --quiet

In [3]:
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
import numpy as np
import os
import json
import re
import ast

role = sagemaker.get_execution_role()
runtime= boto3.client('runtime.sagemaker')

In [4]:
# This hugging face model is suitable for long documents 
hub = {
    'HF_MODEL_ID':'allenai/longformer-base-4096',
    'HF_TASK':'feature-extraction'
}

In [5]:
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

-----!

## Invoking a real time endpoint in SageMaker using boto3 SageMaker runtime

Once the endpoint is created, you can find the endpoint name in the SageMaker console under "Inference" > "endpoints"

In [9]:
endpoint_name = 'huggingface-pytorch-inference-2023-01-21-11-14-27-348'

In [10]:
payload1 = {"inputs": "what is S3"}
payload2 = {"inputs": "what does S3 do"}

In [11]:
response1 = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='application/json',
                                       Body=json.dumps(payload1))
sent1 = response1['Body'].read().decode()
sent1_embedding = np.array(ast.literal_eval(sent1))

In [12]:
response2 = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='application/json',
                                       Body=json.dumps(payload2))
sent2 = response2['Body'].read().decode()
sent2_embedding = np.array(ast.literal_eval(sent2))

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

def pad_to_length(x, arraysize):
        return np.pad(x,((0, 0), (0, arraysize - x.shape[1])), mode = 'constant')
    

def cos_similarity_vectors_diff_size(vec1, vec2):
    vec1_embed_np = vec1.reshape(1,-1)
    vec2_embed_np = vec2.reshape(1,-1)

    maxsize = max(i.shape[1] for i in [vec1_embed_np,vec2_embed_np])

    padded_vec1 = pad_to_length(vec1_embed_np, maxsize)
    padded_vec2 = pad_to_length(vec2_embed_np, maxsize)

    return cosine_similarity(padded_vec1,padded_vec2)[0][0]

In [14]:
cos_similarity_vectors_diff_size(sent1_embedding,sent2_embedding)

0.8763233750811223

## Deploying serverless inference in SageMaker

Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling.

With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. During times when there are no requests, Serverless Inference scales your endpoint down to 0, helping you to minimize your costs. 

### Endpoint Configuration Creation

This is where you can adjust the Serverless Configuration for your endpoint. You will define the max concurrent invocations for a single endpoint, known as **MaxConcurrency**, and the **Memory size**. 

Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB). **The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB**. Serverless Inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. 

Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently. **You can set the maximum concurrency for a single endpoint up to 200, and the total number of serverless endpoints you can host in a Region is 50**. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all of the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled.

If your endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a **cold start**. Since serverless endpoints provision compute resources on demand, your endpoint may experience cold starts. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. The cold start time depends on your model size, how long it takes to download your model, and the start-up time of your container.

In [15]:
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig
from time import gmtime, strftime

client = boto3.client("sagemaker")
runtime = boto3.client("sagemaker-runtime")

# Define the model name, which you can find in SageMaker console and automatically created by the real time endpoint
model_name='huggingface-pytorch-inference-2022-11-22-13-59-51-293'

# Define the serverless configuration in terms of memory size and max concurrency
rcf_serverless_config = 'hf-serverless-epc'+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=rcf_serverless_config,
    ProductionVariants=[
        {
            "VariantName": "HFVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 6144,
                "MaxConcurrency": 1,
            },
        },
    ],
)

# Creating the serverless endpoint
endpoint_name = "HF-Serverless-semanticSimilarity-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())    # give a suitable name

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=rcf_serverless_config,
)

Now, let's check the endpoint creation status.

In [16]:
import time
from datetime import datetime

max_time = time.time() + 15*60 # 15 min

while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    
    response = client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print(f"{current_time} : Endpoint: {status}")

    if status=='InService':
        break
    else:
        time.sleep(60)

11:22:07 : Endpoint: Creating
11:23:07 : Endpoint: Creating
11:24:07 : Endpoint: InService


### Invoking the serverless endpoint using boto3 SageMaker runtime

In [17]:
with open('FinancialAdvisor_JD.txt', 'r') as file:
    JD = file.read().replace("\n", "").replace("\ufeff","")

In [18]:
with open('sample_resume.txt', 'r') as file:
    finance_advisor_resume = file.read().replace("\n", "").replace("\ufeff","")

In [19]:
payload1 = {"inputs": JD}
payload2 = {"inputs": finance_advisor_resume}

In [20]:
response1 = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='application/json',
                                       Body=json.dumps(payload1))

In [21]:
sent1 = response1['Body'].read().decode()
sent1_embedding = np.array(ast.literal_eval(sent1))

In [22]:
response2 = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='application/json',
                                       Body=json.dumps(payload2))

In [23]:
sent2 = response2['Body'].read().decode()
sent2_embedding = np.array(ast.literal_eval(sent2))

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

def pad_to_length(x, arraysize):
        return np.pad(x,((0, 0), (0, arraysize - x.shape[1])), mode = 'constant')
    

def cos_similarity_vectors_diff_size(vec1, vec2):
    vec1_embed_np = vec1.reshape(1,-1)
    vec2_embed_np = vec2.reshape(1,-1)

    maxsize = max(i.shape[1] for i in [vec1_embed_np,vec2_embed_np])

    padded_vec1 = pad_to_length(vec1_embed_np, maxsize)
    padded_vec2 = pad_to_length(vec2_embed_np, maxsize)

    return cosine_similarity(padded_vec1,padded_vec2)[0][0]

In [25]:
cos_similarity_vectors_diff_size(sent1_embedding, sent2_embedding)

0.16094102746989233

**Please note that this model supports word count of 4096 max for a document**