# Deploy Embedding Models on AWS Inferentia2 with Amazon SageMaker

[BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) is a fine-tuned BERT model to map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. It works perfectly for vector databases for LLMs.

In this end-to-end tutorial, you will learn how to deploy and speed up Embeddings Model inference using AWS Inferentia2 with Hugging Face Optimum Neuron on Amazon SageMaker. We are going to use the Hugging Face Inference Neuron Container, a purpose-build Inference Container to easily deploy transformers and diffusers models on AWS Inferentia2 and Trainium. 

You will learn how to: 
1. Setup the development environment
2. Pull from the Hub the pre-compiled model for AWS Neuron (Inferentia2)
3. Create a custom `inference.py` script for `embeddings`
4. Upload the neuron model and inference script to Amazon S3
5. Deploy a Real-time Inference Endpoint on Amazon SageMaker
6. Run and evaluate Inference performance of Embeddings Model on Inferentia2

Let's get started! 🚀

## 1. Setup development environment

For this tutorial, we are going to use a Notebook Instance in Amazon SageMaker  with the Python 3 (ipykernel) and the sagemaker python SDK to deploy [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) to a SageMaker inference endpoint.

As a first step, make sur you have the latest version of `optimum-neuron`and the SageMaker SDK installed.

In [None]:
# Install the required packages
%pip install sagemaker --upgrade --quiet
%pip install optimum-neuron --upgrade --quiet
# restart your kernel

Then, instantiate the sagemaker role and session.

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

Finally, you need to log in the Hugging Face Hub to access the model artefacts, using a [User Access Token](https://huggingface.co/docs/hub/en/security-tokens) with read access.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## 2. Pull from the Hub the pre-compiled model for AWS Neuron (Inferentia2)

At the time of writing, the [AWS Inferentia2 does not support dynamic shapes for inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/dynamic-shapes.html?highlight=dynamic%20shapes#), which means that the input size needs to be static for compiling and inference. 

In simpler terms, this means we need to define the input shapes for our prompt (sequence length), batch size, height and width of the image.

We precompiled the model with the following parameters and pushed it to the Hugging Face Hub [here](https://huggingface.co/aws-neuron/bge-base-en-v1-5-seqlen-384-bs-1): 
* `sequence_length`: 384
* `batch_size`: 1
* `neuron`: 2.21.1

_Note: If you want to compile your own model or with other parameters, follow the guide on how to [export a model to Inferentia](https://huggingface.co/docs/optimum-neuron/en/guides/export_model)._

Let's download it locally.

In [None]:
from huggingface_hub import snapshot_download
 
# compiled model id
compiled_model_id = "aws-neuron/bge-base-en-v1-5-seqlen-384-bs-1"
 
# save compiled model to local directory
save_directory = "/tmp/embedding_model"
# Downloads our compiled model from the HuggingFace Hub
# using the revision as neuron version reference
# and makes sure we exlcude the symlink files and "hidden" files, like .DS_Store, .gitignore, etc.
snapshot_download(
    compiled_model_id,
    revision="2.21.1",
    local_dir=save_directory,
    allow_patterns=["[!.]*.*"]
)

## 3. Create a custom `inference.py` script for `embeddings`

We need to provide a custom `inference.py` script which will override the default inference handler used in the endpoint. We will override `model_fn`to load a Neuron-compiled model and tokenizer, and `predict_fn` to generate and normalize sentence embeddings from input text with a Neuron model.

In [14]:
!mkdir {save_directory}/code

We are using the `NEURON_RT_NUM_CORES=1` to make sure that each HTTP worker uses 1 Neuron core to maximize throughput.

In [None]:
%%writefile {save_directory}/code/inference.py
import os
# To use one neuron core per worker
os.environ["NEURON_RT_NUM_CORES"] = "1"
from optimum.neuron import NeuronModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import torch_neuronx

def model_fn(model_dir):
    # load local converted model and  tokenizer
    model = NeuronModelForFeatureExtraction.from_pretrained(model_dir)
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    return model, tokenizer


def predict_fn(data, pipeline):
    model, tokenizer = pipeline
  
    # extract body 
    inputs = data.pop("inputs", data)
    
    # Tokenize sentences
    encoded_input = tokenizer(inputs,return_tensors="pt",truncation=True,max_length=model.config.neuron["static_sequence_length"])

    # Compute embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
    # normalize embeddings
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)    
    
    return {"embeddings":sentence_embeddings[0].tolist()}

## 3. Upload the neuron model and inference script to Amazon S3

Before we can deploy our neuron model to Amazon SageMaker we need to create a `model.tar.gz` archive with all our model artifacts saved into, e.g. `model.neuron` and upload this to Amazon S3.

Make sure you have suffiscient permissions attached to your Sagemaker Role to upload to an S3 bucket.

In [None]:
# create a model.tar.gz archive with all the model artifacts and the inference.py script.
%cd {save_directory}
!tar zcvf model.tar.gz *
%cd ..

Now we can upload our `model.tar.gz` to our session S3 bucket with `sagemaker`.

In [None]:
from sagemaker.s3 import S3Uploader

# create s3 uri
s3_model_path = f"s3://{sess.default_bucket()}/neuronx/embeddings"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(local_path=f"{save_directory}/model.tar.gz",desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")

## 4. Deploy a Real-time Inference Endpoint on Amazon SageMaker

After we have uploaded our `model.tar.gz` to Amazon S3 can we create a custom `HuggingfaceModel`. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.

The `inf2.xlarge` instance type is the smallest instance type with AWS Inferentia2 support. It comes with 1 Inferentia2 chip with 2 Neuron Cores. This means we can use 2 Model server workers to maximize throughput and run 2 inferences in parallel.

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,        # path to your model.tar.gz on s3
   role=role,                      # iam role with permissions to create an Endpoint
#    transformers_version="4.34.1",  # transformers version used
#    pytorch_version="1.13.1",       # pytorch version used
#    py_version='py310',             # python version used
   image_uri="" #TODO
   model_server_workers=2,         # number of workers for the model server
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge", # AWS Inferentia Instance
    volume_size = 100
)
# ignore the "Your model is not compiled. Please compile your model before using Inferentia." warning, we already compiled our model.

# 5. Run and evaluate Inference performance of Embeddings Model on Inferentia2

The `.deploy()` returns an `HuggingFacePredictor` object which can be used to request inference.

In [36]:
data = {
  "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)


# print some results 
print(f"lenght of embeddings: {len(res['embeddings'])}")
print(f"first 10 elements of embeddings: {res['embeddings'][:10]}")

lenght of embeddings: 768
first 10 elements of embeddings: [-0.03262409567832947, -0.03421149030327797, 0.041609905660152435, -0.0013438157038763165, 0.027236895635724068, -0.05484848469495773, 0.02483028545975685, -0.029165517538785934, -0.02704770117998123, 0.004101182334125042]


Awesome we can now generate embeddings with our model, Lets test the performance of our model.

A load test will we send 10,000 requests to our endpoint use threading with 10 concurrent threads. We will measure the average latency and throughput of our endpoint. We are going to sent an input of 300 tokens to have a total of 3 Million tokens, but remember the model is compiled with a `sequence_length` of 384. This means that the model will pad the input to 384 tokens, this increases the latency a bit. 

> We decided to use 300 tokens as input length to find the balance between shorter inputs which are padded and longer inputs, which are truncated. If you know your chunk size, we recommend to compile the model with that length to get maximum performance.

_Note: When running the load test, the requests are sent from europe and the endpoint is deployed in us-east-2. This adds a network overhead to it._

In [37]:
import threading
import time 
number_of_threads = 10
number_of_requests = int(10000 // number_of_threads)
print(f"number of threads: {number_of_threads}")
print(f"number of requests per thread: {number_of_requests}")

def send_rquests():
    for _ in range(number_of_requests):
        # input counted at https://huggingface.co/spaces/Xenova/the-tokenizer-playground for 100 tokens
        predictor.predict(data={"inputs": "Hugging Face is a company and a popular platform in the field of natural language processing (NLP) and machine learning. They are known for their contributions to the development of state-of-the-art models for various NLP tasks and for providing a platform that facilitates the sharing and usage of pre-trained models. One of the key offerings from Hugging Face is the Transformers library, which is an open-source library for working with a variety of pre-trained transformer models, including those for text generation, translation, summarization, question answering, and more. The library is widely used in the research and development of NLP applications and is supported by a large and active community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Additionally, they offer tools and frameworks to make it easier for developers to integrate and use these models in their own projects. The company has played a significant role in advancing the field of NLP and making cutting-edge models more accessible to the broader community."})

# Create multiple threads
threads = [threading.Thread(target=send_rquests) for _ in range(number_of_threads) ]
# start all threads
start = time.time()
[t.start() for t in threads]
# wait for all threads to finish
[t.join() for t in threads]
print(f"total time: {round(time.time() - start)} seconds")

number of threads: 10
number of requests per thread: 1000
total time: 50 seconds


Sending 10,000 requests or generating 3 million tokens took around 50 seconds. This means we can run around ~200 inferences per second.
When we inspect the latency of the endpoint through cloudwatch we can see that the average request latency is around 9ms. This means we can serve around 222 inferences per second (having 2 HTTP workers).

In [None]:
print(f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}")

The average latency for our Embeddings model is `9ms`.

**Price / performance ratio**

In this post, we deployed a top open source Embeddings Model (BGE) on a single `inf2.xlarge` instance costing $0.76/hour on Amazon SageMaker using Optimum Neuron. We are able to run 2 replicas of the model on a single instance with a avg. model latency of 9ms for inputs of 300 tokens with a max sequence length of 384 and a throughput without network overhead of 222 inferences per second.

This means we can create (300 tokens * 222 requests) 66,600 tokens per second, 3,996,000 tokens per minute and 239,760,000 tokens per hour. This leads to a cost of `~$0.003 1M/tokens` if utilized well. 

For startups and companies looking into GPU alternative for generating emebddings Inferentia2 is a great option for not only efficient and fast but also cost-effective inference.

### Delete model and endpoint

To clean up, we can delete the model and endpoint.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()