# Deploy and benchmark reranker models on Inferentia2 using Amazon SageMaker

In information retrieval and natural language processing applications, rerankers have emerged as powerful tools to enhance the accuracy and relevance of search results. Rerankers are specialized techniques or machine learning models designed to optimize the ordering of a set of retrieved items to improve the overall quality of information retrieval systems.

The objective of this notebook is to demonstrate how you can deploy and scale reranker models using on Inferentia2 using Amazon SageMaker.

## Setup

Upgrade the necessary libraries

In [None]:
! pip install -U transformers sagemaker

In [None]:
! python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
! python -m pip install --upgrade-strategy eager optimum[neuronx]

Instantiate the necessary session paramters

In [None]:
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.djl_inference.model import DJLModel
from sagemaker.jumpstart.model import JumpStartModel

import os
import time
import concurrent.futures
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import boto3
from botocore.config import Config

sagemaker_session = sagemaker.session.Session(
    sagemaker_runtime_client=boto3.client(
        "sagemaker-runtime",
        config=Config(
            connect_timeout=10, retries={"mode": "standard", "total_max_attempts": 20}
        ),
    )
)
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

## Download and compile the model

In this section, we will download and compile the model using a SageMaker training job. We will implement the following steps:
1. Write the compilation script
2. Create a Hugging Face SageMaker Estimator to instatiate the compilation job
3. Run the compilation job

In the compilation script, we use Hugging Face Optimum Neuron Library which is the interface between the 🤗 Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. 

In [None]:
! mkdir src

In [None]:
%%writefile src/compile_reranker.py
import os
import tarfile
import torch
import torch_neuronx
from optimum.neuron import NeuronModelForSequenceClassification


model_name = 'BAAI/bge-reranker-v2-m3'


if __name__=='__main__':
    # Create the input preprocessor and model
    model = NeuronModelForSequenceClassification.from_pretrained(
        model_name,
        export=True,
        batch_size=2,
        dynamic_batch_size=True,
        sequence_length=2048,
        auto_cast_type="fp16"
    )
    
    # Save the TorchScript for inference deployment
    model.save_pretrained("/opt/ml/model/")

Now, let's instantiate a Hugging Face SageMaker estimator referencing the compilation script, and the Deep Learning Container (DLC) to use.

In [None]:
from sagemaker.huggingface import HuggingFace

instance_type = "ml.trn1.2xlarge"
model_name = "BAAI/bge-reranker-v2-m3"
save_directory = "bge-reranker-v2-m3"

s3_model_path = f"s3://{bucket}/compiled_models/{model_name}"


estimator = HuggingFace(
    entry_point="compile_reranker.py",
    source_dir="src",
    role=role,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type=instance_type,
    output_path=s3_model_path,
    disable_profiler=True,
    disable_output_compression=True,
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04",
    volume_size=128,
    py_version="py310",
)

Run the compilation job

In [None]:
estimator.fit()

In [None]:
from sagemaker.s3 import S3Downloader

s3_model_uri = S3Downloader.download(
    s3_uri=f"{s3_model_path}/{estimator._current_job_name}/output/model/",
    local_path=save_directory,
)
print(f"model artifcats downloaded to {save_directory}")

## Prepare the inference script

In this section, we will provide an inference script to perform customize the preprocessing and the postprocessing of the reranking requests.

In [None]:
! mkdir {save_directory}/code

In [None]:
%%writefile {save_directory}/code/inference.py
import os

os.environ["NEURON_RT_NUM_CORES"] = "1"
from optimum.neuron import NeuronModelForSequenceClassification
from transformers import AutoTokenizer
import torch
import torch_neuronx
 
def model_fn(model_dir, temp=None):
    # load local converted model and  tokenizer
    model = NeuronModelForSequenceClassification.from_pretrained(model_dir)
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    return model, tokenizer
 
 
def predict_fn(data, pipeline):
    model, tokenizer = pipeline
 
    # extract body
    inputs = data.pop("inputs", data)
    print(inputs)
    # Tokenize sentences
    encoded_input = tokenizer(inputs,return_tensors="pt", padding=True, truncation=True, max_length=model.config.neuron["static_sequence_length"])
 
    # Compute embeddings
    with torch.no_grad():
        scores = model(**encoded_input, return_dict=True).logits.view(-1, ).float()
        scores = torch.sigmoid(scores).tolist()
 
    return scores

## Upload the model artefacts to Amazon S3

First, compress the model artefacts and inference code.

In [None]:
%cd {save_directory}
!tar zcvf model.tar.gz *
%cd ..

Upload to Amazon S3.

In [None]:
from sagemaker.s3 import S3Uploader

# create s3 uri
s3_model_path = f"s3://{bucket}/inference_artefacts/{model_name}"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(
    local_path=f"{save_directory}/model.tar.gz", desired_s3_uri=s3_model_path
)
print(f"model artifcats uploaded to {s3_model_uri}")

## Create Hugging Face SageMaker Model Objects

In this section, we instantiate a Hugging Face SageMaker Model Object and reference the model artifcats in Amazon S3. We also choose the appropriate Hugging Facec Deep Learning Container image for inference.

In [None]:
# create Hugging Face Model Class
model_reranker = HuggingFaceModel(
    model_data=s3_model_uri,  # path to your model.tar.gz on s3
    role=role,  # iam role with permissions to create an Endpoint
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-inference-neuronx:2.1.2-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04",
)

## Deploy the Model to an endpoint

In [None]:
model_reranker._is_compiled_model = True

model_reranker_predictor = model_reranker.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    volume_size=100,
    wait=False,
)

Once the model is deployed, test the model invocation.

In [None]:
payload = {
    "inputs": [
        ["what is panda?", "hi"],
        [
            "what is panda?",
            "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
        ],
    ]
}

In [None]:
model_bge_rerank.predict(payload)

## Benchmark the endpoint

Create a benchmark scrip that sends concurrent requests, stores and plots the latencies and throughputs

Benchmark the endpoint and plot the results.

In [None]:
# Assuming predictor is already defined and initialized
# and predictor.predict(data=payload) is the method to be benchmarked


def benchmark_predictor(predictor, payload, steps, iterations=5):
    """
    Benchmarks a predictor's performance by measuring latency and throughput
    under varying levels of concurrent requests.

    Args:
        predictor (object): The predictor object with a `predict` method.
        payload (any): The input data to be sent to the predictor.
        steps (list): A list of different numbers of concurrent requests to test.
        iterations (int, optional): The number of iterations for each concurrency level. Default is 5.

    Returns:
        tuple: Three lists containing the request counts, latencies, and throughputs.
    """
    latencies = []
    throughputs = []
    request_counts = []

    def send_request():
        """Sends a single request to the predictor and measures its latency."""
        start_time = time.time()
        resp = predictor.predict(data=payload)
        latency = time.time() - start_time
        return latency

    for num_requests in steps:
        iter_latencies = []
        iter_throughputs = []

        for _ in range(iterations):
            start_time = time.time()

            # Use ThreadPoolExecutor to send concurrent requests
            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_requests
            ) as executor:
                futures = [executor.submit(send_request) for _ in range(num_requests)]
                latencies_batch = [
                    future.result()
                    for future in concurrent.futures.as_completed(futures)
                ]

            total_time = time.time() - start_time

            # Calculate average latency for this iteration
            latency = np.mean(latencies_batch)
            # Calculate throughput for this iteration
            throughput = num_requests / total_time

            iter_latencies.append(latency)
            iter_throughputs.append(throughput)

        # Calculate average latency and throughput over all iterations
        avg_latency = np.mean(iter_latencies)
        avg_throughput = np.mean(iter_throughputs)

        latencies.append(avg_latency)
        throughputs.append(avg_throughput)
        request_counts.append(num_requests)

        # Print results for the current number of requests
        print(
            f"Requests: {num_requests}, Average Latency: {avg_latency:.4f}s, Average Throughput: {avg_throughput:.2f} req/s"
        )

    return request_counts, latencies, throughputs


def plot_metrics(request_counts, latencies, throughputs):
    """
    Plots the benchmarking results, showing the average latency and throughput
    as a function of the number of concurrent requests.

    Args:
        request_counts (list): The list of different numbers of concurrent requests tested.
        latencies (list): The list of average latencies corresponding to the request counts.
        throughputs (list): The list of average throughputs corresponding to the request counts.
    """
    fig, ax1 = plt.subplots()

    color = "tab:blue"
    ax1.set_xlabel("Number of Concurrent Requests")
    ax1.set_ylabel("Average Latency (s)", color=color)
    ax1.plot(request_counts, latencies, color=color)
    ax1.tick_params(axis="y", labelcolor=color)

    ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

    color = "tab:green"
    ax2.set_ylabel(
        "Throughput (requests/s)", color=color
    )  # we already handled the x-label with ax1
    ax2.plot(request_counts, throughputs, color=color)
    ax2.tick_params(axis="y", labelcolor=color)

    fig.tight_layout()  # otherwise the right y-label is slightly clipped
    plt.title("Latency and Throughput Benchmarking")
    plt.show()


def plot_latency_vs_throughput(latencies, throughputs, request_counts):
    """
    Plots latency against throughput, with annotations for the number of concurrent requests.

    Args:
        latencies (list): The list of average latencies.
        throughputs (list): The list of average throughputs.
        request_counts (list): The list of different numbers of concurrent requests tested.
    """
    plt.figure()
    plt.plot(throughputs, latencies, "o-")
    plt.xlabel("Throughput (requests/s)")
    plt.ylabel("Average Latency (s)")
    plt.title("Latency vs Throughput")
    plt.grid(True)

    # Label each point with the request count
    for i, request_count in enumerate(request_counts):
        plt.annotate(
            request_count,
            (throughputs[i], latencies[i]),
            textcoords="offset points",
            xytext=(0, 10),
            ha="center",
        )

    plt.show()

In [None]:
df_benchmark = pd.DataFrame(
    columns=[
        "client_batch_size",
        "concurrent_request_counts",
        "latencies",
        "throughputs",
    ]
)

min_requests = 0
max_requests = 5
step_size = 1
iterations = 5
client_batch_size = 8

steps = list(map(lambda x: 2**x, range(min_requests, max_requests, step_size)))


payload = {
    "inputs": [[
        "what is panda?",
        "A panda is a type of bear that is known for its distinctive black and white coloring.",
    ]]
    * client_batch_size
}


request_counts, latencies, throughputs = benchmark_predictor(
    model_bge_rerank, payload, steps, iterations
)
plot_metrics(request_counts, latencies, throughputs)
plot_latency_vs_throughput(latencies, throughputs, request_counts)

new_data = {
    "client_batch_size": [client_batch_size] * len(request_counts),
    "concurrent_request_counts": request_counts,
    "latencies": latencies,
    "throughputs": throughputs,
}

df_benchmark = pd.DataFrame(
    columns=[
        "client_batch_size",
        "concurrent_request_counts",
        "latencies",
        "throughputs",
    ]
)
df_benchmark = df_benchmark.append(pd.DataFrame(new_data), ignore_index=True)
df_benchmark

## Cleanup

In [None]:
model_bge_rerank_predictor.delete_endpoint()