# SageMaker JumpStart Foundation Models - Benchmark Latency and Throughput

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

---

***
Welcome to Amazon [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html)! You can use SageMaker JumpStart to solve many Machine Learning tasks through one-click in SageMaker Studio, or through [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#use-prebuilt-models-with-sagemaker-jumpstart).

When testing a large language model for production use cases, common questions arise, such as: 
- What is the inference latency for my expected payloads?
- How much throughput does this model configuration provide for my expected payloads?
- What is the inference throughput and latency for my expected concurrency load, i.e., the number of concurrent requests that have invoked the endpoint?
- How much does it cost to generate 1 million tokens?
- How does instance type selection (e.g., `ml.g5.2xlarge`) affect latency and throughput?
- How does modification of the deployment configuration (e.g., tensor parallel degree) affect latency and throughput?

Given these questions, you may notice that inference latency and throughput depend on numerous factors, to include payload, number of concurrent requests, instance type, deployment configuration, and more. In this notebook, we demonstrate how you can run your own latency and throughput benchmark for SageMaker JumpStart endpoints. This benchmarking process involves running load tests with various concurrent request values for each payload and deployed endpoint.
***

In [1]:
# !pip install --upgrade --quiet sagemaker transformers

***
The primary inputs to this benchmarking tool include the models to benchmark and the payloads used to invoke endpoints.
- **`MODELS`**: A dictionary mapping a unique name to benchmarking configuration. The model can be defined in 3 different ways. Each model value should be a dictionary with the following keys:
  - **`jumpstart_model_specs` key**: requires `model_args` and optionally `deploy_args` definitions to use with a SageMaker SDK `JumpStartModel` constructor and deploy methods, respectfully. This should be used to deploy and benchmark a JumpStart model.
  - **`model_specs` key**: requires `image_uri_args`, `model_args`, and `deploy_args` definitions to use with a SageMaker SDK `Model` constructor and deploy methods. This should be used to deploy and benchmark a non-JumpStart model.
  - **`endpoint_name` key**: provide the endpoint name of a pre-deployed model to benchmark.
  - **`huggingface_model_id` key**: to compute metrics with respect to model tokens, provide the HuggingFace Model ID with an appropriate tokenizer to use.
- **`PAYLOADS`**: A dictionary mapping a unique name to a payload of interest. The benchmarking tool will serially run a concurrency probe against each payload.

For this notebook, we deploy LLama2 7B using `JumpStartModel`.
***

In [7]:
from benchmarking.payload import create_test_payload


PAYLOADS = {
    "input_128_output_128": create_test_payload(input_words=128, output_tokens=128),
    # "input_512_output_128": create_test_payload(input_words=512, output_tokens=128),
}

# instance types: ml.g5.2xlarge, ml.g5.4xlarge, ml.g5.8xlarge, ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p4d.24xlarge (p4 quota increase requires manual approval)

# # ALL DEFAULT PARAMETERS
# MODELS = {
#     "llama2-7b-jumpstart-g5-2xlarge": {
#         "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.2xlarge"}},
#         "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     },
#     "llama2-7b-jumpstart-g5-4xlarge": {
#         "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.4xlarge"}},
#         "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     },
#     "llama2-7b-jumpstart-g5-8xlarge": {
#         "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.8xlarge"}},
#         "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     },
#     "llama2-7b-jumpstart-g5-12xlarge": {
#         "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.12xlarge"}},
#         "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     },
#     "llama2-7b-jumpstart-g5-24xlarge": {
#         "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.24xlarge"}},
#         "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     },
#     "llama2-7b-jumpstart-g5-48xlarge": {
#         "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.48xlarge"}},
#         "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     },
#     # "llama2-7b-jumpstart-p4d-24xlarge": {
#     #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.p4d.24xlarge"}},
#     #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
#     # },
# }

# LIMIT NUMBER OF INPUT AND OUTPUT TOKENS
MODELS = {
    # "llama2-7b-jumpstart-g5-2xlarge": {
    #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.2xlarge"}},
    #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
    # },
    # "llama2-7b-jumpstart-g5-4xlarge": {
    #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.4xlarge"}},
    #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
    # },
    # "llama2-7b-jumpstart-g5-8xlarge": {
    #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.8xlarge"}},
    #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
    # },
    "llama2-7b-jumpstart-g5-12xlarge": {
        "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.12xlarge", "env": {
        "MAX_INPUT_TOKENS": "192",
        "MAX_TOTAL_TOKENS": "512",
        "MAX_INPUT_LENGTH": "192"
        }}},
        "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
        # "endpoint_name": "abcdefg-NPUcqgcFgDqT"
    },
    # "llama2-7b-jumpstart-g5-24xlarge": {
    #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.24xlarge"}},
    #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
    # },
    # "llama2-7b-jumpstart-g5-48xlarge": {
    #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.g5.48xlarge"}},
    #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
    # },
    # "llama2-7b-jumpstart-p4d-24xlarge": {
    #     "jumpstart_model_specs": {"model_args": {"model_id": "meta-textgeneration-llama-2-7b-f", "model_version": "3.*", "instance_type": "ml.p4d.24xlarge"}},
    #     "huggingface_model_id": "meta-llama/Llama-2-7b-chat",
    # },
}


***
The default concurrency probe will iteratively produce loads to the endpoint with concurrent request values of $2^x$ for $x\ge 0$ and stop once the endpoint produces an error, most often a SageMaker 60s endpoint invocation timeout. Here, we show how to create a custom concurrency probe iterator object with a different concurrent request schedule and an additional stop iteration criteria when latency goes above an undesirable threshold.
***

In [8]:
from benchmarking.concurrency_probe import ConcurrentProbeIteratorBase


class CustomConcurrencyProbeIterator(ConcurrentProbeIteratorBase):
    """A custom concurrency probe iterator to explore concurrent request multiples with max latency stop criteria."""

    def __iter__(self):
        self.concurrent_requests = 1
        self.increment_value = 10
        self.max_latency_per_token_ms = 100.0
        return self

    def __next__(self) -> int:
        if self.exception is not None:
            e = self.exception
            self.stop_reason = "".join([type(e).__name__, f": {e}" if str(e) else ""])
            raise StopIteration

        if self.result is None:
            return self.concurrent_requests

        last_latency_per_token_ms = self.result["LatencyPerToken"]["p90"]
        if last_latency_per_token_ms > self.max_latency_per_token_ms:
            self.stop_reason = (
                f"Last p90 latency = {last_latency_per_token_ms} > {self.max_latency_per_token_ms}."
            )
            raise StopIteration

        self.concurrent_requests = self.concurrent_requests + self.increment_value

        return self.concurrent_requests


def num_invocation_scaler_with_minimum(
    concurrent_requests: int, factor: int = 5, max_invocations: int = 200
) -> int:
    return min(concurrent_requests * factor, max_invocations)

***
Now create a `Benchmarker` object and run benchmarking for all models. This will first concurrently create a `Predictor` for all models. If `endpoint_name` is specified in the `MODELS` definition or provided in the JSON metrics file of a previous run, the endpoint will be attached to a `Predictor`. Otherwise, a new endpoint will be deployed. Once an endpoint is in service, it will begin the load test concurrency probe. A concurrency probe will be executed concurrently for all models. For each model, the probe will sweep concurrent request values, performing a load test at each unique value, until an error occurs. These errors may be validation checks (e.g., endpoint is overloaded, input sequence length unsupported, etc.), SageMaker invocation timeout, or any other potential model error. The concurrency probe for each specified payload will run serially for each model. When the probe has completed, all computed metrics will be saved in a JSON file for downstream analysis.

***

In [9]:
# add policy to the sagemaker role: AWSPriceListServiceFullAccess
# request access to model from HF: https://huggingface.co/meta-llama/Llama-2-7b-chat
from huggingface_hub import notebook_login
# notebook_login() # only once

In [None]:
from benchmarking.runner import Benchmarker

benchmarker = Benchmarker(
    payloads=PAYLOADS,
    run_concurrency_probe=True,
    concurrency_probe_concurrent_request_iterator_cls=CustomConcurrencyProbeIterator,
    concurrency_probe_num_invocation_hook=num_invocation_scaler_with_minimum,
)
metrics = benchmarker.run_multiple_models(models=MODELS)

2024-10-01 20:06:30,456 | INFO : (Model 'llama2-7b-jumpstart-g5-12xlarge'): Deploying endpoint bm-llama2-7b-jumpstart-g5-12xlarge-2024-10-01-20-06-30-456 ...


Model 'meta-textgeneration-llama-2-7b-f' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llamaEula.txt for terms of use.


2024-10-01 20:06:30,874 | INFO : Model 'meta-textgeneration-llama-2-7b-f' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llamaEula.txt for terms of use.


Using model 'meta-textgeneration-llama-2-7b-f' with version '3.2.0'. You can upgrade to version '4.7.0' to get the latest model specifications. Note that models may have different input/output signatures after a major version upgrade.


2024-10-01 20:06:30,878 | INFO : Using model 'meta-textgeneration-llama-2-7b-f' with version '3.2.0'. You can upgrade to version '4.7.0' to get the latest model specifications. Note that models may have different input/output signatures after a major version upgrade.


Using vulnerable JumpStart model 'meta-textgeneration-llama-2-7b-f' and version '3.2.0'.




Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '3.*'. You can pin to version '3.2.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


2024-10-01 20:06:30,890 | INFO : Creating model with name: meta-textgeneration-llama-2-7b-f-2024-10-01-20-06-30-885
2024-10-01 20:06:31,646 | INFO : Creating endpoint-config with name bm-llama2-7b-jumpstart-g5-12xlarge-2024-10-01-20-06-30-456
2024-10-01 20:06:32,070 | INFO : Creating endpoint with name bm-llama2-7b-jumpstart-g5-12xlarge-2024-10-01-20-06-30-456
------------------------

***
Now that benchmarking is complete, let's load the results into a Pandas DataFrame and create a pivot table that shows throughput, p90 latency, and cost to generate one million tokens. Many variations of these metrics are recorded in the DataFrame, so please extract any information relevant to your benchmarking effort.
***

In [6]:
%load_ext autoreload
%autoreload 2
    
import pandas as pd
from benchmarking.runner import Benchmarker


try:
    df = Benchmarker.load_metrics_pandas()
    df_pivot = Benchmarker.create_concurrency_probe_pivot_table(df)

    pd.set_option("display.max_columns", None)
    pd.set_option("display.max_colwidth", 0)
    pd.set_option("display.max_rows", 500)
    display(df_pivot)
except Exception as e:
    print("Exception Error:",e)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),throughput (tokens/s),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 latency (ms/token),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),p90 request latency (ms),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($),cost to generate 1M tokens ($)
Unnamed: 0_level_1,Unnamed: 1_level_1,concurrent requests,1,11,21,31,41,51,61,71,81,91,101,111,1,11,21,31,41,51,61,71,81,91,101,111,1,11,21,31,41,51,61,71,81,91,101,111,1,11,21,31,41,51,61,71,81,91,101,111
model ID,instance type,payload,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2
llama2-7b-jumpstart-g5-12xlarge,ml.g5.12xlarge,input_128_output_128,59.97,521.64,758.99,900.73,978.79,1061.39,1070.8,1130.62,1146.87,1172.8,1214.77,1236.57,17,21,30,35,45,51,59,68,75,80,88,101,2276,2929,4080,4779,6007,6838,7941,9146,9905,10784,11663,14064,$32.84,$3.78,$2.59,$2.19,$2.01,$1.86,$1.84,$1.74,$1.72,$1.68,$1.62,$1.59


***
Finally, please remember to clean up all model and endpoint resources to avoid incurring additional costs after your benchmarking is complete.
***

In [7]:
benchmarker.clean_up_resources()

2024-10-01 19:49:23,332 | INFO : (Model 'llama2-7b-jumpstart-g5-12xlarge'): Cleaning up resources ...
2024-10-01 19:49:23,558 | INFO : Deleting model with name: meta-textgeneration-llama-2-7b-f-2024-10-01-19-30-51-331
2024-10-01 19:49:23,803 | INFO : Deleting endpoint configuration with name: bm-llama2-7b-jumpstart-g5-12xlarge-2024-10-01-19-30-51-187
2024-10-01 19:49:24,082 | INFO : Deleting endpoint with name: bm-llama2-7b-jumpstart-g5-12xlarge-2024-10-01-19-30-51-187


In [None]:
# df.columns

In [None]:
# df[["ModelID","ConcurrentRequests","Latency.p95"]]

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-benchmarking|inference-benchmarking-customization-options-example.ipynb)
