# Up to 3x higher Throughput for Llama using new Amazon SageMaker multi-replica endpoints

One of the key announcements at this year's re:Invent (2023) was the new Hardware Requirements object for Amazon SageMaker endpoints. This provides granular control over the compute resources for each model replica, including minimum CPU, GPU, memory, and number of replicas. This allows you to optimize your model's throughput and cost by matching the compute resources to the model's requirements. Previously it was not possible to deploy multiple replicas of a model on a single endpoint, which limited the throughput of models that were not compute bound, e.g. open LLMs like Llama 13B on p4d.24xlarge instances. 

In this post, we show how to use the new Hardware Requirements as with the sagemaker sdk and the `ResourceRequirements` object to optimize the deployment of Llama 13B for maximum throughput and cost performance on Amazon SageMaker on a `p4d.24xlarge` instance. The `p4d.24xlarge` instance has 8x A100 GPUs 40GB, which allows us to deploy 8 replicas of Llama 13B on a single instance. You can also use this example to deploy other open LLMs like Mistral, T5 or StarCoder. Additionally it is possible to deploy multiple models on a single instance, e.g. 4x Llama 13B and 4x Mistral 1.3B. Check out the amazing [blog post from Antje for this](https://aws.amazon.com/de/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/). 

We are going to use the Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes. 

In the blog will cover how to:
1. [Setup development environment](#1-setup-development-environment)
2. [Retrieve the new Hugging Face LLM DLC](#2-retrieve-the-new-hugging-face-llm-dlc)
3. [Configure Hardware requirements per replica](#3-configure-hardware-requirements-per-replica)
4. [Deploy and Test Llama 2 on Amazon SageMaker](#4-deploy-llama-2-to-amazon-sagemaker)
5. [Run performance a simple performance benchmark](#5-benchmark-multi-replica-endpoint)

Lets get started!

## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy Llama 2 to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
!pip install "sagemaker>=2.199.0" "transformers==4.35.2" --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.


In [1]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker session region: us-east-1


## 2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)


In [2]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.1.0"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04


## 3. Configure Hardware requirements per replica

Llama 2 comes in 3 different sizes - 7B, 13B & 70B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is an example configuration for Llama 13B. In addition we tried to provide some high level overview of the different hardware requirements for the different model sizes. To keep it simple we only looked at the `p4d.24xlarge` instance type and AWQ/GPTQ quantization. 

| Model                                                              | Instance Type       | Quantization | # replica |
|--------------------------------------------------------------------|---------------------|--------------|-----------|
| [Llama 7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)   | `(ml.)p4d.24xlarge` | `-`          | 8         |
| [Llama 7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)   | `(ml.)p4d.24xlarge` | `GPTQ/AWQ`   | 8         |
| [Llama 13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | `(ml.)p4d.24xlarge` | `-`          | 8         |
| [Llama 13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | `(ml.)p4d.24xlarge` | `GPTQ/AWQ`   | 8         |
| [Llama 70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | `(ml.)p4d.24xlarge` | `-`          | 2         |
| [Llama 70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | `(ml.)p4d.24xlarge` | `GPTQ/AWQ`   | 4         |


_We didn't test the configurations yet. If you run into errors please let me know and I will update the blog post._


In [3]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

llama2_13b_resource_config = ResourceRequirements(
    requests = {
        "copies": 8, # Number of replicas
        "num_accelerators": 1, # Number of GPUs
        "num_cpus": 10,  # Number of CPU cores  96 // num_replica - more for management
        "memory": 100 * 1024,  # Minimum memory in MB 1152 // num_replica - more for management
    },
)

## 4. Deploy Llama 2 to Amazon SageMaker

To deploy [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/models?other=llama-2) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` and then add our `ResourceRequirements` object to the `deploy` method. 

_Note: This is a form to enable access to Llama 2 on Hugging Face after you have been granted access from Meta. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads) and accept our license terms and acceptable use policy before submitting this form. Requests will be processed in 1-2 days. We alternatively use the ungated weights from `NousResearch/Llama-2-13b-chat-hf`._

In [4]:
import json
import uuid
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.enums import EndpointType

# sagemaker config
instance_type = "ml.p4d.24xlarge"
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "NousResearch/Llama-2-13b-chat-hf", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(16384),  # Limits the number of tokens that can be processed in parallel during the generation
  # 'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>" # uncomment when using a private model
  # ,'HF_MODEL_QUANTIZE': "gptq", # comment in when using awq quantized checkpoint

}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config,
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml


After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method using the `ResourceRequirements` object. 

In [5]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1, # number of instances
  instance_type=instance_type, # base instance type
  resources=llama2_13b_resource_config, # resource config for multi-replica
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
  endpoint_name=f"llama-2-13b-chat-{str(uuid.uuid4())}", # name needs to be unique
  endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED # needed to use resource config
  
)


-----!----------------------------------------!

SageMaker will now create our endpoint and deploy the model to it. This can takes a 15-25 minutes, since the replicas are deployed after each other. After the endpoint is created we can use the `predict` method to send a request to our endpoint. To make it easier we will use the [apply_chat_template](apply_chat_template) method from transformers. This allow us to send "openai" like converstaions to our model. 

In [16]:
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(config["HF_MODEL_ID"])

# OpenAI like conversational messages
messages = [
  {"role": "system", "content": "You are an helpful AWS Expert Assistant. Respond only with 1-2 sentences."},
  {"role": "user", "content": "What is Amazon SageMaker?"},
]

# generation parameters
parameters = {
    "do_sample" : True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 50,
    "repetition_penalty": 1.03,
    "return_full_text": False,
}

res = llm.predict(
  {
    "inputs": tok.apply_chat_template(messages, tokenize=False),
    "parameters": parameters
   })

print(res[0]['generated_text'].strip())

Sure, I'd be happy to help! Amazon SageMaker is a fully managed service that provides a range of machine learning (ML) capabilities, including data preparation, training, and deployment, to help you build, train,


In [17]:
import threading
import time
number_of_threads = 20
number_of_requests = int(1000 // number_of_threads)
print(f"number of threads: {number_of_threads}")
print(f"number of requests per thread: {number_of_requests}")

def send_rquests():
    for _ in range(number_of_requests):
        # input counted at https://huggingface.co/spaces/Xenova/the-tokenizer-playground for 100 tokens
        llm.predict({
    "inputs": tok.apply_chat_template(messages, tokenize=False),
    "parameters": parameters
   })

# Create multiple threads
threads = [threading.Thread(target=send_rquests) for _ in range(number_of_threads) ]
# start all threads
start = time.time()
[t.start() for t in threads]
# wait for all threads to finish
[t.join() for t in threads]
print(f"total time: {round(time.time() - start)} seconds")

number of threads: 20
number of requests per thread: 50
total time: 76 seconds


## 5. Benchmark multi-replica endpoint

To Benchmark our new endpoint we will use the same code as for the ["Llama 2 on Amazon SageMaker a Benchmark"](https://huggingface.co/blog/llama-sagemaker-benchmark) from [text-generation-inference-tests](https://github.com/philschmid/text-generation-inference-tests/tree/master/sagemaker_llm_container). When running the benchmark back then it was not possible to deploy multiple replicas of a model on a single endpoint. This limited the throughput of models that were not compute bound, e.g. open LLMs like Llama 13B on p4d.24xlarge instances.

To run the benchmark we need to clone the [text-generation-inference-tests](https://github.com/philschmid/text-generation-inference-tests) and also make sure we meet all the `Prerequisites` from the [README](https://github.com/philschmid/text-generation-inference-tests/tree/master/sagemaker_llm_container), including the installation of [k6](https://k6.io/) which used to run the benchmark.

In [7]:
!git clone https://github.com/philschmid/text-generation-inference-tests.git

Cloning into 'text-generation-inference-tests'...
remote: Enumerating objects: 176, done.[K
remote: Counting objects: 100% (176/176), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 176 (delta 98), reused 160 (delta 84), pack-reused 0[K
Receiving objects: 100% (176/176), 670.64 KiB | 5.12 MiB/s, done.
Resolving deltas: 100% (98/98), done.


Since we already a deployed endpoint we can provide the `endpoint_name` and our hardware requirements as `inference_component`. The name of the `inference_component` can currently only be retrieved using `boto3`

In [8]:
inference_component = llm_model.sagemaker_session.list_inference_components(endpoint_name_equals=llm.endpoint_name).get("InferenceComponents")[0].get("InferenceComponentName")
endpoint_name = llm.endpoint_name

In [14]:
inference_component

'huggingface-pytorch-tgi-inference-2023-12-08-08-1702025443-7ff6'

Change the directory to `sagemaker_llm_container` and run the benchmark. The command below will run a load test for 90s with 50 concurrent users. The result will be saved to `text-generation-inference-tests/sagemaker_llm_container/{CONIFG_NAME}.json`. 

In [12]:
!cd text-generation-inference-tests/sagemaker_llm_container && python benchmark.py \
  --endpoint_name {endpoint_name} \
  --inference_component {inference_component} \
  --model_id {config["HF_MODEL_ID"]} \
  --tp_degree 1 \
  --vu 20

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
token: None
model id: NousResearch/Llama-2-13b-chat-hf
instance type: None
tp_degree: 1
vu: 20
quantize: None
endpoint_name: llama-2-13b-chat-259c22c9-e25d-4661-94f1-0a1f891d8adb
inference_component: huggingface-pytorch-tgi-inference-2023-12-08-08-1702025443-7ff6
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
[0K
[36m          /\      |‾‾| /‾‾/   /‾‾/   [0K
     /\  /  \     |  |/  /   /  /    [0K
    /  \/    \    |     (   /   ‾‾\  [0K
   /          \   |  |\  \ |  (‾)  | [0K
  / __________ \  |__| \__\ \_____/ .io[0m[0K
[0K
[36mINFO[0m[0000] E

We ran multiple benchmarks for different concurrent users, including 50, 75 and 100 concurrent users and `MAX_BATCH_TOTAL_TOKENS` (8192, 16384). The table below includes the thorughput (tokens/second), median latency (ms/token) and the queue time in ms (time the request was in the queue before it was processed). 

| Model | Concurrent Users | max batch tokens | Throughput  (token/second)| med. Latency (ms/token) | Queue Time (ms) | 
|-------|----------------|--------------------|--------------------------|-------------------------|--------------------|
| Llama 13B | 50 | 8192 | 1419 tokens/second | 25.14 (ms/token) | 404ms | 
| Llama 13B | 75 | 8192 | 1561  tokens/second | 25.21 (ms/token) | 844ms |
| Llama 13B | 100 | 8192 | 1634 tokens/second | 25.22 (ms/token) | 1553ms |
| Llama 13B | 20 | 16384 | 757 tokens/second | 24.15 (ms/token) | 21ms | 
| Llama 13B | 50 | 16384 | 1400 tokens/second | 25.15 (ms/token) | 403ms | 
| Llama 13B | 75 | 16384 | 1540  tokens/second | 25.22 (ms/token) | 919ms |
| Llama 13B | 100 | 16384 | 1622 tokens/second | 25.22 (ms/token) | 1521ms |

We achieved a throughput between 1419 and 1634 tokens/second with a median latency between 25.14 and 25.21 ms/token, which is fast enough for real-time chat applications. We also see that the queue time increases with the number of concurrent users. This hints that we reached the limit of instance and should consider scaling out the number of instances. 

Comparing those results with the results from the ["Llama 2 on Amazon SageMaker a Benchmark"](https://huggingface.co/blog/llama-sagemaker-benchmark) we see that we were able to increase the throughput by 2-3x from 668 tokens/second to ~1600 tokens/second, while keeping the latency the same. Further improvements can be achieved by using quantization techniques like GPTQ or AWQ.

In [67]:
!rm -rf text-generation-inference-tests

## 6. Clean up

To clean up, we can delete the model, endpoint and inference component for the hardware requirements. 

_Note: If you have issues deleting an endpoint with an attached inference component, see: https://repost.aws/es/questions/QUEiuS2we2TEKe9GUUYm67kQ/error-when-deleting-and-inference-endpoint-in-sagemaker_

In [None]:
# Delete Inference Component
llm_model.sagemaker_session.delete_inference_component(inference_component_name=inference_component)

we have to wait until the component is deleted before we can delete the endpoint. (can take 2minutes)

In [None]:
# Delete Model and Endpoint
llm.delete_model()
llm.delete_endpoint()