# How to deploy BLOOM to Amazon SageMaker using Text-Generation-Inference

This is an example on how to deploy the open-source LLMs, like [BLOOM](bigscience/bloom) to Amazon SageMaker for inference. We will deploy BLOOM 176B to Amazon SageMake for real-time inference using Hugging Face new LLM solution [text-generation-inference](https://github.com/huggingface/text-generation-inference). 

The example covers:
1. Setup development environment
2. Create `HuggingFace` model with TGI container
3. Deploy BLOOM to Amazon SageMaker
4. Run inference on BLOOM with different parameters
5. Run token streaming on BLOOM
6. Clean up

## What is Text Generation Inference?

[Text Generation Inference](https://github.com/huggingface/text-generation-inference) is a library built by Hugging Face to offer an end-to-end optimized solution to run inference on open source LLMs, already powering Hugging Face services running at scale such as the Hugging Face Inference API for BLOOM, GPT-NeoX, SantaCoder, and many more LLMs. In addition, Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative. \
Text Generation Inference implements optimization for all supported model architectures, including:
* Tensor Parallelism and custom cuda kernels
* Quantization
* Dynamic batching of incoming requests for increased total throughput 
* Accelerated weight loading (start-up time) with safetensors
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)

## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [1]:
!pip install sagemaker --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.1.2 requires botocore<1.23.25,>=1.23.24, but you have botocore 1.29.80 which is incompatible.[0m


If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.


In [19]:
import os 

os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
os.environ["AWS_PROFILE"] = "hf-sm"

In [20]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.


sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker session region: us-east-1


'558105141721'

## 2. Create `HuggingFace` model with TGI container

As of today the text-generation-inference container is not yet available natively inside `sagemaker`. We will use the `HuggingFaceModel` model class with a custom `image_uri` pointing to the registry image of the text-generation-inference container. The text-generation-inference container is available in Github Repository as package. You can find more information about the container [here](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference). 

To make the use with SageMaker easier we uploaded a version of the container to a public ECR repository. If you want to migrate the container yourself We created a `create_ecr_contaienr.sh` script we can use to migrate the container to ECR.
_Note: make sure you have permissions to create ECR repositories and docker running._
```python
image_uri = "ghcr.io/huggingface/text-generation-inference:sagemaker-sha-4b5e36d"
account_id = sess.account_id()
region = sess.boto_region_name

!chmod +x create_ecr_container.sh
!./create_ecr_container.sh {image_uri} {account_id} {region}

image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-text-generation-inference:latest"
```



In [37]:
image_uri = "ghcr.io/huggingface/text-generation-inference:sagemaker-sha-4b5e36d"
account_id = sess.account_id()
region = sess.boto_region_name

!chmod +x create_ecr_container.sh
!./create_ecr_container.sh {image_uri} {account_id} {region}

image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-text-generation-inference:latest"

Pulling base image ghcr.io/huggingface/text-generation-inference:sagemaker-sha-4b5e36d...
sagemaker-sha-4b5e36d: Pulling from huggingface/text-generation-inference

[1B4908093d: Pulling fs layer 
[1B2704fd22: Pulling fs layer 
[1B2dd0fe9d: Pulling fs layer 
[1B06729f70: Pulling fs layer 
[1B8e302abf: Pulling fs layer 
[1Beaf0783e: Pulling fs layer 
[1Bd7bf559f: Pulling fs layer 
[1B0dcce85d: Pulling fs layer 
[1B12c0cbdb: Pulling fs layer 
[1B0d94e1e8: Pulling fs layer 
[1B049f791d: Pulling fs layer 
[1B0c8ee405: Pulling fs layer 
[1B59924e18: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1Be1a7694c: Pulling fs layer 
[1B0d3e3a35: Pulling fs layer 
[1B95b5d6d7: Pulling fs layer 
[1B71368a0f: Pulling fs layer 
[1Ba509dd5d: Pulling fs layer 
[1B95332b84: Pulling fs layer 
[1B243e0100: Pulling fs layer 
[1B8f8af934: Pull complete 595MB/1.595MBB[21A[2K[21A[2K[21A[2K[21A[2K[20A[2K[21A[2K[20A[2K[20A[2K[20A[2K[20A[2K[20A[2K[20A[2K[20A[2K[20

In [34]:
# text-generation-inference container image uri
image_uri=""

Pulling base image ghcr.io/huggingface/text-generation-inference:sagemaker-sha-4b5e36d...
Logging in to Amazon ECR...
Login Succeeded

Logging in with your password grants your terminal complete access to your account. 
For better security, log in with a limited-privilege personal access token. Learn more at https://docs.docker.com/go/access-tokens/
Creating ECR repository sagemaker-text-generation-inference...

An error occurred (RepositoryNotFoundException) when calling the DescribeRepositories operation: The repository with name 'sagemaker-text-generation-inference' does not exist in the registry with id '558105141721'
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-east-1:558105141721:repository/sagemaker-text-generation-inference",
        "registryId": "558105141721",
        "repositoryName": "sagemaker-text-generation-inference",
        "repositoryUri": "558105141721.dkr.ecr.us-east-1.amazonaws.com/sagemaker-text-generation-inference",
        "createdAt": "2023

To deploy BLOOM to Amazon SageMaker we need to create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. . We will use a `p4d.24xlarge` instance type with quantization enabled to deploy BLOOM.

In [25]:
from sagemaker.huggingface import HuggingFaceModel

# Define Model and Endpoint configuration parameter
hf_model_id = "bigscience/bloom" # model id from huggingface.co/models
use_quantized_model = True # wether to use quantization or not
instance_type = "ml.p4d.24xlarge" # instance type to use for deployment
number_of_gpu = 8 # number of gpus to use for inference and tensor parallelism
volume_size = 400 # Adds EBS volume to the instance with the given size in GB
health_check_timeout = 600 # Increase the timeout for the health check to 10 minutes

# create HuggingFaceModel with the image uri
bloom_model = HuggingFaceModel(
  role=role,
  image_uri=image_uri,
  env={
    'HF_MODEL_ID': hf_model_id,
    'HF_MODEL_QUANTIZE': str(use_quantized_model),
    'SM_NUM_GPUS': str(number_of_gpu),
  }
)  

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.p4d.24xlarge` instance type. This instance type is required to run BLOOM 176B using int8 quantization. You can find more information about the instance types [here](https://aws.amazon.com/sagemaker/pricing/instance-types/).

In [26]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
predictor = bloom_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  volume_size=volume_size, 
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Using non-ECR image "ghcr.io/huggingface/text-generation-inference:sagemaker-sha-4b5e36d" without Vpc repository access mode is not supported.

SageMaker will now create our endpoint and deploy the model to it. This can takes a 15-30 minutes.

## 4. Run inference on BLOOM with different parameters

After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor`to run inference on our endpoint. We will run inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. As of today the text-generation-inference container supports the following parameters:
* `temperature`: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
* `max_new_tokens`: The maximum number of tokens to generate. Default value is 20, max value is 512.
* `repetition_penalty`: Controls the likelihood of repetition.
* `seed`: The seed to use for random generation.
* `stop`: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
* `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is 0, which disables top-k-filtering.
* `top_p`: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling.
* `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. Default value is True.

You can find the open api specification of the text-generation-inference container [here](https://huggingface.github.io/text-generation-inference/)

In [None]:
predictor.predict({
	"inputs": "Can you please let us know more details about your "
})

Now we will run inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload.

In [None]:
# define payload
payload = {
  "inputs": "Can you please let us know more details about your ",
  "parametes": {
    "temperature": 0.7,
    "stop": ["."],
    "top_k": 50,
  }
}

# send request to endpoint
response = predictor.predict(payload)

print(response)

## 5. Run token streaming on BLOOM

Text Generation Inference supports token streaming using Server-Sent Events (SSE). This means that the text-generation-inference container will stream the generated tokens back to the client, while the generation is still running. This is useful for long generation tasks where the client wants to see the generation in real-time and gives a better user experience. 

To use token streaming we need to pass the `stream` parameter in our payload and use for python the `sseclient-py` library to read the stream. We cannot use the `predict` method from the `predictor` to run inference on our endpoint. We need to use the `requests` library to send the request to the endpoint using a manuall create AWS Signature Version 4. You can find more information about the AWS Signature Version 4 [here](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html).

In [29]:
!pip install sseclient-py --quiet

Now we will run inference with token streaming. We will use the `sseclient-py` library to read the stream and print the generated tokens. 

In [None]:
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
import requests


# https://gist.github.com/marcogrcr/6f0645b20847be4ef9cd6742427fc97b
# https://github.com/andrewjroth/requests-auth-aws-sigv4
# https://github.com/boto/botocore/issues/1784#issuecomment-659132830
session = boto3.Session()
credentials = session.get_credentials()
creds = credentials.get_frozen_credentials()

def signed_request(method, url, data=None, params=None, headers=None):
    request = AWSRequest(method=method, url=url, data=data, params=params, headers=headers)
    # "service_name" is generally "execute-api" for signing API Gateway requests
    SigV4Auth(creds, "service_name", REGION).add_auth(request)
    return requests.request(method=method, url=url, headers=dict(request.headers), data=data)

def main():
    url = f"my.url.example.com/path"
    data = {"environmentId": self._environment_id}
    headers = {'Content-Type': 'application/x-amz-json-1.1'}
    response = signed_request(method='POST', url=url, data=data, headers=headers)

In [17]:
import sseclient
import requests
import json

prompt = "My name is Olivier and I"

r = requests.post("https://flan-t5-xxl.ngrok.io/generate_stream",stream=True, json={"inputs":prompt})

sse_client = sseclient.SSEClient(r)

for i, event in enumerate(sse_client.events()):
    token = json.loads(event.data)["token"]["text"]
    print(token, end = ' ')

am  a French student . I am  a fan of the French football team , FC Nantes . 

## 6. Clean up

To clean up, we can delete the model and endpoint.


In [None]:

predictor.delete_model()
predictor.delete_endpoint()