# How to deploy BLOOM to Amazon SageMaker using Text-Generation-Inference

This is an example on how to deploy the open-source LLMs, like [BLOOM](bigscience/bloom) to Amazon SageMaker for inference. We will deploy BLOOM 176B to Amazon SageMake for real-time inference using Hugging Face new LLM solution [text-generation-inference](https://github.com/huggingface/text-generation-inference). 

The example covers:
1. Setup development environment
2. Create `HuggingFace` model with TGI container
3. Deploy BLOOM to Amazon SageMaker
4. Run inference on BLOOM with different parameters
5. Run token streaming on BLOOM
6. Clean up

## What is Text Generation Inference?

[Text Generation Inference](https://github.com/huggingface/text-generation-inference) is a library built by Hugging Face to offer an end-to-end optimized solution to run inference on open source LLMs, already powering Hugging Face services running at scale such as the Hugging Face Inference API for BLOOM, GPT-NeoX, SantaCoder, and many more LLMs. In addition, Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative. \
Text Generation Inference implements optimization for all supported model architectures, including:
* Tensor Parallelism and custom cuda kernels
* Quantization
* Dynamic batching of incoming requests for increased total throughput 
* Accelerated weight loading (start-up time) with safetensors
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)

## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
!pip install sagemaker --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.


In [2]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.


sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker session region: us-east-1


## 2. Create `HuggingFace` model with TGI container

As of today the text-generation-inference container is not yet available natively inside `sagemaker`. We will use the `HuggingFaceModel` model class with a custom `image_uri` pointing to the registry image of the text-generation-inference container. The text-generation-inference container is available in Github Repository as package. You can find more information about the container [here](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference). 

To make the use with SageMaker easier we uploaded a version of the container to a public ECR repository. If you want to migrate the container yourself We created a `create_ecr_contaienr.sh` script we can use to migrate the container to ECR.
_Note: make sure you have permissions to create ECR repositories and docker running._
```python
image_uri = "ghcr.io/huggingface/text-generation-inference:sagemaker-sha-631c4c8"
account_id = sess.account_id()
region = sess.boto_region_name

!chmod +x create_ecr_container.sh
!./create_ecr_container.sh {image_uri} {account_id} {region}

image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-text-generation-inference:latest"
```



In [20]:
# text-generation-inference container image uri
image_uri="558105141721.dkr.ecr.us-east-1.amazonaws.com/sagemaker-text-generation-inference:latest"

To deploy BLOOM to Amazon SageMaker we need to create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. . We will use a `p4d.24xlarge` instance type with quantization enabled to deploy BLOOM.

In [21]:
import json
from sagemaker.huggingface import HuggingFaceModel

# Define Model and Endpoint configuration parameter
hf_model_id = "bigscience/bloom" # model id from huggingface.co/models
use_quantized_model = True # wether to use quantization or not
instance_type = "ml.p4d.24xlarge" # instance type to use for deployment
number_of_gpu = 8 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 900 # Increase the timeout for the health check to 15 minutes for downloading bloom

# create HuggingFaceModel with the image uri
bloom_model = HuggingFaceModel(
  role=role,
  image_uri=image_uri,
  env={
    'HF_MODEL_ID': hf_model_id,
    'HF_MODEL_QUANTIZE': json.dumps(use_quantized_model),
    'SM_NUM_GPUS': json.dumps(number_of_gpu)
  }
)  

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.p4d.24xlarge` instance type. This instance type is required to run BLOOM 176B using int8 quantization. You can find more information about the instance types [here](https://aws.amazon.com/sagemaker/pricing/instance-types/).

In [22]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
predictor = bloom_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


-------------------------!

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

## 4. Run inference on BLOOM with different parameters

After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor`to run inference on our endpoint. We will run inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. As of today the text-generation-inference container supports the following parameters:
* `temperature`: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
* `max_new_tokens`: The maximum number of tokens to generate. Default value is 20, max value is 512.
* `repetition_penalty`: Controls the likelihood of repetition.
* `seed`: The seed to use for random generation.
* `stop`: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
* `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is 0, which disables top-k-filtering.
* `top_p`: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling.
* `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. Default value is False.

You can find the open api specification of the text-generation-inference container [here](https://huggingface.github.io/text-generation-inference/)

In [32]:
predictor.predict({
	"inputs": "Can you please let us know more details about your"
})

[{'generated_text': 'Can you please let us know more details about your problem? What is the error message you are getting? What is the exact code you are using?'}]

Now we will run inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload.

In [125]:
# define payload
prompt="""Do a hello world in different languages:
Python: print("hello world")
R:"""

payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
  }
}

# send request to endpoint
response = predictor.predict(payload)

print(response[0]["generated_text"])

Do a hello world in different languages:
Python: print("hello world")
R: print("Hello world!")
Lisp: (format nil "Hello world!")
Scheme:


## 5. Run token streaming on BLOOM

Text Generation Inference supports token streaming using Server-Sent Events (SSE). This means that the text-generation-inference container will stream the generated tokens back to the client, while the generation is still running. This is useful for long generation tasks where the client wants to see the generation in real-time and gives a better user experience. 

To use token streaming we need to pass the `stream` parameter in our payload and use for python the `sseclient-py` library to read the stream. We cannot use the `predict` method from the `predictor` to run inference on our endpoint. We need to use the `requests` library to send the request to the endpoint using a manuall create AWS Signature Version 4. You can find more information about the AWS Signature Version 4 [here](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html).

In [28]:
!pip install sseclient-py --quiet

Now we will run inference with token streaming. We will use the `sseclient-py` library to read the stream and print the generated tokens. We wrote two helper methods, which allows us to run inference with token streaming. The first method `http_request` creates an HTTP request with a AWS Signature Version 4. The second method `stream_request` uses the `http_request` method to send the request to the endpoint and uses the `sseclient-py` library to read the stream and print the generated tokens.

In [123]:
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
import requests
import sseclient

session = boto3.Session()
credentials = session.get_credentials()
creds = credentials.get_frozen_credentials()

# HTTP Request method with AWS SigV4 signing
def http_request(url, data ,method="POST", is_aws=True):
    # set stream attribute in payload
    data["stream"] = True
    body = json.dumps(data)
    # define headers
    headers = {'Content-Type': 'application/json',"Accept":"text/event-stream",'Connection': 'keep-alive'}
    # sign request
    if is_aws:
        request = AWSRequest(method=method, url=url, data=body, headers=headers)
        SigV4Auth(creds, "sagemaker", session.region_name).add_auth(request)
        headers = dict(request.headers)
    # send request
    return requests.post(url=url, headers=headers, data=body, stream=True)

# Stream request for token streamning using SSE
def stream_request(url,data, is_aws=True, split_token=""):
    # send request
    res = http_request(url=url, data=data, is_aws=is_aws)
    # create sse client
    sse_client = sseclient.SSEClient(res)
    # stream output
    print(prompt, end = '')
    for event in sse_client.events():
        token = json.loads(event.data)["token"]["text"]
        print(token, end = split_token)

Let's test it and stream some token from SageMaker.

### _NOTE: it seems that SageMaker is not yet supporting streaming/server send events. Compared to working HF sagemaker waits for the whole request to preprocessed and is not sending chunks back. Below is a second cell using the same code and an HF endpoint which works_

In [124]:
# sageamker endpoint url
url = f"https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/{predictor.endpoint_name}/invocations"

prompt = "Write a adventures story about a John a middle age farmer: "
request_payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "max_new_tokens": 100,
  }
}

# stream request
# TODO streaming not working with SageMaker
stream_request(url=url, data=request_payload)

Write a adventures story about a John a middle age farmer:  something happened that disturbed him very much and he was depressed. The same day he was called to bring some land owned by the grandfather. But before he go, he found a hidden treasure by chance. But the landowner comes first to find the treasure and tried to shot him. John was left with a seriously injury in the brain.
The landowner was taken to jail. The whole village welcome John as a hero. But this does not satisfy him.  In the night John was talking to himself (

In [122]:
# Example for flan hosted on HF with works
url="https://flan-t5-xxl.ngrok.io/generate_stream"

# stream request
stream_request(url=url, data=request_payload,is_aws=False,split_token=" ")

Write a adventures story about a John a middle age farmer: John is  a middle age farmer who sell s his dairy products in town . One day John decided to pick up  a few goat s that were being sold near his dairy . He called the owner of the goat s and inquire d about their prices . The owner of the goat s started to  y ell at John and refused to sell them to him . John was determined to protect his business , so  he grabbe d the goat s and started to ride them out of town . The goat s 

## 6. Clean up

To clean up, we can delete the model and endpoint.


In [126]:

predictor.delete_model()
predictor.delete_endpoint()