# Deploy open-source Large Language Models on Amazon SageMaker

In this notebook, we will show you how to deploy the open-source LLMs from HuggingFace on Amazon SageMaker.

### Deploy text-to-text LLM on SageMaker
In this section, we will deploy the open-source [Falcon 40b instruct model](https://huggingface.co/tiiuae/falcon-40b-instruct) on SageMaker for real-time inference. 

This is an example on how to deploy the open-source LLMs, like [BLOOM](bigscience/bloom) to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container. We will deploy the 40B-Instruct [Falcon](https://huggingface.co/tiiuae/falcon-40b-instruct) an open-source Chat LLM trained by TII.

The example covers:
1. [Setup development environment](#1-setup-development-environment)
2. [Retrieve the new Hugging Face LLM DLC](#2-retrieve-the-new-hugging-face-llm-dlc)
3. [Deploy Falcon to Amazon SageMaker](#3-deploy-open-assistant-12b-to-amazon-sagemaker)
4. [Run inference and chat with our model](#4-run-inference-and-chat-with-our-model)

## What is Hugging Face LLM Inference DLC?

Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. 
Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
* Tensor Parallelism and custom cuda kernels
* Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
* Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
* [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
* Accelerated weight loading (start-up time) with [safetensors](https://github.com/huggingface/safetensors)
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip, RedPajama, open assistant)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)
* [Llama](https://github.com/facebookresearch/llama) (vicuna, alpaca, koala)
* [Starcoder](https://huggingface.co/bigcode/starcoder) / [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) / [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like [HuggingChat](https://hf.co/chat), [OpenAssistant](https://open-assistant.io/), and Inference API for LLM models on the Hugging Face Hub. 

Lets get started!


#### 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
!pip install "sagemaker==2.163.0" --upgrade --quiet

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

#### 2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

## 3. Deploy Deploy Falcon to Amazon SageMaker

To deploy [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.12xlarge` instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.


In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# Define Model and Endpoint configuration parameter
hf_model_id = "tiiuae/falcon-40b-instruct" # model id from huggingface.co/models
instance_type = "ml.g5.12xlarge" # instance type to use for deployment
number_of_gpu = 1 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 600 # Increase the timeout for the health check to 5 minutes for downloading the model

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env={
    'HF_MODEL_ID': hf_model_id,
    # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'MAX_INPUT_LENGTH': json.dumps(1900),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  }
)

After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. 

In [None]:
model_name = hf_model_id.split("/")[-1].replace(".", "-")
endpoint_name = model_name + "-12xl"
endpoint_name

In [None]:
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
  endpoint_name=endpoint_name,
)

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

## 4. Run inference and chat with our model

After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. As of today the TGI supports the following parameters:
* `temperature`: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
* `max_new_tokens`: The maximum number of tokens to generate. Default value is 20, max value is 512.
* `repetition_penalty`: Controls the likelihood of repetition, defaults to `null`.
* `seed`: The seed to use for random generation, default is `null`.
* `stop`: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
* `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is `null`, which disables top-k-filtering.
* `top_p`: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to `null`
* `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. Default value is `false`.
* `best_of`: Generate best_of sequences and return the one if the highest token logprobs, default to `null`.
* `details`: Whether or not to return details about the generation. Default value is `false`.
* `return_full_text`: Whether or not to return the full text or only the generated part. Default value is `false`.
* `truncate`: Whether or not to truncate the input to the maximum length of the model. Default value is `true`.
* `typical_p`: The typical probability of a token. Default value is `null`.
* `watermark`: The watermark to use for the generation. Default value is `false`.

You can find the open api specification of the TGI in the [swagger documentation](https://huggingface.github.io/text-generation-inference/)

In [None]:
chat = llm.predict({
    "inputs": """Hello, how are you?"""
})

print(chat[0]["generated_text"])

### Deploy speech-to-audio LLM on SageMaker
In this section, we will deploy the open-source [Whisper model](https://huggingface.co/openai/whisper-large-v2) to SageMaker real-time hosting. 

In [None]:
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
import json
client = boto3.client('runtime.sagemaker')

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'openai/whisper-large-v2',
	'HF_TASK':'automatic-speech-recognition',
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
	env=hub,
	role=role, 
)
endpoint_name="wisper-large-v2"
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    endpoint_name=endpoint_name,
	initial_instance_count=1, # number of instances
	instance_type='ml.g5.xlarge' # ec2 instance type
)



In [None]:
predictor = sagemaker.predictor.Predictor(endpoint_name=endpoint_name, 
                                          serializer=DataSerializer(content_type='audio/x-audio'),
                                         )

In [None]:
file = "test/test.webm"
with open(file, "rb") as f:
	data = f.read()

In [None]:
# option 1: using SageMaker python SDK
from sagemaker.serializers import DataSerializer

predictor.serializer = DataSerializer(content_type='audio/x-audio') # change to audio/x-audio for audio
predictor.predict(data)

In [None]:
# option 2: using boto3 invoke_endpoint api
response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='audio/x-audio', Body=data)
json.loads(response['Body'].read())

### Deploy image-to-text LLM on SageMaker
In this section, we will show you how to deploy the [blip2 model](https://huggingface.co/Salesforce/blip2-opt-6.7b) on SageMaker.

#### Setup
We get DLC image URL for djl-deepspeed 0.21.0 and set SageMaker settings

In [None]:
from sagemaker import image_uris
import time

session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = session._region_name
bucket = session.default_bucket()  # bucket to house artifacts

img_uri = image_uris.retrieve(framework="djl-deepspeed", region=region, version="0.21.0")
instance_type = "ml.g5.4xlarge"
s3_location = f"s3://{bucket}/djl-serving/"

#### Prepare model file.
We can update the configuration for deployment by modifying the [serving.properties](blip2/serving.properties).
```python
engine = DeepSpeed
option.tensor_parallel_degree=1
option.model_id=Salesforce/blip2-opt-6.7b
```

The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. 

In [None]:
!tar -czvf blip2.tar.gz blip2/

In [None]:
model_tar_url = sagemaker.s3.S3Uploader.upload("blip2.tar.gz", s3_location)

#### Create SageMaker endpoint
Now we create our SageMaker model. Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles documentation for more details.

In [None]:
from datetime import datetime

sm_client = boto3.client("sagemaker")

time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_name = "blip2-" + time_stamp

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": img_uri, "ModelDataUrl": model_tar_url},
)

In [None]:
initial_instance_count = 1
variant_name = "AllTraffic"
endpoint_config_name = "blip2-config-" + time_stamp

production_variants = [
    {
        "VariantName": variant_name,
        "ModelName": model_name,
        "InitialInstanceCount": initial_instance_count,
        "InstanceType": instance_type,
        "ModelDataDownloadTimeoutInSeconds": 1200,
        "ContainerStartupHealthCheckTimeoutInSeconds": 1800
    }
]

endpoint_config = {
    "EndpointConfigName": endpoint_config_name,
    "ProductionVariants": production_variants
}

ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)

In [None]:
endpoint_name = "blip2" + time_stamp
ep_res = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

In [None]:
describe_endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name)

while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)
    
print(f'endpoint {endpoint_name} is in service now.')

#### Test inference endpoint

In [None]:
import base64
import json

def encode_image(img_file):
    with open(img_file, "rb") as image_file:
        img_str = base64.b64encode(image_file.read())
        base64_string = img_str.decode("latin1")
    return base64_string

In [None]:
base64_string = encode_image('carcrash-ai.jpeg')
inputs = {"prompt": "Question: is the car damaged? and if yes, which part of this car are damaged?/n Answer:", "image": base64_string}

smr_client = boto3.client("sagemaker-runtime")

response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name, Body=json.dumps(inputs)
)
print(response["Body"].read())

## clean up

In [None]:
# client.delete_endpoint(EndpointName=endpoint_name)
# client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# client.delete_model(ModelName=model_name)