## Deploy Llama 3.2 1B on AWS Inferentia and SageMaker using HuggingFace TGI

This notebook demonstrates how to deploy [Meta Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model using [Hugging Face Text Generation Inference (TGI) Deep Learning Container on Amazon SageMaker](https://huggingface.co/docs/optimum-neuron/en/guides/neuronx_tgi).

TGI is an open source, high performance inference library that can be used to deploy large language models from Hugging Face’s repository in minutes. The library includes advanced functionality like model parallelism and continuous batching to simplify production inference with large language models like Mistral, LLaMa, StableLM, and GPT-NeoX.



### Setup
Install the SageMaker Python SDK, huggingface_hub and transformers.

First, make sure that the required version are installed.


In [None]:
!pip install "sagemaker>=2.216.0" --upgrade --quiet
!pip install "huggingface_hub==0.24.6" --upgrade --quiet
!pip install "transformers==4.45.2" --upgrade --quiet


### Setup account and role

Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role.


In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()

sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
 
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
 
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

The following code defines the `docker image uri` and `model_id`.

`llm_image` is set to the image URI for the Hugging Face Large Language Model (LLM) inference container.

`model_id` is a Hugging Face model repository that has the model weights. In this case, it contains pre-compiled neuron artifacts. This is described in the next section.

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri
 
# Define the llm image uri
llm_image="763104351884.dkr.ecr."+sess.boto_region_name+".amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04-v1.0"

# To save deployment time, we used a pre-compiled Meta-Llama-3.2-1B model
model_id="cszhzleo/Meta-Llama-3.2-1B-Instruct-nc2-bs1-token1024-neuron-220"
# print ecr image uri
print(f"llm image uri: {llm_image}")

To deploy models on  [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/?nc1=h_ls), one needs to compile the model to a NEFF(Neuron Executable File Format) file that is loaded onto the Neuron devices.

For the purpose of this workshop to save time, we have pre-compiled [Meta Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) and uploaded the resultant artifacts to to the [HuggingFace model hub](`cszhzleo/Meta-Llama-3.2-1B-Instruct-nc2-bs1-token1024-neuron-220`).

Here, we use [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/index). Optimum Neuron is the interface between the HuggingFace Transformers library and AWS Accelerators including [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/?nc1=h_ls).

We use the following command to compile and export model to `OUTPUT_PATH`:

```
optimum-cli export neuron  --model meta-llama/Llama-3.2-1B-Instruct --batch_size 1 --sequence_length 1024 --num_cores 2 --auto_cast_type fp16  <OUTPUT_PATH>
```

To upload compiled model to HuggingFace, run the following commands:

```
huggingface-cli login # Login HuggingFace, user needs to prepare a user access tokens with write permissions

huggingface-cli upload <HF_MODEL_ID> <OUTPUT_PATH> # Upload complied model file

```

For more details, refer the [guide on exporting models to AWS Trainium and Inferentia using Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/guides/export_model).


In practice, you can skip this step and just use the `modelid` on the HuggingFace model hub. The framework maintains a cache of pre-compiled artifacts (NEFF files) of popular models and configurations in a public repository. When you specify a `modelid`, TGI looks up the [Neuron Model Cache](https://huggingface.co/aws-neuron/optimum-neuron-cache) and uses the cached artifacts to deploy the model.

A cached configuration is defined through a model architecture (Llama3), model size (8B), neuron version (2.18), number of inferentia cores (2), batch size (4), and sequence length (4096).

This means that when deploying models with an architecture based on Llama3 and a configuration for which Neuron compiled artifacts exist; there will be no need to re-compile your model.

### Configure variables for `TGI`

The next step is to define certain configurations that will be used to deploy the model.
- `HF_NUM_CORES`: the number of Neuron cores across which the model will be partitioned (tensor parallel degree)
- `HF_BATCH_SIZE`: the batch size to be used for inference
- `HF_SEQUENCE_LENGTH`: the total sequence length (input + output) of requests to the model

In [None]:
from huggingface_hub import HfFolder
from sagemaker.huggingface import HuggingFaceModel
 
# sagemaker config
instance_type = "ml.inf2.xlarge"
health_check_timeout=2400 # additional time to load the model
volume_size=100 # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": model_id,
    "HF_NUM_CORES": "2", # number of neuron cores
    "HF_BATCH_SIZE": "1", # batch size used to compile the model
    "HF_SEQUENCE_LENGTH": "1024", # length used to compile the model
    "HF_AUTO_CAST_TYPE": "fp16",  # dtype of the model
    "MAX_BATCH_SIZE": "1", # max batch size for the model
    "MAX_INPUT_LENGTH": "256", # max length of input text
    "MAX_TOTAL_TOKENS": "1024", # max length of generated text
    #"HF_TOKEN": HfFolder.get_token(), # pass the huggingface token
}



### Create the Hugging Face Model

Next we configure the model object by specifying the `image_uri` of the managed TGI container, and the execution `role` for the endpoint. Additionally, we specify a number of environment variables defined above, including the `HF_MODEL_ID` which corresponds to the model from the HuggingFace Hub that will be deployed.

In [None]:
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)


### Creating a SageMaker Endpoint

Next we deploy the model by invoking the `deploy()` function.

To efficiently deploy and run large language models, it is important to choose an appropriate instance type that can handle the computational requirements. Here we use an `ml.inf2.xlarge` instance which come with 2 neuron cores.

In [None]:
llm_model._is_compiled_model = True
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
  volume_size=volume_size,
  endpoint_name="llama-32-1b-endpoint"
)

**The above step takes about 10-15 minutes.**
While you wait for the model to be deployed, you can read the below resources -

- [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html)
- [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inf2-arch.html)
- [Amazon SageMaker Realtime Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deploy-models.html)
- [Amazon SageMaker with HuggingFace Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/guides/sagemaker)


### Running Inference

Once the endpoint is up and running, we can evaluate the model using the predict() function.


In [None]:
from transformers import AutoTokenizer
 
# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
# Prompt to generate
messages = [
    {"role": "user", "content": "Can you tell me an interesting fact about AWS?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 
# Generation arguments
parameters = {
    "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "return_full_text": False,
    "max_new_tokens": 768,
}
 
res = llm.predict({"inputs": prompt, "parameters": parameters})
print(res[0]["generated_text"].strip().replace("</s>", ""))

### Streaming
Streaming responses from LLMs can significantly improve user experience by reducing wait times and providing real-time feedback. Here is the inference with streaming mode.

In [None]:
import json
import boto3

sm_client = boto3.client('sagemaker-runtime')

body = json.dumps({"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":768}, "stream": True})

resp = sm_client.invoke_endpoint_with_response_stream(
    EndpointName=llm.endpoint_name,
    Body=body,
    ContentType='application/json',
    Accept='application/json',
)
text = ""
for e in resp['Body']:
    tok = e['PayloadPart']['Bytes'].decode('utf-8')
    if tok.startswith('data'): 
        try:
            tok = json.loads(tok[5:])
            print(tok['token']['text'], end='')
        except Exception as e:
            pass



### Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.


In [None]:
llm.delete_model()
llm.delete_endpoint()