Copyright (C) 2024 Intel Corporation

SPDX-License-Identifier: Apache-2.0

# Sagemaker inference with Intel optimizations

## Agenda
0. Prerequisites
1. Build Deep Learning Container and push it to AWS ECR
2. Create a Torchserve file and put it on S3 bucket
3. Create AWS Sagemaker endpoint
4. Invoke the endpoint

### Prerequisites

Install all libraries required to run the example.

In [None]:
!pip install "sagemaker>=2.175.0" --upgrade --quiet
!pip install awscli boto3 s3transfer torch-model-archiver torchserve --upgrade --quiet
!pip install huggingface_hub --upgrade --quiet

Remember also that you have all required accesses on you AWS role. To run this example you're going to need following accesses:
- AmazonSageMakerFullAccess
- AmazonEC2ContainerRegistryFullAccess
- AmazonS3FullAccess

**Define also following variables.** These variables are needed for the Deep Learning containers to build the Docker and push it to the AWS ECR.

In [None]:
from datetime import datetime

current_datetime = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
current_datetime

In [None]:
ACCOUNT_ID = ""
REPOSITORY_NAME = ""
REGION = ""
# modify this based on your S3 Bucket name
S3_BUCKET_NAME = "" # s3://<s3 bucket name>/

In [None]:
# define these variable names based on S3 Bucket name and ECR url
import os
tag = f"2.3.1-cpu-intel-py310-ubuntu20.04-sagemaker-llm-{current_datetime}"
ECR_URL = f"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{REPOSITORY_NAME}:{tag}"
S3_URL = os.path.join(S3_BUCKET_NAME, "llm.tar.gz")
endpoint_name = "llm-ipex"
ECR_URL

### Build a docker container and push it to AWS ECR

If you don't have Docker image prepared beforehand, build the image with all required intel optimizations.

In [None]:
# review Docker
!cat docker/Dockerfile

In [None]:
# build docker image
!docker build -t $ECR_URL docker

In [None]:
# Authenticate to ECR
!aws ecr get-login-password --region {REGION} | docker login --username AWS --password-stdin {ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com

In [None]:
# Push docker image
!docker push $ECR_URL

### Create a Torchserve file and put it on S3 bucket

All information about a model are stored in `model/model-config.yaml`. Please, put the model you'd like to run in model_name param. The endpoint has been tested on `Salesforce/codegen25-7b-multi`. Here's how to create a torchserve file and put it on S3 bucket required to run the endpoint with the container.

In order to change batch size, max length or max new tokens of the model, modify fields in model-config.yaml before creating the Torchserve file.

In [None]:
!cd model && cat model-config.yaml

To generate a Torchserve file use following command:

In [None]:
import yaml

with open("model/model-config.yaml") as stream:
    model_name = yaml.safe_load(stream)["handler"]["model_name"]

if model_name == "":
    raise Exception(("Specify model_name in model/model-config.yaml"))

# Create torchserve model archive
!cd model && torch-model-archiver --force --model-name llm --version 1.0 --handler llm_handler.py --config-file model-config.yaml --archive-format tgz

Next, copy the model into an S3 bucket of your choice:

In [None]:
!cd model && aws s3 cp llm.tar.gz $S3_BUCKET_NAME

### Create AWS Sagemaker endpoint

Next step is to deploy the model to AWS Sagemaker and create an endpoint in order to run inference. 

In [None]:
import sagemaker
import boto3

boto3_session = boto3.session.Session(region_name=REGION)
smr = boto3.client('sagemaker-runtime')
sm = boto3.client('sagemaker')
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)
region = sess._region_name
account = sess.account_id()

bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

In [None]:
from sagemaker import Model

instance_type = "ml.m7i.8xlarge"
sagemaker_name = sagemaker.utils.name_from_base(endpoint_name)

model = Model(
    name="torchserve-llm-ipex" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts
    model_data=S3_URL,
    image_uri=ECR_URL,
    role=role,
    sagemaker_session=sess,
    env={"TS_INSTALL_PY_DEP_PER_MODEL": "true",
         "SAGEMAKER_CONTAINER_LOG_LEVEL": "0",
         "SAGEMAKER_REGION": region},
)
print(sagemaker_name)
print(model)

In [None]:
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=sagemaker_name,
    #volume_size=32, # increase the size to store large model
    model_data_download_timeout=3600, # increase the timeout to download large model
    container_startup_health_check_timeout=600, # increase the timeout to load large model
)

You can inspect the logs to check whether the model has been deployed successfully.

### Invoke the endpoint

Once the model is deployed, invoke the sample response with following code.

In [None]:
import time, json

client = boto3.client('sagemaker-runtime')
task = "Write a python function to compute the factorial of an integer."

custom_attributes = "c000b4f9-df62-4c85-a0bf-7c525f9104a4"  # An example of a trace ID.
content_type = "text/plain"                           # The MIME type of the input data in the request body.
accept = "*/*"                                              # The desired MIME type of the inference in the response.

import io

class Parser:
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        data = self.buff.getvalue()
        
    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

start_time = time.time()
response = client.invoke_endpoint_with_response_stream(
    EndpointName=sagemaker_name, 
    CustomAttributes=custom_attributes, 
    ContentType=content_type,
    Accept=accept,
    Body=task)
print("--- %s seconds ---" % (time.time() - start_time))

if response['ResponseMetadata']['HTTPHeaders']['transfer-encoding'] == 'chunked':
    for event in response['Body']:
        print(json.loads(event['PayloadPart']['Bytes'].decode("utf-8"))["text"], end="")
else:
    parser = Parser()
    for event in response['Body']:
        parser.write(event['PayloadPart']['Bytes'])
        for line in parser.scan_lines():
            print(line.decode("utf-8"), end="")

print("\n--- %s seconds ---" % (time.time() - start_time))

### Clean up

Once you will be done running the endpoint, you can delete it by using following method.

In [None]:
sm.delete_endpoint(EndpointName=sagemaker_name)
sm.delete_endpoint_config(EndpointConfigName=sagemaker_name)
sm.delete_model(ModelName=model.name)