# Codegen Sagemaker inference with Intel optimizations

## Agenda
0. Prerequisites
1. Build Deep Learning Container and push it to AWS ECR
2. Create a Torchserve file and put it on S3 bucket
3. Create AWS Sagemaker endpoint
4. Invoke the endpoint

### Prerequisites

Install all libraries required to run the example.

In [31]:
!pip install "sagemaker>=2.175.0" --upgrade --quiet
! pip install awscli boto3 botocore numpy s3transfer torch-model-archiver==0.8.1 torchserve==0.8.2 --upgrade --quiet

Remember also that you have all required accesses on you AWS role. To run this example you're going to need following accesses:
- AmazonSageMakerFullAccess
- AmazonEC2ContainerRegistryFullAccess
- AmazonS3FullAccess

**Define also following variables.** These variables are needed for the Deep Learning containers to build the Docker and push it to the AWS ECR.

In [47]:
from datetime import datetime

current_datetime = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
current_datetime

'2024-03-08-16-19-50'

In [48]:
ACCOUNT_ID = ""
REPOSITORY_NAME = "pytorch_inference"
REGION = "us-west-2"
# modify this based on your S3 Bucket name
S3_BUCKET_NAME = "" # s3://<s3 bucket name>/

In [None]:
# define these variable names based on S3 Bucket name and ECR url
import os
tag = f"2.2.0-cpu-intel-py310-ubuntu20.04-sagemaker-codegen-{current_datetime}"
ECR_URL = f"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{REPOSITORY_NAME}:{tag}"
S3_URL = os.path.join(S3_BUCKET_NAME, "codegen25.tar.gz")
endpoint_name = "codegen-ipex"
ECR_URL

### Build Deep Learning Container and push it to AWS ECR

If you don't have Docker image prepared beforehand, build the image with all required intel optimizations.

In [None]:
# review Docker
!cat docker/Dockerfile

In [None]:
# build docker image
!docker build -t $ECR_URL docker

In [37]:
# Authenticate to ECR
!aws ecr get-login-password --region {REGION} | docker login --username AWS --password-stdin {ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [None]:
# Push docker image
!docker push $ECR_URL

### Create a Torchserve file and put it on S3 bucket

The endpoint has been tested on `Salesforce/codegen25-7b-multi` model. Here's how to create a torchserve file and put it on S3 bucket required to run the endpoint with Deep Learning Containers.

In order to change batch size, max length or max new tokens of the model, modify fields in model-config.yaml before creating the Torchserve file.

In [49]:
!cd codegen_model && cat model-config.yaml

minWorkers: 1
maxWorkers: 1
responseTimeout: 1500

handler:
    model_name: "Salesforce/codegen25-7b-multi"
    batch_size: 1
    max_length: 1024 
    max_new_tokens: 128
    ipex_weight_only_quantization: true
    woq_dtype: "INT8"
    lowp_mode: "BF16"
    act_quant_mode: "PER_IC_BLOCK"
    group_size: -1
    token_latency: true
    benchmark: true 
    num_warmup: 2
    num_iter: 8
    greedy: true
    


To generate a Torchserve file use following command:

In [40]:
!cd codegen_model && torch-model-archiver --force --model-name codegen25 --version 1.0 --handler codegen_handler.py --config-file model-config.yaml --extra-files codegen25.py --archive-format tgz



Next, copy the model into an S3 bucket of your choice:

In [None]:
!cd codegen_model && aws s3 cp codegen25.tar.gz $S3_BUCKET_NAME

### Create AWS Sagemaker endpoint

Next step is to deploy the model to AWS Sagemaker and create an endpoint in order to run inference. 

In [None]:
import sagemaker
import boto3

boto3_session = boto3.session.Session(region_name=REGION)
smr = boto3.client('sagemaker-runtime')
sm = boto3.client('sagemaker')
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)
region = sess._region_name
account = sess.account_id()

bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

In [None]:
from sagemaker import Model

instance_type = "ml.m7i.8xlarge"
sagemaker_name = sagemaker.utils.name_from_base(endpoint_name)

model = Model(
    name="torchserve-codegen-ipex" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts
    model_data=S3_URL,
    image_uri=ECR_URL,
    role=role,
    sagemaker_session=sess,
    env={"TS_INSTALL_PY_DEP_PER_MODEL": "true",
         "SAGEMAKER_CONTAINER_LOG_LEVEL": "0",
         "SAGEMAKER_REGION": region},
)
print(sagemaker_name)
print(model)

In [44]:
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=sagemaker_name,
    #volume_size=32, # increase the size to store large model
    model_data_download_timeout=3600, # increase the timeout to download large model
    container_startup_health_check_timeout=600, # increase the timeout to load large model
)

-----!

You can inspect the logs to check whether the model has been deployed successfully.

### Invoke the endpoint

Once the model is deployed, invoke the sample response with following code.

In [None]:
import time, json

client = boto3.client('sagemaker-runtime')
task = "Write a python function to compute the factorial of an integer."

custom_attributes = "c000b4f9-df62-4c85-a0bf-7c525f9104a4"  # An example of a trace ID.
content_type = "text/plain"                           # The MIME type of the input data in the request body.
accept = "*/*"                                              # The desired MIME type of the inference in the response.

import io

class Parser:
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        data = self.buff.getvalue()
        
    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

start_time = time.time()
response = client.invoke_endpoint_with_response_stream(
    EndpointName=sagemaker_name, 
    CustomAttributes=custom_attributes, 
    ContentType=content_type,
    Accept=accept,
    Body=task)
print("--- %s seconds ---" % (time.time() - start_time))

parser = Parser()
for event in response['Body']:
    parser.write(event['PayloadPart']['Bytes'])
    for line in parser.scan_lines():
        print("\n", line.decode("utf-8"), end=' \n')

### Clean up

Once you will be done running the endpoint, you can delete it by using following method.

In [46]:
sm.delete_endpoint(EndpointName=sagemaker_name)

{'ResponseMetadata': {'RequestId': '7d244efc-c87a-494b-b095-ebf9c4983b10',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7d244efc-c87a-494b-b095-ebf9c4983b10',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Fri, 08 Mar 2024 14:36:38 GMT'},
  'RetryAttempts': 0}}