# Deploying the fine tuned model on Inf2

Please make sure the following before running the notebook:

- Your fine tuned model has been save to S3 bucket
- You have SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install sagemaker --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers
import jinja2
from pathlib import Path
import json

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

jinja_env = jinja2.Environment()
code_dir = "llama2_13b_inf2_src"

smr_client = boto3.client("sagemaker-runtime")

#load saved parameters
%store -r model_data_s3_location
%store -r model_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model. In `serving.properties` we define key parameters, such as `tensor_parallel_degree` and `model_id`. In our case, `model_id` is the S3 location of our fine tuned model. Please keep in mind that for large models, the [compilation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html) time could be long. To avoid SageMaker hosting timeout error, it is recommended to precompile the model to become Inf2 compatible and save the compiled model to S3. 



In [5]:
!rm -rf {code_dir}
!mkdir -p {code_dir}

In [6]:
%%writefile {code_dir}/serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id={{s3_url}}
option.batch_size=4
option.neuron_optimize_level=2
option.tensor_parallel_degree=8
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16
option.model_loading_timeout=1500

Writing llama2_13b_inf2_src/serving.properties


Plug in the appropriate model location into the serving.properties file. For this publicly hosted model weights, the s3 URL depends on the region in which the notebook is executed.

In [7]:
template = jinja_env.from_string(Path(f"{code_dir}/serving.properties").open().read())
Path(f"{code_dir}/serving.properties").open("w").write(
    template.render(s3_url=model_data_s3_location)
)
!pygmentize {code_dir}/serving.properties | cat -n

     1	[36mengine[39;49;00m=[33mPython[39;49;00m[37m[39;49;00m
     2	[36moption.entryPoint[39;49;00m=[33mdjl_python.transformers_neuronx[39;49;00m[37m[39;49;00m
     3	[36moption.model_id[39;49;00m=[33ms3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models[39;49;00m[37m[39;49;00m
     4	[36moption.batch_size[39;49;00m=[33m4[39;49;00m[37m[39;49;00m
     5	[36moption.neuron_optimize_level[39;49;00m=[33m2[39;49;00m[37m[39;49;00m
     6	[36moption.tensor_parallel_degree[39;49;00m=[33m8[39;49;00m[37m[39;49;00m
     7	[36moption.n_positions[39;49;00m=[33m512[39;49;00m[37m[39;49;00m
     8	[36moption.rolling_batch[39;49;00m=[33mauto[39;49;00m[37m[39;49;00m
     9	[36moption.dtype[39;49;00m=[33mfp16[39;49;00m[37m[39;49;00m
    10	[36moption.model_loading_timeout[39;49;00m=[33m1500[39;49;00m[37m[39;49;00m


### Create a model.tar.gz with the model artifacts

In [8]:
code_file_name = "llama2_13b_inf2_code.tar.gz"
!tar czvf {code_file_name} {code_dir}/

llama2_13b_inf2_src/
llama2_13b_inf2_src/serving.properties


### Upload artifact on S3 and create SageMaker model

In [9]:
s3_code_prefix = f"{model_name}/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data(code_file_name, bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf/code/llama2_13b_inf2_code.tar.gz


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [10]:
image_uri = image_uris.retrieve(
        framework="djl-neuronx",
        region=sess.boto_session.region_name,
        version="0.24.0"
    )

### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names. The LMI container is automatically compiling your model user neuronSDK. This may take up to 30 mins.

In [11]:
instance_type = "ml.inf2.48xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"{model_name.split('/')[-1]}")

print(endpoint_name)

# Create a Model object with the image and model data
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             container_startup_health_check_timeout=1500,
             volume_size=256,
             endpoint_name=endpoint_name)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

Your model is not compiled. Please compile your model before using Inferentia.


Llama-2-13b-hf-2023-12-20-00-42-38-537
------------------------------------------------------!

## Step 5: Test a LLama2 instruction prompt

### Realtime invokation

In [12]:
def get_realtime_response(sagemaker_runtime, endpoint_name, payload):
    """Query endpoint and print the response"""

    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
        CustomAttributes='accept_eula=true'
    )
    
    return response

In [13]:
def build_llama2_prompt(instructions):
    stop_token = "</s>"
    start_token = "<s>"
    startPrompt = f"{start_token}[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, instruction in enumerate(instructions):
        if instruction["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{instruction['content']}\n<</SYS>>\n\n")
        elif instruction["role"] == "user":
            conversation.append(instruction["content"].strip())
        else:
            conversation.append(f"{endPrompt} {instruction['content'].strip()} {stop_token}{startPrompt}")

    return startPrompt + "".join(conversation) + endPrompt

def get_instructions(user_content):
    
    '''
    Note: We are creating a fresh user content everytime by initializing instructions for every user_content.
    This is to avoid past user_content when you are inferencing multiple times with new ask everytime.
    ''' 
    
    system_content = '''
    You are a friendly assistant. Your goal is to anser user questions.'''

    instructions = [
        { "role": "system","content": f"{system_content} "},
    ]
    
    instructions.append({"role": "user", "content": f"{user_content}"})
    
    return instructions

In [14]:
user_ask="What is a machine learning?"
instructions = get_instructions(user_ask)
prompt = build_llama2_prompt(instructions)


inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 1.0,
        "top_k": 50,
        "max_new_tokens": 100,
        "repetition_penalty": 1.03,
        "stop": ["</s>"],
        "return_full_text": False
    }


payload = {
    "inputs":  prompt,
    "parameters": inference_params,
}

In [15]:
%%time
response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": "\n\n[/INST]\n\n[SYS] <</SYS>>\n\nThe term “machine learning” refers to a type of artificial intelligence (AI) that provides computers with the ability to learn and improve from experience without being explicitly programmed.\n\nMachine learning focuses on the development of computer programs that can access data and use it learn for themselves.\n\nMachine learning algorithms are used in a wide range of applications, including:\n\n- Speech recognition\n-"}
CPU times: user 17 ms, sys: 0 ns, total: 17 ms
Wall time: 2.69 s


### Stream Response

In [None]:
import sys, os
module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils.LineIterator import LineIterator

def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    for line in LineIterator(event_stream):
        print(line, end='')

def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload), 
        ContentType="application/json",
        CustomAttributes='accept_eula=true'
    )
    return response_stream

In [None]:
%%time
resp = get_realtime_response_stream(smr_client, endpoint_name, payload)
print_response_stream(resp)

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()