# Deploy a fine-tuned TinyLlama-1.1B model for generative AI inference

## Introduction

In this workshop module, you will learn how to deploy LLM model to [Amazon EC2 inf2 instance](https://aws.amazon.com/ec2/instance-types/inf2/) for generative AI inference.
You will use Amazon SageMaker with [Deep learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html) to deploy the model fine-tuned in the previous workshop module. Amazon SageMaker deployment provides fully managed options for deploying our models using Real Time or Batch modes. AWS Inferentia gives best cost per inference.

## Prerequisites

This notebook uses the SageMaker Python SDK to deploy a fine-tuned model using SageMaker hosting service. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session.

In [None]:
# Upgrade SageMaker SDK to the latest version
%pip install -U sagemaker -q

In [None]:
import logging 
sagemaker_config_logger = logging.getLogger("sagemaker.config") 
sagemaker_config_logger.setLevel(logging.WARNING)

# Import SageMaker SDK, setup our session
import sagemaker
from sagemaker import Model, image_uris, serializers

sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
role = sagemaker.get_execution_role()  # execution role for the endpoint

In [None]:
role

## Specify the LMI container image

[SageMaker LMI containers](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html) use [DJLServing](https://github.com/deepjavalibrary/djl-serving), a model server that is integrated with the [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) library to support tensor parallelism across NeuronCores. The DJL model server and transformers-neuronx library serve as core components of the container, which also includes the [Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/). This setup facilitates the loading of models onto [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) accelerators, parallelizes the model across multiple [NeuronCores](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch), and enables serving via HTTP endpoints.

In [None]:
image_uri = image_uris.retrieve(
        framework="djl-neuronx",
        region=sess.boto_session.region_name,
        version="0.24.0"
    )
image_uri

## Prepare Model Serving Artifacts

The LMI container supports loading models from an Amazon Simple Storage Service (Amazon S3) bucket or Hugging Face Hub. You need  parameters required in *`serving.properties`* file to load and host the model. 

In the following cell, you will need to update *`option.model_id`* with the S3 path you copied from the previous workshop module where fine-tuned model artifact is available. It should be something like
```
option.model_id=s3://sagemaker-us-west-2-xxxxxxxxxxxx/reinvent2023/trn1-tinyllama-2023-11-xx-xx-xx-xx-xxx/output/model
```

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=s3://sagemaker-us-east-2-161949406237/reinvent2023/trn1-tinyllama-2023-12-07-11-34-34-726/output/model
option.batch_size=1
option.neuron_optimize_level=1
option.tensor_parallel_degree=2
option.load_in_8bit=false
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16

Construct the tarball containing *`serving.properties`* and upload it to an S3 bucket. 

In [None]:
%%sh
mkdir mycode
mv serving.properties mycode/
tar czvf mycode.tar.gz mycode/
rm -rf mycode

In [None]:
s3_code_prefix = "reinvent2023/large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mycode.tar.gz", bucket, s3_code_prefix)
print(f"Code uploaded to --- > {code_artifact}")

## Create SageMaker Endpoint
Next, we create the SageMaker endpoint with the model configuration defined earlier. We use the `ml.inf2.xlarge` instance containing a single Inferentia2 accelerator with 2 NeuronCores. Model deployment will usually take 4-5 minutes as model is compiled during the process.

In [None]:
instance_type = "ml.inf2.xlarge"
endpoint_name = sagemaker.utils.name_from_base("tinyllama-finetuned-model")

In [None]:
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

model._is_compiled_model = True

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             container_startup_health_check_timeout=500,
             volume_size=256,
             endpoint_name=endpoint_name,
             ModelDataDownloadTimeoutInSeconds = 1800,
             ContainerStartupHealthCheckTimeoutInSeconds = 3600)

## Inference tests
After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:
- Create a predictor for submit inference requests and receive reponses
- Requests and responses are in json format

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer()
)

Lets submit an inference requests to model server and receive inference result

In [None]:
review_text = "I couldn't believe this was the same director as Antonia's Line.<br /><br />This film has it all, \
a boring plot, disjointed flashbacks, a subplot that has nothing to do with the main plot what so ever, \
and totally uninteresting characters.It was painful to watch. Soooo, painful."

In [None]:
prompt = f"###Query: Classify the following movie review as positive or negative\n \
###Review: {review_text}\n \
###Classification:"

In [None]:
result = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":32, "do_sample":"true"}}
)
result

In [None]:
review_text = "This movie is one of my all-time favorites. I think that Sean Penn did a great job acting. \
It is one of the few true stories that made it to film that I really like. It is in my top 10 films of all-time. \
I watch it over and over and never get tired of it. Great movie!"

In [None]:
prompt = f"###Query: Classify the following movie review as positive or negative\n \
###Review: {review_text}\n \
###Classification:"

In [None]:
result = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":32, "do_sample":"true"}}
)
result

## Cleanup the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

Congratulations on completing the LLM deployment for the inference module!

## (Optional) Deploy original TinyLlama model from Hugging Face hub

If you have spare time, you can also consider an optional step of deploying the original TinyLlama model from [Hugging Face hub](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) for even more fun !

In this scenario, you can specify the name of the Hugging Face model using the *`model_id`* parameter to download the model directly from the Hugging Face repo. The remaining steps of the process remain the same as before.

In [None]:
image_uri = image_uris.retrieve(
        framework="djl-neuronx",
        region=sess.boto_session.region_name,
        version="0.24.0"
    )
image_uri

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=TinyLlama/TinyLlama-1.1B-Chat-v0.4
option.batch_size=1
option.neuron_optimize_level=1
option.tensor_parallel_degree=2
option.load_in_8bit=false
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16

In [None]:
%%sh
mkdir mycode
mv serving.properties mycode/
tar czvf mycode.tar.gz mycode/
rm -rf mycode

In [None]:
s3_code_prefix = "reinvent2023/large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mycode.tar.gz", bucket, s3_code_prefix)
print(f"Code uploaded to --- > {code_artifact}")

In [None]:
instance_type = "ml.inf2.xlarge"
endpoint_name = sagemaker.utils.name_from_base("tinyllama-original-model")

In [None]:
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

model._is_compiled_model = True

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             container_startup_health_check_timeout=500,
             volume_size=256,
             endpoint_name=endpoint_name)

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer()
)

In [None]:
prompt = "How to get in a good university?"
formatted_prompt = (
    f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)

In [None]:
result = predictor.predict(
    {"inputs": formatted_prompt, "parameters": {"max_new_tokens":512, "do_sample":"true"}}
)

In [None]:
prompt = f"###Query: Classify the following movie review as positive or negative\n \
###Review: {review_text}\n \
###Classification:"

In [None]:
result = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":100, "do_sample":"true"}}
)

In [None]:
import json
print(json.loads(result)["generated_text"])

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()