# Serve CodeLlama-7b  on SageMaker using the LMI container.
In this notebook, we deploy the [CodeLlama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) model on SageMaker by leveraging the [SageMaker Large Model Inference Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). 

Code Llama is a family of large language models (LLM), released by Meta, with the capabilities to accept text prompts and generate and discuss code. The release also includes two other variants (Code Llama Python and Code Llama Instruct) and different sizes (7B, 13B, 34B, and 70B).

For the purpose of this notebook, we'll use the weights from the following source:
https://huggingface.co/codellama/CodeLlama-7b-hf

However, you can use the same approach to deploy the model using any other codellama weights.


For information on codellama, please refer [here](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)

This notebook explains how to deploy model optimized for latency and throughput. The tuning guide is available [LLM Tuning Guide](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). 

With the latest release, SageMaker is providing two containers: 0.25.0-deepspeed and 0.25.0-tensorrtllm. The DeepSpeed container contains DeepSpeed, the LMI Distributed Inference Library. The TensorRT-LLM container includes NVIDIA’s TensorRT-LLM Library to accelerate LLM inference.

We recommend the deployment configuration illustrated in the following diagram.

![container](./images/container.png)


Additionally, you can refer to [this AWS resource](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)




## Install, import the required libraries; set some variables

In [1]:
%pip install sagemaker --upgrade  --quiet


Note: you may need to restart the kernel to use updated packages.


In [2]:
import json
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Select the appropriate configuration parameters and container
To optimize the deployment of Large Language Models (LLMs); one needs to choose the appropriate model partitioning framework, optimal batching technique, batching size, tensor parallelism degree, etc. The choice of a particular configuration depends on the usecase.

Hence, based on the usecase, you need to:
1. set the configuration parameters for the container.
2. select the appropriate container image to be used for inference.

### Set the configuration parameters using environment variables
1. `SERVING_LOAD_MODELS` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **MPI**. **MPI** is an engine that allows the model server to start distributed processes to load and serve the model.

2. `OPTION_MODEL_ID`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance.
If you want to download the model from huggingface.co, you can set `OPTION_MODEL_ID` to the model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

3. `OPTION_TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. In this example we use the `ml.g5.4xlarge` instance that has 1 GPU; this is set to `max` to utilize all the GPUs on the instance.

4. `OPTION_ROLLING_BATCH`: This parameter enables the use of a particular batching technique for continuous or iteration level batching to enable merging multiple concurrent requests that arrive at different times for inference. [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is a TensorRT Toolbox for Optimized Large Language Model Inference on Nvidia GPUs. To leverage this, we set this parameter to `trtllm`.

5. `OPTION_MAX_ROLLING_BATCH_SIZE`: The maximum number of concurrent requests to be used in a batch by the model server for inference. Clients can still send more requests to the endpoint, they will be queued.


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


In [3]:
env_trtllm = {"HUGGINGFACE_HUB_CACHE": "/tmp",
              "TRANSFORMERS_CACHE": "/tmp",
              "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
              "OPTION_MODEL_ID": "codellama/CodeLlama-7b-hf",
              "OPTION_TRUST_REMOTE_CODE": "true",
              "OPTION_TENSOR_PARALLEL_DEGREE": "max",
              "OPTION_ROLLING_BATCH": "trtllm",
              "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
              "OPTION_DTYPE":"fp16"
             }

We leverage the tensorRT container; for other containers refer [Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [4]:
trtllm_image_uri = image_uris.retrieve(
    framework="djl-tensorrtllm",
    region=sess.boto_session.region_name,
    version="0.29.0"
)

### When generating a large number of output tokens (> 1024), use the following configuration

For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


In [5]:
# env_lmidist = {"HUGGINGFACE_HUB_CACHE": "/tmp",
#                "TRANSFORMERS_CACHE": "/tmp",
#                "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
#                "OPTION_MODEL_ID": "codellama/CodeLlama-7b-hf",
#                "OPTION_TRUST_REMOTE_CODE": "true",
#                "OPTION_TENSOR_PARALLEL_DEGREE": "max",
#                "OPTION_ROLLING_BATCH": "lmi-dist",
#                "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
#                "OPTION_DTYPE":"fp16"
#               }

# deepspeed_image_uri = image_uris.retrieve(
#     framework="djl-deepspeed", 
#     region=sess.boto_session.region_name, 
#     version="0.29.0"
# )

In [6]:
# - Select the appropriate environment variable which will tune the deployment server.
env = env_trtllm
# env = env_lmidist # use this when generating tokens > 1024  

# - now we select the appropriate container 
# inference_image_uri = deepspeed_image_uri # use this when generating tokens > 1024 
inference_image_uri = trtllm_image_uri

print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

Environment variables are ---- > {'HUGGINGFACE_HUB_CACHE': '/tmp', 'TRANSFORMERS_CACHE': '/tmp', 'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model', 'OPTION_MODEL_ID': 'codellama/CodeLlama-7b-hf', 'OPTION_TRUST_REMOTE_CODE': 'true', 'OPTION_TENSOR_PARALLEL_DEGREE': 'max', 'OPTION_ROLLING_BATCH': 'trtllm', 'OPTION_MAX_ROLLING_BATCH_SIZE': '32', 'OPTION_DTYPE': 'fp16'}
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124


To create the end point the steps are:
- Create the Model using the inference image container

- Create the endpoint config using the following key parameters

In this notebook we leverage the boto3 SDK. You can also use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/).

### Create the Model
Leverage the `inference_image_uri` to create a model object.

In [7]:
model_name = sagemaker.utils.name_from_base("lmi-codellama-7b-trtllm")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

lmi-codellama-7b-trtllm-2024-10-17-11-30-00-638
Created Model: arn:aws:sagemaker:us-east-1:777200923596:model/lmi-codellama-7b-trtllm-2024-10-17-11-30-00-638


### Create an endpoint config
Create an endpoint configuration using the appropriate instance type. Set the `ContainerStartupHealthCheckTimeoutInSeconds` to account for the time taken to download the LLM weights from S3 or the model hub; and the time taken to load the model on the GPUs.

In [8]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.4xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:777200923596:endpoint-config/lmi-codellama-7b-trtllm-2024-10-17-11-30-00-638-config',
 'ResponseMetadata': {'RequestId': 'd9171c23-c152-491c-ac0e-a2de724c5f02',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd9171c23-c152-491c-ac0e-a2de724c5f02',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '135',
   'date': 'Thu, 17 Oct 2024 11:30:01 GMT'},
  'RetryAttempts': 0}}

### Create an endpoint using the model and endpoint config

In [9]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:777200923596:endpoint/lmi-codellama-7b-trtllm-2024-10-17-11-30-00-638-endpoint


#### This step can take ~15 mins or longer

In [10]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:777200923596:endpoint/lmi-codellama-7b-trtllm-2024-10-17-11-30-00-638-endpoint
Status: InService


### Invoke the endpoint with a sample prompt

In [11]:
prompt = """import socket \n def ping_exponential_backoff(host: str):"""
params = { "max_new_tokens":256, 
              "temperature":0.1}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": params
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

'{"generated_text": "\\n    \\"\\"\\"\\n    Ping a host with exponential backoff.\\n    \\"\\"\\"\\n    # Setup the socket\\n    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\\n    s.settimeout(1)\\n\\n    # Setup the exponential backoff\\n    backoff = 1\\n    max_backoff = 10\\n\\n    # Try to connect to the host\\n    while True:\\n        try:\\n            s.connect((host, 80))\\n            break\\n        except socket.error:\\n            print(\\"Connection failed. Retrying in {} seconds\\".format(backoff))\\n            time.sleep(backoff)\\n            backoff *= 2\\n            if backoff > max_backoff:\\n                backoff = max_backoff\\n\\n    # Close the socket\\n    s.close()\\n\\n    # Print a success message\\n    print(\\"Connection successful!\\")\\n\\n\\ndef ping_exponential_backoff_with_timeout(host: str, timeout: int):\\n    \\"\\"\\"\\n    Ping a host with exponential backoff and a timeout.\\n    \\"\\"\\"\\n"}'

## Clean up

In [12]:
# sm_client.delete_endpoint(EndpointName=endpoint_name)
# sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# sm_client.delete_model(ModelName=model_name)