# Deploying Enc-Dec Model with Triton TensorRT-LLM on Amazon SageMaker

This notebook shows how to optimize Encoder-Decoder Models Like T5/BART using NVIDIA TensorRT-LLM and then deploy them using Triton Inference Server on Amazon SageMaker. TensorRT-LLM library accelerates inference performance on the latest LLMs on NVIDIA GPUs.The Triton Inference Server backend for TensorRT-LLM uses the TensorRT-LLM C++ runtime for highly performant inference execution. It includes techniques like in-flight batching and paged KV caching that provide high throughput at low latency. TensorRT-LLM backend has been bundled with Triton Inference Server and is available as a pre-built container (`xx.yy-trtllm-python-py3`) on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags).

This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g5.xlarge`.

## Set up the environment
Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton TRT-LLM ECR image.

In [1]:
!pip install -qU awscli boto3 sagemaker --quiet
!pip install tritonclient[http] --quiet

In [2]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")

  import scipy.sparse


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Next we install `git-lfs` to download Huggingface model

In [None]:
!sudo amazon-linux-extras install epel -y
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
!sudo yum install git-lfs -y

In [3]:
MODEL_NAME="t5-small"
MODEL_TYPE="t5"

# For BART
# MODEL_NAME="bart-base"
# MODEL_TYPE="bart"

In [None]:
!git lfs install

We download the model from HuggingFace. Or if you have your own trained custom HuggingFace model then you can place it in `workspace/hf_models`

In [None]:
!git clone https://huggingface.co/google-t5/t5-small workspace/hf_models/
# !git clone git clone https://huggingface.co/facebook/bart-base workspace/hf_models/

In the [generate_trtllm_triton_model_repo.sh](workspace/generate_trtllm_triton_model_repo.sh) script we build the TRT-LLM engine for encoder-decoder T5/BART model and prepare the Triton Model Repository. In this example we build TP Size=1 single_GPU engine with beam search (max beam width = 2), input len = 1024, output len = 200. To change this edit [generate_trtllm_triton_model_repo.sh](workspace/generate_trtllm_triton_model_repo.sh) script. 

In [4]:
!docker run --gpus all --ulimit memlock=-1 --shm-size=12g -v ${PWD}/workspace:/workspace -w /workspace nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 /bin/bash generate_encdec_triton_model_repo.sh

Unable to find image 'nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3' locally
24.08-trtllm-python-py3: Pulling from nvidia/tritonserver

[1B021b0277: Pulling fs layer 
[1B3065b696: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1B10b333db: Pulling fs layer 
[1B04d6bd35: Pulling fs layer 
[1Bff708333: Pulling fs layer 
[1B80a307de: Pulling fs layer 
[1B2122a834: Pulling fs layer 
[1Be3cb2229: Pulling fs layer 
[1Bb163ddcd: Pulling fs layer 
[1Ba30b7f59: Pulling fs layer 
[1B0fc745fc: Pulling fs layer 
[1Bb0370e40: Pulling fs layer 
[1Bd6da8470: Pulling fs layer 
[4B0fc745fc: Waiting fs layer 
[1B9fa4de74: Pulling fs layer 
[1Bcd062d6b: Pulling fs layer 
[6Bb0370e40: Waiting fs layer 
[11B3cb2229: Waiting fs layer 
[1B8774589e: Pulling fs layer 
[1B66cc0be0: Pulling fs layer 
[13B163ddcd: Waiting fs layer 
[5B271d5ba0: Waiting fs layer 
[5B8774589e: Waiting fs layer 
[1B3514378a: Pulling fs layer 
[1Ba6eea597: Pulling fs layer 
[1B240b277a: Pulling fs l

In [None]:
First, we must create a model repository so the Triton Inference Server can read the model and any associated metadata. 

The tensorrtllm_backend repository includes the setup of a required model repository under all_models/inflight_batcher_llm/ that we can replicate. 

In the directory are four subfolders holding artifacts for different parts of the model execution process. The preprocessing/ and postprocessing/ folders contain scripts for the Triton Inference Server python backend. These scripts are for tokenizing the text inputs and de-tokenizing the model outputs to convert between strings and the token IDs that the model operates on. 

The tensorrt_llm folder is where we’ll place the model engine we previously compiled. And finally, the ensemble folder defines a model ensemble that links the previous three components together and tells the Triton Inference Server how to flow data through them. 

This is the directory structure of `triton_model_repo` that we created

```
triton_model_repo/
├── ensemble
│   ├── 1
│   └── config.pbtxt
├── postprocessing
│   ├── 1
│   │   └── model.py
│   └── config.pbtxt
├── preprocessing
│   ├── 1
│   │   └── model.py
│   └── config.pbtxt
└── tensorrt_llm
    ├── 1
    │   ├── engines
    │   │   └── t5-small
    │   │       ├── decoder
    │   │       └── encoder
    │   ├── hf_models
    │   │   └── t5-small
    │   │       ├── config.json
    │   │       ├── flax_model.msgpack
    │   │       ├── generation_config.json
    │   │       ├── model.safetensors
    │   │       ├── onnx
    │   │       ├── pytorch_model.bin
    │   │       ├── README.md
    │   │       ├── rust_model.ot
    │   │       ├── spiece.model
    │   │       ├── tf_model.h5
    │   │       ├── tokenizer_config.json
    │   │       └── tokenizer.json
    │   └── model.py
    └── config.pbtxt
```

Next we push this image to ECR

In [None]:
!docker tag nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 triton-trtllm
!bash push_ecr.sh triton-trtllm

Set the triton_image_uri from the output of above cell

In [None]:
triton_image_uri = "<ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/triton-trtllm:latest"

## Packaging model files and uploading to s3

In [None]:
!tar --exclude='.ipynb_checkpoints' --exclude='*.bin' \
--exclude='*.h5' --exclude='*.safetensors' --exclude="onnx" \
--exclude='.git*' --exclude='.gitignore' --exclude='.gitattributes' --exclude='.gitmodules' \
-czvf model.tar.gz -C workspace/triton_model_repo/ .

In [None]:
model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix="triton-trtllm-model")

In [None]:
model_uri

## Create SageMaker Endpoint

We start off by creating a sagemaker model from the model files we uploaded to s3 in the previous step.

In this step we also provide an additional Environment Variable i.e. SAGEMAKER_TRITON_DEFAULT_MODEL_NAME which specifies the name of the model to be loaded by Triton. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker.

Additionally, customers can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.

In [None]:
sm_model_name = "triton-trtllm-model-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble"},
}

create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.

In [None]:
endpoint_config_name = "triton-trtllm-model-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [None]:
endpoint_name = "triton-trtllm-model-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Run inference
Once we have the endpoint running we can use a sample text to do an inference using json as the payload format. 

In [None]:
from tritonclient.utils import np_to_triton_dtype
import numpy as np

In [None]:
payload = {}
text_input = "translate English to German: This is Good."
beam_width=2
max_tokens=30
payload["inputs"] = [{"name" : "text_input", "data" : [text_input], "datatype" : "BYTES", "shape" : [1,1]},
    {"name" : "beam_width", "data" : [beam_width], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]}, 
    {"name" : "max_tokens", "data" : [max_tokens], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]},
    ]
response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)
response_str = response["Body"].read().decode()
json_object = json.loads(response_str)
json_object['outputs']

In [None]:
payload = {}
text_input = "translate English to German: This is good."
beam_width=2
max_tokens=50
payload["inputs"] = [{"name" : "text_input", "data" : [text_input], "datatype" : "BYTES", "shape" : [1,1]},
    {"name" : "beam_width", "data" : [beam_width], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]}, 
    {"name" : "max_tokens", "data" : [max_tokens], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]},
    {"name" : "return_log_probs", "data" : [True], "datatype" : "BOOL", "shape" : [1,1]},
    ]
response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)
response_str = response["Body"].read().decode()
json_object = json.loads(response_str)
json_object['outputs']

In [None]:
def invoke_endpoint_test(text_input, max_tokens,beam_width,temperature,repetition_penalty,min_length,bad_words,stop_words, endpoint_name): 
    payload = {}
    payload["inputs"] = [{"name" : "text_input", "data" : [text_input], "datatype" : "BYTES", "shape" : [1,1]},
        {"name" : "beam_width", "data" : [beam_width], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]}, 
        {"name" : "max_tokens", "data" : [max_tokens], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]},
        {"name" : "temperature", "data" : [temperature], "datatype" : np_to_triton_dtype(np.float32), "shape" : [1,1]},
        {"name" : "repetition_penalty", "data" : [repetition_penalty], "datatype" : np_to_triton_dtype(np.float32), "shape" : [1,1]},
        {"name" : "min_length", "data" : [min_length], "datatype" : np_to_triton_dtype(np.float32), "shape" : [1,1]},
        {"name" : "bad_words", "data" : [bad_words], "datatype" : "BYTES", "shape" : [1,1]},
        {"name" : "stop_words", "data" : [stop_words], "datatype" : "BYTES", "shape" : [1,1]},
        ]
    response = smr_client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
    )
    response_str = response["Body"].read().decode()
    json_object = json.loads(response_str)
    return json_object['outputs']

## Terminate endpoint and clean up artifacts

In [None]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)