# Deploying Mistral Model with Triton TensorRT-LLM on Amazon SageMaker

This notebook shows how to optimize Decoder only models like Mistral, LLaMa, etc using NVIDIA TensorRT-LLM and then deploy them using Triton Inference Server on Amazon SageMaker. TensorRT-LLM library accelerates inference performance on the latest LLMs on NVIDIA GPUs.The Triton Inference Server backend for TensorRT-LLM uses the TensorRT-LLM C++ runtime for highly performant inference execution. It includes techniques like in-flight batching and paged KV caching that provide high throughput at low latency. TensorRT-LLM backend has been bundled with Triton Inference Server and is available as a pre-built container (`xx.yy-trtllm-python-py3`) on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags).

**NOTE:** This notebook was tested with the `conda_python3` kernel in `us-east-1` regin on an Amazon SageMaker notebook instance of type `g5.xlarge`.

## Set up the environment
Installs the dependencies required to package the model and run inferences using Triton server.

> You can ignore the pip dependency resolver errors

In [7]:
!pip install -qU awscli boto3 sagemaker 
!pip install -q tritonclient[http] 
!pip install -q huggingface_hub[cli]

Also define the SageMaker client and IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton TRT-LLM ECR image.

In [11]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
from tritonclient.utils import np_to_triton_dtype
import numpy as np
from huggingface_hub import notebook_login

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

## Download model

In this example we use `Mistral 7B v0.2` model so we download it from Huggingface. Or if you have your own trained custom HuggingFace model then you can place it in `workspace/hf_models`

Make sure to login into Huggingface hub using your [HF token](https://huggingface.co/docs/transformers.js/en/guides/private).

In [12]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
!huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2 --local-dir workspace/hf_models/ --local-dir-use-symlinks False 

Fetching 16 files:   0%|                                 | 0/16 [00:00<?, ?it/s]Downloading '.gitattributes' to 'workspace/hf_models/.cache/huggingface/download/.gitattributes.a6344aac8c09253b3b630fb776ae94478aa0275b.incomplete'
Downloading 'model.safetensors.index.json' to 'workspace/hf_models/.cache/huggingface/download/model.safetensors.index.json.361fa9d25a7f791e18ab531b3468ff8f2010642e.incomplete'
Downloading 'model-00001-of-00003.safetensors' to 'workspace/hf_models/.cache/huggingface/download/model-00001-of-00003.safetensors.63654d601820b88b1fa8b4a98df5714f700fbc5b3df2cc4ecbabdced35096d31.incomplete'
Downloading 'generation_config.json' to 'workspace/hf_models/.cache/huggingface/download/generation_config.json.cb0c9b6c64cf786052efdd1a4ae597337b2f2708.incomplete'
Downloading 'model-00003-of-00003.safetensors' to 'workspace/hf_models/.cache/huggingface/download/model-00003-of-00003.safetensors.5f86e15cb3ed9078e30ae6e72445e109d0e337d9cde59b9aeea4ce8e44e54a5d.incomplete'
Downloading

## Optimize model with TRT-LLM and Setup Triton Model Repo

We will be using Triton TRT-LLM NGC container for optimizing our model with Triton TRT-LLM and deployment. So we first pull down the [Triton TRT-LLM image from NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags) and then for the sagemaker endpoint deployment we push it to private ECR repo using [push_ecr.sh](./push_ecr.sh) script.

In [None]:
!docker pull nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3
!docker tag nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3 triton-trtllm
!bash push_ecr.sh triton-trtllm

In [15]:
triton_image_uri = f"{account_id}.dkr.ecr.us-east-1.amazonaws.com/triton-trtllm:latest"

Next we will use the [generate_dec_triton_model_repo.sh](workspace/generate_dec_triton_model_repo.sh) script to build the TRT-LLM engine for encoder-decoder T5/BART model and prepare the Triton Model Repository. In this example, we build single-GPU engine (TP Size=1) for T5 model with beam search (max beam width = 2), maximum input len = 1024, maximum output len = 200. To change this, you can edit [generate_dec_triton_model_repo.sh](workspace/generate_dec_triton_model_repo.sh) script. 

Next we run [generate_dec_triton_model_repo.sh](workspace/generate_dec_triton_model_repo.sh) inside Triton TRT-LLM NGC container. While the docker command below is running feel free to read the cells below for description of what is running in [generate_dec_triton_model_repo.sh](workspace/generate_dec_triton_model_repo.sh) script.

In [16]:
!docker run --gpus all --ulimit memlock=-1 --shm-size=12g -v ${PWD}/workspace:/workspace \
-w /workspace nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3 \
/bin/bash generate_dec_triton_model_repo.sh


== Triton Inference Server ==

NVIDIA Release 24.09 (build 112826551)
Triton Server Version 2.50.0

Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.5 driver version 555.42.06 with kernel driver version 535.183.01.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Git cloning the TRT-LLM Backend repo from GitHub and setting it up...
Cloning into 'tensorrtllm_backend'...
Note: switching to 'f80395e67a464e229b6595acd8f9305b344a5d54'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commi

First, we clone the tensorrtllm_backend backend repo which contains example Triton model repository config files under [`all_models/inflight_batcher_llm/`](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm) that we can use. We use ensemble model instead of BLS in this example so we omit copying the `tensorrt_llm_bls` folder. To learn more about ensemble and BLS models, please see the [Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [Business Logic Scripting documentation](https://github.com/triton-inference-server/python_backend#business-logic-scripting).

```
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.13.0 
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
cd /workspace

rsync -av --exclude='tensorrt_llm_bls' tensorrtllm_backend/all_models/inflight_batcher_llm/ triton_model_repo/
```

Next we define the engine building parameters, things like `max_beam_width`, `max_batch_size`, `max_input_len`, `max_output_len`. Here we build TP Size=1 engine for Mistral model, maximum input len = 32768, maximum output len = 200. We define the model checkpoint and engine paths and also set them up for sagemaker deployment (with respect to `/opt/ml/model`). 

```
export HF_MODEL_PATH=/workspace/hf_models/
export UNIFIED_CKPT_PATH=/workspace/ckpt/
export ENGINE_PATH=/workspace/triton_model_repo/tensorrt_llm/1/engines/
rsync -av --exclude="*.ot" --exclude="*onnx" --exclude="*.bin" --exclude="*.safetensors" --exclude="*.h5" --exclude="*.msgpack" ${HF_MODEL_PATH} /workspace/triton_model_repo/tensorrt_llm/1/hf_models/
export SAGEMAKER_ENGINE_PATH=/opt/ml/model/tensorrt_llm/1/engines
export SAGEMAKER_TOKENIZER_PATH=/opt/ml/model/tensorrt_llm/1/hf_models/


INFERENCE_PRECISION=float16
TP_SIZE=1
MAX_BEAM_WIDTH=1
MAX_BATCH_SIZE=32
MAX_INPUT_LEN=32768
MAX_OUTPUT_LEN=201
```

We then convert the Huggingface model checkpoint to TRT-LLM format.

```
python3 tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${HF_MODEL_PATH} \
                             --output_dir ${UNIFIED_CKPT_PATH} \
                             --dtype ${INFERENCE_PRECISION}
```

Next we build 2 TRT-LLM engines, one for the T5 encoder, second for the T5 decoder

```
                             
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
             --output_dir ${ENGINE_PATH} \
             --gemm_plugin auto \
             --max_input_len ${MAX_INPUT_LEN} \
             --max_beam_width ${MAX_BEAM_WIDTH} 
```

Finally, we start preparing our Triton Model Repo `triton_model_repo` by editing the config.pbtxt files. In the directory, there are four subfolders holding artifacts for different parts of the model execution process. The `preprocessing/` and `postprocessing/` folders contain scripts for the Triton Inference Server python backend. These scripts are for tokenizing the text inputs and de-tokenizing the model outputs to convert between strings and the token IDs that the model operates on. These scripts need access to the original huggingface model's tokenizer files which we have placed in `tensorrt_llm/hf_models` in this example.

The `tensorrt_llm/engines` folder is where we’ll place the model engines we compiled. And finally, the `ensemble` folder defines a model ensemble that links the previous three components together and tells the Triton Inference Server how to flow data through them. For more details please see [here](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main#prepare-the-model-repository).

```
python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:${MAX_BEAM_WIDTH},engine_dir:${ENGINE_PATH},kv_cache_free_gpu_mem_fraction:0.95,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,enable_chunked_context:False,max_queue_size:0

python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${SAGEMAKER_TOKENIZER_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1

python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${SAGEMAKER_TOKENIZER_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1

python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
```

Ultimately, after the docker run command for executing the [generate_dec_triton_model_repo.sh](workspace/generate_dec_triton_model_repo.sh) script is completed we end up `triton_model_repo` which has directory structure:
```
triton_model_repo/
├── ensemble
│   ├── 1
│   └── config.pbtxt
├── postprocessing
│   ├── 1
│   │   ├── model.py
│   └── config.pbtxt
├── preprocessing
│   ├── 1
│   │   ├── model.py
│   └── config.pbtxt
└── tensorrt_llm
    ├── 1
    │   ├── engines
    │   │   ├── config.json
    │   │   └── rank0.engine
    │   ├── hf_models
    │   │   ├── config.json
    │   │   ├── generation_config.json
    │   │   ├── README.md
    │   │   ├── special_tokens_map.json
    │   │   ├── tokenizer_config.json
    │   │   ├── tokenizer.json
    │   │   └── tokenizer.model
    │   └── model.py
    └── config.pbtxt
```

## Packaging model files and uploading to s3

Next, we will package up this Triton model repo `triton_model_repo` in `model.tar.gz` format that SageMaker expects and then upload it to S3 bucket. In this packaging process, we will only retain the tokenizer files from original model checkpoint and exclude any files like `.safetensors`, `.bin`, `.h5`, etc.

In [None]:
!tar --exclude='.ipynb_checkpoints' --exclude='*.bin' \
--exclude='*.h5' --exclude='*.safetensors' --exclude="onnx" \
--exclude='.git*' --exclude='.gitignore' --exclude='.gitattributes' \
--exclude='.gitmodules' --exclude='*.msgpack' --exclude="*.ot" --exclude=".cache" \
-czvf model.tar.gz -C workspace/triton_model_repo/ .

./
./postprocessing/
./postprocessing/config.pbtxt
./postprocessing/1/
./postprocessing/1/model.py
./ensemble/
./ensemble/config.pbtxt
./ensemble/1/
./ensemble/1/.tmp
./preprocessing/
./preprocessing/config.pbtxt
./preprocessing/1/
./preprocessing/1/model.py
./tensorrt_llm/
./tensorrt_llm/config.pbtxt
./tensorrt_llm/1/
./tensorrt_llm/1/hf_models/
./tensorrt_llm/1/hf_models/tokenizer.model
./tensorrt_llm/1/hf_models/pytorch_model.bin.index.json
./tensorrt_llm/1/hf_models/special_tokens_map.json
./tensorrt_llm/1/hf_models/config.json
./tensorrt_llm/1/hf_models/README.md
./tensorrt_llm/1/hf_models/tokenizer_config.json
./tensorrt_llm/1/hf_models/tokenizer.json
./tensorrt_llm/1/hf_models/.cache/
./tensorrt_llm/1/hf_models/.cache/huggingface/
./tensorrt_llm/1/hf_models/.cache/huggingface/download/
./tensorrt_llm/1/hf_models/.cache/huggingface/download/special_tokens_map.json.lock
./tensorrt_llm/1/hf_models/.cache/huggingface/download/tokenizer.model.metadata
./tensorrt_llm/1/hf_models/.cach

In [None]:
model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix="triton-trtllm-model")

## Create SageMaker Endpoint

We start off by creating a sagemaker model from the Triton Image Uri and Triton Model Repo we uploaded to S3 in the previous steps.

In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker. We are deploying TRT-LLM ensemble model so we will specify `"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble"`

Additionally, users can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.

In [None]:
sm_model_name = "triton-trtllm-model-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble"},
}

create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Using the sagemaker model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.

In [None]:
endpoint_config_name = "triton-trtllm-model-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [None]:
endpoint_name = "triton-trtllm-model-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

The endpoint creation can take about 10 minutes for the T5 model in this example.

In [None]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Run inference
Once we have the endpoint running we can use a sample text to do an inference using json as the payload format. In this example request we are running with `beam search width=2` and requesting TRT-LLM to return `output_log_probs` (Log probabilities for each output) as well as `cum_log_probs`(Cumulative probabilities for each output).

In [None]:
payload = {}
text_input = "what is machine learning?"
beam_width=1
max_tokens=40
payload["inputs"] = [{"name" : "text_input", "data" : [text_input], "datatype" : "BYTES", "shape" : [1,1]},
    {"name" : "beam_width", "data" : [beam_width], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]}, 
    {"name" : "max_tokens", "data" : [max_tokens], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]},
    {"name" : "return_log_probs", "data" : [True], "datatype" : "BOOL", "shape" : [1,1]},
    ]
response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)
response_str = response["Body"].read().decode()
json_object = json.loads(response_str)
json_object['outputs']

In [None]:
print("Text Output response from model is", json_object['outputs'][-1]['data'])

We can also send request with other TRT-LLM supported inputs like `temperature`, `repetition_penalty`, `min_length`, `bad_words`, `stop_words`. For more details on TRT-LLM supported input and outputs and how to set them please see [docs here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md#model-input-and-output).

In [None]:
payload = {}
text_input = "what is machine learning?"
beam_width=1
max_tokens=40
temperature=0.9
repetition_penalty=1.0
min_length=1
bad_words=[""]
stop_words=[""]

payload["inputs"] = [{"name" : "text_input", "data" : [text_input], "datatype" : "BYTES", "shape" : [1,1]},
    {"name" : "beam_width", "data" : [beam_width], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]}, 
    {"name" : "max_tokens", "data" : [max_tokens], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]},
    {"name" : "temperature", "data" : [temperature], "datatype" : np_to_triton_dtype(np.float32), "shape" : [1,1]},
    {"name" : "repetition_penalty", "data" : [repetition_penalty], "datatype" : np_to_triton_dtype(np.float32), "shape" : [1,1]},
    {"name" : "min_length", "data" : [min_length], "datatype" : np_to_triton_dtype(np.int32), "shape" : [1,1]},
    {"name" : "bad_words", "data" : bad_words, "datatype" : "BYTES", "shape" : [1,1]},
    {"name" : "stop_words", "data" : stop_words, "datatype" : "BYTES", "shape" : [1,1]},
    {"name" : "return_log_probs", "data" : [True], "datatype" : "BOOL", "shape" : [1,1]},
    ]
response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)
response_str = response["Body"].read().decode()
json_object = json.loads(response_str)
json_object['outputs']

## Terminate endpoint and clean up artifacts

Once you are done with the endpoint, you can delete it along with other artifacts like sagemaker model and endpoint config.

In [None]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

## Conclusion

In this example, you have seen how to optimize decoder only models such as Mistral using TRT-LLM and Triton and deploy them on SageMaker. To learn more details about running Encoder-Decoder models with Triton TRT-LLM backend please see [TRT-LLM backend docs](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md). For learning about the best practices for Tuning the Performance of TensorRT-LLM and Triton please see this [guide.](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md). To learn about deploying other TRT-LLM models check out [llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) and other [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples)