# TensorRT-LLM: Llama-3-Taiwan-8B-Instruct
The objective of this notebook is to demonstrate the use of TensorRT-LLM to optimize Llama-3-Instruct, run inference, and examine using various advance optimization techniques.

## Overview of TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism).

In [None]:
! pip install datasets==2.19 rouge_score
! git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.9.0 --single-branch
%cd tensorrtllm_backend/
! git lfs install
! git submodule update --init --recursive
%cd /workspace

## 1. Download model from Huggingface

In [None]:
!huggingface-cli download yentinglin/Llama-3-Taiwan-8B-Instruct --local-dir Llama-3-Taiwan-8B-Instruct --local-dir-use-symlinks=False 

## 2. Building TensorRT-LLM engine(s) for Llama-3-Taiwan-8B-Instruct

This section shows how to build tensorrt engine(s) using huggingface model.
Before we proceed to build our engine, it is important to be aware of the supported matrixes for Llama-3 as listed below:

- FP16
- FP8
- INT8 & INT4 Weight-Only
- SmoothQuant
- Groupwise quantization (AWQ/GPTQ)
- FP8 KV cache
- INT8 KV cache (+ AWQ/per-channel weight-only)
- Tensor Parallel

### 2.1 Build TensorRT-LLM engines - FP16

**TensorRT-LLM** builds TensorRT engine(s) from HF checkpoint. Firstly, we used the `convert_checkpoint.py` script to convert Llama-3-Taiwan-8B-Instruct into tensorrt-llm checkpoint format. We use the `trtllm-build` command to build our tensorrt engine.

The `trtllm-build` command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The checkpoint directory provides the model's weights and architecture configuration. The number of engine files is also same to the number of GPUs used to run inference.

`trtllm-build` command has a variety of options. In particular, the plugin-related options have two categories:

- Plugin options that requires a data type (e.g., `gpt_attention_plugin`), you can
    - explicitly specify `float16`/`bfloat16`/`float32`, so that the plugins are enabled with the specified precision;
    - implicitly specify `auto`, so that the plugins are enabled with the precision automatically inferred from model dtype (i.e., the dtype specified in weight conversion); or
    - disable the plugin by `disable`.
    
- Other features that requires a boolean (e.g., `context_fmha`, `paged_kv_cache`, `remove_input_padding`), you can
enable/disable the feature by specifying `enable`/`disable`.

Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding --workers argument. Please note that currently workers feature only supports single node.

The last step is to run the inference using the `run.py` and `summarize.py` script. 

In [None]:
# Define model weight path, output checkpoint path and output engine path
%env HF_LLAMA_MODEL=/workspace/Llama-3-Taiwan-8B-Instruct
%env UNIFIED_CKPT_PATH=ckpt/llama/8b/fp16
%env ENGINE_PATH=engines/llama/8b/fp16

!python tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ${HF_LLAMA_MODEL} \
--output_dir ${UNIFIED_CKPT_PATH} \
--dtype float16

!trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --output_dir ${ENGINE_PATH}

#### flag description for `convert_checkpoint.py`:
- `model_dir`: path to the model directory
- `output_dir`: path to the directory to store the tensorrt-llm checkpoint format or the tensorrt engine
- `dtype`: data type to use for model conversion to tensorrt-llm checkpoint

#### flag description for `trtllm-build`:
- `checkpoint_dir`: path to the directory to load the tensorrt-llm checkpoint needed to build the tensorrt engine
- `gpt_attention_plugin`: GPT attention plugin
- `gemm_plugin`: required plugin to prevent accuracy issue

### Run FP16 engine Inference
To run a TensorRT-LLM LLaMA model using the engines generated by trtllm-build

In [None]:
!python3 tensorrtllm_backend/tensorrt_llm/examples/run.py \
--max_output_len 50 \
--tokenizer_dir ${HF_LLAMA_MODEL} \
--engine_dir ${ENGINE_PATH}

### 2.2 Build TensorRT-LLM engines - INT8 KV cache + per-channel weight-only quantization
To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes. TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

In [None]:
# Define model weight path, output checkpoint path and output engine path
%env HF_LLAMA_MODEL=/workspace/Llama-3-Taiwan-8B-Instruct
%env UNIFIED_CKPT_PATH=ckpt/llama/8b/int8
%env ENGINE_PATH=engines/llama/8b/int8

!python tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ${HF_LLAMA_MODEL} \
--output_dir ${UNIFIED_CKPT_PATH} \
--dtype float16 \
--use_weight_only \
--weight_only_precision int8

!trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --output_dir ${ENGINE_PATH}

### Summarization using the LLaMA model

In [None]:
!python tensorrtllm_backend/tensorrt_llm/examples/summarize.py \
--data_type fp16 \
--test_hf \
--hf_model_dir ${HF_LLAMA_MODEL} \
--test_trt_llm \
--engine_dir ${ENGINE_PATH} \
--max_ite 10

### 2.3 Build TensorRT-LLM engines - FP8 Post-Training Quantization [Optional]

The examples below uses the NVIDIA Modelopt (AlgorithMic Model Optimization) toolkit for the model quantization process. Although the V100 does not support the FP8 datatype, we have included it as a reference.

In [None]:
# Define model weight path, output checkpoint path and output engine path
# %env HF_LLAMA_MODEL=/workspace/Llama-3-Taiwan-8B-Instruct
# %env UNIFIED_CKPT_PATH=ckpt/llama/8b/fp8
# %env ENGINE_PATH=engines/llama/8b/fp8

# !python tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py \
# --model_dir ${HF_LLAMA_MODEL} \
# --dtype float16 \
# --qformat fp8 \
# --kv_cache_dtype fp8 \
# --output_dir ${UNIFIED_CKPT_PATH} \
# --calib_size 512


# !trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
#              --remove_input_padding enable \
#              --gpt_attention_plugin float16 \
#              --gemm_plugin float16 \
#              --output_dir ${ENGINE_PATH}

### 4. Build TensorRT-LLM engines - Groupwise quantization (AWQ/GPTQ)
One can enable AWQ/GPTQ INT4 weight only quantization with these options when building engine with trtllm-build:
NVIDIA Modelopt toolkit is used for AWQ weight quantization. Please see [examples/quantization/README.md](tensorrtllm_backend/tensorrt_llm/examples/quantization/README.md) for Modelopt installation instructions.

In [None]:
# Define model weight path, output checkpoint path and output engine path
%env HF_LLAMA_MODEL=/workspace/Llama-3-Taiwan-8B-Instruct
%env UNIFIED_CKPT_PATH=ckpt/llama/8b/int4
%env ENGINE_PATH=engines/llama/8b/int4

# Quantize HF LLaMA 8B checkpoint into INT4 AWQ format
!python tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py \
--model_dir ${HF_LLAMA_MODEL} \
--dtype float16 \
--qformat int4_awq \
--awq_block_size 128 \
--output_dir ${UNIFIED_CKPT_PATH} \
--calib_size 4

!trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --output_dir ${ENGINE_PATH} \
             --paged_kv_cache enable

# Triton Inference Server with TensorRT-LLM backend: Llama-3-Taiwan-8B-Instruct Deployment using Triton Inference Server

The Triton for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) backend. You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main). The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference Server.

## Using the TensorRT-LLM Backend
We will look at 4 steps to serve the TensorRT-LLM model with the Triton TensorRT-LLM Backend on a 1-GPU environment. The example uses [Llama-3-Taiwan-8B-Instruct](https://huggingface.co/yentinglin/Llama-3-Taiwan-8B-Instruct) from the TensorRT-LLM repository.

### 1. Build TensorRT-LLM engines

### 2. Prepare inference configs

There are four models in the all_models/inflight_batcher_llm directory that will be used in this example: preprocessing -> tensorrt_llm -> postprocessing

- **preprocessing**: This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
- **tensorrt_llm**: This model is a wrapper of your TensorRT-LLM model and is used for inferencing
- **postprocessing**: This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
- **ensemble**: This model is used to chain the three models above together.

<div><center>
<img src="./images/ensemble.png" width="1000"/>
</center></div>

In [None]:
%env HF_LLAMA_MODEL=/workspace/Llama-3-Taiwan-8B-Instruct
%env ENGINE_PATH=engines/llama/8b/int8

In [None]:
%env BS=64
!rm -rf tensorrtllm_backend/llama_ifb
!cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/ tensorrtllm_backend/llama_ifb

!python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:$BS,preprocessing_instance_count:1
!python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:$BS,postprocessing_instance_count:1
!python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$BS,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
!python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/llama_ifb/ensemble/config.pbtxt triton_max_batch_size:$BS
!python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:$BS,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

You can look at the `config.pbtxt` files for your reference and also learn more about the [model configuration parameters](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#modify-the-model-configuration).



- a) View changes in the **Pre-processing config file** *[tensorrtllm_backend/llama_ifb/preprocessing/config.pbtxt](./tensorrtllm_backend/llama_ifb/preprocessing/config.pbtxt)*


|  Line   |Parameters | Value | 
|-|-|-| 
|   124   | `tokenizer_dir` | **`/workspace/Llama-3-Taiwan-8B-Instruct`**|
|   29    | `triton_max_batch_size` |64|
|   137   | `preprocessing_instance_count` | 1|

---

- b) View changes in the **Post-processing config file**  *[tensorrtllm_backend/llama_ifb/postprocessing/config.pbtxt](./tensorrtllm_backend/llama_ifb/postprocessing/config.pbtxt)*



|  Line   | Parameters | Value | 
|-|-|-| 
|   97    | `tokenizer_dir` | **`/workspace/Llama-3-Taiwan-8B-Instruct`**|
|   29    | `triton_max_batch_size` |64|
|   110   | `preprocessing_instance_count` | 1|

---

- c) View changes in the **tensorrt_llm_bls config file**  *[tensorrtllm_backend/llama_ifb/tensorrt_llm_bls/config.pbtxt](./tensorrtllm_backend/llama_ifb/tensorrt_llm_bls/config.pbtxt)*


|  Line   | Parameters | Value | 
|-|-|-| 
|   29    | `triton_max_batch_size` | 64|
|   32    | `decoupled_mode` |False|
|   244   | `bls_instance_count` | 1|
|   226   | `accumulate_tokens` |False|


d) View changes in the **Ensemble config file**  *[tensorrtllm_backend/llama_ifb/ensemble/config.pbtxt](./tensorrtllm_backend/llama_ifb/ensemble/config.pbtxt)*

 
|  Line   | Parameters | Value|
|-|-|-|
|    29   | `triton_max_batch_size` |64 |

---

- e)  View changes in the **tensorrt_llm config file**  *[tensorrtllm_backend/llama_ifb/tensorrt_llm/config.pbtxt](./tensorrtllm_backend/llama_ifb/tensorrt_llm/config.pbtxt)*


|  Line   | Name | Value|
|-|-|-|
|   28    |  `triton_backend`        |    "tensorrtllm"                |
|   29    |`triton_max_batch_size` | 64 |
|   32    |`decoupled_mode` | False|
|   350   |`max_beam_width` | 1 |
|   368   |`engine_dir` |  **`engines/llama/8b/int4`** |
|   374   |`max_tokens_in_paged_kv_cache` | 2560|
|   380   |max_attention_window_size|2560|
|   398   |kv_cache_free_gpu_mem_fraction |0.5 |
|   423   |exclude_input_in_output |True |
|   453   |enable_kv_cache_reuse | False|
|   362   |batching_strategy |inflight_fused_batching |
|    37   |max_queue_delay_microseconds | 0|

### 3. Launch Triton server

Open a terminal and run the following code:

- On the terminal, navigate to the launch script folder by running this command:

`cd /workspace/`

- Start the Triton Server with this command:

`python /workspace/tensorrtllm_backend/scripts/launch_triton_server.py --world_size 1 --model_repo=/workspace/tensorrtllm_backend/llama_ifb/`

<center><img src="./images/terminal.png"  alt-text="terminal"/></center>

## Query the server with the Triton-generated endpoint
You can query the server using Triton's generate endpoint with a curl command based on the following general format within your client environment/container:

In [None]:
!curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{"text_input": "What is machine learning?", \
"max_tokens": 20, \
"bad_words": "", \
"stop_words": "", \
"pad_id": 2, \
"end_id": 2}'

## Querying and Formatting using Python
We notice the format is not quite useful, let us now try to do the same via Python, here is a snippet in Python that does the same as above, let us run it now:

In [None]:
import requests
import json
import os

# Retrieve the HTTP port from environment variables
http_port = 8000

# Check if HTTP_PORT is set
if http_port is None:
    print("Error: HTTP_PORT environment variable is not set.")
    exit(1)

# Set the URL with the HTTP port
url = f'http://localhost:{http_port}/v2/models/ensemble/generate'

In [None]:
# Define the payload
input_text = "What is machine learning?"
payload = {
    "text_input": input_text,
    "max_tokens": 1024,
    "bad_words": "",
    "stop_words": "<|eot_id|>"
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")

In [None]:
# Define the payload
input_text = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a helpful AI assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\
What is machine learning?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"

payload = {
    "text_input": input_text,
    "max_tokens": 1024,
    "bad_words": "",
    "stop_words": "<|eot_id|>"
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")

## Kill the server

In [None]:
!pgrep mpirun | xargs kill

## Clear your data

In [None]:
!rm -rf Llama-3-Taiwan-8B-Instruct/ ckpt engines tensorrtllm_backend