<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="trt-custom-model.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
        <a href="trt-llama-chat.ipynb">2</a>
        <a href="trt-custom-model.ipynb">3</a>
        <a>4</a>
        <a href="challenge.ipynb">5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span>
</div>

# Deploying A Finetuned Model Using Triton Inference Server (TensorRT-LLM Backend)
--- 
<div style="text-align:left; color:#FF0000; height:80px; text-color:red; font-size:20px">Please note that you can run this lab only by using the TRT-LLM Container </div>

In this notebook, our focus would be using TensorRT-LLM Backend to deploy the tensorrt engine built in the previous notebook. TensorRT-LLM Backend aims to let you serve [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) models with Triton Inference Server. You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).


## Using the TensorRT-LLM Backend

To use the TensorRT-LLM Backend, follow the steps below:

- Clone the TensorRT-LLM Backend repo ([tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend)). This has already been done within the container running the lab.
   <img src="images/trt-clone.png" />
- Navigate into cloned repository and create the model repository `triton_model_repo`
   <img src="images/trt-model-repo.png" />
- Copy all files in `all_models/inflight_batcher_llm` directory into the `triton_model_repo`
   <img src="images/trt-copy-folder5.png" />

- There are five model directories within the `all_models/inflight_batcher_llm`. The directories include:
    - **preprocessing**: This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
    - **tensorrt_llm**: This model is a wrapper of your TensorRT-LLM model and is used for inferencing
    - **postprocessing**: This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
    - **ensemble**: This model is used to chain the three models above together as: `preprocessing -> tensorrt_llm -> postprocessing`
    - **tensorrt_llm_bls**: This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together.



Learn more about [ensemble model](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [tensorrt_llm_bls model](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#tensorrt_llm_bls).

- Copy the model tensorrt engine from `../../model/trt_engines/fp16/1-gpu` into the `triton_model_repo/tensorrt_llm/1` directory
   <img src="images/trt-copy-engine.png" />

Now, let's execute the above steps. But before that, we have to copy the `tensorrtllm_backend` repo from within our container to the `source_code` directory for ease of access.

In [None]:
# copy tensorrtllm_backend to source_code if not already present
!cp -r /workspace/tensorrtllm_backend ../../source_code

Next, we navigate into the `tensorrtllm_backend` and execute all the steps mentioned above for `FP16 tensorrt engine`.

In [None]:
%%bash
# Create the model repository that will be used by the Triton server
cd ../../source_code/tensorrtllm_backend/
mkdir triton_model_repo

# Copy the example models to the model repository
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# Copy the TRT engine to triton_model_repo/tensorrt_llm/1/
cp  ../../model/trt_engines/fp16/1-gpu/* triton_model_repo/tensorrt_llm/1

### Modify the model configuration

The following table shows the fields that need to be modified before deployment:

<div><center>
<img src="images/ensemble.png" width="1000"/>
</center></div>


**Kindly change the key-value pair using the correct configuration by running the script cell bellow**

In [None]:
%%bash
# Run this script to set the right key-value pairs automatically.

cd ../../source_code/tensorrtllm_backend/

export HF_LLAMA_MODEL="../../model/Llama-2-7b-chat-hf-merged"
export    ENGINE_PATH="../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" 
export BACKEND="tensorrtllm"


python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:V1,max_queue_delay_microseconds:0

You can look at the `config.pbtxt` files for your reference and also learn more about the [model configuration parameters](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#modify-the-model-configuration).



- a) View changes in the **Pre-processing config file** *[triton_model_repo/preprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/preprocessing/config.pbtxt)*


|  Line   |Parameters | Value | 
|-|-|-| 
|   124   | `tokenizer_dir` | **`/workspace/app/model/Llama-2-7b-chat-hf-merged`**|
|   29    | `triton_max_batch_size` |64|
|   137   | `preprocessing_instance_count` | 1|

---

- b) View changes in the **Post-processing config file**  *[triton_model_repo/postprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/postprocessing/config.pbtxt)*



|  Line   | Parameters | Value | 
|-|-|-| 
|   97    | `tokenizer_dir` | **`/workspace/app/model/Llama-2-7b-chat-hf-merged`**|
|   29    | `triton_max_batch_size` |64|
|   110   | `preprocessing_instance_count` | 1|

---

- c) View changes in the **tensorrt_llm_bls config file**  *[triton_model_repo/tensorrt_llm_bls/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/config.pbtxt)*


|  Line   | Parameters | Value | 
|-|-|-| 
|   29    | `triton_max_batch_size` | 64|
|   32    | `decoupled_mode` |False|
|   244   | `bls_instance_count` | 1|
|   226   | `accumulate_tokens` |False|


d) View changes in the **Ensemble config file**  *[triton_model_repo/ensemble/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/ensemble/config.pbtxt)*

 
|  Line   | Parameters | Value|
|-|-|-|
|    29   | `triton_max_batch_size` |64 |

---

- e)  View changes in the **tensorrt_llm config file**  *[triton_model_repo/tensorrt_llm/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt)*


|  Line   | Name | Value|
|-|-|-|
|   28    |  `triton_backend`        |    "tensorrtllm"                |
|   29    |`triton_max_batch_size` | 64 |
|   32    |`decoupled_mode` | False|
|   350   |`max_beam_width` | 1 |
|   368   |`engine_dir` |  **`/workspace/app/source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1`** |
|   374   |`max_tokens_in_paged_kv_cache` | 2560|
|   380   |max_attention_window_size|2560|
|   398   |kv_cache_free_gpu_mem_fraction |0.5 |
|   423   |exclude_input_in_output |True |
|   453   |enable_kv_cache_reuse | False|
|   362   |batching_strategy |V1 |
|    37   |max_queue_delay_microseconds | 0|



### Launch Triton server

We can launch the Triton server with the following command:

- Press `Crtl+Shift+L` and open a new terminal
  <center><img src="images/terminal.png"  alt-text="terminal"/></center>
- On the terminal, navigate to the launch script folder by running this command: `cd ../../source_code/tensorrtllm_backend`
- Start the Triton Server with this command: `python3 scripts/launch_triton_server.py  --world_size=1  --model_repo=triton_model_repo`


It will take a few minutes to run and when successfully deployed, the server produces logs similar to the screenshot below.

<center><img src="images/triton-server.png"  alt-text="server"/></center>

## Query the server with the Triton-generated endpoint

You can query the server using Triton's
[generated endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
with a curl command based on the following general format within your client
environment/container:

```bash
curl -X POST localhost:$HTTP_PORT/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'
```

In the case of the models used in this example, you can replace MODEL_NAME with `ensemble`. Examining the
ensemble model's config.pbtxt file, you can see that 4 parameters are required to generate a response
for this model:

- "text_input": Input text to generate a response from
- "max_tokens": The number of requested output tokens
- "bad_words": A list of bad words (can be empty)
- "stop_words": A list of stop words (can be empty)

Therefore, we can query the server in the following way:

```bash
curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "explain what is astrophotography?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
```
*Note: The value of HTTP_PORT is already set within the docker container to 8000.*

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "explain what is astrophotography?", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

Which should return a result similar to (formatted for readability):
```json
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[
...
  "sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"[/INST] Astrophotography is a branch of photography that deals with the capture of images of celestial objects such as stars, planets, and galaxies. Astrophotography involves the use of specialized equipment such as telescopes, cameras, and software to capture high-quality images of these objects. Astrophotography can be used to study the structure and behavior of celestial objects, as well as to create beautiful and awe-inspiring images that can be appreciated by people of all ages and backgrounds. \n\nAstrophotography is a challenging and rewarding hobby that requires a combination of technical knowledge, creativity, and patience. It is a great way to learn about the universe and to appreciate its beauty and complexity. \n\nIf you are interested in astrophotography, there are many resources available to help you get started. You can find books, websites, and online communities"}
```

You can ask further details regarding your previous question

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": " I want you to explain further on astrophotography ", "max_tokens":200 , "bad_words": "", "stop_words": ""}'

Likely output:


```json
...
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"and how to take a picture of the milky way.  I want you to explain how to use a tripod, how to use a camera, how to use a lens, and how to use a remote shutter release. I want you to explain how to use a tripod, how to use a camera, how to use a lens, and how to use a remote shutter release.  I want you to explain how to use aperture, shutter speed, and ISO.  I want you to explain how to use a polarizing filter.  I want you to explain how to use a star tracker.  I want you to explain how to use a camera with a mirror lock-up.  I want you to explain how to use a camera with a mirror lock-up and a remote shutter release.  I want you to explain how to use a camera with a mirror lock-up, a remote shutter release, and a self-timer.  I want you to explain how to use a camera with a mirror lock-up, a remote shutter release, a self-timer, and a flash"}

```

### Querying and Formatting using Python

We notice the format is not quite useful, let us now try to do the same via Python, here is a snippet in Python that does the same as above, let us run it now: 

In [None]:
import requests
import json
import os
import time

# Retrieve the HTTP port from environment variables
http_port = os.getenv('HTTP_PORT')

# Check if HTTP_PORT is set
if http_port is None:
    print("Error: HTTP_PORT environment variable is not set.")
    exit(1)

# Set the URL with the HTTP port
url = f'http://localhost:{http_port}/v2/models/ensemble/generate'

In [None]:
# Define the payload
input_text = "explain what is astrophotography?"
payload = {
    "text_input": input_text,
    "max_tokens": 200,
    "bad_words": "",
    "stop_words": ""
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")


Likely output is a follows: 

```pythoon
Input: explain what is astrophotography?
Output: <s>explain what is astrophotography? [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies.

Astrophotography is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras, and mounts. The images are often captured in low light conditions and require long exposure times to capture the details of the celestial objects.

Astrophotography is used to capture images of celestial objects for scientific research, educational purposes, and for the general public to appreciate the beauty of the night sky.  [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies. It is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras

```

We see that a lot of lines are repeated, let us now truncate this using a Python function and give it another try.

In [None]:
def truncate_repetitive_text(text, n_words=10):
    words = text.split()
    unique_phrases = set()
    output_words = []

    for i in range(len(words) - n_words + 1):
        phrase = ' '.join(words[i:i + n_words])
        if phrase in unique_phrases:
            # Once a repetition is found, return the text up to that point
            return ' '.join(output_words)
        unique_phrases.add(phrase)
        output_words.append(words[i])

    # If no repetition is found, return the entire text
    return ' '.join(output_words)

In [None]:
# Define the payload
input_text = "explain what is astrophotography?"
payload = {
    "text_input": input_text,
    "max_tokens": 200,  # Increased number of tokens
    "bad_words": "",
    "stop_words": ""
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Truncate repetitive text
    output_text = truncate_repetitive_text(output_text)

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")


Likely output:

```text
Input: explain what is astrophotography?
Output: <s>explain what is astrophotography? [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies. Astrophotography is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras, and mounts. The images are often captured in low light conditions and require long exposure times to capture the details of the celestial objects. Astrophotography is used to capture images of celestial objects for scientific research, educational purposes, and for the general public to appreciate the beauty of the night sky.
```

We see that the output is much better, using some more simple functions, we can completely build a post-processing wrapper that cleans the results that we get. 

## Benchmark Test 

- **End-to-End test**: The testing script sends requests to the deployed `ensemble` model. The Ensemble model is ensembled by three models: preprocessing, tensorrt_llm and postprocessing. The test checks the total latency of the three parts of an ensemble model.


#### End-to-End Test Using FP16 TRT Engine

In [None]:
!python3 ../../source_code/tensorrtllm_backend/tools/inflight_batcher_llm/end_to_end_test.py  \
         --dataset  ../../data/simple_data.json  \
         --max-input-len 500

Likely output:
```python
...
context_logits.shape: (1, 1, 1)
generation_logits.shape: (1, 1, 1, 1)
[INFO] Start testing on 13 prompts.
[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 13 prompts.
[INFO] Total Latency: 4769.683 ms
```

### Shutdown Triton Server

Run the below cell to Shutdown the Triton server, otherwise you will get an error while running the next section.

In [None]:
!kill $(ps aux | grep '[t]ritonserver' | awk '{print $2}')

---
### Benchmark Test Using INT8 TRT Engine

In [None]:
%%bash
# Create the model repository that will be used by the Triton server
cd ../../source_code/tensorrtllm_backend/

# make TRT engine folder for triton 
mkdir -p triton_model_repo/tensorrt_llm/2

# Copy the TRT engine to triton_model_repo/tensorrt_llm/2
cp  ../../model/trt_engines/weight_only/1-gpu/* triton_model_repo/tensorrt_llm/2

- Kindly open the **tensorrt llm config file** and make the changes listed in the table below *[triton_model_repo/tensorrt_llm/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt)*

| Line # | Name | Value|
| :----------------------: | :----------------------: | :-----------------------------: |
| 368|`engine_dir` | **`/workspace/app/source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/2`** |

- Launch Triton server
    - Press `Crtl+Shift+L` and open a new terminal
    - On the terminal, navigate to the launch script folder by running this command: `cd ../../source_code/tensorrtllm_backend`
    - Start the Triton Server with this command: `python3 scripts/launch_triton_server.py  --world_size=1  --model_repo=triton_model_repo`

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "explain what is astrophotography?", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

#### End-to-End Test

In [None]:
!python3 ../../source_code/tensorrtllm_backend/tools/inflight_batcher_llm/end_to_end_test.py  \
         --dataset  ../../data/simple_data.json \
         --max-input-len 500

Likely output:

```python
...
context_logits.shape: (1, 1, 1)
generation_logits.shape: (1, 1, 1, 1)
[INFO] Start testing on 13 prompts.
[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 13 prompts.
[INFO] Total Latency: 3556.254 ms
```

### Shutdown Triton Server

Run the below cell to Shutdown the Triton server

In [None]:
!kill $(ps aux | grep '[t]ritonserver' | awk '{print $2}')

Congratulations, we've been able to successfully deploy the TensorRT Engine and send an Inference request to the server! 

There are more features available in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [Triton Server](https://github.com/triton-inference-server/tensorrtllm_backend) that would be beneficial to different use-cases. You can refer the respective Github repositories to make use of latest releases and features. 

---
## Licensing
Copyright © 2023 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="trt-custom-model.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
        <a href="trt-llama-chat.ipynb">2</a>
        <a href="trt-custom-model.ipynb">3</a>
        <a>4</a>
        <a href="challenge.ipynb">5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>