<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="trt-llama-chat.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
        <a href="trt-llama-chat.ipynb">2</a>
        <a>3</a>
        <a href="challenge.ipynb">4</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span>
</div>

# Deploying Finetune Model using Triton Inference Server 
--- 

In this notebook our focus would be the use of TensorRT-LLM Backend to deploy tensorrt engine built in the previous notebook. The goal of TensorRT-LLM Backend is to let you serve [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) models with Triton Inference Server. You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).


## Using the TensorRT-LLM Backend

To use the TensorRT-LLM Backend, follow the steps below:

- Clone the TensorRT-LLM Backend repo ([tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend)), and this we have done within the container for running the lab.
   <img src="images/trt-clone.png" />
- Navigate into the create the model repository `triton_model_repo`
   <img src="images/trt-model-repo.png" />
- Copy all files in `all_models/inflight_batcher_llm` directory into the `triton_model_repo`
   <img src="images/trt-copy-folders.png" />

- There are four model directories within the `all_models/inflight_batcher_llm`. The directories include:
    - **preprocessing**: This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
    - **tensorrt_llm**: This model is a wrapper of your TensorRT-LLM model and is used for inferencing
    - **postprocessing**: This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
    - **ensemble**: This model is used to chain the three models above together as: `preprocessing -> tensorrt_llm -> postprocessing`

To learn more about ensemble model, please see
[here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models).

- Copy the model tensorrt engine from `../../model/trt_engines/fp16/1-gpu` into the `triton_model_repo/tensorrt_llm/1` directory
   <img src="images/trt-copy-engine.png" />

Now, let's execute the above steps. But before that, we have to copy the `tensorrtllm_backend` repo from within our container to the `source_code` directory for ease of access.

In [None]:
# copy tensorrtllm_backend to source_code
!cp -r /workspace/tensorrtllm_backend /workspace/app/source_code

Next, we navigate into the `tensorrtllm_backend` and execute all the steps mentioned above for `FP16 tensorrt engine`.

In [None]:
%%bash
# Create the model repository that will be used by the Triton server
cd /workspace/app/source_code/tensorrtllm_backend/
mkdir triton_model_repo

# Copy the example models to the model repository
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# Copy the TRT engine to triton_model_repo/tensorrt_llm/1/
cp  /workspace/app/model/trt_engines/fp16/1-gpu/* triton_model_repo/tensorrt_llm/1

### Modify the model configuration

The following table shows the fields that need to be modified before deployment:

<div><center>
<img src="images/ensemble.png" width="1000"/>
</center></div>


**Kindly change the key-value pair using the below table to set the correct configuration, and do not forget to save it.**

- a) Kindly open the **Pre-processing config file** and make the changes listed in the table below *[triton_model_repo/preprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/preprocessing/config.pbtxt)*

| Line # | Name | Description | 
| :----------------------:| :----------------------: | :-----------------------------: | 
| 83 | `tokenizer_dir` | The path to the tokenizer for the model. Please set the path to **`/workspace/app/model/Llama-2-7b-chat-hf-merged`**|
| 90 | `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. Please set the type to **`llama`** |

<center><img src="images/trt-preconfig.png"  alt-text="server"/></center>

---

- b) Kindly open the **tensorrt llm config file** and make the changes listed in the table below *[triton_model_repo/tensorrt_llm/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt)*

| Line # | Name | Description
| :----------------------: | :----------------------: | :-----------------------------: |
| 32|`decoupled` | Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. Please set Decoupled to **`False`** |
| 170|`gpt_model_type` | Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1` , Please set as **`V1`**|
| 176|`gpt_model_path` | Path to the TensorRT-LLM engines for deployment. Please set the path to **`/workspace/app/source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1`** |

<center><img src="images/trt-llmconfig.png"  alt-text="server"/></center>

---
- c) Kindly open the **Post-processing config file** and make the changes listed in the table below *[triton_model_repo/postprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/postprocessing/config.pbtxt)*

| Line # | Name | Description
| :----------------------:| :----------------------: | :-----------------------------: |
| 48 | `tokenizer_dir` | The path to the tokenizer for the model. Please set the path to **`/workspace/app/model/Llama-2-7b-chat-hf-merged`** |
| 55 | `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. Please set the type to **`llama`** |

<center><img src="images/postconfig.png"  alt-text="postconfig"/></center>
---


### Launch Triton server

We can launch the Triton server with the following command:

- Press `Crtl+Shift+L` and open a new terminal
  <center><img src="images/terminal.png"  alt-text="terminal"/></center>
- On the terminal, navigate to the launch script folder by running this command: `cd /workspace/app/source_code/tensorrtllm_backend`
- Start the Triton Server with this command: `python3 scripts/launch_triton_server.py --tritonserver "/opt/tritonserver/bin/tritonserver --http-port $HTTP_PORT --allow-grpc False --allow-metrics False"  --world_size=1  --model_repo=/workspace/app/source_code/tensorrtllm_backend/triton_model_repo`


It will take a few minutes to run and when successfully deployed, the server produces logs similar to the screenshot below.

<center><img src="images/triton-server.png"  alt-text="server"/></center>

## Query the server with the Triton-generated endpoint

You can query the server using Triton's
[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
with a curl command based on the following general format within your client
environment/container:

```bash
curl -X POST localhost:$HTTP_PORT/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'
```

In the case of the models used in this example, you can replace MODEL_NAME with `ensemble`. Examining the
ensemble model's config.pbtxt file, you can see that 4 parameters are required to generate a response
for this model:

- "text_input": Input text to generate a response from
- "max_tokens": The number of requested output tokens
- "bad_words": A list of bad words (can be empty)
- "stop_words": A list of stop words (can be empty)

Therefore, we can query the server in the following way:

```bash
curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "explain what is astrophotography?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
```

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "explain what is astrophotography?", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

Which should return a result similar to (formatted for readability):
```json
{"model_name":"ensemble",
 "model_version":"1",
 "sequence_end":false,
 "sequence_id":0,
 "sequence_start":false,
  "text_output":"<s>explain what is astrophotography? [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies.\n\nAstrophotography is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras, and mounts. The images are often captured in low light conditions and require long exposure times to capture the details of the celestial objects.\n\nAstrophotography is used to capture images of celestial objects for scientific research, educational purposes, and for the general public to appreciate the beauty of the night sky.  [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies. It is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras"
}
```

You can ask further details regarding your previous question

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": " I want you to explain further on astrophotography ", "max_tokens":200 , "bad_words": "", "stop_words": ""}'

Likely output:


```json

{"model_name":"ensemble",
 "model_version":"1",
 "sequence_end":false,
 "sequence_id":0,
 "sequence_start":false,
 "text_output":"<s> I want you to explain further on astrophotography  [/INST] Astrophotography is a type of photography that involves capturing images of celestial objects such as stars, planets, and galaxies.Ъ It is a challenging and rewarding form of photography that requires specialized equipment and techniques.\n\nSome of the key concepts in astrophotography include:\n\n1. Telescopes: Astrophotography requires a telescope to capture images of celestial objects. There are several types of telescopes available, including reflecting telescopes, refracting telescopes, and catadioptric telescopes.\n\n2. Mounts: A mount is required to hold the telescope steady and track the movement of celestial objects. There are several types of mounts available, including altazimuth mounts, equatorial mounts, and altazimuth-equatorial mounts.\n\n3. Cameras: Astrophotography"}

```

### Querying and Formatting using Python

We notice the format is not quite useful, let us now try to do the same via Python, here is a snippet in Python that does the same as above, let us run it now: 

In [None]:
import requests
import json
import os
import time

# Retrieve the HTTP port from environment variables
http_port = os.getenv('HTTP_PORT')

# Check if HTTP_PORT is set
if http_port is None:
    print("Error: HTTP_PORT environment variable is not set.")
    exit(1)

# Set the URL with the HTTP port
url = f'http://localhost:{http_port}/v2/models/ensemble/generate'

In [None]:
# Define the payload
input_text = "explain what is astrophotography?"
payload = {
    "text_input": input_text,
    "max_tokens": 200,
    "bad_words": "",
    "stop_words": "\n"
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")


Likely output is a follows: 

```pythoon
Input: explain what is astrophotography?
Output: <s>explain what is astrophotography? [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies.

Astrophotography is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras, and mounts. The images are often captured in low light conditions and require long exposure times to capture the details of the celestial objects.

Astrophotography is used to capture images of celestial objects for scientific research, educational purposes, and for the general public to appreciate the beauty of the night sky.  [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies. It is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras

```

We see that a lot of lines are repeated, let us now truncate this using a Python function and give it another try.

In [None]:
def truncate_repetitive_text(text, n_words=10):
    words = text.split()
    unique_phrases = set()
    output_words = []

    for i in range(len(words) - n_words + 1):
        phrase = ' '.join(words[i:i + n_words])
        if phrase in unique_phrases:
            # Once a repetition is found, return the text up to that point
            return ' '.join(output_words)
        unique_phrases.add(phrase)
        output_words.append(words[i])

    # If no repetition is found, return the entire text
    return ' '.join(output_words)

In [None]:
# Define the payload
input_text = "explain what is astrophotography?"
payload = {
    "text_input": input_text,
    "max_tokens": 200,  # Increased number of tokens
    "bad_words": "",
    "stop_words": ""
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Truncate repetitive text
    output_text = truncate_repetitive_text(output_text)

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")


Likely output:

```text
Input: explain what is astrophotography?
Output: <s>explain what is astrophotography? [/INST] Astrophotography is a type of photography that focuses on capturing images of celestial objects such as stars, planets, and galaxies. Astrophotography is a challenging form of photography as it requires the use of specialized equipment such as telescopes, cameras, and mounts. The images are often captured in low light conditions and require long exposure times to capture the details of the celestial objects. Astrophotography is used to capture images of celestial objects for scientific research, educational purposes, and for the general public to appreciate the beauty of the night sky.
```

We see that the output is much better, using some more simple functions, we can completely build a post-processing wrapper that cleans the results that we get. 

### Benchmark Test

There are two type of benchmark testing for TensorRT-LLM backend
- **End-to-End Test**: The testing script sends requests to the deployed `ensemble` model. The Ensemble model is ensembled by three models: preprocessing, tensorrt_llm and postprocessing. The test checks the total latency of the three parts of an ensemble model.

- **Identity Test** The testing script sends requests directly to the deployed `tensorrt_llm` model. The identity test latency indicates the inference latency of TensorRT-LLM without including the pre/post-processing latency.

#### End-to-End Test 

In [None]:
!python3 /workspace/app/source_code/tensorrtllm_backend/tools/inflight_batcher_llm/end_to_end_test.py  \
         --dataset  /workspace/app/source_code/tensorrtllm_backend/ci/L0_backend_trtllm/simple_data.json \
         --max_input_len 200

Likely output:
```python
[INFO] Start testing on 13 prompts.
[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 13 prompts.
[INFO] Total Latency: 833.867 ms
```

#### Identity Test 

In [None]:
!python3 /workspace/app/source_code/tensorrtllm_backend/tools/inflight_batcher_llm/identity_test.py \
         --dataset /workspace/app/source_code/tensorrtllm_backend/ci/L0_backend_trtllm/simple_data.json \
         --max_input_len 200 \
         --tokenizer_dir /workspace/app/model/Llama-2-7b-chat-hf-merged

Likely output:
```python
Tokens per word:  1.496
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 13 prompts.
[INFO] Total Latency: 801.281 ms
Expected op tokens 20.0
```

### Shutdown Triton Server

Run the below cell to Shutdown the Triton server

In [None]:
!kill $(ps aux | grep '[t]ritonserver' | awk '{print $2}')

Congratulations, we've been able to successfully deploy the TensorRT Engine and send an Inference request to the server! 

There are more features available in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [Triton Server](https://github.com/triton-inference-server/tensorrtllm_backend) that would be beneficial to different use-cases. You can refer the respective Github repositories to make use of latest releases and features. 

---
## Acknowledgment

This notebook is adapt from NVIDIA's [TensorRT-LLM Backend Github repository](https://github.com/triton-inference-server/tensorrtllm_backend)

## Licensing
Copyright © 2023 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="trt-llama-chat.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
        <a href="trt-llama-chat.ipynb">2</a>
        <a>3</a>
        <a href="challenge.ipynb">4</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>