<p> <center> <a href="../../Start_Here.ipynb">Home Page</a> </center> </p>
 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="TRT-LLM-Part1.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="TRT-LLM-Part1.ipynb">1</a>
        <a>2</a>
    </span>
</div>

# TensorRT-LLM Backend: Llama-2-7B Deployment using Triton Inference Server 
 


The Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).
The goal of TensorRT-LLM Backend is to let you serve [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) models with Triton Inference Server. 

## Building the TensorRT-LLM Backend

**NOTE : TensorRT-LLM Backend is already installed if built using a container environment and hence need not be installed**, but there are several ways to access the TensorRT-LLM Backend depending on the deployment scenario, such as: 

#### Option 1. Run the Docker Container

Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
Backend and Python Backend. This container should have everything to run a
TensorRT-LLM model. You can find this container on the
[Triton NGC page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).

#### Option 2. Build via the build.py Script in Server Repo

Starting with Triton 23.10 release, you can follow steps described in the
[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
guide and use the
[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
script.

A sample command to build a Triton Server container with all options enabled is
shown below, which will build the same TRT-LLM container as the one on the NGC.

```bash
BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
PYTHON_BACKEND_REPO_TAG=r23.10

# Run the build script. The flags for some features or endpoints can be removed if not needed.
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
              --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
              --filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
              --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
              --backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
              --image=base,${BASE_CONTAINER_IMAGE_NAME} \
              --backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
              --backend=python:${PYTHON_BACKEND_REPO_TAG}
```

The `BASE_CONTAINER_IMAGE_NAME` is the base image that will be used to build the
container. By default it is set to the most recent min image of Triton, on NGC,
that matches the Triton release you are building for. You can change it to a
different image if needed by setting the `--image` flag like the command below.
The `TENSORRTLLM_BACKEND_REPO_TAG` and `PYTHON_BACKEND_REPO_TAG` are the tags of
the TensorRT-LLM backend and Python backend repositories that will be used
to build the container. You can also remove the features or endpoints that you
don't need by removing the corresponding flags.


## Using the TensorRT-LLM Backend

We will look at 4 steps to serve the TensorRT-LLM model with the Triton TensorRT-LLM Backend on a 1-GPU environment. The example uses the LLaMA-v2 7B model from
the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama).

### 1. Prepare TensorRT-LLM engines

We can skip this step as we already have the engines ready.  If you are looking for some other model deployment, please follow the [guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/) in
TensorRT-LLM repository for specific details on how to prepare respective engines for deployment.


### 2. Create the model repository

There are four models in the `all_models/gpt`
directory that will be used in this example:
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
- "ensemble": This model is used to chain the three models above together:
preprocessing -> tensorrt_llm -> postprocessing

To learn more about ensemble model, please see
[here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models).

In [None]:
%%bash
# Create the model repository that will be used by the Triton server
cd ../../source_code/tensorrtllm_backend 
mkdir triton_model_repo

# Copy the example models to the model repository
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# Copy the TRT engine to triton_model_repo/tensorrt_llm/1/
cp ../models/llama-engines/fp16/1-gpu/* triton_model_repo/tensorrt_llm/1

### 3. Modify the model configuration

The following table shows the fields that need to be modified before deployment:

<div><center>
<img src="images/ensemble.png" width="1000"/>
</center></div>


**Kindly change the key-value pair using the below table to set the correct configuration, and do not forget to save it.**

- a) Kindly open the Pre-processing config file and make the changes listed in the table below *[triton_model_repo/preprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/preprocessing/config.pbtxt)*

| Line # | Name | Description | 
| :----------------------:| :----------------------: | :-----------------------------: | 
| 83 | `tokenizer_dir` | The path to the tokenizer for the model. In this example, the path should be set to `/code/tensorrt_llm/source_code/models/llama-2-7b`|
| 90 | `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `llama` |

- b) Kindly open the tensorrt config file and make the changes listed in the table below *[triton_model_repo/tensorrt_llm/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt)*

| Line # | Name | Description
| :----------------------: | :----------------------: | :-----------------------------: |
| 32|`decoupled` | Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. In this example, we set it to `False` |
| 170|`gpt_model_type` | Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1` , In this example, we need to set it to `V1`|
| 176|`gpt_model_path` | Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/code/tensorrt_llm/source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` |

- c) Kindly open the Post-processing config file and make the changes listed in the table below *[triton_model_repo/postprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/postprocessing/config.pbtxt)*

| Line # | Name | Description
| :----------------------:| :----------------------: | :-----------------------------: |
| 48 | `tokenizer_dir` | The path to the tokenizer for the model. In this example, the path should be set to `/code/tensorrt_llm/source_code/models/llama-2-7b` |
| 55 | `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `llama` |

### 4. Launch Triton server

We can launch the Triton server with the following command:

- **1. Press `Crtl+Shift+L`**
- **2. Kindly Open a New Terminal**
- **3. Running the following commands:**
```bash
# Move to the correct folder of the launch script
cd /code/tensorrt_llm/source_code/tensorrtllm_backend
# Run the Triton server 
python3 scripts/launch_triton_server.py --tritonserver "/opt/tritonserver/bin/tritonserver --http-port $HTTP_PORT --allow-grpc False --allow-metrics False" \
                                        --world_size=1 \
                                        --model_repo=/code/tensorrt_llm/source_code/tensorrtllm_backend/triton_model_repo
```

It will take a few minutes to run and when successfully deployed, the server produces logs similar to the following ones.
```
I1117 17:38:57.345404 1954858 http_server.cc:4497] Started HTTPService at 0.0.0.0:9876
```


## Query the server with the Triton-generated endpoint

You can query the server using Triton's
[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
with a curl command based on the following general format within your client
environment/container:

```bash
curl -X POST localhost:$HTTP_PORT/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'
```

In the case of the models used in this example, you can replace MODEL_NAME with `ensemble`. Examining the
ensemble model's config.pbtxt file, you can see that 4 parameters are required to generate a response
for this model:

- "text_input": Input text to generate a response from
- "max_tokens": The number of requested output tokens
- "bad_words": A list of bad words (can be empty)
- "stop_words": A list of stop words (can be empty)

Therefore, we can query the server in the following way:

```bash
curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
```

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

Which should return a result similar to (formatted for readability):
```json
{
  "model_name": "ensemble",
  "model_version": "1",
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "What is machine learning? Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine"
}
```

You can now try to edit and send an Inference request to the deployed server using a modified version of the above command!

In [None]:
!curl -X POST localhost:$HTTP_PORT/v2/models/ensemble/generate -d '{"text_input": " xxxxxxxxxxxxxxxx ", "max_tokens": xxxx , "bad_words": "", "stop_words": ""}'

### Querying and Formatting using Python

We notice the format is not quite useful, let us now try to do the same via Python, here is a snippet in Python that does the same as above, let us run it now: 

In [None]:
import requests
import json
import os

# Retrieve the HTTP port from environment variables
http_port = os.getenv('HTTP_PORT')

# Check if HTTP_PORT is set
if http_port is None:
    print("Error: HTTP_PORT environment variable is not set.")
    exit(1)

# Set the URL with the HTTP port
url = f'http://localhost:{http_port}/v2/models/ensemble/generate'

In [None]:
# Define the payload
input_text = "What is machine learning?"
payload = {
    "text_input": input_text,
    "max_tokens": 100,
    "bad_words": "",
    "stop_words": "\n"
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")

Expected output is a follows: 

```text
Input: What is machine learning?
Output: <s>What is machine learning? Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine learning is a subset of artificial intelligence. Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine learning is a subset of artificial intelligence. Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine learning is a subset of artificial intelligence. Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly
```

We see that a lot of lines are repeated, let us now truncate this using a Python function and give it another try.

In [None]:
def truncate_repetitive_text(text, n_words=10):
    words = text.split()
    unique_phrases = set()
    output_words = []

    for i in range(len(words) - n_words + 1):
        phrase = ' '.join(words[i:i + n_words])
        if phrase in unique_phrases:
            # Once a repetition is found, return the text up to that point
            return ' '.join(output_words)
        unique_phrases.add(phrase)
        output_words.append(words[i])

    # If no repetition is found, return the entire text
    return ' '.join(output_words)

In [None]:
# Define the payload
input_text = "What is machine learning?"
payload = {
    "text_input": input_text,
    "max_tokens": 100,  # Increased number of tokens
    "bad_words": "",
    "stop_words": ""
}

# Make a POST request
response = requests.post(url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    # Parse the response
    data = response.json()
    output_text = data.get('text_output')

    # Truncate repetitive text
    output_text = truncate_repetitive_text(output_text)

    # Format and print the output
    print(f"Input: {input_text}")
    print(f"Output: {output_text}")
else:
    print(f"Error: {response.status_code}")


Expected output:

```text
Input: What is machine learning?
Output: <s>What is machine learning? Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine learning is a subset of artificial intelligence.
```

We see that the output is much better, using some more simple functions, we can completely build a post-processing wrapper that cleans the results that we get. Let us now go ahead and shutdown the Triton Server.

### Shutdown Triton Server

Run the below cell to Shutdown the Triton server

In [None]:
!kill $(ps aux | grep '[t]ritonserver' | awk '{print $2}')

Congratulations, we've been able to successfully deploy the TensorRT Engine and send an Inference request to the server! 

There are more features available in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [Triton Server](https://github.com/triton-inference-server/tensorrtllm_backend) that would be beneficial to different use-cases. You can refer the respective Github repositories to make use of latest releases and features. 

## Licensing
Copyright © 2023 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<p> <center> <a href="../../Start_Here.ipynb">Home Page</a> </center> </p>
 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="TRT-LLM-Part1.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="TRT-LLM-Part1.ipynb">1</a>
        <a>2</a>
    </span>
</div>