<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="llama-chat-finetune.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
         <a>2</a>
        <a href="triton-llama.ipynb">3</a>
        <a href="challenge.ipynb">4</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="triton-llama.ipynb">Next Notebook</a></span>
</div>

# Building TensorRT Engine With Finetune Model  
---

The objective of this nootebook is to demostrate the use of TensorRT-LLM to optimizing our finetune Llama-2-7b-chat (`../../model/Llama-2-7b-chat-hf-merged`) from the previous notebook, run inference, and examine using various advance optimization techniques.

### Overview of TensorRT-LLM 

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main) is a toolkit to assemble optimized solutions to perform Large Language Model (LLM) inference. It offers a Python API to define models and compile efficient [TensorRT](https://developer.nvidia.com/tensorrt) engines for NVIDIA GPUs. It also contains Python and C++ components to build runtimes to execute those engines as well as [backends](https://github.com/triton-inference-server/tensorrtllm_backend) for the [Triton Inference Server](https://developer.nvidia.com/triton-inference-server) to easily create web-based services for LLMs. TensorRT-LLM supports single GPU, multi-GPU and multi-node configurations (using [Tensor Parallelism](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#tensor-parallelism) and/or [Pipeline Parallelism](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#pipeline-parallelism)). TensorRT-LLM wraps TensorRT’s deep learning compiler—which includes optimized kernels from FasterTransformer, pre- and post-processing, and multi-GPU and multi-node communication—in a simple open-source Python API for defining, optimizing, and executing LLMs for inference in production.

The Python API of TensorRT-LLM is architectured to look similar to the PyTorch API. It provides users with a functional module containing functions like `einsum`, `softmax`, `matmul` or `view`. TensorRT-LLM maximize performance and reduce memory footprint by allowing models to be execute using different quantization modes. Thus, it supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the [SmoothQuant technique](https://arxiv.org/abs/2211.10438).

### Key Features of TensorRT-LLM

TensorRT-LLM contains examples that implement the following features.

* Multi-head Attention([MHA](https://arxiv.org/abs/1706.03762))
* Multi-query Attention ([MQA](https://arxiv.org/abs/1911.02150))
* Group-query Attention([GQA](https://arxiv.org/abs/2307.09288))
* In-flight Batching
* Paged KV Cache for the Attention
* Tensor Parallelism
* Pipeline Parallelism
* INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
* [SmoothQuant](https://arxiv.org/abs/2211.10438)
* [GPTQ](https://arxiv.org/abs/2210.17323)
* [AWQ](https://arxiv.org/abs/2306.00978)
* [FP8](https://arxiv.org/abs/2209.05433)
* Greedy-search
* Beam-search
* RoPE

Some of the features are not enabled for these [models](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). Please find list of TensorRT-LLM supported models [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main?tab=readme-ov-file#models)

### Support Device

TensorRT-LLM is rigorously tested on the following GPUs:

- H100
- L40S
- A100
- A30
- V100 (experimental)

## Building TensorRT engine(s) for Llama-2

In this section, we show how to build tensorrt engine(s) using our merged model. Firstly, we use the `build.py` script to build our tensorrt engine using single GPU and FP16. The second step is to run the inference using the `run.py` script.  Before we proceed to buid our engine, it important to be aware of the supported matrixes for Llama-2 as listed bellow:

- FP16
- FP8
- INT8 & INT4 Weight-Only
- SmoothQuant
- Groupwise quantization (AWQ/GPTQ)
- FP8 KV CACHE
- INT8 KV CACHE (+ AWQ/per-channel weight-only)
- Tensor Parallel
- STRONGLY TYPED

**flag description**:
- **model_dir**: path to the model directory 
- **output_dir**: path to the directory to store the tensorrt-llm checkpoint format or the tensorrt engine
- **dtype**:  data type to use in for model covertion to tensorrt-llm checkpoint
- **checkpoint_dir**: path to the directory to load the tensorrt-llm checkpoint needed to build the tensorrt engine
- **use_gemm_plugin**: required plugin to prevent accuracy issue
- **gpt_attention_plugin**: GPT attention plugin
- **weight_only_precision**: required weight precision to build tensorrt engine
- **enable_context_fmha**: context-dependent Faster Multihead Attention plugin that reduce the memory footprint significantly

#### Build the LLaMA 7B model using a single GPU and FP16 

- Build Tensorrt Engine

In [None]:
!python3  ../../../tensorrtllm_backend/tensorrt_llm/examples/llama/build.py \
                --model_dir ../../model/Llama-2-7b-chat-hf-merged \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ../../model/trt_engines/fp16/1-gpu/

Expected output: 

```python
...
[02/23/2024-22:04:07] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +12853, now: CPU 0, GPU 12853 (MiB)
[02/23/2024-22:04:11] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 32216 MiB
[02/23/2024-22:04:11] [TRT-LLM] [I] Total time of building llama_float16_tp1_rank0.engine: 00:00:35
[02/23/2024-22:04:11] [TRT-LLM] [I] Config saved to ../../model/trt_engines/fp16/1-gpu/config.json.
[02/23/2024-22:04:11] [TRT-LLM] [I] Serializing engine to ../../model/trt_engines/fp16/1-gpu/llama_float16_tp1_rank0.engine...
[02/23/2024-22:04:26] [TRT-LLM] [I] Engine serialized. Total time: 00:00:14
[02/23/2024-22:04:26] [TRT-LLM] [I] Timing cache serialized to ../../model/trt_engines/fp16/1-gpu/model.cache
[02/23/2024-22:04:26] [TRT-LLM] [I] Total time of building all 1 engines: 00:01:22

```


- Run Inference

In [None]:
!python3  ../../../tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
               --max_output_len=200 \
               --tokenizer_dir ../../model/Llama-2-7b-chat-hf-merged \
               --engine_dir=../../model/trt_engines/fp16/1-gpu/ \
               --input_text "explain what is astrophotography?"

Expected output: 

```python
...

Running the float16 engine ...
Input: "explain what is astrophotography?"
Output: "

Astrophotography is the branch of photography that deals with the capture of images of celestial objects such as stars, planets, and galaxies. It involves the use of specialized equipment such as telescopes, cameras, and lenses to capture images of these objects. Astrophotography is a challenging and rewarding field that requires a combination of technical knowledge and creativity.

Astrophotography can be used to study the properties of celestial objects, such as their size, shape, and color. It can also be used to create artistic images that capture the beauty and wonder of the night sky. Astrophotography has been used to capture some of the most iconic images of the universe, including the Pillars of Creation in the Eagle Nebula and the Orion Nebula.

Astrophotography is a popular hobby and profession, with many amateur astronomers and"

```

#### Build the LLaMA 7B model using a single GPU and BF16

- Build Tensorrt Engine

In [None]:
!python3  ../../../tensorrtllm_backend/tensorrt_llm/examples/llama/build.py \
                --model_dir ../../model/Llama-2-7b-chat-hf-merged \
                --dtype bfloat16 \
                --remove_input_padding \
                --use_gpt_attention_plugin bfloat16 \
                --enable_context_fmha \
                --use_gemm_plugin bfloat16 \
                --output_dir ../../model/trt_engines/bf16/1-gpu/

Expected output:

```python
...
[02/23/2024-22:15:37] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 42037 MiB
[02/23/2024-22:15:37] [TRT-LLM] [I] Total time of building llama_bfloat16_tp1_rank0.engine: 00:00:29
[02/23/2024-22:15:37] [TRT-LLM] [I] Config saved to ../../model/trt_engines/bf16/1-gpu/config.json.
[02/23/2024-22:15:37] [TRT-LLM] [I] Serializing engine to ../../model/trt_engines/bf16/1-gpu/llama_bfloat16_tp1_rank0.engine...
[02/23/2024-22:15:52] [TRT-LLM] [I] Engine serialized. Total time: 00:00:15
[02/23/2024-22:15:52] [TRT-LLM] [I] Timing cache serialized to ../../model/trt_engines/bf16/1-gpu/model.cache
[02/23/2024-22:15:53] [TRT-LLM] [I] Total time of building all 1 engines: 00:00:59

```

- Run Inference

In [None]:
!python3  ../../../tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
               --max_output_len=200 \
               --tokenizer_dir ../../model/Llama-2-7b-chat-hf-merged \
               --engine_dir=../../model/trt_engines/bf16/1-gpu/ \
               --input_text "explain what is astrophotography?"

Expected output:

```python

Running the bfloat16 engine ...
Input: "explain what is astrophotography?"
Output: "

Astrophotography is the branch of photography that deals with the capture of images of celestial objects such as stars, planets, and galaxies. It involves the use of specialized equipment such as telescopes, cameras, and lenses to capture images of these objects. Astrophotography is a challenging and rewarding field that requires a combination of technical knowledge and creativity.

Astrophotography can be used to study the properties of celestial objects, such as their size, shape, and color. It can also be used to create artistic images that capture the beauty and wonder of the night sky. Astrophotography has been used to capture some of the most iconic images of the universe, including the Pillars of Creation in the Eagle Nebula and the Orion Nebula.

Astrophotography is a popular hobby and profession, with many amateur astronomers and"
```

#### Build the LLaMA 7B model using a single GPU and apply INT8 weight-only quantization

- Build Tensorrt Engine

In [None]:
!python3  ../../../tensorrtllm_backend/tensorrt_llm/examples/llama/build.py \
                --model_dir ../../model/Llama-2-7b-chat-hf-merged \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --output_dir ../../model/trt_engines/weight_only/1-gpu/

Expected Output:

```python
...
[02/23/2024-22:23:17] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 29558 MiB
[02/23/2024-22:23:17] [TRT-LLM] [I] Total time of building llama_float16_tp1_rank0.engine: 00:00:26
[02/23/2024-22:23:17] [TRT-LLM] [I] Config saved to ../../model/trt_engines/weight_only/1-gpu/config.json.
[02/23/2024-22:23:17] [TRT-LLM] [I] Serializing engine to ../../model/trt_engines/weight_only/1-gpu/llama_float16_tp1_rank0.engine...
[02/23/2024-22:23:25] [TRT-LLM] [I] Engine serialized. Total time: 00:00:07
[02/23/2024-22:23:25] [TRT-LLM] [I] Timing cache serialized to ../../model/trt_engines/weight_only/1-gpu/model.cache
[02/23/2024-22:23:25] [TRT-LLM] [I] Total time of building all 1 engines: 00:02:48

```

- Run Inference

In [None]:
!python3  ../../../tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
               --max_output_len=200 \
               --tokenizer_dir ../../model/Llama-2-7b-chat-hf-merged \
               --engine_dir=../../model/trt_engines/weight_only/1-gpu/ \
               --input_text "explain what is astrophotography?"

Expected Output

```python
Running the float16 engine ...
Input: "explain what is astrophotography?"
Output: "

Astrophotography is the branch of photography that deals with the capture of images of celestial objects such as stars, planets, and galaxies. It involves the use of specialized equipment such as telescopes, cameras, and lenses to capture images of these objects. Astrophotography is a challenging and rewarding field that requires a combination of technical knowledge and creativity.

Astrophotography can be used to study the properties of celestial objects, such as their size, shape, and color. It can also be used to create artistic images that capture the beauty and wonder of the night sky. Astrophotography has been used to capture some of the most iconic images of the universe, including the Pillars of Creation in the Eagle Nebula and the Orion Nebula.

Astrophotography is a popular hobby and profession, with many amateur astronomers and"
```

#### Other methods to Build and Run LLaMA-2 7B, 30B, and 70B

- 2-way tensor parallelism.
- 2-way tensor parallelism and 2-way pipeline parallelism
- 8-way tensor parallelism for 70B
- 4-way tensor parallelism and 2-way pipeline parallelism for 70B
- Build LLaMA 70B TP=8 using Meta checkpoints directly.

Please find examples for the listed methods [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)

### Advance OPtimization Techniques

- **Quantization** 

TensorRT-LLM implements different quantization methods with support matrix for the different models. Given a matrix (2D tensor) of shape M x N (M rows and N columns) where M is the number of tokens and N is the number of channels. TensorRT-LLM has the three following modes to quantize and dequantize the elements of the tensor:

    - Per-tensor: It uses a single scaling factor for all the elements,
    - Per-token: It uses a different scaling factor for each token. There are M scaling factors in that case,
    - Per-channel: It uses a different scaling factor for each channel. There are N scaling factors in that case.

```python

# Per-tensor scaling.
for mi in range(M):
    for ni in range(N):
        q[mi][ni] = int8.satfinite(x[mi][ni] * s)

# Per-token scaling.
for mi in range(M):
    for ni in range(N):
        q[mi][ni] = int8.satfinite(x[mi][ni] * s[mi])

# Per-channel scaling.
for mi in range(M):
    for ni in range(N):
        q[mi][ni] = int8.satfinite(x[mi][ni] * s[ni])

```
Use the [link](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md) to explore more topics that include INT8 SmoothQuant, INT4 and INT8 Weight-Only, GPTQ, and AWQ.

- **In-flight Batching**

In-flight Batching also known as continuous batching or iteration-level batching. The technique aims at reducing wait times in queues, eliminating the need for padding requests and allowing for higher GPU utilization. TensorRT-LLM uses on a component, called the Batch Manager, to support in-flight batching of requests. More on The Batch Manager API can be found [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/batch_manager.md) 

- **Multi-head, Multi-query and Group-query Attention**

Multi-head(MHA), Multi-query(MQA) and Group-query Attention(GQA) are variants of attention mechanism found in most the Large Language Models, and are implemented and optimized in TensorRT-LLM. The [MHA](https://arxiv.org/abs/1706.03762) is the sequence of a batched matmul, a softmax and another batched matmul while [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/abs/2307.09288) are variants of MHA that use fewer, so-called, K/V head than the number of query heads.. This [document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md) summarizes those implementations in TensorRT-LLM.  

## Performance of TensorRT-LLM


The data in the following tables is provided as a reference point to help users validate observed performance for TensorRT-LLM on H100 (Hopper). It should not be considered as the peak performance that can be delivered by TensorRT-LLM. The different performance numbers below were collected using single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs for GPT, GPT-like(LLaMA/OPT/GPT-J/SmoothQuant-GPT), BERT models, and Encoder-Decoder models for Peak Throughput and Low Latency. Below is the table for H100 GPUs (FP8).
 
 
 | Model | Batch Size | TP (1) | Input Length | Output Length |  Throughput (out tok/s/GPU) |
 |:-----:|:--------:| :-------------: | :------------: | :------------: | :--------------: |
 |GPT-J 6B|	1024|1|128|128|26,150|
 |GPT-J 6B|	120	|   1|128  |   2048|8,011|
 |GPT-J 6B|	64	|1	 |2048 |    128|2,551|
 |GPT-J 6B|	64	|1	 |2048 |   2048|3,327|
 |LLaMA 7B|	768	|1	 |128  |    128|19,694|
 |LLaMA 7B|	112	|1	 |128  |   2048|6,818|
 |LLaMA 7B|	80	|1	 |2048 |    128|2,244|
 |LLaMA 7B|	48	|1	 |2048 |   2048|2,740|
 |LLaMA 70B|1024|	2|128  |    128|2,657|
 |LLaMA 70B|480	|4	 |128  |2048   |1,486|
 |LLaMA 70B|96	|2	 |2048 |128	   |306|
 |LLaMA 70B|64  |	2|2048 |2048   |547|
 |Falcon 180B|1024|	4|	128|128    |987|
 |Falcon 180B|1024|	8|	128|2048   |724|
 |Falcon 180B|	64|	4|2048 |128	   |112|
 |Falcon 180B|	64|	4|2048 |2048   |264|
 

Please click on the [Performance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md) and [Benchmark](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/python/README.md) links to see a detailed table on `throughput` and `low Latency` for H100, L40S (Ada) and A100 (Ampere).

---
## Acknowledgment

This notebook is adapt from NVIDIA's [TensorRT-LLM Github repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main)

## References

- https://nvidia.github.io/TensorRT-LLM/architecture.html
- https://github.com/NVIDIA/TensorRT-LLM

## Licensing
Copyright © 2023 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

 <div>
    <span style="float: left; width: 33%; text-align: left;"><a href="llama-chat-finetune.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
         <a>2</a>
        <a href="triton-llama.ipynb">3</a>
        <a href="challenge.ipynb">4</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="triton-llama.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>