<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="trt-llama-chat.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
         <a href="trt-llama-chat.ipynb">2</a>
          <a>3</a>
        <a href="triton-llama.ipynb">4</a>
        <a href="LangChain-with-Guardrails.ipynb">5</a>
        <a href="challenge.ipynb">6</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="triton-llama.ipynb">Next Notebook</a></span>
</div>

# TensorRT-LLM: Adding Custom Model
---

### Setup TRT-LLM Environment

TensorRT-LLM must be built from source depending on the Hardware environment we use it in to achieve ideal performance. 

**Note: This step should be skipped if you have already built this via Dockerfile or Singularity recipe or running this in a bootcamp environment.**

There are multiple ways to set up TensorRT-LLM; one recommended approach is as follows: 


#### 1). Download sources 

```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs cmake

# Clone Repository and set it up
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull
```

#### 2). Using Docker 

```bash
# Using the make to build using Docker
make -C docker release_build
```

For a complete list of options to set up TRT-LLM kindly refer [**here**](https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md).

### Workflow of Integrating New Models using TensorRT-LLM: 

Let us start by understanding the workflow of TensorRT-LLM so we can learn how we can integrate new models into TensorRT-LLM:

- **Step 1** : Convert weights from different source frameworks into TensorRT-LLM checkpoint

- **Step 2** : Build the TensorRT-LLM checkpoint into TensorRT engine(s) with a unified build command

- **Step 3** : Load the engine(s) to the TensorRT-LLM model runner and evaluate with different evaluation tasks

<div><center>
<img src="images/workflow.png" width="1000"/>
</center></div>  


The above image is a pictorial representation of the workflow described. Let us understand the steps in detail using the LLaMA 7b Model, as we will use the same model in the upcoming notebooks.

#### **Step 1** :  Convert to TensorRT-LLM Checkpoint format

First, we start off by converting checkpoints from other sources to the TensorRT-LLM Checkpoint format. Continuing with our example, let us look at the current format of LLaMA 7b model that we currently have. 

We see the following files: 

- **README.md**: This is a markdown file typically used to provide information about the software or project, including how to install, configure, and use it, as well as any other relevant details or documentation.
- **checklist.chk**: This file is an MD5 checksum file used to verify the integrity of the other files in the directory. It ensures that the files are not corrupted or altered from their original state.
- **consolidated.00.pth**: This is a PyTorch model file that contains the trained model weights. PyTorch uses the .pth extension for saving model checkpoints.
- **params.json**: A JSON file containing model parameters. For the 7B model, it includes details such as the dimensionality of the model, the number of heads, layers, and other configuration details that are necessary for initializing and running the model.
- **tokenizer.model**: This file is associated with the tokenizer used by the LLaMA model. A tokenizer is responsible for converting text into a format that the model can understand, typically by breaking text into tokens and converting these tokens into numerical representations.


Since the model is trained in Pytorch, we can load it and see the weight tensors below: 

<div><center>
<img src="images/meta-ckpt.png" width="1500"/>
</center></div>  

Before we convert this specific Meta checkpoint to TensorRT-LLM format, let us look at what is the checkpoint format. 

TensorRT-LLM defines its own checkpoint format. A checkpoint directory includes:

- One **config json** file, which contains several model hyper-parameters
- **One or several rank weights** files, each file contains a dictionary of tensors (weights). Different ranks will load the different files in a multi-GPU (multi-process) scenario.

##### **Config**

The `config.json` contains important hyper-parameters of the model. A complete hyper-parameter list can be found [here](https://nvidia.github.io/TensorRT-LLM/new_workflow.html#config). Let us look at an example of the `config.json` :

```json 

{
    "architecture": "OPTForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "vocab_size": 50272,
    "position_embedding_type": "learned_absolute",
    "max_position_embeddings": 2048,
    "hidden_act": "relu",
    "quantization": {
        "use_weight_only": false,
        "weight_only_precision": "int8"
    },
    "mapping": {
        "world_size": 2,
        "tp_size": 2
    },
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "do_layer_norm_before": true,
    "use_prompt_tuning": false
}

```

##### **Rank Weights**

Like PyTorch, the tensor(weight) name is a string containing hierarchical information, uniquely mapped to a particular parameter of a TensorRT-LLM model.

Let us look at an example through the Attention weights of the 0-th transformer layer.

The Attention layer contains two Linear layers, `qkv` and `dense`; each Linear layer contains one weight and one bias. So, there are four tensors (weights) in total, whose names are:

```python
transformer.layers.0.attention.qkv.weight
transformer.layers.0.attention.qkv.bias
transformer.layers.0.attention.dense.weight
transformer.layers.0.attention.dense.bias
```

where `transformer.layers.0.attention` is the prefix name, indicating that the weights/biases are in the attention module of the 0-th transformer layer. A complete example of converting various layers can be found [here](https://nvidia.github.io/TensorRT-LLM/new_workflow.html#rank-weights).

With our understanding of Step 1 of converting checkpoint to the TensorRT-LLM format. Let us now look at Step 2 of the process, where we integrate the Model and its Architecture. 

#### **Step 2** : Workflow of LLaMA - Implementing the Model architecture

**TensorRT-LLM Python API :**

Before we head into defining our Architecture, let us understand that TensorRT-LLM has a Python API that can be used to define Large Language Models. This API is built on top of the powerful TensorRT Python API to create graph representations of deep neural networks in TensorRT. The [Documentation](https://nvidia.github.io/TensorRT-LLM/python-api/tensorrt_llm.layers.html) has a list of Python API that help accelerate building the model. Below is a snippet from the documentation showing some Python APIs available to us via TensorRT-LLM. 

<div><center>
<img src="images/pythonapi.png" width="1000"/>
</center></div>

Let us now  go back to our example of using LLaMA. The LLaMA papers with their modification to the transformer can be found here: [LLaMA 1](https://arxiv.org/abs/2302.13971) , [LLaMA 2](https://arxiv.org/abs/2307.09288). 

We now visualise this and compare it to the `model.py` we currently have as part of the TensorRT-LLM repository. 

There are three Primary functions that we call are:

- **Embedding** : The embedding layer translates tokens into a format the model can work with.
- **Decoder Layer** : Decoder layers process and refine these representations through attention mechanisms and neural networks. We will look at them in detail below.
- **RMS Norm** : RMSNorm ensures that the output activations are normalized in a manner that promotes stability and efficiency.



<div><center>
<img src="images/arch.png" width="1000"/>
</center></div>  

##### **Decoder Layer**


The Decoder layer in LLaMA 2 is a fundamental component of the model's architecture, primarily responsible for processing and generating language. Each Decoder layer, or transformer block, is constructed from a self-attention layer and a feed-forward neural network. The decoder is built of three fundamental building blocks:

- RMS Norm 
- Attention
- MLP 

<div><center>
<img src="images/decode-init.png" width="1300"/>
</center></div>  


While the `init` method covers the functions being initialised, we can get a better idea of the flow when we look into the `forward` method of the Decoder Class. 

<div><center>
<img src="images/decode-forward.png" width="1200"/>
</center></div> 

#### **Step 3** : Running the checkpoint file through ModelRunner. 

The Model Runner is designed to work with the TensorRT-LLM Python API, allowing users to load the engines and evaluate different tasks quickly. It supports various functionalities such as multi-GPU and multi-node configurations, enabling efficient execution of Large Language Models (LLMs) across different hardware setups. A flow of the [`run.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py) that can be used with all the models is given below.  


<div><center>
<img src="images/modelrunner.png" width="1500"/>
</center></div> 

### Documentation is your friend

Documentation plays a pivotal role in implementing custom models in TensorRT-LLM (TRT-LLM), acting as a comprehensive guide through the intricacies of the process. For developers looking to optimize and execute Large Language Models (LLMs) efficiently on NVIDIA GPUs, the TRT-LLM documentation provides essential insights into the architecture, APIs, and best practices for performance tuning. It covers the Python API for defining models, compiling efficient engines, and building runtimes for executing those engines alongside C++ components for more advanced use cases. Moreover, the documentation outlines the new workflow for converting model weights from various source frameworks into a TRT-LLM compatible format, building these into TensorRT engines, and finally, running inference with these optimized engines.

The Documentation can be found [here](https://nvidia.github.io/TensorRT-LLM/index.html)

---
## Acknowledgment

This notebook is adapt from NVIDIA's [TensorRT-LLM Github repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main)

## References

- https://nvidia.github.io/TensorRT-LLM/architecture.html
- https://github.com/NVIDIA/TensorRT-LLM

## Licensing
Copyright © 2023 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="trt-llama-chat.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
         <a href="trt-llama-chat.ipynb">2</a>
          <a>3</a>
        <a href="triton-llama.ipynb">4</a>
        <a href="LangChain-with-Guardrails.ipynb">5</a>
        <a href="challenge.ipynb">6</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="triton-llama.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>