# NVIDIA TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
<br>
[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)

## TensorRT-LLM environment setup
Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.
<br>
1. Install `tensorrt_llm` following the instruction on [TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM). Llama2 and Mistral models are supported with this connector. The following steps are shown for Llama2
2. Ensure you have access to the Llama 2 [repository on huggingface](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

## Langchain-nvidia-trt setup
To install from source:
1. `cd langchain-nvidia/libs/trt`
2. `!pip install -e .`

## Use TensorRT-LLM to create engine files for the model
Llama2 and Mistral models are supported. The following steps are shown for Llama2

In [None]:
from tensorrt_llm import LLM, ModelConfig
from huggingface_hub import snapshot_download

#Download the Llama2 model
model_dir = snapshot_download(repo_id="meta-llama/Llama-2-7b-chat-hf",token="<hf_token>")

# Load the model via LLM and save the .engine file
# Please restart the kernel after saving the .engine file
# to prevent OOM errors with the torch and engine loaded
config = ModelConfig(model_dir=model_dir)
llm = LLM(config)
llm.save("./model")
#Plug this path to the TrtLlmAPI

### Building engine files for Windows users
Instead of using the steps above, build the engine files using the following steps:
1. Clone the [TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM) repository
2. Change directory to [examples/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama) for LLama models
3. Convert model to checkpoint format and build the engine using the following commands

In [None]:
# Build the LLaMA 7B model using a single GPU and FP16.
python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \
                              --output_dir ./tllm_checkpoint_1gpu_fp16 \
                              --dtype float16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \
            --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu \
            --gemm_plugin float16 \
            --context_fmha disable \
            --context_fmha_fp32_acc enable

## Create the TrtLlmAPI instance
When setting up an LLM object, provide the model directory where the engine built is placed, tokenizer path to the cloned huggingface repository and temperature to specify the desired deterministic nature of the responses. Call `invoke` with a prompt. 

In [None]:
from langchain_nvidia_trt.llms import TrtLlmAPI
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

llm = TrtLlmAPI(
    model_path="./model",
    tokenizer_dir="./model",
    temperature=1.0
)
chain = prompt | llm
print(chain.invoke({"question": "Who is Paul Graham?"}))
