# Nvidia TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
<br>
[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)

## TensorRT-LLM Environment Setup
Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.
<br>
1. Nvidia Cuda 12.2 or higher is currently required to run TensorRT-LLM
2. Install `tensorrt_llm` via pip with `pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/`
3. For this example we will use Llama2. The Llama2 model files need to be created via scripts following the instructions [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)<br>
   * The following files will be created from following the step above
   * `rank0.engine`: The main output of the build script, containing the executable graph of operations with the model weights embedded
   * `config.json`: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine
5. `mkdir model`
6. Move all of the files mentioned above to the model directory.

In [None]:
!pip install langchain-nvidia-trt

## Create the TrtLlmAPI instance
Call `invoke` with a prompt

In [None]:
from langchain_nvidia_trt.llms import TrtLlmAPI
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

llm = TrtLlmAPI(
    model_path="./model",
    tokenizer_dir="meta-llama/Llama-2-7b-chat",
)
chain = prompt | llm
print(chain.invoke({"question": "What is important about Half Life 2 RTX?"}))
