# Nvidia Triton+TRT-LLM

Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.

[Triton Inference Server Github](https://github.com/triton-inference-server/server)

## Install tritonclient
Since we are interacting with the Triton inference server we will need to install the `tritonclient` package. The `tritonclient` package contains both the GRPC and HTTP client implementations.

`tritonclient` can be easily installed using `pip3 install tritonclient[all]`.

In [None]:
!pip3 install tritonclient[all]


## Imports

In [None]:
import os

from langchain.chains import LLMChain
from langchain.llms import TritonTensorRT
from langchain.prompts import PromptTemplate


## Create the Triton+TRT-LLM instance
Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.

An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]

In [None]:
from langchain.callbacks import streaming_stdout

callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]

# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below
triton_llm = TritonTensorRT(server_url ="localhost:8001", model_name="ensemble", callbacks=callbacks, tokens=500)

# Here we simply perform a one-shot prompt as we are only demonstrating the connectors capability. All other Langchain functionality still applies to the TritonTensorRT class
triton_llm("What is the tallest building in the world?")
