<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/quantization/openvino/TensorRT-LLM/TensorRT_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NVIDIA TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference. It is designed to work in a complementary fashion with training frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on running an already-trained network quickly and efficiently on NVIDIA hardware.

TensorRT includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. The core of NVIDIA TensorRT is a C++ library that facilitates high-performance inference on NVIDIA GPUs. TensorRT takes a trained network, which consists of a network definition and a set of trained parameters, and produces a highly optimized runtime engine that performs inference for that network

NVIDIA® TensorRT™ is an ecosystem of APIs for high-performance deep learning inference. TensorRT includes an inference runtime and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes TensorRT, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.

https://github.com/NVIDIA/TensorRT

# TensorRT-LLM
https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html

A TensorRT Toolbox for Optimized Large Language Model Inference

#### Install
TensorRT-LLM is a toolkit to assemble optimized solutions to perform Large Language Model (LLM) inference. It offers a Model Definition API to define models and compile efficient TensorRT engines for NVIDIA GPUs. It also contains Python and C++ components to build runtimes to execute those engines as well as backends for the Triton Inference Server to easily create web-based services for LLMs. TensorRT-LLM supports multi-GPU and multi-node configurations (through MPI).

https://nvidia.github.io/TensorRT-LLM/installation/linux.html

https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html#support-matrix-software

https://github.com/NVIDIA/TensorRT-LLM

Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

Install the latest preview version (corresponding to the main branch) of TensorRT-LLM. If you want to install the stable version (corresponding to the release branch), please remove the --pre option.

pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

In [1]:
! pip3 install tensorrt_llm  tensorrt --extra-index-url https://pypi.nvidia.com -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 GB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.8/464.8 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.9/346.9 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25

In [44]:
from tensorrt_llm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_new_tokens=256)

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")



Loading Model: [1;32m[1/3]	[0mDownloading HF model
[38;20mDownloaded model to /root/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6
[0m[38;20mTime: 0.260s
[0mLoading Model: [1;32m[2/3]	[0mLoading HF model to memory
160it [00:00, 2036.85it/s]
[38;20mTime: 0.105s
[0mLoading Model: [1;32m[3/3]	[0mBuilding TRT-LLM engine
[38;20mTime: 28.062s
[0m[1;32mLoading model done.
[0m[38;20mTotal latency: 28.427s
[0m

In [45]:
import datetime
import torch
import gc

In [47]:
time1 = datetime.datetime.now()

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for p in prompts:
  outputs = llm.generate(p, sampling_params)

  generated_text = outputs.outputs[0].text
  print(f"Prompt: {p!r},\n Generated text: {generated_text!r}")
  print("-"*25)


time2 = datetime.datetime.now()
print(f"Time {str(time2-time1)}")

Prompt: 'Hello, my name is',
 Generated text: 'Sarah. I am a professional writer and editor with more than six years of experience in writing articles and blogs for a variety of industries. I am passionate about learning and sharing knowledge, and I enjoy helping others learn and grow through my writing. In my free time, I enjoy traveling, reading, and spending time with my family and friends. You can find me on LinkedIn, Facebook, and Twitter.'
-------------------------
Prompt: 'The president of the United States is',
 Generated text: 'required to submit a daily report to Congress on the coronavirus pandemic. What is the recommended format for the daily report?'
-------------------------
Prompt: 'The capital of France is',
 Generated text: "located in which region?\n\n2. What is the second-largest city in France by population and what is its name?\n\n3. What is the currency used in France and where does it come from?\n\n4. What is the official language spoken in France and which count

In [48]:
del llm

In [49]:
torch.cuda.empty_cache()
gc.collect()

705

In [50]:
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="cuda")

In [51]:
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
time1 = datetime.datetime.now()
for p in prompts:
  messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds",
    },
    {"role": "user", "content": p},
]

  prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  outputs = pipe(p, max_new_tokens=256, do_sample=True, temperature=0.8,  top_p=0.95)
  generated_text = outputs[0]["generated_text"]
  print(f"Prompt: {p!r},\n Generated text: {generated_text!r}")
  print("-"*25)


time2 = datetime.datetime.now()
print(f"Time {str(time2-time1)}")

Prompt: 'Hello, my name is',
 Generated text: "Hello, my name is [Your Name] and I am a [Job Title] at [Company Name]. I have been working for [Company Name] for [Number of Years] years and I am currently responsible for managing [Number of Employees] employees. I have experience in managing [Number of Employees] at other companies, which gives me a good understanding of the challenges and opportunities that come with managing a large team. I am very familiar with [Company Name]'s mission and vision, and I am confident in my ability to ensure that our team is aligned with these goals.\n\nI am committed to providing the highest level of service to [Company Name]'s employees, as well as to ensuring that our team is fully engaged and motivated to achieve our company's goals. I am excited to join [Company Name] and contribute to its continued success.\n\nIn my previous role as [Previous Position], I was responsible for [Responsibilities], which included [Responsibilities]. In this role, I 

In [43]:
del pipe
torch.cuda.empty_cache()
gc.collect()

0