# TensorRT-LLM - Llama 3 1M Context

##### https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#1m-long-context-test-case

### 15 July 2024
### nvcr.io/nvidia/rapidsai/notebooks:24.04-cuda12.0-py3.10

### VM Specs

In [1]:
!uname -a

Linux verb-workspace 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux


In [2]:
!cat /etc/lsb-release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"


In [3]:
!nvidia-smi

Mon Jul 15 22:30:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0              42W / 400W |      7MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:           216Gi       2.0Gi        25Gi       5.0Mi       188Gi       212Gi
Swap:             0B          0B          0B


In [5]:
!nproc

30


In [6]:
!python -V

Python 3.10.14


# Install System & Python Dependencies   
https://nvidia.github.io/TensorRT-LLM/installation/linux.html

In [7]:
%%time 

!apt-get -y install python3.10 python3-pip python3.10-dev  openmpi-bin libopenmpi-dev git git-lfs python3-mpi4py

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenmpi-dev is already the newest version (4.1.2-2ubuntu1).
openmpi-bin is already the newest version (4.1.2-2ubuntu1).
python3-mpi4py is already the newest version (3.1.3-1build2).
git is already the newest version (1:2.34.1-1ubuntu1.11).
python3.10 is already the newest version (3.10.12-1~22.04.4).
python3.10-dev is already the newest version (3.10.12-1~22.04.4).
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
python3-pip is already the newest version (22.0.2+dfsg-1ubuntu0.4).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.
CPU times: user 21.9 ms, sys: 11.3 ms, total: 33.2 ms
Wall time: 1.28 s


# Install TensorRT-LLM
### For the latest verions, use the '--pre' command line option
##### This is needed for the 1M Context use case provided in this notebook

In [8]:
%%time
# 4+ minutes

# latest version —> add --pre after ‘-U’
# stable version —> no `--pre` option. Will be v0.10 (15 Jul '24)

!pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
[0mCPU times: user 58.9 ms, sys: 27.1 ms, total: 86 ms
Wall time: 5.23 s


### Test TensorRT-LLM Installation

In [9]:
import tensorrt_llm

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070900


In [10]:
tensorrt_llm.__version__

'0.12.0.dev2024070900'

# git clone TensorRT-LLM code
### 3 minutes

In [12]:
%%time 
# 13 seconds

!git clone https://github.com/NVIDIA/TensorRT-LLM.git

Cloning into 'TensorRT-LLM'...
remote: Enumerating objects: 19939, done.[K
remote: Counting objects: 100% (9518/9518), done.[K
remote: Compressing objects: 100% (2362/2362), done.[K
remote: Total 19939 (delta 7631), reused 8358 (delta 7118), pack-reused 10421[K
Receiving objects: 100% (19939/19939), 298.50 MiB | 56.48 MiB/s, done.
Resolving deltas: 100% (14668/14668), done.
Updating files: 100% (2422/2422), done.
Filtering content: 100% (14/14), 212.51 MiB | 113.75 MiB/s, done.
CPU times: user 151 ms, sys: 66.8 ms, total: 218 ms
Wall time: 13.5 s


### Intall dependencies

In [14]:
!cat TensorRT-LLM/requirements-dev.txt

-r requirements.txt
datasets==2.19.2
einops
graphviz
mypy
parameterized
pre-commit
pybind11
pybind11-stubgen
pytest-cov
pytest-forked
pytest-xdist
rouge_score
cloudpickle
typing-extensions==4.8.0
bandit==1.7.7
jsonlines==4.0.0
jieba==0.42.1
rouge==1.0.1


In [15]:
%%time 
# 2.5 minutes
!pip install -r TensorRT-LLM/requirements-dev.txt

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Ignoring tensorrt: markers 'platform_machine == "aarch64"' don't match your environment
Collecting accelerate>=0.25.0 (from -r TensorRT-LLM/requirements.txt (line 2))
  Using cached accelerate-0.32.1-py3-none-any.whl.metadata (18 kB)
Collecting build (from -r TensorRT-LLM/requirements.txt (line 3))
  Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting colored (from -r TensorRT-LLM/requirements.txt (line 4))
  Using cached colored-2.2.4-py3-none-any.whl.metadata (3.6 kB)
Collecting cuda-python (from -r TensorRT-LLM/requirements.txt (line 5))
  Using cached cuda_python-12.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting diffusers>=0.27.0 (from -r TensorRT-LLM/requirements.txt (line 6))
  Using cached diffusers-0.29.2-py3-none-any.whl.metadata (19 kB)
Collecting lark (from -r TensorRT-LLM/requirements.txt (line 7))
  Using cached lark-1.1.9-py3-none-any.whl.met

# Download Llama-3 8B Instruct Gradient 1048K
### 2.5 minutes
##### https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k

In [16]:
%%time 
# 2.5 minutes

!git-lfs clone https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k/

          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into 'Llama-3-8B-Instruct-Gradient-1048k'...
remote: Enumerating objects: 96, done.[K
remote: Counting objects: 100% (93/93), done.[K
remote: Compressing objects: 100% (93/93), done.[K
remote: Total 96 (delta 49), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (96/96), 2.27 MiB | 8.39 MiB/s, done.
CPU times: user 1.33 s, sys: 388 ms, total: 1.72 sB/s                           
Wall time: 2min 39s


# Convert Checkpoint 
### 12 Minutes
##### output dir --> /home/rapids/notebooks/Llama-3-8B-Instruct-Gradient-1048k/trt_ckpts

In [None]:
%%time

# 12 minutes
!python TensorRT-LLM/examples/llama/convert_checkpoint.py \
    --model_dir  Llama-3-8B-Instruct-Gradient-1048k/ \
    --output_dir Llama-3-8B-Instruct-Gradient-1048k/trt_ckpts \
    --dtype float16 \
    --tp_size 4

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070900
0.12.0.dev2024070900


# Build TensorRT-LLM Engine
### 6 minutes
###### output dir --> /home/rapids/notebooks/Llama-3-8B-Instruct-Gradient-1048k/trt_engines

In [None]:
%%time 
# 6 minutes

!python -m tensorrt_llm.commands.build \
            --checkpoint_dir Llama-3-8B-Instruct-Gradient-1048k/trt_ckpts \
            --output_dir Llama-3-8B-Instruct-Gradient-1048k/trt_engines \
            --gemm_plugin float16 \
            --max_num_tokens 4096 \
            --max_input_len 1048566 \
            --max_seq_len 1048576 \
            --use_paged_context_fmha enable \
            --workers 4

# Prepare 1M needle-in-a-haystack datasets
### 8 seconds

In [None]:
%%time 
# 8 seconds

!python ./TensorRT-LLM/examples/infinitebench/construct_synthetic_dataset.py \
    --test_case build_passkey \
    --test_level 7

### Inspect Synthetic Data

In [None]:
!wc -c passkey.jsonl
!wc -w passkey.jsonl
!head -c 150 passkey.jsonl && printf '\n.....\n' && tail -c 250 passkey.jsonl

### Run Inference

In [None]:
!mkdir -p 1M_context

In [None]:
%%time
# <1 second

!mpirun -n 4   --allow-run-as-root python3 TensorRT-LLM/examples/eval_long_context.py \
               --task passkey \
               --engine_dir /home/rapids/notebooks/Llama-3-8B-Instruct-Gradient-1048k/trt_engines \
               --tokenizer_dir ./Llama-3-8B-Instruct-Gradient-1048k/ \
               --stop_idx 1 \
               --max_input_length 1048566 \
               --enable_chunked_context \
               --max_tokens_in_paged_kv_cache 1100000 \
               --output_dir ./1M_context