
# Understanding TensorRT-LLM: Kernel Fusion, Memory Efficiency, and High-Throughput LLM Inference

**Author:** Marco Punio  

This notebook is a conceptual and light-code explainer of why **NVIDIA TensorRT-LLM (TRT-LLM)** can deliver **2‚Äì4√ó faster LLM inference** compared to framework baselines like vanilla PyTorch, while also being more efficient than many general-purpose runtimes.

> üîé Note: This notebook is **CPU-only** and does **not** require a GPU or TensorRT-LLM installation. The goal is to understand **architecture and performance reasoning**, not to run real TRT-LLM code.



## 1. The LLM Inference Bottleneck: Prefill vs Decode

LLM inference is usually split into two phases:

- **Prefill (context ingestion)**: process the full input sequence once.
- **Decode (token generation)**: generate tokens one-by-one while reusing a **KV cache**.

### 1.1 Prefill Phase

- Long sequence length (e.g., 1K‚Äì8K tokens).
- Large matrix multiplications (GEMMs) dominate.
- GPU can reach high Tensor Core utilization.
- Often **compute-bound**.

### 1.2 Decode Phase

- We generate **one token per step per sequence**.
- Each new token has to attend over all previous tokens using the KV cache.
- Much smaller effective batch size at each step.
- Often **memory-bound** because we keep re-reading the KV cache.

This means **prefill and decode have different bottlenecks**, which matters when thinking about why TRT-LLM is faster.


In [4]:

import time
import numpy as np

def naive_attention_step(seq_len=2048, d_model=128):
    """CPU-only simulation of a single attention step.
    This is NOT optimized and is just for timing intuition."""
    Q = np.random.randn(1, d_model)
    K = np.random.randn(seq_len, d_model)
    V = np.random.randn(seq_len, d_model)

    scores = Q @ K.T                     # (1, seq_len)
    probs = np.exp(scores) / np.exp(scores).sum()
    out = probs @ V                      # (1, d_model)
    return out

for L in [256, 512, 1024, 2048, 4096]:
    t0 = time.time()
    _ = naive_attention_step(seq_len=L)
    dt = time.time() - t0
    print(f"Seq len = {L:4d}, simulated attention latency = {dt:.4f} s")

Seq len =  256, simulated attention latency = 0.0026 s
Seq len =  512, simulated attention latency = 0.0043 s
Seq len = 1024, simulated attention latency = 0.0056 s
Seq len = 2048, simulated attention latency = 0.0333 s
Seq len = 4096, simulated attention latency = 0.0165 s


### Key takeaways running first cell

- This matches the theoretical O(N) scaling of full attention latency growing linearly with sequence length because of QK^T
- The first timing measurement is not reliable due to library warmup
- BLAS kernel selection explains why 512 tokens can be faster than 256 tokens
- This expirement reveals meaningful transformer intuition even without a GPU
- The scaling trend shown here is the same one optimized by NVIDIA cuBLAS and TensortRT-LLM

### Using NumPy to understand algorthmic intuition

The NumPy example is intentionally CPU-only. Its purpose is to illustrate ***what attention computes*** and why full attention scales as **O(sequence_length)**.
NumPy is perfect for this because it exposes the math directly:
- Q @ K·µÄ
- softmax
- attention √ó V

This demo helps us understand *why* attention becomes expensive as context grows.

However, NumPy does **not** reflect how real frameworks execute attention on GPUs.



## 2. Why Framework Baselines Struggle (Vanilla PyTorch/Eager)

A typical transformer layer in an LLM involves a bunch of the following operations:

- Linear projections for Q, K, V
- Attention (softmax over QK·µÄ)
- Feed-forward network (MLP)
- Residual connections
- LayerNorm

In a naive framework execution (e.g., vanilla PyTorch eager mode):

- Each operation often launches a **separate kernel** on the GPU.
- Each kernel reads/writes from global memory.
- The CPU runtime / dispatcher adds overhead between kernel launches.

This leads to:

- Many small kernels per token during **decode**.
- High **kernel launch overhead**.
- High **memory traffic** (read/write between each op).
- Inability to fully saturate Tensor Cores, especially in decode.



## 3. What TensorRT-LLM Does Differently (High Level)

**TensorRT-LLM** is a specialized runtime and compiler stack for transformer inference that focuses on:

1. **Graph-level optimization**  
   - Capture the LLM graph.
   - Fuse compatible ops into bigger kernels.
   - Reorder operations for better memory access patterns.

2. **Fused attention and MLP kernels**  
   - Combine multiple operations into a single kernel launch.
   - Greatly reduce memory round-trips.

3. **Efficient KV cache management (paged KV cache)**  
   - Avoid large, contiguous allocations that fragment memory.
   - Use block-based layouts that scale better with sequence length and batch size.

4. **Dynamic batching and scheduling**  
   - Batch tokens across multiple sequences at runtime.
   - Maximize throughput without blowing up latency.

5. **Lower precision (FP8/INT8) support**  
   - Use Tensor Cores more effectively.
   - Reduce memory bandwidth and storage per parameter.


TensorRT-LLM is similar in spirit to TorchDynamo and TorchInductor in that it
captures the model graph, fuses operations, and generates optimized kernels.
However, TRT-LLM goes much further: it also provides a full inference runtime,
paged KV cache management, dynamic batching, FP8 Tensor Core kernels, and
serving-level optimizations. It is not just a compiler ‚Äî it is a complete
LLM inference engine.




## 4. KV Cache and Paged Attention (Conceptual)

During decode, we reuse the previously computed key/value tensors (K, V) so that generating token *t* does **not** require recomputing attention against all tokens *< t*. These cached tensors form the **KV cache**.

### 4.1 Naive KV Cache Layout

In a naive implementation:

- Each sequence allocates one large contiguous KV buffer
- As sequence grow at different rates, memory becomes fragmented.
- If the buffer needs to grow, frameworks may need to 'realloc' or copy large amounts of memory.
- This becomes expensive when serving many concurrent sequences.

This layout is simple but does not scale well for real-time multi-user serving workloads.

### 4.2 Paged KV Cache (High Level)

Paged KV cache avoids these problems by splitting the KV memory into **fixed-size blocks ("pages")**, typically 16-128 tokens each.

A small **indirection table** maps logical sequence positions -> physical memory blocks.

This gives several benefits:

- **No large reallocations:** Growing a sequence only means attaching another page, not resizing a giant buffer.
- **Memory reuse:** Freed pages can be recycled for other sequences.
- **Reduced fragmentation:** The system only manages uniform blocks.
- **Efficient batching:** Pages align sequences so that multiple sequences can be processed together. 

Paged KV cache is one of the key enablers behind high-throughput serving systems like vLLM and TensorRT-LLM.

### Important Clarification: Paged Attention Does *Not* Reduce Compute

Even with paging, attention still computes against **all previous tokens**.
Splitting K/V into blocks changes **how we access memory**, but it does *not*
change the algorithmic cost.

This is why the simulated paged-attention NumPy cell has latency numbers nearly
identical to the full-attention version‚Äîthe computation is the same, only the
memory layout is different.

On real GPUs, however, the paged layout:

- Improves memory locality,
- Enables kernel fusion,
- Reduces overhead under high concurrency,
- Eliminates fragmentation,

all of which significantly improve **system-level performance**, even though the
core math is unchanged.

In [6]:

import math
import time
import numpy as np

BLOCK = 64  # tokens per "page"

def paged_attention_step(seq_len=2048, d_model=128, block_size=BLOCK):
    """CPU-only simulation of a block-wise attention pattern.
    Still not optimized, but mirrors the idea of processing in chunks."""
    Q = np.random.randn(1, d_model)
    num_blocks = math.ceil(seq_len / block_size)
    out = np.zeros((1, d_model))

    for b in range(num_blocks):
        this_block_size = min(block_size, seq_len - b * block_size)
        K_block = np.random.randn(this_block_size, d_model)
        V_block = np.random.randn(this_block_size, d_model)
        scores = Q @ K_block.T
        probs = np.exp(scores) / np.exp(scores).sum()
        out += probs @ V_block

    return out

for L in [256, 512, 1024, 2048, 4096]:
    t0 = time.time()
    _ = paged_attention_step(seq_len=L)
    dt = time.time() - t0
    print(f"Seq len = {L:4d}, simulated paged attention = {dt:.4f} s")

Seq len =  256, simulated paged attention = 0.0152 s
Seq len =  512, simulated paged attention = 0.0020 s
Seq len = 1024, simulated paged attention = 0.0088 s
Seq len = 2048, simulated paged attention = 0.0077 s
Seq len = 4096, simulated paged attention = 0.0163 s



## 5. Kernel Fusion: Fewer, Heavier Kernels

A simplified unfused pipeline for a feed-forward block might look like:

1. Linear layer (GEMM)
2. Add bias
3. Apply activation (e.g., GELU)
4. Another linear layer
5. Add residual connection

Each of these steps might be a separate kernel launch in a naive implementation.

### 5.1 Unfused vs Fused (Conceptual)

- **Unfused**:  
  - Each step reads from and writes to global memory.  
  - Many small kernels, each doing relatively little work.

- **Fused**:  
  - Combine multiple steps into one kernel.  
  - Data stays in registers or shared memory longer.  
  - Much fewer global memory accesses.

On GPUs, **global memory bandwidth** is a major bottleneck. Fusing kernels reduces trips to global memory, which directly boosts throughput.


In [7]:

import numpy as np

def unfused_mlp(x, W1, b1, W2, b2):
    # Many conceptual "kernels"
    h = x @ W1       # GEMM
    h = h + b1       # add bias
    h = np.tanh(h)   # activation
    out = h @ W2     # GEMM
    out = out + b2   # add bias
    return out

def fused_mlp(x, W1, b1, W2, b2):
    # Same math, but conceptually in one kernel
    h = x @ W1
    h = np.tanh(h + b1)
    out = h @ W2 + b2
    return out

# Just to show they're numerically close
np.random.seed(0)
x = np.random.randn(2, 4)
W1 = np.random.randn(4, 8)
b1 = np.random.randn(8)
W2 = np.random.randn(8, 4)
b2 = np.random.randn(4)

u = unfused_mlp(x, W1, b1, W2, b2)
f = fused_mlp(x, W1, b1, W2, b2)
print("Max difference between unfused and fused:", np.max(np.abs(u - f)))

Max difference between unfused and fused: 0.0


### Note:

The unfused and fused versions produce the exact same output because fusion does *not* change the math. In this NumPy CPU example, they also appear the same speed because NumPy does not launch GPU kernels - everything runs as plain CPU ops.

The goal of this example is just to show that fusion does not affect correctness. The *performance* benefits of fusion only show up on GPUs, where unfused execution would launch many small kernels and bounce data through global memory. TensorRT-LLM fuses these ops into a single optimized GPU kernel, which is why fused MLPs are dramatically faster in practice.


## 6. TRT-LLM vs PyTorch vs vLLM (Conceptual Comparison)

We can compare three broad approaches to LLM inference:

| Feature                         | Vanilla PyTorch | vLLM                         | TensorRT-LLM                     |
|---------------------------------|-----------------|-----------------------------|----------------------------------|
| Execution mode                  | Eager (dynamic) | Runtime optimized for LLMs  | Ahead-of-time + runtime optimized |
| Kernel fusion                   | Limited         | Some fusion & optimizations | Aggressive fusion via TensorRT   |
| KV cache layout                 | Basic/naive     | PagedAttention              | Paged KV + TensorRT-optimized   |
| Dynamic batching                | Manual/custom   | Built-in                    | Built-in                         |
| Precision support (FP8/INT8)    | Limited         | Limited / external          | First-class, Tensor Core aware   |
| Target use-case                 | Research/dev    | High-throughput serving     | High-throughput, low-latency prod |

The key takeaway is that **TRT-LLM** is built to **minimize memory traffic and kernel overhead** while maximizing **Tensor Core utilization**, especially during **decode**.



## 7. What Happens When We Generate a Single Token? (Conceptual Walkthrough)

For a single decode step on a batch of sequences, a very simplified view:

### 7.1 Vanilla PyTorch

1. Launch kernel for Q, K, V projections.
2. Launch kernel for QK·µÄ.
3. Launch kernel for scaling + masking.
4. Launch kernel for softmax.
5. Launch kernel for attention * V.
6. Launch kernels for MLP and residuals.
7. Launch kernel for LayerNorm.

‚û°Ô∏è Many small kernels, many reads/writes to global memory.

### 7.2 vLLM

- Uses **paged attention** and runtime tactics to:
  - Reuse KV cache efficiently.
  - Batch multiple sequences.
  - Reduce some overhead compared to pure eager mode.

Still, many operations remain distinct kernels.

### 7.3 TensorRT-LLM

- Ahead-of-time optimized graph.
- **Fused attention kernels** that combine several of the above steps.
- Fused MLPs.
- Layouts chosen to best align with Tensor Cores.
- Lower precision (e.g., FP8) paths where appropriate.

‚û°Ô∏è **Far fewer kernels per token**, and each kernel does more useful work per byte of memory accessed.



## 8. Performance Reasoning (Back-of-the-Envelope)

We can think of tokens/sec as being limited by:

- **Compute throughput** (FLOPs / second).
- **Memory bandwidth** (bytes / second).
- **Kernel launch overhead** (CPU ‚Üî GPU coordination).

During **decode**, memory bandwidth and launch overhead often dominate.

### 8.1 Very Rough Toy Model

Let:

- `T_py` = time per token in vanilla PyTorch.
- `T_trt` = time per token in TRT-LLM.

If TRT-LLM:

- Reduces global memory reads/writes by ~2√ó through fusion.
- Reduces kernel launches per token from ~N to ~N/2 or less.
- Uses lower precision to double effective bandwidth.

Then it's entirely plausible for:

\[ T_{trt} \approx \frac{1}{2} T_{py} \quad \text{to} \quad \frac{1}{4} T_{py} \]

which matches the **2‚Äì4√ó throughput improvement** often reported in practice.


In [8]:

def estimate_cost_per_million_tokens(tokens_per_second, gpu_hourly_cost):
    """Very rough cost-per-million-tokens model."""
    seconds_per_million = 1_000_000 / tokens_per_second
    hours = seconds_per_million / 3600
    return hours * gpu_hourly_cost

tps_pytorch = 500   # toy numbers
tps_trtllm  = 1500  # 3x faster

gpu_cost = 3.0  # $/GPU-hour, example

cost_pytorch = estimate_cost_per_million_tokens(tps_pytorch, gpu_cost)
cost_trtllm  = estimate_cost_per_million_tokens(tps_trtllm, gpu_cost)

print(f"PyTorch cost per 1M tokens:  ${cost_pytorch:.2f}")
print(f"TRT-LLM  cost per 1M tokens: ${cost_trtllm:.2f}")
print(f"Cost reduction factor: {cost_pytorch / cost_trtllm:.2f}x")

PyTorch cost per 1M tokens:  $1.67
TRT-LLM  cost per 1M tokens: $0.56
Cost reduction factor: 3.00x


### What this cost model shows

This toy example illustrates a simple but important point: if a model produces more tokens per second, it directly reduces the cost of serving tokens. The cost per million tokens is proportional to the number of GPU-hours needed. If TensorRT-LLM delivers 3x higher throughput than PyTorch (1500 vs 500 tokens/sec in this example), then the GPU spends 3x less time generating the same number of tokens. Since GPU pricing is based on time, the cost per million tokens drop by roughly the same factor.

This is why throughput optimizations translates directly into cost savings in production LLM workloads.


## 9. When to Use TensorRT-LLM vs vLLM (Customer Guidance)

### TensorRT-LLM is a great fit when:

- You need **maximum throughput** and **low latency** in production.
- You're serving **large models** (e.g., 13B, 34B, 70B+).
- You care about **cost-per-token** at scale.
- You can invest engineering time to integrate a more specialized runtime.
- You want to exploit **FP8/INT8** and advanced kernel fusion.

### vLLM is a great fit when:

- You want a simpler runtime with strong performance.
- You prioritize ease-of-use and fast experimentation.
- You may still be iterating on models and prompts frequently.
- You don't need the absolute maximum throughput that TRT-LLM can offer.

As a Solutions Architect, the key is to:

- Understand the **customer's constraints** (latency, throughput, budget, model size).
- Understand the **technical bottlenecks** (compute vs memory vs networking).
- Recommend the stack that aligns with both.



## 10. Summary and Talking Points

By now, you should be able to explain:

- Why LLM **decode** tends to be **memory-bound**.
- How **kernel fusion** in TensorRT-LLM reduces memory traffic and kernel launch overhead.
- What **paged KV cache** is and why it helps under high concurrency.
- How **FP8/INT8** and Tensor Cores contribute to higher throughput.
- Conceptual differences between **PyTorch**, **vLLM**, and **TensorRT-LLM**.

### Handy interview / customer-facing bullets

- *"TRT-LLM reduces the number of kernels per token and the amount of global memory traffic, which is why it often delivers 2‚Äì4√ó higher decode throughput than vanilla PyTorch."*
- *"Paged KV cache and dynamic batching are crucial for high-concurrency LLM serving because they keep memory usage predictable and avoid fragmentation."*
- *"The real value to customers isn't just speed in isolation, it's **cost per token** and **latency under load**, which is where TRT-LLM really shines."*

You can now expand this notebook by:

- Adding simple benchmark experiments on CPU (for educational purposes).
- Sketching diagrams of execution graphs.
- Linking to real TRT-LLM docs or repos in Markdown cells.
