<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>


# 0.1 Accelerating AI Inference: A Deep Dive into NVIDIA Dynamo on a K8s Cluster
Welcome to Accelerating AI Inference: A Deep Dive into NVIDIA Dynamo on a Kubernetes Cluster!<br>

In this course you'll learn how to deploy, configure, and optimize NVIDIA Dynamo, a high-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. This course will demonstrate the complete lifecycle of Dynamo deployment on Kubernetes, from basic setup to advanced disaggregated serving and performance optimization.


---


## 0.1.1 What is NVIDIA Dynamo?

NVIDIA Dynamo is a high-throughput, low-latency inference framework specifically designed for serving generative AI and reasoning models in multi-node distributed environments. It provides several key innovations:

### Key Features:
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps balance throughput and latency
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand  
- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer**: Reduces inference response time through NVIDIA Inference Transfer Library (NIXL)
- **Inference engine agnostic**: Supports TRT-LLM, vLLM, SGLang and others

### Architecture Benefits:
- **40% improvement in TTFT (Time To First Token)** with KV cache system memory offloading
- **Efficient multi-node scaling** with disaggregated serving architecture
- **Reduced latency** through intelligent request routing and caching
- **Resource optimization** via dynamic GPU scheduling and workload distribution


---

# NVIDIA Dynamo Architecture & Concepts


## 0.1.2 High-Level Architecture

NVIDIA Dynamo is designed around several key architectural principles that address the unique challenges of serving large language models (LLMs) and generative AI workloads at scale.

### Core Design Principles:

1. **Inference Engine Agnostic**: Supports multiple backends including TensorRT-LLM, vLLM, and SGLang
2. **Distributed by Design**: Built for multi-node, multi-GPU environments from the ground up
3. **Performance Optimized**: Focus on both throughput and latency optimization
4. **Resource Efficient**: Smart resource allocation and utilization


The following diagrams illustrate the architecture of the Dynamo platform:

<center><img src="images/dynamo/dynamo_architecture.png"></center>


### Key Components:

- **Dynamo Planner**: Intelligent workload distribution and GPU scheduling
- **KV Block Manager**: Efficient key-value cache management across memory tiers
- **Request Router**: LLM-aware routing to minimize recomputation
- **NIXL (NVIDIA Inference Transfer Library)**: Accelerated data transfer
- **Disaggregated Serving**: Separate prefill and decode operations


## 0.1.3 Disaggregated Serving

One of Dynamo's most powerful features is its ability to disaggregate the inference process into separate **prefill** and **decode** operations. This architectural choice provides significant performance benefits.

### Traditional vs. Disaggregated Serving

**Traditional Approach:**
The prefill phase processes user input to generate the first output token and is compute-bound, while the decode phase generates subsequent tokens and is memory-bound. Co-locating these phases on the same GPU or GPU node leads to inefficient resource use, especially for long input sequences. Additionally, the distinct hardware needs of each phase limit model parallelism flexibility, causing missed performance opportunities.

- Single GPU/node handles both prefill (processing input prompt) and decode (generating tokens)
- Resource contention between different phases of inference
- Suboptimal GPU utilization patterns

**Dynamo's Disaggregated Approach:**
To address these issues, disaggregated serving separates the prefill and decode phases onto different GPUs or nodes. This enables developers to optimize each phase independently, applying different model parallelism strategies and assigning different GPU devices to each phase
- **Prefill nodes**: Optimized for high-throughput batch processing of input prompts
- **Decode nodes**: Optimized for low-latency token generation
- Independent scaling of each phase based on workload characteristics

<center><img src="images/dynamo/disaggregated-serving-traditional-serving-comparison-2.png" width="800" alt="Disaggregated Serving Architecture"></center>

## 0.1.4 KV Cache Management

The Key-Value (KV) cache is critical for LLM performance, storing computed attention states to avoid recomputation. Dynamo introduces sophisticated KV cache management that spans multiple memory tiers.

### KV Block Manager (KVBM) Architecture


The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.

<center><img src="images/dynamo/kvbm-arch.png" width="700" alt="KV Block Manager Architecture"></center>

The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.

The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.

The last layer, the NIXL layer, provides unified support for enabling all data and storage transactions. NIXL enables P2P GPU transfers, enables RDMA and NVLINK remote memory sharing, dynamic block registration and metadata exchange and provides a plugin interface for storage backends.

NIXL integrates with several backends:

- Block memory (Eg. GPU HBM, Host DRAM, Remote DRAM, Local SSD when exposed as block device)
- Local file system (for example, POSIX)
- Remote file system (for example, NFS)
- Object stores (for example, S3-compatible)
- Cloud storage (for example, blob storage APIs)

More reading here: https://docs.nvidia.com/dynamo/latest/architecture/kvbm_architecture.html

## 0.1.5 LLM-Aware Request Routing

Dynamo's request router is designed specifically for LLM workloads, understanding the unique characteristics of language model inference to optimize routing decisions.

### Traditional vs. LLM-Aware Routing

**Traditional Load Balancing:**
- Round-robin or random distribution
- No awareness of model state or cache content
- Higher cache misses and recomputation

**Dynamo's LLM-Aware Routing:**
- Routes requests to nodes with relevant KV cache data
- Minimizes cache misses and recomputation
- Optimizes for conversation continuity and prefix matching

### Routing Strategies:

1. **Prefix-Aware Routing**: Routes requests with common prefixes to the same node
2. **Cache-Aware Routing**: Considers existing KV cache when making routing decisions
3. **Load-Balanced Routing**: Balances load while maintaining cache efficiency
4. **Conversation Affinity**: Keeps multi-turn conversations on the same node when beneficial

### Performance Impact:

- **Reduced latency** through cache hit optimization
- **Lower compute requirements** due to reduced recomputation
- **Better resource utilization** across the cluster
- **Improved user experience** with faster response times


## 0.1.6 Dynamic GPU Scheduling and Planning

The Dynamo Planner is responsible for intelligent workload distribution and resource allocation across the cluster. It provides both load-based and SLA-based planning strategies. The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.

Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size. Key features include:
- Load-based scaling that monitors KV cache utilization and prefill queue size to make scaling decisions
- SLA-based scaling that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
- Multi-backend support for both local (Circus) and Kubernetes environments
- Graceful scaling that ensures no requests are dropped during scale-down operations


## 0.1.7 NVIDIA Inference Transfer Library (NIXL)

Large-scale distributed inference leverages model parallelism techniques such as Tensor, pipeline, and expert parallelism, which rely on internode and intranode, low-latency, high-throughput communication leveraging GPUDirect RDMA. These systems also require rapid KV cache transfer between prefill and decode GPU workers in disaggregated serving environments. 

Additionally, they must support accelerated communication libraries that are both hardware- and network-agnostic, capable of efficiently moving data across GPUs and memory hierarchies including storage—like CPU memory, and block, file, and object storage—and compatible with a range of networking protocols.


<center><img src="images/dynamo/nvidia-inference-transfer-library.png" width="700" alt="KV Block Manager Architecture"></center>


NVIDIA Inference Transfer Library (NIXL) is a high-throughput, low-latency point-to-point communication library that provides a consistent data movement API to move data rapidly and asynchronously across different tiers of memory and storage using the same semantics. It is specifically optimized for inference data movement, supporting nonblocking and noncontiguous data transfers between various types of memory and storage. 

NIXL supports heterogeneous data paths as well as different types of memory and local SSDs, plus networked storage from key NVIDIA storage partners.  

NIXL enables NVIDIA Dynamo to interface with other communications libraries such as GPUDirect Storage, UCX, and S3 with a common API regardless of whether the transfer is over  NVLink (C2C or NVSwitch), InfiniBand, RoCE, or Ethernet. NIXL, in conjunction with the NVIDIA Dynamo policy engine, automatically chooses the best backend connection and abstracts away the differences between multiple types of memory and storage. This is accomplished through generalized “memory sections” which can be HBM, DRAM, local SSD, or networked storage (Block, Object, or File). 


---

## Structure of the Course

The objective of this section is to give you a high-level understanding of the course structure. This course is self-contained and provides hands-on experience with NVIDIA Dynamo deployed on Azure Kubernetes Service (AKS). 

### Lab Environment
This course uses a pre-configured **Azure Kubernetes Service (AKS)** cluster with:
- **CPU Node Pool**: Standard_D16ds_v4 (16 vCPUs, 64 GB RAM) for control plane operations
- **GPU Node Pool**: Standard_NC40ads_H100_v5 with 2x H100 GPUs per node for inference workloads
- **NVIDIA GPU Operator** pre-installed for GPU resource management
- **Nginx Ingress Controller** for external access
- **Dynamic scaling** - GPU nodes scale automatically based on workload demand

### Prerequisites
- Basic understanding of Kubernetes concepts (pods, services, deployments)
- Familiarity with container technologies
- Basic knowledge of machine learning inference concepts
- **Hugging Face token** (for accessing models)

### Table of Contents

0. [**Course Overview**](Dynamo_00_Course_Overview.ipynb) (*this* notebook)<br>
    - Introduction to NVIDIA Dynamo
    - Course structure and objectives
    - Lab environment setup
    - Prerequisites and requirements
<br><br>

1. [**Architecture & Lab Overview**](Dynamo_01_Architecture_and_Lab_Overview.ipynb)<br>
    - **Dynamo Architecture**
      - Core components and design principles
      - Disaggregated serving concepts
      - KV Cache Management (KVBM)
      - Request routing and scheduling
    - **Lab Infrastructure**
      - AKS cluster configuration
      - GPU resource management
      - Monitoring setup
<br><br>

2. [**vLLM Aggregated Deployment**](Dynamo_02_vLLM_Agg_Deployment.ipynb)<br>
    - **Introduction & Setup**
      - Environment configuration
      - Deployment prerequisites
    - **Standard Deployment**
      - Basic configuration
      - Service deployment
      - Testing and validation
    - **Router Deployment**
      - Cache-aware routing
      - Performance optimization
    - **Performance Analysis**
      - Benchmarking setup
      - Results visualization
<br><br>

3. [**vLLM Disaggregated Deployment**](Dynamo_03_vLLM_disAgg_Deployment.ipynb)<br>
    - **Introduction & Setup**
      - Disaggregated architecture
      - Environment configuration
    - **Standard Deployment**
      - Deployment configuration
      - Service deployment
      - Performance monitoring
    - **Router Deployment**
      - Router architecture
      - Configuration & testing
    - **Performance Analysis**
      - Benchmark execution
      - Results comparison
<br><br>

### Expected Learning Outcomes

By completing this course, you will:
- Master NVIDIA Dynamo architecture concepts and implementation patterns
- Deploy and manage production-ready AI inference infrastructure on Kubernetes
- Optimize LLM serving performance using advanced techniques like disaggregated serving
- Implement comprehensive monitoring and observability for GPU workloads
- Benchmark and analyze different deployment strategies for your specific use cases


<a href="https://www.nvidia.com/dli"> <img src="./images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>


---
