<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a> 
 
# 3.0 vLLM Disaggregated Deployment with Dynamo

In this notebook, you'll learn how to deploy Large Language Models (LLMs) using NVIDIA Dynamo with vLLM disaggregated backend. We'll explore disaggregated deployment patterns where prefill and decode operations are separated into specialized workers for optimal performance.

## Learning Objectives

By the end of this notebook, you will be able to:
- Deploy vLLM models using Dynamo's disaggregated configuration
- Understand prefill/decode worker separation for improved performance
- Implement disaggregated routing with specialized workers
- Monitor deployment status and troubleshoot issues
- Conduct performance benchmarking and analysis
- Compare disaggregated vs aggregated deployment strategies

---

## Table of Contents

**[3.1 Introduction & Setup](#31-introduction--setup)**
- [3.1.1 Dynamo Inference Graphs & Architecture](#311-dynamo-inference-graphs--architecture)
- [3.1.2 Environment Configuration](#312-environment-configuration)

**[3.2 Standard Disaggregated Deployment](#32-standard-disaggregated-deployment)**
- [3.2.1 Deployment Configuration](#321-deployment-configuration)
- [3.2.2 Service Deployment & Testing](#322-service-deployment--testing)
- [3.2.3 Performance Monitoring](#323-performance-monitoring)

**[3.3 Disaggregated Router Deployment](#33-disaggregated-router-deployment)**
- [3.3.1 Router Architecture & Benefits](#331-router-architecture--benefits)
- [3.3.2 Router Configuration & Deployment](#332-router-configuration--deployment)
- [3.3.3 Router Testing & Validation](#333-router-testing--validation)

**[3.4 Performance Analysis](#34-performance-analysis)**
- [3.4.1 Dataset Generation](#341-dataset-generation)
- [3.4.2 Benchmark Execution](#342-benchmark-execution)
- [3.4.3 Results & Visualization](#343-results--visualization)

**[3.5 Summary & Next Steps](#35-summary--next-steps)** 
 
--- 
 
## 3.1 Introduction & Setup
 
### 3.1.1 Deployment Configurations 
 
In this notebook, we'll implement and compare two deployment strategies: 
 
#### 3.1.1.1 **Standard Aggregated Deployment** (`vllm-disagg`) 
- **Use Case**: Development and moderate-scale inference 
- **Architecture**: Single-node serving with unified frontend 
- **Benefits**: Simplified deployment and management 
 
#### 3.1.1.2 **KV Cache-Aware Router Deployment** (`vllm-disagg-router`) 
- **Use Case**: Production environments requiring intelligent routing 
- **Architecture**: Multiple workers with cache-aware load balancing 
- **Benefits**: Improved cache efficiency and reduced latency 

--- 

[Open Prometheus!](/prom/graph) 
 
[Open Grafana!](/grafana/)

 
## 3.2 Environment Setup 
 
### 3.2.1 Prerequisites Check 
 
Before we begin, let's ensure our environment is properly configured:

In [None]:
# Import utility functions
from huggingface_hub import HfApi, login
from IPython.display import Markdown, display
from utils.aiperf import create_benchmark_comparison_visualization
from utils.hf_api import check_model_access, check_token_validity, get_hf_token
from utils.k8s_helpers import wait_for_pods_ready
from utils.paths import CONFIGS_DIR, DEPLOYMENT_DIR, ensure_dirs_exist
from utils.util import image_model_replacement

# Configure deployment parameters
NAMESPACE = "dynamo-cloud"
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"
VLLM_IMAGE = "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0"

# Ensure required directories exist
ensure_dirs_exist()

### 3.2.2 HF Token Check 
We are using Hugging Face token to download the Tokenizer and the Model.

Read here on creating token: https://huggingface.co/docs/hub/en/security-tokens

In [None]:
api = HfApi()
token = get_hf_token()

# Step 1: Validate token
if not check_token_validity(api, token):
    token = input("Please re-enter a valid Hugging Face token: ").strip()
    if not check_token_validity(api, token):
        print("Exiting: Invalid token provided twice.")

In [None]:
# Step 2: Check model access
check_model_access(api, token, MODEL_NAME)

# Step 3: Hugging Face login
login(token=token)
print("Logged in to Hugging Face Hub via `login()`.")

# Step 4: Create K8s secret
!kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=$token -n $NAMESPACE

### 3.2.3 Network Configuration 
 
Let's get the load balancer IP address that we'll use to access our deployed services. This IP will be used for external access to our vLLM deployments. 

In [None]:
# Get the load balancer IP
load_balancer_ip_result = !kubectl get svc ingress-nginx-controller -n nginx-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
load_balancer_ip = load_balancer_ip_result[0]
print("\nüìã Configuration Summary:")
print(f"   Namespace: {NAMESPACE}")
print(f"   Load Balancer IP: {load_balancer_ip}")

### 3.2.4 Delete exisiting Deployments
 
Before we proceed with the deployments, let's clean up any existing aggregated deployments:

In [None]:
!kubectl delete -f {CONFIGS_DIR}/vllm-agg.yaml -n $NAMESPACE

In [None]:
!kubectl delete -f {CONFIGS_DIR}/vllm-agg-router.yaml -n $NAMESPACE

---

## 3.3 Standard Disaggregated Deployment

### 3.3.1 Deploy Persistent Volume Claim 
 
Let's create the PVC that will store our model files: 

In [None]:
!cat {PVC_DIR}/vllm-disagg.yaml

In [None]:
# Apply the PVC configuration
!kubectl apply -f {PVC_DIR}/vllm-disagg.yaml -n $NAMESPACE
!kubectl get pvc vllm-disagg-pvc -n $NAMESPACE

### 3.3.2 Deploying vLLM disaggregated

Now let's deploy our first configuration - the standard vLLM disaggregated deployment. This setup provides a simple, unified serving architecture ideal for development and moderate-scale inference workloads. 

In disaggregated deployment, we separate the inference pipeline into specialized workers: 
 
#### **VllmPrefillWorker** (Input Processing) 
- **Purpose**: Handles input token processing and attention computation 
- **Workload**: Compute-intensive operations for prompt encoding 
- **Resource Usage**: High GPU compute, moderate memory 
- **Command**: `--is-prefill-worker` flag enables prefill specialization 
 
#### **VllmDecodeWorker** (Token Generation) 
- **Purpose**: Handles output token generation and autoregressive decoding 
- **Workload**: Memory-intensive operations with KV cache management 
- **Resource Usage**: Moderate GPU compute, high memory for cache 
- **Command**: Standard worker mode optimized for decode operations 

 
#### High-level flow: 
<center><img src="images/dynamo/flows/vllm-disagg.png" width="500px"></center> 
 
 
#### Deployment Architecture: 
<center><img src="images/dynamo/flows/vllm-disagg-aks.png" width="1000px"></center> 

In [None]:
# Create configured YAML for vLLM deployment

# Get source and destination paths
image_model_replacement(
    source_yaml_path=f"{DEPLOYMENT_DIR}/vllm-disagg.yaml",
    destination_directory=CONFIGS_DIR,
    vllm_image=VLLM_IMAGE,
    model_name=MODEL_NAME,
)

In [None]:
# Show the generated configuration
print("\nüìÑ Generated Configuration:")
print("=" * 60)
!cat {CONFIGS_DIR}/'vllm-disagg.yaml'

In [None]:
# Deploy the vLLM configuration
!kubectl apply -f {CONFIGS_DIR}/vllm-disagg.yaml -n $NAMESPACE

In [None]:
# Show deployment status
!kubectl get dynamographdeployment vllm-disagg -n $NAMESPACE

Sample output for the pods: 
```
NAME                                             READY   STATUS    RESTARTS   AGE
vllm-disagg-frontend-58d854b979-4hdvk            1/1     Running   0          2m9s
vllm-disagg-vllmdecodeworker-7bdb9bbb59-w7dq8    1/1     Running   0          2m8s
vllm-disagg-vllmprefillworker-59589c4c59-8pmzr   1/1     Running   0          2m8s
vllm-disagg-vllmprefillworker-59589c4c59-bg5c9   1/1     Running   0          2m8s
vllm-disagg-vllmprefillworker-59589c4c59-qgmc4   1/1     Running   0          2m8s
```

In [None]:
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-disagg -n $NAMESPACE

In [None]:
vllm_disagg_vllmprefillworker_name = !kubectl get pods -n $NAMESPACE | grep vllm-disagg-vllmprefillworker | cut -d' ' -f1

!kubectl describe pods {vllm_disagg_vllmprefillworker_name[0]}  -n dynamo-cloud

In [None]:
# Wait for vLLM pods to be ready
wait_for_pods_ready("vllm-disagg", NAMESPACE)

In [None]:
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-disagg -n $NAMESPACE

In [None]:
!kubectl expose deployment vllm-disagg-frontend --type=ClusterIP --port=8000 --target-port=8000 -n $NAMESPACE
!kubectl get svc -n $NAMESPACE

### 3.3.3 Setting up External Access 
 
Now that our vLLM service is running, we need to configure ingress to allow external access. The ingress controller will route HTTP requests to our vLLM frontend service. 

In [None]:
!cat {INGRESS_DIR}/vllm-disagg-ingress.yaml

In [None]:
print("Deploying Ingress for External Access...")

# Deploy the ingress configuration
!kubectl apply -f {INGRESS_DIR}/vllm-disagg-ingress.yaml -n $NAMESPACE

### 3.3.4 Testing the Standard vLLM Deployment 
 
Now that our vLLM service is deployed and accessible, let's test it to ensure everything is working correctly. We'll start with basic connectivity tests and then move on to inference testing. 
 
#### 3.3.4.1 Basic Connectivity Tests 
 
First, let's verify that our service is accessible and responding to requests: 

In [None]:
!curl "http://{load_balancer_ip}/vllm-disagg/v1/models" | jq

#### 3.3.4.2 Chat Completion Test 
 
Now let's test the actual inference capability by sending a chat completion request to our vLLM service: 

In [None]:
import requests

# Configure the chat completion request
endpoint = f"http://{load_balancer_ip}/vllm-disagg/v1/chat/completions"
headers = {"accept": "application/json", "Content-Type": "application/json"}

data = {
    "messages": [
        {
            "content": "You are a polite and respectful chatbot helping people plan a vacation.",
            "role": "system",
        },
        {"content": "What is the capital of Germany?", "role": "user"},
    ],
    "model": MODEL_NAME,
    "temperature": 0.5,
    "max_tokens": 150,
    "top_p": 1,
    "stream": False,
}

llm_response = requests.post(endpoint, headers=headers, json=data, timeout=60)

if llm_response.status_code == 200:
    response_data = llm_response.json()
    print(response_data)

#### 3.3.4.3 Metrics Monitoring 
 
Let's check the metrics endpoint to see performance statistics for our vLLM service:

In [None]:
!curl "http://{load_balancer_ip}/vllm-disagg/metrics"

---

## 3.4 Performance Analysis
 
### 3.4.1 Generating Test Dataset 
 
Let's create a comprehensive test dataset that simulates realistic LLM inference patterns. 

```
we will not be generating dataset as we will be using the dataset created in Dynamo_02_vLLM_Agg_Deployment.ipynb. If you have not created there, you can uncomment below. 
```

In [None]:
# # Generate test dataset for benchmarking
# success = generate_mooncake_dataset(
#     time_duration=60,
#     request_rate_min=3,
#     request_rate_max=5,
#     request_rate_period=30,
#     isl1=250,
#     osl1=50,
#     isl2=500,
#     osl2=100,
#     output_file="results/dataset.jsonl"
# )

# if not success:
#     print("Failed to generate test dataset")
#     raise Exception("Dataset generation failed")

### 3.4.2 Benchmarking Standard vLLM Deployment 
 
Let's start with benchmarking our standard deployment to establish baseline performance. 

<p style="color:red">Copy the below printed command and run it in the terminal:</p>

In [None]:
command = f'aiperf profile -m "{MODEL_NAME}" --endpoint-type "chat" --url "http://{load_balancer_ip}/vllm-disagg" --input-file "/dli/task/results/dataset.jsonl" --custom-dataset-type "mooncake_trace" --fixed-schedule --artifact-dir "/dli/task/results/standard-benchmark-disagg" --streaming'


# Display the command for reference
print(
    "Please make sure you have run the following command in terminal before proceeding:\n"
)
# Display as shell block
display(Markdown(f"```bash\n{command}\n```"))

print("\n")

# Ask for confirmation
response = input("Have you run this command? (yes/no): ").strip().lower()

# Conditional execution
if response == "yes":
    print("Great! You can proceed.")
else:
    print("Please run the command in terminal first before proceeding.")
    # Optionally, stop execution
    import sys

    sys.exit("Execution stopped. Run the command first.")

#### Check the Grafana Dashboard
[Open Grafana!](/grafana/) -> Dynamo Dashboard

<center><img src="images/dynamo-grafana-dashboard.png" width="700px"></center>

### 3.4.3 Cleanup disaggregated deployments

In [None]:
!kubectl delete -f dynamo-deployments/vllm-disagg.yaml -n $NAMESPACE

---

## 3.5 Disaggregated Router Deployment

### 3.5.1 Router Architecture & Benefits

Now let's deploy our second configuration - the disaggregated router deployment. This advanced setup provides intelligent routing with multiple specialized workers for enhanced performance and scalability.

Disaggregated routing is an optimization technique that:
- **Separates prefill and decode operations** across specialized worker instances
- **Routes requests intelligently** to appropriate worker types based on operation
- **Optimizes resource allocation** by matching workload characteristics to worker capabilities
- **Improves throughput** through parallel processing of different operation types 

### 3.5.2 Router Storage Configuration 
 
The router deployment requires its own storage volume for model files. Let's create a dedicated PVC for the router configuration: 

In [None]:
!cat {PVC_DIR}/vllm-disagg-router.yaml

In [None]:
# Apply the router PVC configuration
!kubectl apply -f {PVC_DIR}/vllm-disagg-router.yaml -n $NAMESPACE
!kubectl get pvc -n $NAMESPACE

### 3.5.3 Deploying the Router Service 
 
Let's visualize how KV cache-aware routing works compared to standard routing: 

 
#### High-level flow: 
<center><img src="images/dynamo/flows/vllm-disagg-router.png" width="500px"></center> 
 
 
#### Deployment Architecture: 
<center><img src="images/dynamo/flows/vllm-disagg-router-aks.png" width="1000px"></center> 

 
Now let's deploy the disaggregated router configuration: 

In [None]:
# Create configured YAML for router deployment
success = image_model_replacement(
    source_yaml_path=f"{DEPLOYMENT_DIR}/vllm-disagg-router.yaml",
    destination_directory=CONFIGS_DIR,
    vllm_image=VLLM_IMAGE,
    model_name=MODEL_NAME,
)

In [None]:
!cat {CONFIGS_DIR}/vllm-disagg-router.yaml

In [None]:
# Deploy the router configuration
!kubectl apply -f {CONFIGS_DIR}/vllm-disagg-router.yaml -n $NAMESPACE

In [None]:
# Show deployment status
print("\nDeployment Status:")
!kubectl get dynamographdeployment -n $NAMESPACE

print("\nRouter Pod Status:")
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-disagg-router -n $NAMESPACE

In [None]:
# Wait for router pods to be ready
wait_for_pods_ready("vllm-disagg-router", NAMESPACE)

In [None]:
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-disagg-router -n $NAMESPACE

In [None]:
!kubectl expose deployment vllm-disagg-router-frontend --type=ClusterIP --port=8000 --target-port=8000 -n $NAMESPACE
!kubectl get svc -n $NAMESPACE

### 3.5.4 Router External Access Configuration 
 
Now let's configure ingress for the router deployment to enable external access with the `/vllm-disagg-router` path: 

In [None]:
print("üåê Deploying Router Ingress...")

# Deploy the router ingress configuration
!kubectl apply -f {INGRESS_DIR}/vllm-disagg-router-ingress.yaml -n $NAMESPACE

### 3.5.5 Router Testing & Validation

Let's test our disaggregated router deployment to ensure it's working correctly: 

#### 3.5.5.1 Basic Connectivity Tests 
 
First, let's verify that our service is accessible and responding to requests: 

In [None]:
!curl "http://{load_balancer_ip}/vllm-disagg-router/v1/models" | jq

#### 3.5.5.2 Router Chat Completion Test 
 
Let's test the inference capability of our router deployment: 

In [None]:
endpoint = f"http://{load_balancer_ip}/vllm-disagg-router/v1/chat/completions"
router_response = requests.post(endpoint, headers=headers, json=data, timeout=60)
response_data = router_response.json()
print(response_data)

## 3.6 Benchmark Execution

### 3.6.1 Benchmark vLLM Disaggregated Router

Now let's benchmark our deployments to compare their performance. 

<p style="color:red">Copy the below printed command and run it in the terminal:</p>

In [None]:
from IPython.display import display

command = f'aiperf profile -m "{MODEL_NAME}" --endpoint-type "chat" --url "http://{load_balancer_ip}/vllm-disagg-router" --input-file "/dli/task/results/dataset.jsonl" --custom-dataset-type "mooncake_trace" --fixed-schedule --artifact-dir "/dli/task/results/router-benchmark-disagg" --streaming'


# Display the command for reference
print(
    "Please make sure you have run the following command in terminal before proceeding:\n"
)
# Display as shell block
display(Markdown(f"```bash\n{command}\n```"))

print("\n")

# Ask for confirmation
response = input("Have you run this command? (yes/no): ").strip().lower()

# Conditional execution
if response == "yes":
    print("Great! You can proceed.")
else:
    print("Please run the command in terminal first before proceeding.")
    # Optionally, stop execution
    import sys

    sys.exit("Execution stopped. Run the command first.")

#### Check the Grafana Dashboard
[Open Grafana!](/grafana/) -> Dynamo Dashboard

<center><img src="images/dynamo-grafana-dashboard.png" width="700px"></center>


### 3.6.2 Results & Visualization

Now let's analyze the benchmark results from both deployments and create comprehensive visualizations comparing their performance characteristics. 

In [None]:
# Create visualization for disaggregated deployments
comparison_fig = create_benchmark_comparison_visualization(
    standard_dir="results/standard-benchmark-disagg",
    router_dir="results/router-benchmark-disagg",
    deployment_labels=("Standard Disagg", "Router Disagg"),
    output_filename="disagg_benchmark_comparison.png",
)

---


<h2 style="color:green;">Congratulations!</h2>

## Key Takeaways: Disaggregated Deployment

**What is Disaggregated Deployment?**
- Separates the **prefill** (prompt processing) and **decode** (token generation) phases into independent components
- Uses a **KV Cache-aware router** to intelligently direct requests to backends with warm caches
- Enables **independent scaling** of each component based on workload characteristics

**Performance Benefits:**
- **Reduced Latency**: Router directs requests to backends with relevant cached data, minimizing redundant computations
- **Higher Throughput**: Optimized resource allocation for prefill vs. decode operations
- **Better Resource Utilization**: Scale prefill and decode independently based on actual demand

**When to Use:**
- Production environments requiring consistent low latency
- Applications with repeated or similar prompts (e.g., chatbots, RAG systems)
- High-traffic scenarios where intelligent routing provides measurable gains

---

### Feedback and Questions

We hope this lab provided valuable insights into deploying high-performance LLM inference systems. For questions, issues, or suggestions, please reach out to your instructor or consult the course materials.

 

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>