<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a> 
 
# 2.0 vLLM Aggregated Deployment with Dynamo 
 
In this notebook, you'll learn how to deploy Large Language Models (LLMs) using NVIDIA Dynamo with vLLM backend. We'll explore two different deployment patterns, perform performance testing, and compare results. 
 
## Learning Objectives 
 
By the end of this notebook, you will be able to: 
- Deploy vLLM models using Dynamo's aggregated configuration 
- Implement KV Cache-aware routing for improved performance 
- Monitor deployment status and troubleshoot issues 
- Conduct performance benchmarking and analysis 
- Compare different deployment strategies 
 
--- 
 
## Table of Contents 
 
**[2.1 Introduction & Setup](#21-introduction--setup)** 
- [2.1.1 Dynamo Inference Graphs & Configurations](#211-dynamo-inference-graphs--configurations) 
- [2.1.2 Environment Setup & Prerequisites](#212-environment-setup--prerequisites)

**[2.2 Aggregated vLLM Deployment](#22-standard-vllm-deployment)** 
- [2.2.1 Deployment Configuration](#221-deployment-configuration) 
- [2.2.2 Service Deployment & Testing](#222-service-deployment--testing)

**[2.3 KV Cache-Aware Router Deployment](#23-kv-cache-aware-router-deployment)** 
- [2.3.1 Router Architecture & Benefits](#231-router-architecture--benefits) 
- [2.3.2 Router Deployment & Testing](#232-router-deployment--testing)
 
**[2.4 Performance Analysis](#24-performance-analysis)** 
- [2.4.1 Benchmark Setup & Execution](#241-benchmark-setup--execution) 
- [2.4.2 Results & Visualization](#242-results--visualization)
 
**[2.5 Summary & Next Steps](#25-summary--next-steps)** 
 
--- 
 
## 2.1 Introduction 
 
### 2.1.1 What are Dynamo Inference Graphs? 
 
Dynamo inference graphs represent complete LLM serving pipelines with interconnected components: 
 
| Component | Description | Role | 
|-----------|-------------|------| 
| **Frontend** | HTTP request handler | Routes requests and manages responses | 
| **Prefill Workers** | Input token processors | Handle compute-intensive prefill operations | 
| **Decode Workers** | Output token generators | Manage memory-intensive decode operations | 
| **Load Balancers** | Request distributors | Optimize workload distribution | 
 
### 2.1.2 Deployment Configurations 
 
In this notebook, we'll implement and compare two deployment strategies: 
 
#### 2.1.2.1 **Standard Aggregated Deployment** (`vllm-agg`) 
- **Use Case**: Development and moderate-scale inference 
- **Architecture**: Single-node serving with unified frontend 
- **Benefits**: Simplified deployment and management 
#### 2.1.2.2 **KV Cache-Aware Router Deployment** (`vllm-agg-router`) 
- **Use Case**: Production environments requiring intelligent routing 
- **Architecture**: Multiple workers with cache-aware load balancing 
- **Benefits**: Improved cache efficiency and reduced latency 
--- 

[Open Prometheus!](/prom/graph) 
 
[Open Grafana!](/grafana/)

 
## 2.2 Environment Setup 
 
### 2.2.1 Prerequisites Check 
 
Before we begin, let's ensure our environment is properly configured:

In [None]:
# Import utility functions
from huggingface_hub import HfApi, login
from IPython.display import Markdown, display
from utils.aiperf import create_benchmark_comparison_visualization
from utils.data_generation import generate_mooncake_dataset
from utils.hf_api import check_model_access, check_token_validity, get_hf_token
from utils.k8s_helpers import wait_for_pods_ready
from utils.paths import CONFIGS_DIR, DEPLOYMENT_DIR, ensure_dirs_exist
from utils.util import image_model_replacement

# Configure deployment parameters
NAMESPACE = "dynamo-cloud"
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"
VLLM_IMAGE = "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0"

# Ensure required directories exist
ensure_dirs_exist()

### 2.2.2 HF Token Check 
We are using Hugging Face token to download the Tokenizer and the Model.

Read here on creating token: https://huggingface.co/docs/hub/en/security-tokens

In [None]:
api = HfApi()
token = get_hf_token()

# Step 1: Validate token
if not check_token_validity(api, token):
    token = input("Please re-enter a valid Hugging Face token: ").strip()
    if not check_token_validity(api, token):
        print("Exiting: Invalid token provided twice.")

In [None]:
# Step 2: Check model access
check_model_access(api, token, MODEL_NAME)

# Step 3: Hugging Face login
login(token=token)
print("Logged in to Hugging Face Hub via `login()`.")

# Step 4: Create K8s secret
!kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=$token -n $NAMESPACE

### 2.2.3 Network Configuration 
 
Let's get the load balancer IP address that we'll use to access our deployed services. This IP will be used for external access to our vLLM deployments. 

In [None]:
# Get the load balancer IP
load_balancer_ip_result = !kubectl get svc ingress-nginx-controller -n nginx-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
load_balancer_ip = load_balancer_ip_result[0]
print("\nüìã Configuration Summary:")
print(f"   Namespace: {NAMESPACE}")
print(f"   Load Balancer IP: {load_balancer_ip}")

--- 
 
## 2.3 vLLM Aggregated Deployment 
 
Now let's deploy our first configuration - the standard vLLM aggregated deployment. This setup provides a simple, unified serving architecture ideal for development and moderate-scale inference workloads. 


### 2.3.1 Persistent Volume Configuration 
 
Before deploying our vLLM service, we need to create persistent storage for model files. This PersistentVolumeClaim (PVC) will store the downloaded model weights and be shared across our deployment. 

In [None]:
!cat {PVC_DIR}/vllm-agg.yaml

 Let's create the PVC that will store our model files: 

In [None]:
# Apply the PVC configuration
!kubectl apply -f {PVC_DIR}/vllm-agg.yaml -n $NAMESPACE
!kubectl get pvc vllm-agg-pvc -n $NAMESPACE

### 2.3.2 Deploying the vLLM Service 

 
#### High-level flow: 
<center><img src="images/dynamo/flows/vllm-agg.png" width="350px"></center> 
 
 
#### Deployment Architecture: 
<center><img src="images/dynamo/flows/vllm-agg-aks.png" width="1000px"></center> 

The following configuration creates a simple aggregated deployment with: 
- **Frontend**: Handles HTTP requests and routes them to workers 
- **VllmDecodeWorker**: Processes both prefill and decode operations 
- **Model**: Uses Qwen2.5-32B-Instruct-GPTQ-Int4 as our example LLM 

In [None]:
# Create configured YAML for vLLM deployment

# Get source and destination paths
image_model_replacement(
    source_yaml_path=f"{DEPLOYMENT_DIR}/vllm-agg.yaml",
    destination_directory=CONFIGS_DIR,
    vllm_image=VLLM_IMAGE,
    model_name=MODEL_NAME,
)

In [None]:
# Show the generated configuration
print("\nüìÑ Generated Configuration:")
print("=" * 60)
!cat {CONFIGS_DIR}/'vllm-agg.yaml'

In [None]:
# Deploy the vLLM configuration
!kubectl apply -f {CONFIGS_DIR}/vllm-agg.yaml -n $NAMESPACE

In [None]:
# Show deployment status
!kubectl get dynamographdeployment vllm-agg -n $NAMESPACE

Sample output:
``` 
NAME                                         READY   STATUS    RESTARTS   AGE
vllm-agg-frontend-5dbbfb4c85-r8m85           0/1     Pending   0          7s
vllm-agg-vllmdecodeworker-85fc89dcf5-8kw59   0/1     Pending   0          7s
 
```

In [None]:
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-agg -n $NAMESPACE

The pods must be pending as the node is scaling up. The following warning is ok. 
``` 
 Warning FailedScheduling 3m24s default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. 
 Normal TriggeredScaleUp 3m14s cluster-autoscaler pod triggered scale-up: [{aks-h100-22443994-vmss 0->1 (max: 2)}] 
```

In [None]:
vllm_agg_vllmdecodeworker_name = !kubectl get pods -n $NAMESPACE | grep vllm-agg-vllmdecodeworker | cut -d' ' -f1

!kubectl describe pods {vllm_agg_vllmdecodeworker_name[0]} -n $NAMESPACE

In [None]:
# Wait for vLLM pods to be ready
wait_for_pods_ready("vllm-agg", NAMESPACE)

In [None]:
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-agg -n $NAMESPACE

In [None]:
!kubectl expose deployment vllm-agg-frontend --type=ClusterIP --port=8000 --target-port=8000 -n $NAMESPACE
!kubectl get svc -n $NAMESPACE

### 2.3.3 Setting up External Access 
 
Now that our vLLM service is running, we need to configure ingress to allow external access. The ingress controller will route HTTP requests to our vLLM frontend service. 

In [None]:
!cat {INGRESS_DIR}/vllm-agg-ingress.yaml

In [None]:
# Deploy the ingress configuration
!kubectl apply -f {INGRESS_DIR}/vllm-agg-ingress.yaml -n $NAMESPACE

Wait a moment for ingress to be processed. 

### 2.3.4 Testing the Standard vLLM Deployment 
 
Now that our vLLM service is deployed and accessible, let's test it to ensure everything is working correctly. We'll start with basic connectivity tests and then move on to inference testing. 
 
#### 2.3.4.1 Basic Connectivity Tests 
 
First, let's verify that our service is accessible and responding to requests: 

In [None]:
!curl "http://{load_balancer_ip}/vllm-agg/v1/models" | jq

#### 2.3.4.2 Chat Completion Test 
 
Now let's test the actual inference capability by sending a chat completion request to our vLLM service: 

In [None]:
import requests

# Configure the chat completion request
endpoint = f"http://{load_balancer_ip}/vllm-agg/v1/chat/completions"
headers = {"accept": "application/json", "Content-Type": "application/json"}

data = {
    "messages": [
        {
            "content": "You are a polite and respectful chatbot helping people plan a vacation.",
            "role": "system",
        },
        {"content": "What is the capital of Germany?", "role": "user"},
    ],
    "model": MODEL_NAME,
    "temperature": 0.5,
    "max_tokens": 150,
    "top_p": 1,
    "stream": False,
}

llm_response = requests.post(endpoint, headers=headers, json=data, timeout=60)

if llm_response.status_code == 200:
    response_data = llm_response.json()
    print(response_data)

#### 2.3.4.3 Metrics Monitoring 
 
Let's check the metrics endpoint to see performance statistics for our vLLM service:

In [None]:
!curl "http://{load_balancer_ip}/vllm-agg/metrics"

--- 
 
## 2.4 Part 2: KV Cache-Aware Router Deployment 
 
Now let's deploy our second configuration - the KV Cache-aware router deployment. This advanced setup provides intelligent routing that considers Key-Value cache states to minimize cache misses and improve overall performance. 
 
### 2.4.1 What is KV Cache-Aware Routing? 
 
KV Cache-aware routing is an optimization technique that: 
- **Tracks cache states** across multiple worker instances 
- **Routes requests intelligently** to workers with relevant cached data 
- **Reduces cache misses** by leveraging previously computed key-value pairs 
- **Improves latency** for similar or continuing conversations 
 
#### Key Benefits: 
- **Better Cache Utilization**: Maximizes reuse of cached computations 
- **Reduced Latency**: Faster response times for cache hits 
- **Cost Efficiency**: Lower compute requirements through cache reuse 
- **Scalability**: Better performance under high load 
 
### 2.4.2 Configuration Comparison 

Enables KV cache-aware routing logic:
```
 "--router-mode kv"
```

### 2.4.3 Router Storage Configuration 
 
The router deployment requires its own storage volume for model files. Let's create a dedicated PVC for the router configuration: 

In [None]:
!cat {PVC_DIR}/vllm-agg-router.yaml

In [None]:
# Apply the router PVC configuration
!kubectl apply -f {PVC_DIR}/vllm-agg-router.yaml -n $NAMESPACE

In [None]:
!kubectl get pvc -n $NAMESPACE

### 2.4.4 KV Cache-Aware Routing Architecture 
 
Let's visualize how KV cache-aware routing works compared to standard routing: 

 
#### High-level flow: 
<center><img src="images/dynamo/flows/vllm-agg-router.png" width="500px"></center> 
 
 
#### Deployment Architecture: 
<center><img src="images/dynamo/flows/vllm-agg-router-aks.png" width="1000px"></center> 

### 2.4.5 Deploying the Router Service 
 
Now let's deploy the KV Cache-aware router configuration: 

In [None]:
# Create configured YAML for router deployment
success = image_model_replacement(
    source_yaml_path=f"{DEPLOYMENT_DIR}/vllm-agg-router.yaml",
    destination_directory=CONFIGS_DIR,
    vllm_image=VLLM_IMAGE,
    model_name=MODEL_NAME,
)

In [None]:
!cat {CONFIGS_DIR}/vllm-agg-router.yaml

In [None]:
# Deploy the router configuration
!kubectl apply -f {CONFIGS_DIR}/vllm-agg-router.yaml -n $NAMESPACE

In [None]:
# Show deployment status
!kubectl get dynamographdeployment -n $NAMESPACE
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-agg-router -n $NAMESPACE

In [None]:
# Wait for router pods to be ready
wait_for_pods_ready("vllm-agg-router", NAMESPACE)

In [None]:
!kubectl get pods -l nvidia.com/dynamo-namespace=vllm-agg-router -n $NAMESPACE

In [None]:
!kubectl expose deployment vllm-agg-router-frontend --type=ClusterIP --port=8000 --target-port=8000 -n $NAMESPACE
!kubectl get svc -n $NAMESPACE

### 2.4.6 Router External Access Configuration 
 
Now let's configure ingress for the router deployment to enable external access with the `/vllm-agg-router` path: 

In [None]:
print("üåê Deploying Router Ingress...")

# Deploy the router ingress configuration
!kubectl apply -f {INGRESS_DIR}/vllm-agg-router-ingress.yaml -n $NAMESPACE

Wait a moment for ingress to be processed

### 2.4.7 Testing the Router Deployment 
 
Let's test our KV Cache-aware router deployment to ensure it's working correctly: 
 
#### 2.4.7.1 Router Connectivity Test 

In [None]:
!curl "http://{load_balancer_ip}/vllm-agg-router/v1/models" | jq

#### 2.4.7.2 Router Chat Completion Test 
 
Let's test the inference capability of our router deployment: 

In [None]:
data

In [None]:
endpoint = f"http://{load_balancer_ip}/vllm-agg-router/v1/chat/completions"
router_response = requests.post(endpoint, headers=headers, json=data, timeout=60)
response_data = router_response.json()
print(response_data)

## 2.5 Performance Benchmarking

### 2.5.1 Benchmarking Standard vLLM Deployment 


In [None]:
# Generate test dataset for benchmarking
success = generate_mooncake_dataset(
    time_duration=60,
    request_rate_min=3,
    request_rate_max=5,
    request_rate_period=30,
    isl1=250,
    osl1=50,
    isl2=500,
    osl2=100,
    output_file="results/dataset.jsonl",
)

if not success:
    print("Failed to generate test dataset")
    raise Exception("Dataset generation failed")

### 2.5.2 Benchmarking Standard vLLM Deployment 
 
Let's start with benchmarking our standard deployment to establish baseline performance.

<p style="color:red">Copy the below printed command and run it in the terminal:</p>

In [None]:
command = f'aiperf profile -m "{MODEL_NAME}" --endpoint-type "chat" --url "http://{load_balancer_ip}/vllm-agg" --input-file "/dli/task/results/dataset.jsonl" --custom-dataset-type "mooncake_trace" --fixed-schedule --artifact-dir "/dli/task/results/standard-benchmark" --streaming'


# Display the command for reference
print(
    "Please make sure you have run the following command in terminal before proceeding:\n"
)
# Display as shell block
display(Markdown(f"```bash\n{command}\n```"))

print("\n")

# Ask for confirmation
response = input("Have you run this command? (yes/no): ").strip().lower()

# Conditional execution
if response == "yes":
    print("Great! You can proceed.")
else:
    print("Please run the command in terminal first before proceeding.")
    # Optionally, stop execution
    import sys

    sys.exit("Execution stopped. Run the command first.")

You will get a screen like below: 
<center><img src="images/ai-perf.png" width="700px"></center>

### 2.5.3 Benchmarking KV Cache-Aware Router Deployment 
 
Now let's benchmark our KV Cache-aware router deployment to compare performance.


<p style="color:red">Copy the below printed command and run it in the terminal:</p>

In [None]:
from IPython.display import display

command = f'aiperf profile -m "{MODEL_NAME}" --endpoint-type "chat" --url "http://{load_balancer_ip}/vllm-agg-router" --input-file "/dli/task/results/dataset.jsonl" --custom-dataset-type "mooncake_trace" --fixed-schedule --artifact-dir "/dli/task/results/router-benchmark" --streaming'


# Display the command for reference
print(
    "Please make sure you have run the following command in terminal before proceeding:\n"
)
# Display as shell block
display(Markdown(f"```bash\n{command}\n```"))

print("\n")

# Ask for confirmation
response = input("Have you run this command? (yes/no): ").strip().lower()

# Conditional execution
if response == "yes":
    print("Great! You can proceed.")
else:
    print("Please run the command in terminal first before proceeding.")
    # Optionally, stop execution
    import sys

    sys.exit("Execution stopped. Run the command first.")

You will get a screen like below: 
<center><img src="images/ai-perf-vllm-agg-router.png" width="700px"></center>

#### Check the Grafana Dashboard
[Open Grafana!](/grafana/) -> Dynamo Dashboard

<center><img src="images/dynamo-grafana-dashboard.png" width="700px"></center>


### 2.5.4 Performance Data Analysis & Visualization 
 
Now let's analyze the benchmark results from both deployments and create comprehensive visualizations comparing their performance characteristics. 

In [None]:
# Create visualization for aggregated deployments
comparison_fig = create_benchmark_comparison_visualization(
    standard_dir="results/standard-benchmark",
    router_dir="results/router-benchmark",
    deployment_labels=("Standard Agg", "Router Agg"),
    output_filename="agg_benchmark_comparison.png",
)

## 2.6 Clean Up

### 2.6.1 Delete exisiting Deployments
 
Before we proceed with the next notebook, let's clean up any existing aggregated deployments:

In [None]:
!kubectl delete -f {CONFIGS_DIR}/vllm-agg.yaml -n $NAMESPACE

In [None]:
!kubectl delete -f {CONFIGS_DIR}/vllm-agg-router.yaml -n $NAMESPACE

--- 
<h2 style="color:green;">Congratulations!</h2> 
 
## Key Takeaways: Aggregated Deployment

**What is Aggregated Deployment?**
- **All-in-one architecture**: Prefill (prompt processing) and decode (token generation) run together in the same pods
- **Unified scaling**: All components scale together as a single unit
- **Simpler configuration**: Fewer components to manage and monitor

**Benefits:**
- **Simplicity**: Easier to deploy, configure, and troubleshoot
- **Lower Overhead**: No router or coordination between separate components
- **Resource Efficiency**: Good for moderate traffic workloads where separation isn't needed
- **Faster Time-to-Deploy**: Fewer moving parts means quicker setup

**When to Use:**
- Development and testing environments
- Low to moderate traffic applications
- Scenarios without cache reuse patterns
- When operational simplicity is prioritized over maximum performance

**Trade-offs:**
- Less flexibility in scaling individual components
- No intelligent cache-aware routing
- May not maximize resource utilization under high load

### Next Steps:

- In the next notebook, you'll deploy a **disaggregated architecture** and compare the performance differences.


**Continue to**: [vLLM Disagg Deployment](Dynamo_03_vLLM_disAgg_Deployment.ipynb)


<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>