# Lab 1: Introduction and Kubernetes-Based Deployment

## Overview

In this lab, you'll deploy a **disaggregated serving** model on Kubernetes using NVIDIA Dynamo. This architecture separates the inference workload into two specialized workers: one handles the initial prompt processing (prefill), and another generates the response tokens (decode). By splitting these tasks, each worker can be optimized for its specific job, leading to better GPU utilization and more predictable performance.

You'll deploy a small language model (Qwen 1.5B) that runs on a single node with 2 GPUs‚Äîone GPU for the prefill worker and one for the decode worker. After deployment, you'll test the model using OpenAI-compatible APIs (the same format as ChatGPT's API) and run performance benchmarks to measure response quality.

This is the foundation for understanding Dynamo's capabilities. In Lab 2, you'll add monitoring to observe what's happening inside your deployment. In Lab 3, you'll see a different architecture where multiple workers share their work more dynamically.

**Duration**: ~90 minutes

---

## Section 1: Environment Setup

### Objectives
- Verify Kubernetes access 
- Install Dynamo dependencies
- Set up prerequisites (kubectl, helm)

### Prerequisites
Before starting, ensure you have:
- ‚úÖ Kubernetes cluster access (kubeconfig provided by instructor)
- ‚úÖ `kubectl` installed (version 1.24+) or `microk8s kubectl`
- ‚úÖ `helm` 3.x installed
- ‚úÖ HuggingFace token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

### Step 2: Set Configuration Variables

Run this to set up your environment. The defaults work for most users:


In [None]:
%%bash
# Set environment variables (these defaults work for most setups)
export RELEASE_VERSION=${RELEASE_VERSION:-0.8.0}
export NAMESPACE=${NAMESPACE:-dynamo}
export CACHE_PATH=${CACHE_PATH:-/data/huggingface-cache}

# Get node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

echo "‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ"
echo "üéì Lab 1: Environment Configuration"
echo "‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ"
echo "  Release Version:  $RELEASE_VERSION"
echo "  Namespace:        $NAMESPACE"
echo "  Cache Path:       $CACHE_PATH"
echo "  Node IP:          $NODE_IP"
echo ""
echo "üìå Service Ports (after deployment):"
echo "  Frontend API:     http://$NODE_IP:30100"
echo "  Grafana:          http://$NODE_IP:30080"
echo ""
echo "üí° Note: Frontend will be accessible after deploying in Section 3"
echo "‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ"

### Step 3: Verify Kubernetes Access


In [None]:
%%bash
# Verify kubectl is installed and configured
echo "=== kubectl version ==="
kubectl version --client

echo ""
echo "=== Cluster info ==="
kubectl cluster-info

echo ""
echo "=== GPU nodes ==="
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu

### Step 4: Set Up NGC Authentication

To access NVIDIA's Dynamo container images, you need to authenticate with NGC.

#### Get Your NGC API Key

1. Go to [NGC](https://ngc.nvidia.com/)
2. Sign in or create an account
3. Click on your profile in the top right corner
4. Select **"Setup"** ‚Üí **"Generate API Key"**
5. Copy your API key (it will only be shown once!)

#### Set NGC API Key

**Get your NGC API Key from [ngc.nvidia.com](https://ngc.nvidia.com/)** (Go to Profile > Setup > Generate API Key)

In [None]:
import os
import getpass

# Get NGC API key from user
print("Enter your NGC API Key from https://ngc.nvidia.com/")
print("(Go to Profile > Setup > Generate API Key)")
print("")
NGC_API_KEY = getpass.getpass("NGC API Key: ")

# Save it for later use (creating pull secrets)
os.environ['NGC_API_KEY'] = NGC_API_KEY

print("")
print("‚úì NGC API key saved")
print("  You can now use it to login and create pull secrets")

#### Set HuggingFace Token

**HuggingFace token is required to download models.** Get yours from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (Create a 'Read' token if you don't have one)

In [None]:
import os
import getpass

# Get HuggingFace token from user
print("Enter your HuggingFace Token from https://huggingface.co/settings/tokens")
print("(Create a 'Read' token if you don't have one)")
print("")
HF_TOKEN = getpass.getpass("HF Token: ")

# Save it for later use
os.environ['HF_TOKEN'] = HF_TOKEN

print("")
print("‚úì HuggingFace token saved to environment")
print("  Available as $HF_TOKEN in bash cells")

#### Login to NGC Registry

In [None]:
%%bash
# Login to NGC container registry
echo "$NGC_API_KEY" | helm registry login nvcr.io --username '$oauthtoken' --password-stdin

echo ""
echo "‚úì NGC authentication complete"
echo "  You can now pull Dynamo container images"

### Step 5: Create Your Namespace

In [None]:
%%bash
# Create the namespace
export NAMESPACE=${NAMESPACE:-dynamo}

kubectl create namespace $NAMESPACE 2>&1 | grep -v "AlreadyExists" || true

# Verify namespace was created
echo ""
echo "Verifying namespace:"
kubectl get namespace $NAMESPACE

### Step 6: Create NGC Pull Secret

Create a Kubernetes secret so that pods can pull images from NGC.

In [None]:
%%bash
# Get variables
export NAMESPACE=${NAMESPACE:-dynamo}

# Create NGC image pull secret
kubectl create secret docker-registry ngc-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password="$NGC_API_KEY" \
    --namespace $NAMESPACE \
    2>&1 | grep -v "AlreadyExists" || true

# Verify secret was created
echo ""
echo "Verifying NGC secret:"
kubectl get secret ngc-secret -n $NAMESPACE
echo "‚úì NGC pull secret created in namespace: $NAMESPACE"

### Step 7: Create HuggingFace Token Secret

In [None]:
%%bash
# Get variables
export NAMESPACE=${NAMESPACE:-dynamo}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
    --from-literal=HF_TOKEN="$HF_TOKEN" \
    --namespace $NAMESPACE \
    2>&1 | grep -v "AlreadyExists" || true

# Verify secret was created
echo ""
echo "Verifying HuggingFace secret:"
kubectl get secret hf-token-secret -n $NAMESPACE
echo "‚úì HuggingFace token secret created"

## Section 2: Install Dynamo Platform

### Objectives
- Install Dynamo CRDs (Custom Resource Definitions) with validation webhooks
- Install Dynamo platform (operator with K8s-native discovery)
- Verify platform components are running

### Architecture (v0.8.0 Simplified)

```
Client Request
      ‚Üì
Frontend (OpenAI API + Disaggregated Router)
      ‚Üì
Prefill Worker (GPU 0) ‚Üí Processes prompt ‚Üí Generates KV cache
      ‚Üì
Decode Worker (GPU 1) ‚Üí Uses KV cache ‚Üí Generates tokens
      ‚Üì
Response to Client

Infrastructure:
- Kubernetes EndpointSlices (service discovery)
- TCP Transport (direct worker connections)
- Dynamo Operator (manages deployments)
- Validation Webhooks (catch errors early)

Note: This basic deployment doesn't use NATS or etcd.
      Lab 3 covers distributed deployments
```

### Deployment Mode

We're using the **recommended cluster-wide deployment** (default). According to the [official Dynamo documentation](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/platform/README.md):

- ‚úÖ **Recommended**: One cluster-wide operator per cluster (default)
- This is the standard deployment for single-node and production clusters
- Install a **namespace-scoped Dynamo operator** that only manages resources in your namespace
- The CRDs are cluster-wide and should already be installed (check first)

### Step 1: Install Dynamo CRDs

**Note:** CRDs are cluster-wide resources and only need to be installed **once per cluster**. This step checks if they exist and installs them if needed.


In [None]:
%%bash
# Set default RELEASE_VERSION if not already set
export RELEASE_VERSION=${RELEASE_VERSION:-0.8.0}

# Check if CRDs already exist, install if not
if kubectl get crd dynamographdeployments.nvidia.com &>/dev/null && \
   kubectl get crd dynamocomponentdeployments.nvidia.com &>/dev/null; then
    echo "‚úì CRDs already installed"
    kubectl get crd | grep nvidia.com
else
    echo "Installing Dynamo CRDs v$RELEASE_VERSION..."
    helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-$RELEASE_VERSION.tgz
    helm install dynamo-crds dynamo-crds-$RELEASE_VERSION.tgz --namespace default
    
    echo ""
    echo "Verifying CRD installation:"
    kubectl get crd | grep nvidia.com
    echo ""
    echo "‚úì CRDs include validation webhooks for early error detection"
fi

### Step 2: Install Dynamo Platform

**Simplified in v0.8.0:** NATS and etcd are now **optional**. Dynamo uses Kubernetes-native service discovery (EndpointSlices) and TCP transport by default, making deployment simpler and reducing infrastructure dependencies.


In [None]:
%%bash
# Set defaults if not already set
export RELEASE_VERSION=${RELEASE_VERSION:-0.8.0}
export NAMESPACE=${NAMESPACE:-dynamo}

echo "Using configuration:"
echo "  RELEASE_VERSION: $RELEASE_VERSION"
echo "  NAMESPACE: $NAMESPACE"
echo ""

# Download platform chart
echo "Downloading Dynamo platform chart v$RELEASE_VERSION..."
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz

# Install Dynamo platform (namespace-scoped, K8s-native discovery)
echo "Installing Dynamo platform in namespace: $NAMESPACE"
echo "Using K8s-native discovery (no NATS/etcd required)"
helm install dynamo-platform \
    dynamo-platform-$RELEASE_VERSION.tgz \
    --namespace $NAMESPACE \
    --set dynamo-operator.namespaceRestriction.enabled=true

echo ""
echo "‚úì Platform installation initiated"
echo "  Discovery: Kubernetes EndpointSlices (native)"
echo "  Transport: TCP (default in v0.8.0)"
echo ""
echo "Waiting for pods to be ready..."

### Step 3: Wait for Platform Pods to Be Ready

Re-run the following cell until all pods report as "Running"


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

echo "Waiting for platform pods to be ready..."
echo ""

# Wait for pods to be ready (timeout after 5 minutes)
TIMEOUT=300
ELAPSED=0
INTERVAL=5

while [ $ELAPSED -lt $TIMEOUT ]; do
    # Get pod status
    NOT_READY=$(kubectl get pods -n $NAMESPACE --no-headers 2>/dev/null | grep -v "Running\|Completed" | wc -l)
    TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers 2>/dev/null | wc -l)
    READY=$((TOTAL - NOT_READY))
    
    echo "[$ELAPSED s] Pods ready: $READY/$TOTAL"
    kubectl get pods -n $NAMESPACE
    
    # Check if all pods are ready
    if [ $NOT_READY -eq 0 ] && [ $TOTAL -gt 0 ]; then
        echo ""
        echo "‚úì All platform pods are ready!"
        break
    fi
    
    echo ""
    echo "Waiting for pods to be ready... (checking again in ${INTERVAL}s)"
    echo ""
    sleep $INTERVAL
    ELAPSED=$((ELAPSED + INTERVAL))
done

if [ $ELAPSED -ge $TIMEOUT ]; then
    echo "‚ö†Ô∏è  Timeout waiting for pods to be ready"
    echo "Please check pod status manually: kubectl get pods -n $NAMESPACE"
fi

## Section 3: Deploy Your First Model with Disaggregated Serving

### Objectives
- Understand disaggregated serving architecture
- Configure and deploy a model using vLLM backend with separate prefill and decode workers
- Use Kubernetes manifests to deploy Dynamo resources

### Available Backends
In this lab, we'll use **vLLM** with disaggregated serving:
- **vLLM**: High-throughput serving with PagedAttention
- Model: `Qwen/Qwen2.5-1.5B-Instruct` (small, fast to download)
- Architecture: Disaggregated serving with separate prefill and decode workers

**Other backends** (for exploration):
- **SGLang**: Optimized for complex prompting and structured generation
- **TensorRT-LLM**: Maximum performance on NVIDIA GPUs

### What is Disaggregated Serving?

Disaggregated serving separates the inference pipeline into specialized workers:

**Prefill Worker** (GPU 0):
- Processes input prompts (compute-intensive)
- Converts tokens into KV cache
- Passes KV cache to decode workers

**Decode Worker** (GPU 1):
- Generates output tokens (memory-intensive)
- Uses KV cache from prefill worker
- Produces the final response

**Benefits:**
- ‚úÖ **Independent scaling**: Scale prefill and decode separately based on workload
- ‚úÖ **Resource optimization**: Each worker optimized for its specific task
- ‚úÖ **Better throughput**: Specialized workers can handle more requests

**Architecture:**
```
Client Request
    ‚Üì
Frontend (Router)
    ‚Üì
Prefill Worker (GPU 0) ‚Üí processes prompt ‚Üí generates KV cache
    ‚Üì
Decode Worker (GPU 1) ‚Üí receives KV cache ‚Üí generates tokens
    ‚Üì
Response to Client
```

### Deployment Configuration

We'll use a `DynamoGraphDeployment` resource that defines:
- **Frontend**: OpenAI-compatible API endpoint with disaggregated routing
- **VllmPrefillWorker**: 1 replica on GPU 0 for prompt processing
- **VllmDecodeWorker**: 1 replica on GPU 1 for token generation

### Step 1: Create Deployment Manifest

We will create the `disagg_router.yaml` file dynamically with your specific configuration variables:

In [None]:
%%bash
# Set defaults if not already set
export RELEASE_VERSION=${RELEASE_VERSION:-0.8.0}
export CACHE_PATH=${CACHE_PATH:-/data/huggingface-cache}

# Create the deployment YAML with environment variables
cat <<EOF > disagg_router.yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-disagg-router
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-disagg-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:$RELEASE_VERSION
      envs:
        - name: DYN_ROUTER_MODE
          value: disaggregated
    VllmPrefillWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-disagg-router
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      envs:
        - name: DYN_LOG
          value: "info"
      extraPodSpec:
        volumes:
        - name: local-model-cache
          hostPath:
            path: $CACHE_PATH
            type: DirectoryOrCreate
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:$RELEASE_VERSION
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
          volumeMounts:
          - name: local-model-cache
            mountPath: /root/.cache
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
            - -c
          args:
            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct --tensor-parallel-size 1
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-disagg-router
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      envs:
        - name: DYN_LOG
          value: "info"
      extraPodSpec:
        volumes:
        - name: local-model-cache
          hostPath:
            path: $CACHE_PATH
            type: DirectoryOrCreate
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:$RELEASE_VERSION
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
          volumeMounts:
          - name: local-model-cache
            mountPath: /root/.cache
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
            - -c
          args:
            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct --tensor-parallel-size 1
EOF

echo "‚úì Deployment manifest created: disagg_router.yaml"
echo "  Using Image Version: $RELEASE_VERSION"
echo "  Using Cache Path:    $CACHE_PATH"
echo ""
echo "Verify the configuration:"
grep "image:" disagg_router.yaml

### Step 2: Deploy the Model


In [None]:
%%bash
# Set defaults if not already set
export NAMESPACE=${NAMESPACE:-dynamo}

# Apply the deployment
kubectl apply -f disagg_router.yaml --namespace $NAMESPACE

echo ""
echo "‚úì Deployment created. This will take 4-6 minutes for first run."
echo "  - Pulling container images"
echo "  - Downloading model from HuggingFace"
echo "  - Loading model into GPU memory"

### Step 2b: Expose Frontend Service via NodePort

**CRITICAL**: By default, the deployment is internal-only. We must expose it via a NodePort Service to access it on port `30100`.

In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

# Create NodePort service to expose the frontend
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: vllm-frontend-nodeport
  namespace: $NAMESPACE
spec:
  type: NodePort
  selector:
    nvidia.com/dynamo-component: Frontend
    nvidia.com/dynamo-graph-deployment-name: vllm-disagg-router
  ports:
  - port: 8000
    targetPort: 8000
    nodePort: 30100
    protocol: TCP
    name: http
EOF

echo ""
echo "‚úì Service exposed on NodePort 30100"
echo "  Access URL: http://$NODE_IP:30100"
echo ""
echo "Note: The service will be accessible once the frontend pod is running."

### Step 3: Monitor Deployment Progress


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

echo "Expected pods:"
echo "  - vllm-disagg-router-frontend-xxxxx     (Frontend)"
echo "  - vllm-disagg-router-vllmprefillworker-xxxxx (Prefill Worker on GPU 0)"
echo "  - vllm-disagg-router-vllmdecodeworker-xxxxx  (Decode Worker on GPU 1)"
echo ""
echo "Waiting for deployment pods to be ready (this may take 4-6 minutes for first run)..."
echo ""

# Wait for pods to be ready (timeout after 10 minutes for model download)
TIMEOUT=600
ELAPSED=0
INTERVAL=10

while [ $ELAPSED -lt $TIMEOUT ]; do
    # Get vllm pod status
    VLLM_PODS=$(kubectl get pods -n $NAMESPACE 2>/dev/null | grep vllm || true)
    NOT_READY=$(echo "$VLLM_PODS" | grep -v "1/1.*Running" | grep -v "^$" | wc -l)
    TOTAL=$(echo "$VLLM_PODS" | grep -v "^$" | wc -l)
    READY=$((TOTAL - NOT_READY))
    
    echo "[$ELAPSED s] VLLM Pods ready: $READY/$TOTAL"
    kubectl get pods -n $NAMESPACE | grep -E '(NAME|vllm)'
    
    # Check if all vllm pods are ready
    if [ $NOT_READY -eq 0 ] && [ $TOTAL -ge 3 ]; then
        echo ""
        echo "‚úì All deployment pods are ready!"
        echo ""
        echo "DynamoGraphDeployment status:"
        kubectl get dynamographdeployment -n $NAMESPACE
        break
    fi
    
    echo ""
    echo "Waiting for pods to be ready... (checking again in ${INTERVAL}s)"
    echo "üí° Tip: Model download happens on first run and may take 3-5 minutes"
    echo ""
    sleep $INTERVAL
    ELAPSED=$((ELAPSED + INTERVAL))
done

if [ $ELAPSED -ge $TIMEOUT ]; then
    echo "‚ö†Ô∏è  Timeout waiting for pods to be ready"
    echo "Check logs: kubectl logs -l component=VllmPrefillWorker -n $NAMESPACE --tail=50"
fi

### Step 4: View Worker Logs (Optional)

While waiting for the deployment, you can watch the model loading progress in both workers.

**Note**: In disaggregated serving, both the prefill and decode workers load the model separately.


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

# Get logs from worker pods
PREFILL_POD=$(kubectl get pods -n $NAMESPACE | grep vllmprefillworker | awk '{print $1}' | head -1)
DECODE_POD=$(kubectl get pods -n $NAMESPACE | grep vllmdecodeworker | awk '{print $1}' | head -1)

if [ -n "$PREFILL_POD" ]; then
    echo "=== Prefill Worker Logs (GPU 0): $PREFILL_POD ==="
    echo "Look for:"
    echo "  - 'Loading model weights...' (downloading)"
    echo "  - 'Model loading took X.XX GiB' (loaded)"
    echo ""
    kubectl logs $PREFILL_POD -n $NAMESPACE --tail=30
    echo ""
fi

if [ -n "$DECODE_POD" ]; then
    echo "=== Decode Worker Logs (GPU 1): $DECODE_POD ==="
    echo "Look for:"
    echo "  - 'Loading model weights...' (downloading)"
    echo "  - 'Model loading took X.XX GiB' (loaded)"
    echo ""
    kubectl logs $DECODE_POD -n $NAMESPACE --tail=30
fi

if [ -z "$PREFILL_POD" ] && [ -z "$DECODE_POD" ]; then
    echo "Worker pods not found yet, please wait and try again"
fi

## Section 4: Testing and Validation

### Objectives
- Expose the service locally using port forwarding
- Send test requests to the deployment
- Verify OpenAI API compatibility
- Test streaming and non-streaming responses

### Testing Strategy
Once your deployment is running (`1/1 Ready`), you'll:
1. Connect to the frontend via NodePort (already exposed on port 30100)
2. Test with curl commands
3. Verify response format and functionality

### Step 1: Get Frontend URL

The frontend is already exposed via NodePort on port 30100 (set up in Step 2b). Get the connection URL:

In [None]:
%%bash
# Get the node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

echo "Node IP: $NODE_IP"
echo ""
echo "Frontend URL: http://$NODE_IP:30100"
echo ""
echo "‚úì Access the frontend at: http://$NODE_IP:30100"

### Step 2: Test the `/v1/models` Endpoint


In [None]:
%%bash
# Get node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

# Test the /v1/models endpoint
curl http://$NODE_IP:30100/v1/models

### Step 3: Simple Non-Streaming Chat Completion


In [None]:
%%bash
# Get node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

# Test non-streaming chat completion
curl http://$NODE_IP:30100/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Hello! How are you?"}],
    "stream": false,
    "max_tokens": 50
  }'

### Step 4: Test Streaming Response


In [None]:
%%bash
# Get node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

# Test streaming chat completion
curl http://$NODE_IP:30100/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Write a short poem about AI"}],
    "stream": true,
    "max_tokens": 100
  }'

### Step 5: Test with Different Parameters


In [None]:
%%bash
# Get node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

# Test with different parameters
curl http://$NODE_IP:30100/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Explain quantum computing in one sentence"}],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100,
    "top_p": 0.9
  }'

## Section 5: Benchmarking with AI-Perf

### Objectives
- Install and configure AI-Perf benchmarking tool
- Run performance benchmarks against your Kubernetes deployment
- Analyze throughput, latency, and token metrics
- Compare performance across different configurations

### Metrics to Measure
- Throughput (requests/second, tokens/second)
- Latency (TTFT - Time To First Token, TPOT - Time Per Output Token, end-to-end)
- GPU utilization
- KV cache efficiency

### Benchmarking Setup
You'll run AI-Perf from your local machine against the port-forwarded service, simulating:
- Different concurrency levels (fixed concurrent requests)
- Request rate patterns (requests per second)
- Various workload characteristics

### Step 1: Install AI-Perf (if not already installed)


In [None]:
import subprocess
import sys

# Ensure pip is available in the venv
print("Setting up pip in venv...")
subprocess.run([sys.executable, "-m", "ensurepip", "--default-pip"], 
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# Install AI-Perf in the venv
print("Installing AI-Perf...")
result = subprocess.run([sys.executable, "-m", "pip", "install", "aiperf", "-q"])

if result.returncode == 0:
    print("‚úì AI-Perf installed successfully")
    # Verify aiperf can be imported
    verify = subprocess.run([sys.executable, "-c", "import aiperf"], capture_output=True)
    if verify.returncode == 0:
        print("  aiperf is ready to use")
else:
    print("‚ö†Ô∏è  Installation had issues, but may still work")

### Step 2: Run Baseline Benchmark (Low Concurrency)

**‚ö†Ô∏è IMPORTANT: Run benchmarks in a TERMINAL, not in notebook cells (aiperf can crash the kernel).**

**To run this benchmark:**

1. Open a new terminal (File ‚Üí New ‚Üí Terminal in JupyterLab)
2. Copy and paste this command:

```
cd ~/dynamo-grove-brev/resources && ./run-benchmark.sh baseline
```

This will run a low concurrency benchmark (1 concurrent request, 100 total requests) and display metrics including:
- Time to First Token (TTFT)
- Token throughput
- Request latency
- Percentile distributions (p50, p90, p99)

### Step 3: Run Benchmark with Higher Concurrency

**‚ö†Ô∏è IMPORTANT: Run benchmarks in a TERMINAL, not in notebook cells.**

**To run this benchmark:**

1. Open a new terminal (File ‚Üí New ‚Üí Terminal in JupyterLab)
2. Copy and paste this command:

```
cd ~/dynamo-grove-brev/resources && ./run-benchmark.sh high
```

This will run a high concurrency benchmark (4 concurrent requests, 200 total requests) to stress test the system and see how it handles multiple simultaneous users.

### Step 4: Run Benchmark with Request Rate

**‚ö†Ô∏è IMPORTANT: Run benchmarks in a TERMINAL, not in notebook cells.**

**To run this benchmark:**

1. Open a new terminal (File ‚Üí New ‚Üí Terminal in JupyterLab)
2. Copy and paste this command:

```
cd ~/dynamo-grove-brev/resources && ./run-benchmark.sh rate
```

This will run a request rate benchmark (10 requests per second, 200 total requests) to simulate a steady stream of users hitting the API at a controlled rate.

### Step 5: Analyze Results

Review the benchmark outputs above. Key metrics to look for:
- **Throughput**: requests/second and tokens/second
- **TTFT (Time To First Token)**: How quickly does the first token appear?
- **TPOT (Time Per Output Token)**: Generation speed
- **End-to-end latency**: Total request time


---

**Next Steps:**
- Continue to **Lab 2: Monitoring and Observability** to add dashboards and metrics
- Or skip to **Lab 3: Distributed Serving** for multi-GPU deployments
- See **Appendix** for cleanup commands (only if completely done with all labs)

## Troubleshooting

### Check Pod Status


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

# Check all pods in your namespace
kubectl get pods -n $NAMESPACE

echo ""
echo "# To describe a specific pod to see errors:"
echo "# kubectl describe pod <pod-name> -n $NAMESPACE"

### View Pod Logs


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

# View logs from a specific component
echo "Frontend logs:"
kubectl logs -l component=Frontend -n $NAMESPACE --tail=50

echo ""
echo "Worker logs:"
kubectl logs -l component=VllmDecodeWorker -n $NAMESPACE --tail=50

### Check Deployment Status


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

# Check DynamoGraphDeployment status
echo "DynamoGraphDeployment status:"
kubectl describe dynamographdeployment vllm-disagg-router -n $NAMESPACE

echo ""
echo "Operator logs:"
kubectl logs -l app.kubernetes.io/name=dynamo-operator -n $NAMESPACE --tail=50

### Check Recent Events


In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

# View recent events in your namespace
kubectl get events -n $NAMESPACE --sort-by=.lastTimestamp | tail -20

### Common Issues

1. **ImagePullBackOff**: Check if you have access to NGC containers. Verify image version is correct.
2. **Pods stuck in Pending**: Check if GPU resources are available: `kubectl describe pod <pod-name> -n $NAMESPACE`
3. **Model download slow**: First run takes longer due to model download. Check worker logs for progress.
4. **Port forward not working**: Make sure pods are `1/1 Ready` before forwarding. Kill existing port-forward processes: `pkill -f port-forward`

---

## Known Issues (v0.8.0)

**‚ö†Ô∏è Important Notes for Dynamo v0.8.0:**

1. **Validation Webhook Timing**: In rare cases, validation webhooks may reject valid configurations during high cluster load. If deployment fails with validation errors, wait 30 seconds and retry.

2. **K8s-native Discovery Cold Start**: First request after deployment may take 5-10s longer as EndpointSlices propagate. Subsequent requests are fast.

3. **TCP Transport Port Conflicts**: If using custom ports, ensure they don't conflict with existing services. Default ports (8000, 8001) are usually safe.

4. **Model Loading on Multiple GPUs**: For multi-GPU setups, ensure sufficient shared memory (`/dev/shm`) is available. Add `--shm-size=2g` to worker pod spec if needed.

**Workarounds:**
- For webhook issues: Add `--wait --timeout=5m` to helm installs
- For discovery delays: Add readiness probe with longer `initialDelaySeconds`
- Check [GitHub Issues](https://github.com/ai-dynamo/dynamo/issues) for latest updates

**Fixed in v0.8.1+:** Many of these issues are addressed in patch releases. Check release notes for updates.

---

## Summary

### What You Learned
- ‚úÖ How to set up a namespace-scoped Dynamo deployment on Kubernetes
- ‚úÖ Kubernetes-based disaggregated deployment architecture
- ‚úÖ Creating and managing DynamoGraphDeployment resources
- ‚úÖ Backend engine deployment (vLLM)
- ‚úÖ Testing with OpenAI-compatible API
- ‚úÖ Performance benchmarking with AI-Perf

### Key Takeaways
- Namespace-scoped operators enable safe multi-tenant deployments
- Disaggregated serving separates prefill and decode for optimized resource utilization
- KV-cache routing provides intelligent load balancing across replicas
- DynamoGraphDeployment CRD simplifies complex inference deployments
- AI-Perf provides comprehensive performance insights

### Next Steps
- **(Optional)** Complete the **Monitoring Extension** (`lab1-monitoring.md`) to set up Prometheus and Grafana for observability
- In **Lab 2**, you'll explore advanced optimizations and use AIConfigurator to optimize configurations for larger models

---

## Appendix: Step-by-Step Commands

This appendix provides complete commands for each section. Use these as a reference during the lab.

**Note for MicroK8s users:** Replace `kubectl` with `microk8s kubectl` in all commands below, or set up an alias:


In [None]:
%%bash
alias kubectl='microk8s kubectl'

### A0. Troubleshooting Pod Issues

If your pods are in `Error` or `CrashLoopBackOff` state, use this comprehensive diagnostic:

In [None]:
%%bash
export NAMESPACE=${NAMESPACE:-dynamo}

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE | grep vllm
echo ""

echo "=== GPU Availability ==="
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu,GPU-Allocatable:.status.allocatable.nvidia\\.com/gpu
echo ""

echo "=== Checking Secrets ==="
kubectl get secret hf-token-secret -n $NAMESPACE &>/dev/null && echo "‚úì HF token secret exists" || echo "‚úó HF token secret missing!"
kubectl get secret ngc-secret -n $NAMESPACE &>/dev/null && echo "‚úì NGC secret exists" || echo "‚úó NGC secret missing!"
echo ""

echo "=== Prefill Worker Logs (last 30 lines) ==="
PREFILL_POD=$(kubectl get pods -n $NAMESPACE | grep vllmprefillworker | awk '{print $1}' | head -1)
if [ -n "$PREFILL_POD" ]; then
    kubectl logs $PREFILL_POD -n $NAMESPACE --tail=30
else
    echo "No prefill pod found"
fi
echo ""

echo "=== Decode Worker Logs (last 30 lines) ==="
DECODE_POD=$(kubectl get pods -n $NAMESPACE | grep vllmdecodeworker | awk '{print $1}' | head -1)
if [ -n "$DECODE_POD" ]; then
    kubectl logs $DECODE_POD -n $NAMESPACE --tail=30
else
    echo "No decode pod found"
fi
echo ""

echo "=== Common Issues ==="
echo "1. If 'insufficient gpu' error: You need 2 GPUs for disaggregated serving"
echo "2. If 'HF_TOKEN' error: Make sure you created the hf-token-secret"
echo "3. If 'ImagePullBackOff': Check NGC secret and credentials"
echo "4. If model download errors: Check network connectivity to huggingface.co"

### A1. Environment Setup


In [None]:
%%bash
# Verify kubectl is installed and configured
kubectl version --client
kubectl cluster-info

# Set your configuration
export NAMESPACE="dynamo"
export RELEASE_VERSION="0.8.0"     # Dynamo version
export HF_TOKEN="your_hf_token"    # Your HuggingFace token
export CACHE_PATH="/data/huggingface-cache"  # Shared cache path

# Create your personal namespace
kubectl create namespace ${NAMESPACE}

# Verify namespace was created
kubectl get namespace ${NAMESPACE}

# Check GPU nodes are available (optional)
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu

### A2. Install Dynamo Platform (Namespace-Scoped)


In [None]:
%%bash
# Step 1: Check if CRDs are already installed (cluster-wide)
if kubectl get crd dynamographdeployments.nvidia.com &>/dev/null && \
   kubectl get crd dynamocomponentdeployments.nvidia.com &>/dev/null; then
    echo "‚úì CRDs already installed"
else
    echo "‚ö†Ô∏è  CRDs not found. Ask instructor to install them, or run:"
    echo "helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz"
    echo "helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default"
fi

# Step 2: Download Dynamo platform helm chart
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz

# Step 3: Install namespace-scoped Dynamo platform
# IMPORTANT: --set dynamo-operator.namespaceRestriction.enabled=true restricts operator to this namespace
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
  --namespace ${NAMESPACE} \
  --set dynamo-operator.namespaceRestriction.enabled=true

# Step 4: Wait for platform pods to be ready (~2-3 minutes)
echo "Waiting for platform pods to be ready..."
kubectl wait --for=condition=ready pod \
  --all \
  --namespace ${NAMESPACE} \
  --timeout=300s

# Step 5: Verify platform is running
kubectl get pods -n ${NAMESPACE}
# You should see: dynamo-operator, etcd, and nats pods in Running state

# Step 6: Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="${HF_TOKEN}" \
  --namespace ${NAMESPACE}

# Verify secret was created
kubectl get secret hf-token-secret -n ${NAMESPACE}

### A3. Deploy Your First Model

Create a deployment YAML file `disagg_router.yaml`:

```yaml
# disagg_router.yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-disagg-router
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-disagg-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${RELEASE_VERSION}
      envs:
        - name: DYN_ROUTER_MODE
          value: disaggregated
    VllmPrefillWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-disagg-router
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      envs:
        - name: DYN_LOG
          value: "info"
      extraPodSpec:
        volumes:
        - name: local-model-cache
          hostPath:
            path: ${CACHE_PATH}  # Defaults to /data/huggingface-cache
            type: DirectoryOrCreate
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${RELEASE_VERSION}
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
          volumeMounts:
          - name: local-model-cache
            mountPath: /root/.cache
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
            - -c
          args:
            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct --tensor-parallel-size 1
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-disagg-router
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      envs:
        - name: DYN_LOG
          value: "info"
      extraPodSpec:
        volumes:
        - name: local-model-cache
          hostPath:
            path: ${CACHE_PATH}  # Defaults to /data/huggingface-cache
            type: DirectoryOrCreate
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${RELEASE_VERSION}
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
          volumeMounts:
          - name: local-model-cache
            mountPath: /root/.cache
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
            - -c
          args:
            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct --tensor-parallel-size 1
```

Deploy the model:


In [None]:
%%bash
# Apply the deployment
kubectl apply -f disagg_router.yaml --namespace ${NAMESPACE}

# Monitor deployment progress
kubectl get dynamographdeployment -n ${NAMESPACE}

# Watch pods starting up (this takes 4-6 minutes for first run)
kubectl get pods -n ${NAMESPACE} -w
# Press Ctrl+C to stop watching

# Check specific pod status
kubectl get pods -n ${NAMESPACE} | grep vllm

# View worker logs to see model loading progress
WORKER_POD=$(kubectl get pods -n ${NAMESPACE} | grep vllmdecodeworker | head -1 | awk '{print $1}')
kubectl logs ${WORKER_POD} -n ${NAMESPACE} --tail=50 --follow

### A4. Test the Deployment


In [None]:
%%bash
# The frontend is exposed via NodePort on port 30100
# Get the node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

echo "Frontend URL: http://$NODE_IP:30100"
echo ""
echo "Quick test commands (run in terminal):"
echo ""
echo "# Test 1: Check available models"
echo "curl http://$NODE_IP:30100/v1/models"
echo ""
echo "# Test 2: Simple chat completion"
echo "curl http://$NODE_IP:30100/v1/chat/completions -H 'Content-Type: application/json' -d '{\"model\": \"Qwen/Qwen2.5-1.5B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"stream\": false, \"max_tokens\": 50}'"

### A5. Benchmark with AI-Perf


In [None]:
%%bash
# Get node IP
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
FRONTEND_URL="http://$NODE_IP:30100"

echo "Benchmarking frontend at: $FRONTEND_URL"
echo ""

# Install AI-Perf (if not already installed)
pip install aiperf -q

echo "=== Running benchmarks ==="
echo ""

# Run a simple benchmark (adjust parameters as needed)
echo "1. Low concurrency benchmark..."
aiperf profile     --log-level warning     --model Qwen/Qwen2.5-1.5B-Instruct     --url $FRONTEND_URL     --endpoint-type chat     --streaming     --concurrency 1     --request-count 100

# Run with higher concurrency
echo ""
echo "2. High concurrency benchmark..."
aiperf profile     --log-level warning     --model Qwen/Qwen2.5-1.5B-Instruct     --url $FRONTEND_URL     --endpoint-type chat     --streaming     --concurrency 4     --request-count 200

# Run with request rate
echo ""
echo "3. Request rate benchmark..."
aiperf profile     --log-level warning     --model Qwen/Qwen2.5-1.5B-Instruct     --url $FRONTEND_URL     --endpoint-type chat     --streaming     --request-rate 10     --request-count 200

### A6. Scale Your Deployment


In [None]:
%%bash
# Edit your disagg_router.yaml and change replicas from 1 to 2
# Then reapply:
kubectl apply -f disagg_router.yaml --namespace ${NAMESPACE}

# Watch the new worker come online
kubectl get pods -n ${NAMESPACE} -w

# Test that load is distributed (KV-cache routing should work)
# Run multiple requests and check logs from both workers
kubectl logs -l component=VllmDecodeWorker -n ${NAMESPACE} --tail=20

### A7. Cleanup (‚ö†Ô∏è ONLY AFTER COMPLETING ALL LABS)

**‚ö†Ô∏è WARNING: DO NOT RUN THIS DURING THE WORKSHOP**

This cleanup is **ONLY** for when you're completely done with:
- ‚úÖ Lab 1: Deployment
- ‚úÖ Lab 2: Monitoring (requires Lab 1 deployment running)
- ‚úÖ Lab 3: Distributed Serving (requires platform installed)

**If you're continuing to Lab 2 or Lab 3, DO NOT run these commands.**

In [None]:
%%bash
# Step 1: Delete the Lab 1 deployment (only after Lab 2 is done)
export NAMESPACE=${NAMESPACE:-dynamo}
kubectl delete dynamographdeployment vllm-disagg-router -n ${NAMESPACE}
kubectl delete svc vllm-frontend-nodeport -n ${NAMESPACE}

# Step 2: Verify pods are terminating
kubectl get pods -n ${NAMESPACE}

# Step 3: (Optional) Complete cleanup after ALL labs
# This removes everything including the platform:
# kubectl delete namespace ${NAMESPACE}
# helm uninstall dynamo-platform -n ${NAMESPACE}
# helm uninstall dynamo-crds -n default

### A8. Troubleshooting


In [None]:
%%bash
# Check pod status
kubectl get pods -n ${NAMESPACE}

# Describe a pod to see errors
kubectl describe pod <pod-name> -n ${NAMESPACE}

# View logs from a specific pod
kubectl logs <pod-name> -n ${NAMESPACE}

# Check DynamoGraphDeployment status
kubectl describe dynamographdeployment vllm-disagg-router -n ${NAMESPACE}

# Check operator logs
kubectl logs -l app.kubernetes.io/name=dynamo-operator -n ${NAMESPACE}

# Check if image pull is working
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'