# vLLM Profiler Demo

This notebook demonstrates the complete workflow of the vLLM profiler system using a Kubernetes mutating admission webhook.

## Overview

The vLLM profiler system automatically instruments vLLM pods with PyTorch profiler capabilities:

1. **Mutating Webhook** intercepts pod creation
2. **Injects profiler code** via ConfigMap and environment variables
3. **Auto-loads profiler** when Python starts using `sitecustomize.py`
4. **Instruments vLLM** using import hooks to wrap `Worker.execute_model`
5. **Captures traces** of CPU+CUDA activity and exports profiler output

## What We'll Do

This demo follows the same workflow as `test-vllm-integration.sh`:

1. Check prerequisites
2. Deploy the profiler webhook and ConfigMap
3. Create a vLLM pod with profiler instrumentation
4. Wait for vLLM server to start
5. Send inference request generating 200 tokens
6. Verify profiler output in logs
7. Clean up resources

## Step 1: Check Prerequisites

Verify we have access to the Kubernetes cluster and required namespaces.

In [None]:
# Configuration
import os
NAMESPACE = "downstream-llm-d"
POD_NAME = "demo-vllm-profiler"
MODEL = "facebook/opt-125m"  # Small model for testing
VLLM_IMAGE = "vllm/vllm-openai:latest"
MAX_MODEL_LEN = "2048"

os.environ["KUBECONFIG"] = "/home/michey/llmd_aug2025/kubeconfig.llmd.8xh200"
os.environ["NAMESPACE"] = NAMESPACE
os.environ["POD_NAME"] = POD_NAME
os.environ["MODEL"] = MODEL
os.environ["VLLM_IMAGE"] = VLLM_IMAGE
os.environ["MAX_MODEL_LEN"] = MAX_MODEL_LEN

print(f"Configuration:")
print(f"  Namespace: {NAMESPACE}")
print(f"  Pod name: {POD_NAME}")
print(f"  Model: {MODEL}")
print(f"  Image: {VLLM_IMAGE}")

In [None]:
import os
os.environ["KUBECONFIG"] = "/home/michey/llmd_aug2025/kubeconfig.llmd.8xh200"
os.environ["NAMESPACE"] = "downstream-llm-d"
os.environ["POD_NAME"] = "demo-vllm-profiler"
os.environ["MODEL"] = "facebook/opt-125m"  # Small model for testing
os.environ["VLLM_IMAGE"] = "vllm/vllm-openai:latest"
os.environ["MAX_MODEL_LEN"] = "2048"

In [None]:
# Check cluster access
!oc whoami
!oc cluster-info | head -1

In [None]:
# Check if namespace exists
!oc get namespace {NAMESPACE} 2>/dev/null || echo "Namespace {NAMESPACE} not found - will be created during deployment"

## Step 2: Deploy the Profiler

Deploy the mutating webhook and ConfigMap containing the profiler code.

In [None]:
# Deploy profiler (this runs deploy.sh)
!./deploy.sh --skip-build

In [None]:
# Verify webhook is running
!oc get pods -n vllm-profiler

In [None]:
# Check webhook configuration
!oc get mutatingwebhookconfiguration env-injector-webhook -o jsonpath='{.webhooks[0].clientConfig.service.name}' && echo " (webhook service)"
!echo "Target namespace: $(oc get deployment env-injector -n vllm-profiler -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="TARGET_NAMESPACE")].value}')"
!echo "Target labels: $(oc get deployment env-injector -n vllm-profiler -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="TARGET_LABELS")].value}')"

In [None]:
%%bash
oc get configmap env-injector-files -n "$NAMESPACE"
echo "ConfigMap contains:"
oc get configmap env-injector-files -n "$NAMESPACE" -o yaml

## Step 3: Create vLLM Pod

Create a vLLM pod with the label `llm-d.ai/inferenceServing=true` so the webhook will instrument it.

The webhook will automatically:
- Inject `PYTHONPATH=/home/vllm/profiler`
- Mount `sitecustomize.py` and `profiler_config.yaml` from ConfigMap
- Set `VLLM_RPC_TIMEOUT=1800000`

In [None]:
# First, clean up any existing test pod
!oc delete pod {POD_NAME} -n {NAMESPACE} --ignore-not-found=true --wait=false
!sleep 5

In [None]:
%%bash -s "$POD_NAME" "$NAMESPACE" "$VLLM_IMAGE" "$MODEL" "$MAX_MODEL_LEN"
# Create vLLM pod with profiler instrumentation
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: $1
  namespace: $2
  labels:
    llm-d.ai/inferenceServing: "true"
    demo: "vllm-profiler"
spec:
  containers:
  - name: vllm
    image: $3
    env:
    - name: HOME
      value: /tmp
    - name: HF_HOME
      value: /tmp/huggingface
    - name: TRANSFORMERS_CACHE
      value: /tmp/huggingface
    - name: XDG_CACHE_HOME
      value: /tmp/cache
    - name: FLASHINFER_WORKSPACE_DIR
      value: /tmp/flashinfer
    command:
    - python3
    - -m
    - vllm.entrypoints.openai.api_server
    - --model
    - $4
    - --max-model-len
    - "$5"
    - --host
    - "0.0.0.0"
    - --port
    - "8000"
    ports:
    - containerPort: 8000
      name: http
    resources:
      requests:
        memory: "4Gi"
      limits:
        memory: "8Gi"
  restartPolicy: Never
EOF

### Verify Webhook Injection

Check that the webhook successfully injected the profiler configuration:

In [None]:
%%bash
echo "Environment variables:"
oc get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].env[*].name}' | tr ' ' '\n' | grep -E 'PYTHON|VLLM' || echo "  (waiting for pod to be created...)"

In [None]:
%%bash
echo "Volume mounts:"
oc get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].volumeMounts[*].mountPath}' | tr ' ' '\n' | grep profiler || echo "  (waiting for pod to be created...)"

## Step 4: Wait for vLLM Server to Start

Wait for the pod to be running and the vLLM server to be ready to accept requests.

In [None]:
# Wait for pod to be running (may take a few minutes)
!oc wait --for=condition=Ready pod/{POD_NAME} -n {NAMESPACE} --timeout=300s

In [None]:
# Check for profiler installation in logs
!echo "Checking if profiler was loaded..."
!oc logs {POD_NAME} -n {NAMESPACE} 2>&1 | grep -E '\[profiler\]' | head -5 || echo "Profiler messages not yet visible"

In [None]:
# Wait for vLLM server to respond to /v1/models endpoint
import time
import subprocess

print("Waiting for vLLM server to be ready (checking /v1/models endpoint)...")
max_wait = 600  # 10 minutes
elapsed = 0

while elapsed < max_wait:
    result = subprocess.run(
        f"oc exec {POD_NAME} -n {NAMESPACE} -- curl -sf http://localhost:8000/v1/models",
        shell=True,
        capture_output=True
    )
    if result.returncode == 0:
        print(f"\n✓ vLLM server is ready! (took {elapsed}s)")
        break
    
    if elapsed >= max_wait:
        print(f"\n✗ Timeout waiting for server to start")
        break
    
    time.sleep(5)
    elapsed += 5
    if elapsed % 30 == 0:
        print(f"  Still waiting... ({elapsed}s elapsed)")
    else:
        print(".", end="", flush=True)

In [None]:
# Get pod IP for reference
POD_IP = !oc get pod {POD_NAME} -n {NAMESPACE} -o jsonpath='{{.status.podIP}}'
POD_IP = POD_IP[0]
os.environ["POD_IP"] = POD_IP
print(f"Pod IP: {POD_IP}")

## Step 5: Send Inference Request

Send a single inference request that generates 200 tokens. This will trigger the profiler (configured for calls 100-150).

In [None]:
%%bash
set -x
# Create temporary client pod to send request
echo "Sending inference request to generate 200 tokens..."

oc run demo-curl-client -n $NAMESPACE --image=curlimages/curl:latest --rm -i --restart=Never -- /bin/sh -c "
set -e
echo 'Sending inference request...'
curl -X POST http://$POD_IP:8000/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{
        \"model\": \"$MODEL\",
        \"prompt\": \"Write a detailed story about a brave knight who goes on an adventure:\",
        \"max_tokens\": 200,
        \"temperature\": 0.7
    }'
echo ''
echo 'Request completed successfully'
" || echo "Request may have failed, continuing to check logs..."

## Step 6: Verify Profiler Output

Check the pod logs to see the profiler output with CPU and CUDA timing information.

In [None]:
# Wait a few seconds for profiler to write output
import time
time.sleep(5)

In [None]:
# Check for profiler installation message
!echo "=== Profiler Installation ==="
!oc logs {POD_NAME} -n {NAMESPACE} 2>&1 | grep "\[profiler\] vLLM profiler installed" || echo "✗ Profiler not loaded"

In [None]:
# Check for profiler start message
!echo "=== Profiler Start ==="
!oc logs {POD_NAME} -n {NAMESPACE} 2>&1 | grep "\[profiler\] Starting profiler" || echo "△ Profiler start message not found (may need more requests)"

In [None]:
# Extract profiler output (first 60 lines)
!echo "=== Profiler Output (Top Operations by CUDA Time) ==="
!oc logs {POD_NAME} -n {NAMESPACE} 2>&1 | sed -n '/===== begin profiler output/,/===== end profiler output/p' | head -60

In [None]:
# Check for trace export message
!echo "=== Trace Export Status ==="
!oc logs {POD_NAME} -n {NAMESPACE} 2>&1 | grep -E "Exported trace to:|Chrome trace export disabled" | tail -1

## Step 7: Optional - Retrieve Trace File

If trace export is enabled, retrieve the Chrome trace JSON file for visualization.

In [None]:
%%bash
echo "Trace files in pod:"
#use sh -c otherwise jupyter will single quote the wildcard expansion
oc exec $POD_NAME -n $NAMESPACE -- /bin/sh -c "ls -lh /tmp/trace*.json 2>/dev/null" || echo "No trace files found"

In [None]:
# Retrieve trace file (if exists)
import subprocess
import os

# Get trace filename from pod
result = subprocess.run(
    [
        "oc", "exec", POD_NAME, "-n", NAMESPACE,
        "--", "/bin/sh", "-c", "ls /tmp/trace*.json 2>/dev/null"
    ],
    capture_output=True,
    text=True
)

if result.returncode == 0 and result.stdout.strip():
    trace_file = result.stdout.strip().split('\n')[0]
    local_trace = "./demo-trace.json"
    
    copy_result = subprocess.run(
        f"oc cp {NAMESPACE}/{POD_NAME}:{trace_file} {local_trace}",
        shell=True,
        capture_output=True
    )
    
    if copy_result.returncode == 0:
        print(f"✓ Trace file retrieved: {local_trace}")
        print(f"  Size: {os.path.getsize(local_trace)} bytes")
        print(f"\nTo visualize:")
        print(f"  1. Open Chrome browser")
        print(f"  2. Navigate to chrome://tracing")
        print(f"  3. Click 'Load' and select {local_trace}")
    else:
        print(f"✗ Failed to copy trace file")
else:
    print("No trace files found (trace export may be disabled)")

## Step 8: Cleanup

Remove the test pod (keep the profiler webhook deployed for future use).

In [None]:
# Delete the demo pod
#!oc delete pod {POD_NAME} -n {NAMESPACE} --ignore-not-found=true
#print(f"\n✓ Demo pod {POD_NAME} deleted")

In [None]:
# Optionally, delete the entire profiler system
# Uncomment the line below to remove webhook and ConfigMap

# !./teardown.sh --force

## Summary

This demo showed the complete vLLM profiler workflow:

✓ **Deployed** mutating webhook and ConfigMap  
✓ **Created** vLLM pod with automatic profiler instrumentation  
✓ **Verified** webhook injected PYTHONPATH and mounted profiler code  
✓ **Sent** inference request to trigger profiling  
✓ **Captured** profiler output with CPU/CUDA timing data  

### Key Features Demonstrated

- **Zero Code Changes**: vLLM source code unchanged
- **Transparent Injection**: Webhook automatically instruments matching pods
- **Import Hook Magic**: sitecustomize.py wraps Worker.execute_model at import time
- **Configurable Ranges**: Profile specific call ranges (e.g., 100-150)
- **Multi-source Config**: Environment variables override YAML config

### Next Steps

- Customize profiling ranges via annotations: `vllm.profiler/ranges="50-100,200-300"`
- Enable trace export: `vllm.profiler/export-trace="true"`
- Visualize traces in Chrome: `chrome://tracing`
- Profile production workloads: Add label `llm-d.ai/inferenceServing=true` to any vLLM pod