# Lab 8 – Production LLM Deployment with vLLM (OpenAI-Compatible)

In this lab, you will deploy an optimized LLM using **vLLM**, the industry-standard production inference server. vLLM provides OpenAI-compatible endpoints with advanced features like continuous batching, PagedAttention, and efficient memory management.

## Objectives

- Export an Unsloth-optimized model to HuggingFace format for vLLM deployment
- Deploy the model using vLLM's production-grade OpenAI-compatible server
- Test the endpoint with OpenAI SDK
- Understand production deployment best practices (batching, GPU utilization, scaling)

## Why vLLM for Production?

vLLM is the gold standard for production LLM serving, used by companies like Anthropic, Databricks, and many others:
- ⚡ **High throughput**: Continuous batching + PagedAttention
- 🎯 **Low latency**: Optimized CUDA kernels
- 📊 **Resource efficient**: Up to 24x higher throughput than naive implementations
- 🔌 **OpenAI-compatible**: Drop-in replacement for OpenAI API
- 📈 **Production-ready**: Request queuing, multi-GPU support, metrics

This is what you'd actually use in production, not a custom FastAPI wrapper.


In [None]:
# Install Unsloth for model optimization and vLLM for production serving
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Install vLLM - the production inference server
%pip install vllm openai

print("✅ Unsloth and vLLM installation complete!")
print("⚠️ IMPORTANT: Requires GPU runtime (T4 or better)")
print("🔄 Now restart runtime before proceeding.")

In [None]:
# 1️⃣ Prepare model for vLLM deployment

from unsloth import FastLanguageModel
import torch

# Load and prepare your optimized model
# In a real scenario, this would be your fine-tuned model from Labs 5, 6, or 7
model_name = "unsloth/Qwen2.5-1.5B-Instruct"  # Using smaller model for Colab
save_directory = "./vllm_model"

print("📦 Loading optimized model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    dtype=torch.float16,
    load_in_4bit=False,  # vLLM doesn't support 4-bit, so we use FP16
)

FastLanguageModel.for_inference(model)

# Export to HuggingFace format for vLLM
print(f"\n💾 Exporting model to {save_directory}...")
model.save_pretrained_merged(save_directory, tokenizer, save_method="merged_16bit")

print(f"\n✅ Model exported successfully!")
print(f"📁 Model location: {save_directory}")

## Model Ready for vLLM

**Next steps:**
1. vLLM will load this model
2. vLLM provides production-grade features:
   - Continuous batching (processes requests as they arrive)
   - PagedAttention (efficient KV cache management)
   - OpenAI-compatible API (drop-in replacement)
   - Request queuing and prioritization
   - Multi-GPU support (tensor parallelism)

In [None]:
# 2️⃣ Start vLLM server in background

import subprocess
import time

print("🚀 Starting vLLM server (v1 engine - auto-optimized)...")
print("⏱️  Server startup takes ~30-60 seconds (loading model into GPU)...\n")

vllm_process = subprocess.Popen([
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", save_directory,
    "--host", "0.0.0.0",
    "--port", "8000",
    "--dtype", "float16",
    "--max-model-len", "2048",
    "--gpu-memory-utilization", "0.8",  # Use 80% of GPU memory
    # Note: v1 engine auto-handles chunked prefill and scheduler steps
    # Don't add --enable-chunked-prefill or --num-scheduler-steps (forces v0 fallback)
],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

# Wait for server to start
print("⏳ Waiting for server to become ready...")
max_wait = 90
start = time.time()

while time.time() - start < max_wait:
    try:
        import requests
        response = requests.get("http://localhost:8000/health", timeout=1)
        if response.status_code == 200:
            print(f"✅ vLLM server ready! (took {time.time() - start:.1f}s)")
            break
    except:
        pass
    time.sleep(2)
else:
    print("⚠️  Server startup timeout - may need more time")

## vLLM Production Server Running (v1 Engine)!

**Server capabilities:**
- OpenAI-compatible API at `http://localhost:8000/v1`
- Continuous batching (automatically batches concurrent requests)
- PagedAttention (efficient memory management)
- Streaming support
- Token usage tracking
- Health monitoring at `/health`
- Metrics at `/metrics` (Prometheus-compatible)

**Production features enabled:**
- vLLM v1 engine: Auto-optimized chunked prefill & scheduling
- GPU memory utilization: 80%
- Max sequence length: 2048 tokens
- Dtype: float16 (optimal for inference)

**v1 Engine improvements over v0:**
- 1.5-2x better throughput
- Automatic parameter tuning
- Better memory efficiency

In [None]:
# 3️⃣ Test vLLM server with OpenAI SDK

from openai import OpenAI
import time

# Initialize OpenAI client pointing to our vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none"  # vLLM doesn't require API key by default
)

print("🧪 Testing vLLM Production Server...\n")
print("="*70)

# Test 1: Simple completion
print("\n📋 TEST 1: Simple Chat Completion")
print("-"*70)

start_time = time.time()
response = client.chat.completions.create(
    model=save_directory,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain quantization in LLMs in one sentence."}
    ],
    temperature=0.7,
    max_tokens=100
)
elapsed = time.time() - start_time

print(f"⏱️  Response time: {elapsed:.2f}s")
print(f"\n💬 User: Explain quantization in LLMs in one sentence.")
print(f"🤖 Assistant: {response.choices[0].message.content}")
print(f"\n📊 Token usage:")
print(f"  - Prompt tokens: {response.usage.prompt_tokens}")
print(f"  - Completion tokens: {response.usage.completion_tokens}")
print(f"  - Total tokens: {response.usage.total_tokens}")

# Test 2: Streaming response
print("\n" + "="*70)
print("📋 TEST 2: Streaming Response (Production Feature)")
print("-"*70)

print("\n💬 User: What are the benefits of using vLLM for production deployments?")
print("🤖 Assistant (streaming): ", end="", flush=True)

start_time = time.time()
stream = client.chat.completions.create(
    model=save_directory,
    messages=[
        {"role": "user", "content": "What are the benefits of using vLLM for production? List 3 briefly."}
    ],
    temperature=0.5,
    max_tokens=150,
    stream=True  # Enable streaming
)

full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)

elapsed = time.time() - start_time
print(f"\n\n⏱️  Streaming response time: {elapsed:.2f}s")

# Test 3: Concurrent requests (demonstrates batching)
print("\n" + "="*70)
print("📋 TEST 3: Concurrent Requests (Continuous Batching)")
print("-"*70)

import concurrent.futures

questions = [
    "What is machine learning?",
    "What is deep learning?",
    "What is a neural network?"
]

def send_request(question):
    start = time.time()
    response = client.chat.completions.create(
        model=save_directory,
        messages=[{"role": "user", "content": question}],
        max_tokens=50
    )
    return time.time() - start

print(f"\n🚀 Sending {len(questions)} concurrent requests...")
start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(send_request, questions))

total_time = time.time() - start_time

print(f"✅ All {len(questions)} requests completed in {total_time:.2f}s")
print(f"📊 Avg latency: {sum(results)/len(results):.2f}s per request")
print(f"\n💡 vLLM's continuous batching processes concurrent requests efficiently!")

print("\n" + "="*70)
print("✅ All tests complete!")


## Production Deployment with Docker + vLLM

### Production-Grade Dockerfile

```dockerfile
# Use official vLLM image (includes CUDA, PyTorch, and all dependencies)
FROM vllm/vllm-openai:latest

# Set working directory
WORKDIR /app

# Copy your fine-tuned model
COPY ./vllm_model /app/model

# Expose vLLM port
EXPOSE 8000

# Health check for container orchestration
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

# Start vLLM OpenAI-compatible server (v1 engine)
# Note: Don't add --enable-chunked-prefill or --num-scheduler-steps
# These force v0 fallback. v1 engine (default) auto-optimizes these.
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "/app/model", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--dtype", "float16", \
     "--max-model-len", "4096", \
     "--gpu-memory-utilization", "0.9"]
```

### Build and Deploy

```bash
# Build Docker image
docker build -t my-llm-service:latest .

# Test locally with GPU
docker run --gpus all -p 8000:8000 my-llm-service:latest

# Push to AWS ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com
docker tag my-llm-service:latest <account>.dkr.ecr.us-east-1.amazonaws.com/my-llm-service:latest
docker push <account>.dkr.ecr.us-east-1.amazonaws.com/my-llm-service:latest
```

### Production Scaling Strategy

1. **Horizontal Scaling**: Multiple ECS tasks behind Application Load Balancer
2. **Vertical Scaling**: Use larger GPU instances (g5.xlarge → g5.2xlarge → g5.12xlarge)
3. **Multi-Model Serving**: vLLM can serve multiple LoRA adapters efficiently
4. **Auto-scaling**: Based on GPU utilization and request queue depth

### Monitoring & Observability

vLLM exposes Prometheus metrics at `/metrics`:
- `vllm:num_requests_running` - Active requests
- `vllm:num_requests_waiting` - Queued requests (use for autoscaling)
- `vllm:gpu_cache_usage_perc` - KV cache utilization  
- `vllm:time_to_first_token_seconds` - TTFT latency
- `vllm:time_per_output_token_seconds` - Token generation speed
- `vllm:avg_generation_throughput_toks_per_s` - Overall throughput

## Reflection

### Production Deployment Learnings

- **Why vLLM over custom FastAPI?** vLLM provides continuous batching, PagedAttention, and optimized CUDA kernels that can achieve 24x higher throughput than naive implementations

- **Latency considerations**: vLLM adds ~50-100ms overhead for the first request (server initialization), but subsequent requests benefit from efficient batching and KV cache management

- **Production hardening checklist**:
  - ✅ Authentication: Add API key validation or OAuth2
  - ✅ Rate limiting: Use AWS API Gateway or NGINX rate limiting
  - ✅ Monitoring: Prometheus + Grafana for metrics, CloudWatch for logs
  - ✅ Auto-scaling: CloudWatch alarms based on GPU utilization
  - ✅ Cost optimization: Use Spot instances for non-critical workloads

- **vLLM vs alternatives**:
  - **TGI (Text Generation Inference)**: Similar features, HuggingFace ecosystem
  - **TensorRT-LLM**: Lower latency, requires more setup
  - **Ollama**: Great for local/edge deployment, less for cloud scale
  
- **When to use what**:
  - Development/prototyping: Direct model loading (what we started with)
  - Production < 100 QPS: vLLM single instance
  - Production > 100 QPS: vLLM with horizontal scaling + load balancer
  - Ultra-low latency: TensorRT-LLM with optimized kernels
