# Inference Optimization Challenge — Qwen2.5-0.5B on Tesla T4

**Goal:** Maximize output token throughput (tok/s) for Qwen2.5-0.5B on a Tesla T4 GPU.  
**Baseline to beat:** 3,332 tok/s

## Scoring

| Metric | Weight | Constraint |
|--------|--------|------------|
| Output throughput (tok/s) | 40% | Higher is better |
| P99 TPOT | 20% | Must stay under 50 ms |
| P99 TTFT | 15% | Must stay under 2000 ms |
| Request success rate | 10% | Must be 100% |
| Code quality & documentation | 15% | Clean, reproducible, explained |

## 1. Setup

In [None]:
!pip install -q vllm

In [None]:
!nvidia-smi

## 2. Launch vLLM Server (Baseline — Default Config)

In [None]:
import subprocess, time, requests

MODEL = "Qwen/Qwen2.5-0.5B"

# Launch vLLM server in the background
server_proc = subprocess.Popen(
    ["vllm", "serve", MODEL],
    stdout=open("server_baseline.log", "w"),
    stderr=subprocess.STDOUT,
)
print(f"Server PID: {server_proc.pid}")

# Wait for server to be ready
for i in range(120):
    try:
        r = requests.get("http://localhost:8000/health", timeout=2)
        if r.status_code == 200:
            print(f"Server ready after {i*2}s")
            break
    except Exception:
        pass
    time.sleep(2)
else:
    print("ERROR: Server not ready after 240s")
    print(open("server_baseline.log").read()[-2000:])

## 3. Run Benchmark (Baseline)

In [None]:
!mkdir -p results

!vllm bench serve \
  --backend openai \
  --base-url http://localhost:8000/v1 \
  --endpoint /completions \
  --model Qwen/Qwen2.5-0.5B \
  --tokenizer Qwen/Qwen2.5-0.5B \
  --max-concurrency 50 \
  --num-prompts 200 \
  --ignore-eos \
  --random-input-len 512 \
  --random-output-len 512 \
  --save-result \
  --result-dir ./results \
  --result-filename baseline.json \
  --label baseline

## 4. Report Baseline Results

In [None]:
import json

with open("results/baseline.json") as f:
    data = json.load(f)

total = data['completed'] + data.get('failed', 0)

print("=" * 60)
print("BASELINE RESULTS")
print("=" * 60)
print(f"  Output throughput:  {data['output_throughput']:.2f} tok/s")
print(f"  Mean TPOT:          {data['mean_tpot_ms']:.2f} ms")
print(f"  P99 TPOT:           {data['p99_tpot_ms']:.2f} ms  (limit: 50 ms)")
print(f"  Mean TTFT:          {data['mean_ttft_ms']:.2f} ms")
print(f"  P99 TTFT:           {data['p99_ttft_ms']:.2f} ms  (limit: 2000 ms)")
print(f"  Completed requests: {data['completed']}/{total}")
print(f"  Failed requests:    {data.get('failed', 0)}")
print("=" * 60)

## 5. Cleanup

In [None]:
server_proc.terminate()
server_proc.wait()
print("Server stopped.")