
---

# 📈 **Deployments · HPA · KEDA**

---

## 🌀 Deployment (baseline)

* Keeps **N replicas alive** + rolls updates safely.
* Scale manually:

```bash
kubectl scale deploy/vllm --replicas=3 -n llm
```

**Throughput knobs (tune inside pod first):**

* ⚡ vLLM → `--max-num-seqs`, `--gpu-memory-utilization`
* ⚡ TGI → `--max-batch-size`, `--max-concurrent-requests`

---

## 📊 HPA (Horizontal Pod Autoscaler)

* Auto-scales on **CPU/RAM** usage.
* Good **baseline autoscaling** when no custom metrics.

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: vllm-hpa, namespace: llm }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: vllm }
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }
```

---

## 🚀 KEDA (advanced autoscaling)

* Scale on **real LLM signals** → requests/sec, queue depth, tokens/sec.
* Works with **Prometheus**, **Redis**, **Kafka**, etc.

🔹 **Prometheus (requests in flight):**

```yaml
triggers:
  - type: prometheus
    metadata:
      query: sum(rate(http_requests_in_flight{app="vllm"}[1m]))
      threshold: "20"
```

🔹 **Redis (embedding workers):**

```yaml
triggers:
  - type: redis
    metadata:
      listName: embed-jobs
      listLength: "100"
```

---

## 🛡️ Availability guardrails

* **PDB (PodDisruptionBudget)** → keep ≥1 pod during maintenance.
* **Tip:** set `maxUnavailable: 0` for zero-downtime rollouts.
* Always use **readinessProbe** → avoid routing to cold pods.

---

## ⚡ Quick ops

```bash
kubectl top pods -n llm            # needs metrics-server
kubectl rollout status deploy/vllm # watch rollout
kubectl describe hpa vllm-hpa -n llm
```

---

## ✅ Rule of thumb

1. Tune **batch/concurrency flags**.
2. Start with fixed **replicas**.
3. Add **HPA** (CPU/RAM).
4. Upgrade to **KEDA** (real workload metrics).

---
