
---

# 🧠 Why Kubernetes for LLMs

---

## 🚀 When to use Kubernetes (LLMs)

* **⚡ Scale & uptime** → auto-healing, zero-downtime rollouts.
* **🎮 GPU scheduling** → pool GPUs, place pods on GPU nodes.
* **🔗 Multi-service graphs** → gateway ↔ LLM ↔ embeddings ↔ vector DB ↔ cache.

---

## 🎯 When K8s helps (LLM scenarios)

* **📈 Bursting traffic** → replicas scale up/down with load.
* **🖥️ GPU pooling** → taints/tolerations keep GPU pods on GPU nodes.
* **🔀 Multi-service pipelines** → chain gateway ↔ LLM ↔ rewriter ↔ embeddings ↔ vector DB.
* **🛡️ Ops guardrails** → probes, limits, rollbacks, disruption budgets.

---

## 🏗️ Minimal Reference Architecture

```
Client → Ingress → Gateway(API) → LLM Inference (GPU)
                         │
                         ├─ Embeddings Service (CPU)
                         ├─ Vector DB (Milvus/Qdrant/pgvector)
                         └─ Cache/Queue (Redis/Kafka)
```

* **📂 Artifacts** → models on PVC (RWX preferred) or pulled by initContainers.
* **🖧 Node pools** → GPU (tainted) + CPU (general).

---

## 🔑 K8s pieces (LLM mapping)

* **📦 Pod** → one inference instance.
* **🌀 Deployment** → manages replicas + rolling updates.
* **🌐 Service** → stable in-cluster endpoint.
* **🚪 Ingress/LB** → public entry to the cluster.
* **💾 PVC** → persistent model weights/cache.
* **📊 HPA/KEDA** → autoscale on CPU/RAM or RPS/tokens/sec.

---

## 📂 Model storage options

* **🗄️ RWX PVC** → shared, read-only, best for models.
* **📥 InitContainer + RWO PVC** → download model at pod startup.
* **📦 Warm image** → bake weights in image (fast start, heavy build).

---

## 📡 Ops signals to watch

* **⚡ Throughput** → tokens/sec.
* **⏱️ Latency** → p95 response time.
* **🎮 GPU util** → % + VRAM usage.
* **📌 Queue depth** → requests in-flight.

---

## 🛡️ Safety basics

* **🩺 Probes** → readiness, liveness, startup.
* **📐 Requests/Limits** → CPU, RAM, GPU guardrails.
* **🔒 Immutable images** → avoid `:latest`, pin tags/digests.

---

## 🧪 60-sec local smoke (CPU)

```bash
kind create cluster
kubectl create ns llm
kubectl -n llm create deploy echo --image=ealen/echo-server --port=80
kubectl -n llm expose deploy echo --port=80
kubectl -n llm port-forward svc/echo 8080:80
# test: curl http://localhost:8080
```

---
