
# 01\_Serving\_Local\_K8s\_Cloud

## 🎯 Goal

Serve your LLM/RAG **reliably, securely, and cheaply** with the same interface in dev → prod.

---

## 🧭 Options at a glance

| Scenario         | How to serve                     | When to use                |
| ---------------- | -------------------------------- | -------------------------- |
| 🧪 Local         | `mlflow models serve` or FastAPI | Solo dev, quick tests      |
| 📦 Docker        | `mlflow models build-docker`     | Reproducible env, CI smoke |
| ☸️ Kubernetes    | Deployment + HPA + Ingress       | Team/prod, autoscaling     |
| ☁️ Managed Cloud | SageMaker/Vertex/ECS/Cloud Run   | Faster ops, pay-as-you-go  |

---

## 📦 Model interface (keep it stable)

* 🔌 **HTTP/JSON** endpoint exposing `predict()`.
* 🧾 **Signature**: document input/output schema.
* 🧪 **Smoke payload**: tiny request you can run in CI.

---

## 💻 Local (fast feedback)

* ▶️ **Serve**: `mlflow models serve -m runs:/... --port 5000`
* 🧰 Alt: FastAPI/uvicorn wrapper if you need custom routing/headers.
* ✅ Pros: zero infra; ⚠️ Cons: not shareable, no HA.

---

## 🐳 Docker (portable)

* 🧱 **Build**: `mlflow models build-docker -m runs:/... -n my-llm:latest`
* 🔐 **Secrets via env**: never bake keys into image.
* 🧪 Use in CI to **smoke-test** before promotion.

---

## ☸️ Kubernetes (prod control)

* 🧩 **Deployment**: set **resources** (requests/limits) + **replicas**.
* 📈 **HPA**: scale on CPU/QPS/latency; add **PDB** to avoid full drain.
* ❤️ **Probes**: `readiness` (deps OK) & `liveness` (self-heal loops).
* 🔐 **Config**: `Secrets` for API keys, `ConfigMaps` for non-secrets.
* 🗄️ **State**: keep indices/artifacts in **S3/GCS/Blob**; mount only if needed.
* 🌉 **Ingress/Gateway**: NGINX/Istio + **TLS** termination.
* ⚙️ **GPU** (if local models): node selectors + tolerations.
* 🔁 **Rollouts**: canary/shadow via Istio or progressive delivery.

---

## ☁️ Cloud patterns

* 🧰 **Managed endpoints**: SageMaker/Vertex/Databricks Model Serving → autoscale, metrics, IAM.
* 🧩 **Containers**: ECS/EKS/GKE/Cloud Run/Azure Container Apps → bring your Docker; scale to zero for API-wrapper use cases.
* ⚡ **Serverless edge** (if calling external LLMs): API Gateway + Lambda/Cloud Functions for bursty workloads.

---

## 📊 Observability & SLOs

* ⏱️ **Metrics**: p50/p95 latency, QPS, tokens, **cost**.
* 🧯 **Safety**: toxicity/PII block rates as counters.
* 🔍 **Tracing**: OpenTelemetry trace IDs across retrieval → LLM.
* 🔔 **Alerts**: on SLO breach (latency/cost/error rate).
* 🪵 **Logs**: structured; never log raw secrets/PII.

---

## 🔐 Security musts

* 🔑 **IAM roles** over static keys; short-lived tokens.
* 🚧 **Rate limits** + WAF; per-tenant quotas.
* 🕳️ **Egress control**: restrict outbound to LLM providers only.
* 🧽 **Redaction**: strip PII before logging & tracing.

---

## 💰 Cost levers

* 🧠 **Caching**: prompt/result cache; reuse embeddings.
* 📦 **Batching** & **streaming** responses.
* ✂️ **Context discipline**: trim docs; smart reranking before LLM.
* 💤 **Scale-to-zero** for spiky traffic (Cloud Run/Lambda).

---

## 🚦 Rollout playbook

1. 🧪 **Shadow** new model behind the current.
2. 🐤 **Canary** 1–5% traffic; watch latency, cost, safety.
3. ⚖️ **A/B** until win-rate clear → promote alias to **champion**.
4. 🔙 **Rollback** = flip alias; no redeploy.

---

## ✅ Pre-flight checklist

* 🔒 Secrets via env/secret manager
* 🧾 Signature locked & validated
* 🧪 CI smoke test with sample payload
* 📈 Dashboards + alerts ready
* 🗂️ Artifacts & indices versioned (fingerprints)
* 🚀 Rollout strategy (shadow/canary) defined

---

## 🗣️ One-liner

**“Package once, serve anywhere—use Docker for portability, K8s/Cloud for autoscale, and enforce SLOs with strong observability and gates.”**
