
---

# 📦 **Pods · Deployments · Services (LLM lens)**

---

## 🔑 TL;DR

* **🧩 Pod** → one inference instance (vLLM/TGI).
* **🌀 Deployment** → manages replicas & rolling updates.
* **🌐 Service** → stable endpoint; add Ingress/LB for public access.

---

## 🧩 Pod

* Runs 1+ containers with shared IP/volumes.
* **Ephemeral** → don’t store data inside; mount **PVC** for models/cache.

---

## 🌀 Deployment

* Ensures **N replicas** and safe updates.
* Scale with:

```bash
kubectl scale deploy/vllm --replicas=3 -n llm
```

---

## 🌐 Service

* **ClusterIP** → internal only.
* **LoadBalancer** → cloud public IP.
* **Ingress** → HTTP routing + TLS.

---

## 🛠 Minimal Example

**Deployment**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm, namespace: llm }
spec:
  replicas: 1
  selector: { matchLabels: { app: vllm } }
  template:
    metadata: { labels: { app: vllm } }
    spec:
      containers:
        - name: api
          image: vllm/vllm-openai:latest
          args: ["--model","/models/Llama-3-8B-Instruct","--max-num-seqs","8"]
          ports: [{ containerPort: 8000 }]
          volumeMounts:
            - { name: model, mountPath: /models, readOnly: true }
      volumes:
        - name: model
          persistentVolumeClaim: { claimName: models-pvc }
```

**Service**

```yaml
apiVersion: v1
kind: Service
metadata: { name: vllm-svc, namespace: llm }
spec:
  selector: { app: vllm }
  ports: [{ port: 80, targetPort: 8000 }]
  type: ClusterIP
```

👉 Local test:

```bash
kubectl -n llm port-forward svc/vllm-svc 8080:80
curl http://localhost:8080/health
```

---

## 💡 LLM Tips

* Mount models via **PVC (read-only)**.
* Use **immutable image tags** (no `:latest`).
* Add **probes & resources** early; autoscale later with HPA/KEDA.

---

