
---

# 📊 **Metrics · Tracing · GPU Util**

---

## 🎯 What to measure (LLM apps)

* ⚡ **Throughput** → tokens/sec, requests/sec, queue depth
* ⏱ **Latency** → p50/p95
* ❌ **Errors** → 4xx/5xx, timeouts
* 🎮 **GPU** → util %, VRAM used
* 📂 **Cache/IO** → HF cache hit %, PV read

---

## 🛠 Quick setup

* 🟢 **metrics-server** → `kubectl top pods`
* 📈 **Prometheus + Grafana** → dashboards
* 🎮 **DCGM exporter** → GPU metrics

👉 Start Prometheus+Grafana → add DCGM for GPU.

---

## 🧩 App metrics (FastAPI example)

Expose `/metrics` with Prometheus counters:

```python
TOKENS = Counter("tokens_total","tokens emitted")
LAT    = Histogram("latency_seconds","request latency")
```

👉 Mount via `/metrics` for scraping.

---

## 📡 Prometheus scrape (sketch)

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  selector: { matchLabels: { app: gateway } }
  endpoints:
    - port: metrics
      interval: 15s
```

---

## 📊 Grafana queries

* Tokens/sec → `sum(rate(tokens_total[1m]))`
* Error rate → `5xx / total`
* p95 latency → `histogram_quantile(0.95, … )`
* GPU util → `avg(DCGM_FI_DEV_GPU_UTIL)`
* VRAM used → `avg(DCGM_FI_DEV_FB_USED)`

---

## 🔎 Tracing (optional)

* Add **OpenTelemetry** → gateway spans only.
* Export to **Tempo/Jaeger** for debugging long-tail latency.

---

## 📈 Scaling signals (KEDA)

Use **requests in flight**, **queue depth**, or **tokens/sec per replica**.

```yaml
- type: prometheus
  metadata:
    metricName: requests_in_flight
    query: sum(app_inflight)
    threshold: "20"
```

---

## 💰 Cost formula

```
cost_per_1k_tokens ≈ gpu_hourly_cost / (tokens_per_sec * 3.6)
```

👉 Goal: high GPU util + good p95 latency.

---

## 🚨 Alerts (tiny start)

* Error rate > 2%
* p95 latency > SLO
* GPU util <30% (wasted) or >98% (hot)
* VRAM >90% (OOM risk)

---

## 🔍 Quick ops

```bash
kubectl -n monitoring get pods       # prom/grafana
kubectl -n llm logs deploy/gateway   # gateway logs
kubectl -n llm port-forward svc/grafana 3000:80
```

---
