
---

# üìä **Metrics ¬∑ Tracing ¬∑ GPU Util**

---

## üéØ What to measure (LLM apps)

* ‚ö° **Throughput** ‚Üí tokens/sec, requests/sec, queue depth
* ‚è± **Latency** ‚Üí p50/p95
* ‚ùå **Errors** ‚Üí 4xx/5xx, timeouts
* üéÆ **GPU** ‚Üí util %, VRAM used
* üìÇ **Cache/IO** ‚Üí HF cache hit %, PV read

---

## üõ† Quick setup

* üü¢ **metrics-server** ‚Üí `kubectl top pods`
* üìà **Prometheus + Grafana** ‚Üí dashboards
* üéÆ **DCGM exporter** ‚Üí GPU metrics

üëâ Start Prometheus+Grafana ‚Üí add DCGM for GPU.

---

## üß© App metrics (FastAPI example)

Expose `/metrics` with Prometheus counters:

```python
TOKENS = Counter("tokens_total","tokens emitted")
LAT    = Histogram("latency_seconds","request latency")
```

üëâ Mount via `/metrics` for scraping.

---

## üì° Prometheus scrape (sketch)

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  selector: { matchLabels: { app: gateway } }
  endpoints:
    - port: metrics
      interval: 15s
```

---

## üìä Grafana queries

* Tokens/sec ‚Üí `sum(rate(tokens_total[1m]))`
* Error rate ‚Üí `5xx / total`
* p95 latency ‚Üí `histogram_quantile(0.95, ‚Ä¶ )`
* GPU util ‚Üí `avg(DCGM_FI_DEV_GPU_UTIL)`
* VRAM used ‚Üí `avg(DCGM_FI_DEV_FB_USED)`

---

## üîé Tracing (optional)

* Add **OpenTelemetry** ‚Üí gateway spans only.
* Export to **Tempo/Jaeger** for debugging long-tail latency.

---

## üìà Scaling signals (KEDA)

Use **requests in flight**, **queue depth**, or **tokens/sec per replica**.

```yaml
- type: prometheus
  metadata:
    metricName: requests_in_flight
    query: sum(app_inflight)
    threshold: "20"
```

---

## üí∞ Cost formula

```
cost_per_1k_tokens ‚âà gpu_hourly_cost / (tokens_per_sec * 3.6)
```

üëâ Goal: high GPU util + good p95 latency.

---

## üö® Alerts (tiny start)

* Error rate > 2%
* p95 latency > SLO
* GPU util <30% (wasted) or >98% (hot)
* VRAM >90% (OOM risk)

---

## üîç Quick ops

```bash
kubectl -n monitoring get pods       # prom/grafana
kubectl -n llm logs deploy/gateway   # gateway logs
kubectl -n llm port-forward svc/grafana 3000:80
```

---
