
---

# 📊 Metrics (Prometheus) & Dashboards

> **Intent** → Expose **quantitative signals** (latency, errors, throughput, resources) to detect issues early and guide performance work.

---

## 🎯 What to Measure (RED + USE)

* **Requests**: rate, errors, duration (p50/p90/p95/p99)
* **Resources**: CPU, memory, file descriptors, GC, event loop lag
* **DB/HTTP**: query counts, pool usage, timeouts/retries, upstream status codes
* **Queues/Jobs**: enqueued/running/failed, retry counts, age
* **Custom**: business KPIs (signups, orders, cache hit rate)

---

## 🧭 Metric Types

* **Counter** → ever-increasing (requests\_total, errors\_total)
* **Gauge** → current value (in\_flight\_requests, pool\_size)
* **Histogram** → distributions (request\_duration\_seconds)
* **Summary** → client-side quantiles (use sparingly; prefer histograms)

---

## 🌐 Prometheus Scraping

* Expose `/metrics` endpoint (text format).
* Label consistently: `service`, `env`, `route`, `method`, `status`, `version`, `tenant?`.
* Avoid **high-cardinality** labels (user\_id, request\_id).
* Set **scrape intervals** (e.g., 15s) and **retention** per environment.

---

## 📈 Dashboards (Grafana)

* **Service overview**: RPS, error rate, p95/p99 latency, saturation (CPU/mem), top 5 slow routes.
* **Dependency health**: DB pool saturation, query latency, external API error/timeout rates.
* **SLO board**: availability %, latency SLO burn rate, error budgets.
* **Release watch**: metrics by `version` label during/after deploys.

---

## 🧪 Alerts (SLO-Based)

* **Multi-window, multi-burn** alerts (fast + slow burn) for:

  * Error budget exhaustion (5xx, 429, timeouts)
  * Latency SLO breach (p95 route latency)
  * DB pool saturation > N% sustained
  * Queue backlog age > threshold
* Page humans only for **user-impacting** issues; route noisy alerts to tickets.

---

## 🧹 Cardinality & Cost Control

* Keep label sets **small and bounded**.
* Use **route templates** (e.g., `/users/{id}` not `/users/123`).
* Sample high-volume metrics if needed; prefer **exemplars** for traces.
* Drop or aggregate rarely used metrics.

---

## 🔗 Correlating Signals

* Link dashboards to **logs** (request\_id) and **traces** (trace\_id).
* Use **exemplars** in histograms to jump to trace samples.
* Compare **before/after** metrics around deploys and feature flags.

---

## 🧭 Runbook Essentials

* Document **meaning** of key metrics and common failure modes.
* Include **expected ranges** and **action steps** for alerts.
* Keep **owner** and **dashboard links** visible.

---

## ✅ Outcome

You get **actionable visibility**: clear dashboards, SLO-driven alerts, and low-noise metrics that quantify performance and reliability—so you can **detect, diagnose, and improve** quickly.

---
