
---

# 🕸️ Traces (OpenTelemetry) & Trace IDs

> **Intent** → See a single request’s **journey** across services (API → DB → HTTP calls → queue) with **end-to-end timing and context**.

---

## 🧭 What Tracing Gives You

* **Causality & timing** across hops (who called what, how long).
* **Bottleneck detection** (slow DB query, external API, serialization).
* **Error localization** (which span failed, with attributes).
* **Correlation** with logs/metrics via **trace\_id**/**span\_id**.

---

## 🔗 Core Concepts

* **Trace**: a tree of **spans** representing one request flow.
* **Span**: timed operation with attributes (route, DB table, status).
* **Context propagation**: pass **traceparent** headers across services.
* **Resources**: service-level attributes (name, version, env, region).

---

## 🎯 What to Instrument

* **Inbound requests** (FastAPI middleware) → root span: route + status
* **DB calls** (SQLAlchemy, drivers) → query duration, rows, pool waits
* **HTTP clients** (httpx) → URL template, method, status, retries
* **Queues/jobs** (Celery/RQ/Arq) → link enqueue span ↔ worker span
* **Custom spans** for business steps (validation, pricing, rendering)

---

## 🧩 Attributes That Matter

* `http.route`, `http.method`, `http.status_code`
* `db.system`, `db.statement?` (redact PII), `db.sql.table`
* `peer.service`, `net.peer.name`, `retry.count`
* `user.id?`, `tenant.id?` (hashed/anonymized if needed)
* Always **redact secrets/PII** at the source

---

## 🧪 Sampling Strategy

* **Head sampling** (at request start): keep X% of traces globally.
* **Tail sampling** (collector): keep **slow/errors** at higher rate.
* **Dynamic** during incidents or for hot routes; tag by `env=prod`.
* Keep storage costs sane while preserving **useful** traces.

---

## 🔗 Correlating with Logs & Metrics

* Inject **trace\_id / span\_id** into log records (structured logging).
* Add **exemplars** to latency histograms for quick trace jump.
* From a 5xx spike → jump to sample traces → root cause fast.

---

## 🧯 Error Handling

* Mark spans with **status=ERROR** and **exception** events.
* Record **type**, **message**, **stack** (sanitized).
* Avoid double-reporting: log once at the edge, annotate in spans.

---

## 📊 Visualizations (Backends)

* Use **OTel Collector** → export to Jaeger/Tempo/Zipkin/New Relic/Datadog.
* Dashboards: **waterfall views**, **service maps**, **top slow routes**.
* Compare traces **before/after** deploys or feature flags.

---

## 🧱 Operational Hygiene

* Standardize **service.name**, **deployment.environment**, **version**.
* Keep attribute **cardinality** bounded; don’t tag with raw IDs everywhere.
* Ensure **propagation** through gateways/proxies; don’t strip headers.
* Budget telemetry cost: sample, drop noisy spans, compress exports.

---

## ✅ Outcome

You gain **end-to-end visibility**: a request’s full path with **timings, errors, and context**, linked to logs and metrics—so you can **diagnose performance issues** and **fix incidents** rapidly.
