
---

# 🧪 LangSmith (Eval) & Hooks

> **Intent** → Measure and improve LLM quality with **traces, datasets, evaluations, and CI gates** wired into your FastAPI flows.

---

## 🧭 What LangSmith Gives You

* **Tracing**: step-by-step runs (prompts, tool calls, latencies, errors).
* **Datasets**: curated inputs + expected outputs (goldens).
* **Evaluations**: automatic metrics (accuracy, BLEU/ROUGE), LLM-as-judge, custom scorers.
* **Comparisons**: run A/B on prompt/model/config changes.

---

## 🔗 Where to Hook in FastAPI

* **Request boundary**: log inputs, request\_id, tenant, version.
* **Agent/tool layer**: capture tool calls, retries, errors, timings.
* **Output stage**: record final answer, tokens, cost, confidence.
* **Background jobs**: trace long-running evals asynchronously.

---

## 📚 Datasets & Goldens

* Store real user-like prompts with **expected outputs** or scoring rules.
* Keep **edge cases** (long context, multilingual, safety triggers).
* Version datasets; tag by **domain** and **difficulty**.

---

## 📏 Scoring & Metrics

* **Exact/semantic match** (string vs embedding similarity).
* **Task-specific**: extraction accuracy, field-level F1, hallucination rate.
* **LLM-as-judge**: rubric-based scoring with **bias controls** (multiple judges, consensus).
* **Cost/latency**: tokens, wall time per step.

---

## 🔁 Regression & A/B

* Run **candidate vs baseline** across the same dataset.
* Flag **quality regressions** (thresholds per metric).
* Record **diff artifacts** (where candidate failed/won).
* Promote only if **guardrails + metrics** pass.

---

## 🧯 Guardrails with Hooks

* **Pre-exec**: input policy checks (PII, prompt injection).
* **Mid-exec**: tool allowlists, depth/step limits, rate caps.
* **Post-exec**: output validators (schema, toxicity, safety tags).
* On violation: **abort or downgrade** model; log evidence.

---

## 🧪 CI Integration

* PR pipeline: run **smoke eval** (small dataset) → block on regressions.
* Nightly: **full suite** across domains; publish dashboards.
* Store **run IDs** and link to PR/commit for traceability.

---

## 🧬 Versioning & Reproducibility

* Pin **model**, **prompt**, **tools**, **temperature**, **system msg** per run.
* Log **config hashes**; attach to each trace.
* Keep **prompt diffs** human-readable for review.

---

## 🔐 Privacy & Safety

* **Redact** sensitive inputs/outputs before logging.
* Use **tenant-aware** storage; restrict dataset access.
* Align with **compliance** (retention, deletion, audit trails).

---

## 📊 What to Dashboard

* Pass rate by dataset, latency distributions, token costs.
* Top failing cases & common error tags (format, hallucination, safety).
* Trend lines across releases; burn-down of known failure modes.

---

## ✅ Outcome

Your LLM features become **measurable and reliable**: end-to-end traces, reproducible evals, CI gates, and safety hooks that **prevent regressions** and guide **prompt/model improvements**.
