
---

# 🧭 LangSmith — Ultimate Cheatsheet

> **Mental model:** LangSmith = **observability + evaluations** for LLM apps. It records **runs/spans/traces**, manages **datasets & experiments**, collects **feedback**, supports **A/B testing**, and exports **OpenTelemetry** to your own stack. Works standalone or with **LangChain/LangGraph**.

---

## 0) Setup & SDKs ⚙️

| What                     | How (Python/JS)                                             | Notes                                    |
| ------------------------ | ----------------------------------------------------------- | ---------------------------------------- |
| Create account & API key | Settings → **API Keys**                                     | Use separate keys per env (dev/stg/prod) |
| Enable tracing (JS)      | Env vars or pass `LangChainTracer`                          | Works with JS LangChain out of the box   |
| Python SDK               | `pip install -U langsmith` → `from langsmith import Client` | Use `traceable` decorator for quick wins |
| Framework-agnostic       | Toggle via env vars or thin wrappers                        | Works with LC/LG, or any custom stack    |

**Soundbite:** *“API key + env var → instant runs/spans across Python & JS.”*

---

## 1) Tracing & Observability 🔎

| Concept             | What to say                                                          | Practical tip                                       |
| ------------------- | -------------------------------------------------------------------- | --------------------------------------------------- |
| Runs/Spans/Traces   | Every model/tool call = **run** (span); nested runs form a **trace** | Expand run tree to debug latency/tokens/errors      |
| Current span access | Log custom breadcrumbs/IDs inside a traced function                  | Attach request IDs, user IDs (hashed), release tags |
| Run schema          | IDs, inputs/outputs, timings, type, error fields                     | Keep inputs minimal; mask sensitive data            |
| Tags & metadata     | Add `tags` + `metadata` for env, release, feature                    | Make filtering & dashboards effortless              |
| Sampling & PII      | Tune sampling; redact before upload                                  | Define redaction policy centrally                   |
| OpenTelemetry       | Export to Prometheus/Grafana/Jaeger                                  | Unify LLM metrics with app/infra SLOs               |

**Quick Python (enable + tag runs):**

```python
import os
os.environ["LANGSMITH_API_KEY"] = "lsk_..."
from langsmith import Client, traceable

client = Client(
    tracing_sampling_rate=0.2,  # 20% sampling
    metadata={"service":"chat-api","env":"prod"}
)

@traceable(tags=["prod","chat"], metadata={"release":"2025.10.1"})
def answer(q: str): ...
```

---

## 2) Datasets & Evaluations 📊

| Task                          | How                                                     | Notes                                          |
| ----------------------------- | ------------------------------------------------------- | ---------------------------------------------- |
| Create/manage datasets        | UI (Datasets → New) or SDK (`create_examples`)          | Keep dataset immutable; use versions           |
| Run evaluations               | `from langsmith.evaluation import evaluate`             | Name experiments clearly (`experiment_prefix`) |
| Compare experiments (A/B)     | Run on same dataset; compare metrics in Experiments     | Track deltas: quality, latency, cost           |
| Feedback-driven evals         | Log feedback on runs; slice by tag/feature/user         | Map thumbs-up/down to numeric scores           |
| Built-in vs custom evaluators | Use built-ins (QA/faithfulness/context) or write custom | Start with built-ins; add domain checks later  |

**Python: eval on a dataset**

```python
from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()
report = evaluate(
  target=my_chain,                     # callable or endpoint
  data="my_rag_dataset",               # dataset name/id
  evaluators=["qa", "context_precision"],
  experiment_prefix="rag-oct16"
)
```

---

## 3) Feedback & Annotations 📝

| Need                   | API / UX                                                        | Notes                                          |
| ---------------------- | --------------------------------------------------------------- | ---------------------------------------------- |
| User thumbs-up/down    | `Client().create_feedback(run_id, key="user-score", score=1.0)` | Use consistent keys (`"user-score"`, `"csat"`) |
| Frontend-safe feedback | Presigned token for browser submissions                         | Avoid exposing API key; expire tokens          |
| Inline annotations     | Annotate a run or queue tasks for reviewers                     | Great for agent trajectories                   |
| Feedback schema        | Human, LLM, or system feedback                                  | Store reason text + score for analytics        |

**One-liner:** *“User feedback is first-class and correlates with cost/latency over time.”*

---

## 4) Prompt Management 🧱

| Capability             | How                                           | Notes                                            |
| ---------------------- | --------------------------------------------- | ------------------------------------------------ |
| Version & push prompts | Programmatically push named prompts with tags | Treat prompts like code: description + changelog |
| A/B prompt comparisons | Run variants on same dataset & compare        | Pair with snapshot tests to catch regressions    |

---

## 5) Integrations 🔌

| With                 | How                                              | Notes                                    |
| -------------------- | ------------------------------------------------ | ---------------------------------------- |
| LangChain (Py/JS)    | Env-based tracing or `traceable`; wrap providers | Add tags per feature/route               |
| LangGraph            | Use `RunnableConfig` to pass tags/metadata       | Name nodes consistently to map hot paths |
| OpenTelemetry stacks | Export traces/metrics to your collector          | Build cost/latency dashboards in Grafana |

---

## 6) Production Workflows 🏭

| Theme             | Checklist                                                                    |
| ----------------- | ---------------------------------------------------------------------------- |
| Debug & replay    | Reproduce with same inputs; diff runs; inspect error paths; attach snapshots |
| Canary & rollouts | Compare new vs baseline; gate on quality deltas and p95 latency              |
| Monitoring        | SLOs for latency/cost; rate limits/backpressure; OTel export enabled         |

---

## 7) Security & Privacy 🔒

| Concern       | What you do                               | Notes                                  |
| ------------- | ----------------------------------------- | -------------------------------------- |
| PII in traces | Client-side masking/redaction/anonymizers | Hash emails/IDs; remove secrets        |
| Sampling      | Set `tracing_sampling_rate` per env       | Higher in dev, lower in prod (but >0%) |
| Data paths    | Use tags/metadata to isolate envs/tenants | Tie to retention policies              |

---

## 8) CI/CD & Automation 🤖

| Goal                  | How                                           | Notes                                        |
| --------------------- | --------------------------------------------- | -------------------------------------------- |
| Gate releases on eval | Run `evaluate(...)` in CI; fail on regression | Thresholds per metric (quality/latency/cost) |
| Example pipelines     | Use GH Actions or your CI to run eval suites  | Store reports as artifacts; link to runs     |
| OTel in prod          | Pipe metrics/traces to Prometheus/Grafana     | Alert on error spikes and p95 surges         |

---

## 9) “Answer-in-a-Minute” Snippets ⚡

**A) Minimal tracing (Python)**

```python
from langsmith import traceable
@traceable(tags=["api"], metadata={"endpoint":"/chat"})
def chat(q): ...
```

**B) Attach user feedback**

```python
from langsmith import Client
Client().create_feedback(run_id, key="user-score", score=1.0)
```

**C) Create dataset & run eval**

```python
from langsmith import Client
from langsmith.evaluation import evaluate
c = Client()
c.create_examples(dataset_name="faq",
                  inputs=[{"q":"hi"}],
                  outputs=[{"a":"hello"}])
report = evaluate(target=my_chain, data="faq", evaluators=["qa"])
```

**D) OTel export (high level)**

* Enable LangSmith OTel → forward to collector → build Grafana dashboards.

---

## 10) Checklists ✅

**Prod Readiness**

* 🔖 Tags/metadata (env, release, user) • 🧰 Sampling set • 🕵️ PII redaction on inputs/outputs
* 🧪 Dataset + baseline experiment • 🧯 Error paths traced • 📊 OTel → Grafana/Prometheus
* 🚦 CI gates on eval deltas • 🧾 Cost/latency dashboards

**RAG/Agent Eval**

* 📚 Dataset coverage & slices • 🧪 Faithfulness/context precision • 🧭 Agent step/trajectory checks
* 🧉 Human feedback pipeline (inline/queue) • 🔁 Compare experiments before rollout

---

## 11) Common Pitfalls 🚫 → Fix ✅

| Pitfall                | Fix                                                     |
| ---------------------- | ------------------------------------------------------- |
| No evaluators/contract | Use built-in evaluators + snapshot tests                |
| Missing tags/metadata  | Standardize `env`, `release`, `feature`, `user_id_hash` |
| Logging raw PII        | Apply redaction masks before tracing                    |
| 0% sampling in prod    | Set a small but nonzero sampling rate (e.g., 1–5%)      |

---

## 12) Quick Talking Points 🎤

* *“Each LLM/tool call is a **run**; nested runs form **traces** filterable by tags/metadata.”*
* *“We A/B **prompts/models** on consistent datasets and **gate releases** on eval deltas.”*
* *“**User feedback** is logged per run and tracked against **cost/latency**.”*
* *“We export **OTel** to Prometheus/Grafana for unified SLO dashboards.”*

---
