Add freshness, fleet, chaos, goal-drift, retrieval-lineage, and root-cause modules by hidai25 · Pull Request #233 · hidai25/eval-view

hidai25 · 2026-05-15T09:32:59Z

Description

This PR introduces six new pure-function modules and their corresponding CLI commands, plus comprehensive test coverage and contributor recipes. These modules extend EvalView's observability and testing capabilities:

Freshness (evalview.core.freshness + evalview.commands.freshness_cmd): Detects production-query coverage gaps in the test suite using Jaccard token similarity. Identifies uncovered production queries, clusters them by similarity, and synthesizes new test cases to close gaps—even when nothing has failed yet.
Fleet (evalview.core.fleet + evalview.commands.fleet_cmd): Aggregates monitor history from multiple instances (canary, prod, regional pods) into a unified view. Detects anomalies (instances deviating from fleet mean) and fleet-wide failures.
Root-Cause Hints (evalview.core.root_cause_hint): Narrates coordinated test failures with structured diagnostics. Pluggable hinter architecture where each hinter looks for one specific cross-test pattern and returns a RootCauseHint with cause ID, narrative, evidence, and suggested actions.
Goal Drift (evalview.core.goal_drift): Detects when an agent trajectory diverges from the original user intent using Jaccard token overlap (deterministic baseline) plus a pluggable judge interface for smarter implementations.
Retrieval Lineage (evalview.core.retrieval_lineage): Attributes which retrieved chunks actually influenced the agent's output. Ships a deterministic token-overlap baseline plus pluggable attribution judges for embedding-based or LLM-based methods.
Chaos Injection (evalview.core.chaos): Defines controlled disruption modes (tool failure, latency spike, goal interruption) for agent simulation. Deterministic scenario planning with seed-based reproducibility.
OTel Semantic Conventions (evalview.core.otel_semconv): Portable OpenTelemetry vocabulary for agent-layer instrumentation (turns, tool selections, handoffs, memory reads, retrieval). Enables traces from any adapter to be consumed by any observability tool.

All modules are pure (no I/O, no network, no LLM by default) and follow EvalView's design patterns:

Deterministic, testable, CI-friendly
Pluggable interfaces for smarter implementations
Comprehensive docstrings and contributor recipes

Related Issue

Closes foundational work for production-traffic analysis, multi-instance monitoring, failure diagnosis, and agent-behavior observability.

Type of Change

New feature (non-breaking change which adds functionality)
Documentation update
Test improvements

Changes Made

Core Modules (Pure Functions)

evalview/core/freshness.py (500 lines): Query normalization, Jaccard similarity, coverage computation, query clustering, test synthesis
evalview/core/fleet.py (442 lines): Instance summarization, anomaly detection, fleet-wide failure detection
evalview/core/root_cause_hint.py (485 lines): Hinter registry, six shipped hinters (provider rollout, tool changes, output drift, runtime fingerprint, coordinated output drift, coordinated tool addition/removal)
evalview/core/goal_drift.py (259 lines): Trajectory summarization, Jaccard-based drift detection, pluggable judge interface
evalview/core/retrieval_lineage.py (318 lines): Token-overlap attribution baseline, pluggable judges, memory lineage, stale-memory detection
evalview/core/chaos.py (294 lines): Disruption modes (tool failure, latency spike, goal interruption), scenario planning, deterministic seed-based generation
evalview/core/otel_semconv.py (307 lines): Portable OTel span names, attribute constants, enum values for agent-layer instrumentation

CLI Commands

evalview/commands/freshness_cmd.py (418 lines): Load production queries from incidents/JSONL, compute coverage, propose new

https://claude.ai/code/session_011waTjdqtPP891LzYG9xknk

`evalview autopr` already closes the loop from a failing production trace to a pinned regression test. `evalview freshness` closes the complementary loop — from drifted traffic to a new capability test — *even when nothing has failed yet*. It directly attacks the "eval sets go stale, silent failures slip through" complaint that's been dominating r/AI_Agents. Pipeline: evalview monitor ──▶ .evalview/incidents.jsonl ─┐ ├──▶ freshness ──▶ tests/coverage/*.yaml (optional) --from-log prod-queries.jsonl ────────┘ Algorithm (pure Jaccard, no network, no LLM, no embeddings — matches the regression_synth contract): 1. Pull production queries from incidents.jsonl + optional --from-log. 2. Normalize (lowercase, strip punctuation, collapse digit runs to <num> so order IDs don't shred clusters, drop a small stoplist). 3. Score each prod query's max Jaccard against the existing suite — below --threshold = uncovered. 4. Greedy single-linkage cluster the uncovered queries. 5. Drop clusters below --min-cluster-size (likely one-offs). 6. Print a ranked report; with --propose, write one YAML stub per cluster to tests/coverage/, idempotent on meta.coverage.slug. Stubs are deliberately minimal: just the representative query, no expected assertions. We don't know what "correct" looks like for an uncovered query — pretending we do creates false confidence. The reviewer's flow is review → snapshot → that becomes the baseline. Wired into the CLI under the Production section next to monitor/autopr. Coverage: 33 new tests; full suite stays green (1909 passed).

`detect_coordinated_incident` already answers "are these failures correlated?". The follow-up question users actually ask in the war room is "correlated *how*, and what do I do about it?". This commit ships a pluggable hinter layer that synthesizes existing diff signals into a narrated diagnosis with concrete next-step commands. Architecture: detect_coordinated_incident ↓ passes diffs to analyze_root_cause_hint → picks the highest-priority matching ↓ RootCauseHint and attaches it to Incident.hint (new optional field, backward-compatible) ↓ consumed by Slack / Discord notifiers → render narrative + top action below the existing headline Five hinters ship in v1, each a small pure function in `evalview/core/root_cause_hint.py`: - provider_rollout (priority 100, high conf): ≥N tests with model_changed=True → likely silent provider rollout, surfaces the actual model-ID transition and suggests `evalview model-check --pin` first. - runtime_fingerprint_shift (priority 80, medium conf): ≥N tests share a new fingerprint different from baseline. - coordinated_tool_addition (priority 70, high conf): ≥N failing tests all newly call the same tool → prompt or tool- description edit, suggests `git log` + `grep` recipes. - coordinated_tool_removal (priority 70, high conf): ≥N failing tests all stopped calling the same tool. - coordinated_output_drift (priority 50, medium conf): ≥N tests with similarity < 0.7 and no tool change → output-only drift (model got chattier / terser / refused). Selection is deterministic: highest (priority, confidence_rank) wins, ties broken by HINTERS registration order. Hinters are conservative by contract — when in doubt they return None and the incident still gets its basic classification. Contributor side-work: - New recipe `docs/agent-recipes/add-root-cause-hint.md` walks contributors through writing a new hinter as a single pure function, with explicit done-criteria and pitfalls. - `HINTERS_ROADMAP` in the module lists six concrete good-first-issue hinters with sized scope (coordinated_cost_spike, coordinated_latency_spike, coordinated_refusal, coordinated_parameter_drift, coordinated_decision_drift, coordinated_retrieval_drop). Each is one function + one test file away from shipping. Backward compatibility: `Incident.hint` is `Optional[RootCauseHint] = None`; every existing call site still constructs Incidents cleanly, notifiers fall back to the headline-only card when hint is None, and the full test suite (1929 passed) remains green. Coverage: 20 new tests covering each hinter's positive case, negative case, and selection logic; integration test pins that `detect_coordinated_incident` attaches a hint when a hinter matches.

`evalview since`, `progress`, and `drift` each tell you what happened in one monitor session. The moment you run more than one — canary + prod, a regional fleet, dev + CI both watching — those single-history commands start hiding regional failures inside fleet averages and vice-versa. The "observing 1 → N agents" complaint is exactly this gap. `evalview fleet` is the cross-instance synthesizer. evalview fleet --dir .evalview/history/ evalview fleet --history monitor-eu.jsonl --history monitor-us.jsonl evalview fleet --json > fleet.json evalview fleet --anomalies-only # only show pods deviating from mean evalview fleet --require-clean # CI gate What it produces: - Fleet header: instance count, total cycles, fleet pass rate, cost. - Per-instance table sorted by pass rate (worst first). - Anomaly callouts via Z-score against fleet pass-rate distribution (default ≥2σ). Refuses to call anything anomalous below n=3 instances — the math isn't meaningful and false anomalies are worse than none. - Fleet-wide failures: tests failing in ≥40% of instances surface separately so operators can tell "everywhere" from "one bad pod". Pure analytics — no network, no LLM. Reads JSONL written by `evalview monitor --history`. Forward-compatible: silently skips record shapes it doesn't recognize so future monitor record types don't break old rollups. Architecture: evalview/core/fleet.py # pure data + math (frozen dataclasses) evalview/commands/fleet_cmd.py # Click CLI + Rich rendering Coverage: 23 new tests. Full suite stays green (1952 passed). CLI registered in the Production section next to monitor / autopr / freshness so all the cross-cutting "what's happening across my agents" commands live together.

Adds a focused "Threat Model: Untrusted Inputs" section motivated by the May 2026 Langfuse RCE (single OTel trace request → prototype pollution → cross-project access). The new section enumerates every external input EvalView accepts, the parser used for each, the sandbox posture, and what EvalView explicitly does *not* do (no pickle, no eval/exec, no unsafe yaml.load, no code execution from cassettes). Also documents: - Cloud sync isolation contract (opt-in, OAuth, server-enforced per-project boundary). - MCP server attack surface (stdio by default; network is out-of-band and unsupported). - Where to report parser bugs (Security Advisory, not public issue). No code changes; doc-only commit.

…ntions (draft v0) Upstream OTel's gen_ai.* covers model-call instrumentation. The agent layer above it — turns, tool selections, handoffs, memory reads, retrieval, human interventions — has no shared vocabulary, so traces from one stack are useless to another. Every observability vendor invents its own attribute names. The "make traces portable" complaint in the agent-eval discourse is exactly this gap. This commit ships a portable spec EvalView can both emit and consume, and that any other tool can adopt without depending on EvalView. Coverage: - 10 span kinds: agent.run, agent.turn, agent.tool_choice, agent.tool_call, agent.handoff, agent.memory.{read,write}, agent.retrieval, agent.intervention, agent.plan. - ~30 attributes covering identity, state fingerprints, goal text + drift delta, tool choice rationale + alternatives, parameter fingerprints, retrieval chunks + influence scores, memory keys + age, handoff targets, intervention outcomes, cost, verdict. - Validation helpers: is_known_span(), is_known_attribute(), attributes_for_span(span) returning the recommended attribute set. Status: draft v0. OTEL_SEMCONV_VERSION = "0.1.0"; bumped on breaking change so consumers can branch on rev. EvalView root spans will stamp the version so consumers know how to parse. No new dependencies — constants are plain strings. Adapters with the OTel SDK emit real spans; adapters without it can attach the same attributes to their existing trace dicts. Contributor side-work: - Recipe `docs/agent-recipes/add-otel-emission.md` walks adapter authors through wiring the constants into their emitter, with pitfalls (no inventing names, fingerprint don't dump, preserve parent_run.id across handoffs). - Recipe lists 4 sized roadmap items as good first issues: LangGraph wiring, `evalview validate-trace` command, JSON-Schema export, verdict_at_step extension. Coverage: 14 tests pinning constants↔index round-trip, naming convention, validation helpers, and the recommendation table. Full suite stays green.

…istic + judge slot) Decision rationale logging answers *what* the agent chose at each step. This module answers the question one layer above: *is the trajectory still about the original ask, or did the agent quietly wander?* The classic failure: user asks "cancel my subscription and refund the last charge"; agent does account lookup, plan check, terms review, then answers a pricing question. By step 12, the trajectory looks nothing like the goal. A drift signal at step 6 ("the agent is no longer working on a 'cancel + refund' goal") catches it cheaply before the user is unhappy. What ships: - GoalEvent + GoalDriftAnalysis dataclasses (frozen, JSON-friendly). - Deterministic baseline: Jaccard token overlap between stated goal and a trajectory summary (last 8 events, weighted toward current intent), with the same digit-collapse-to-<num> normalization freshness uses so order IDs don't shred similarity. - GoalDriftJudge type — pluggable callable for smarter judges (embedding cosine, LLM, whatever). Returns None to fall back to the baseline; raises silently and falls back. Score is clamped to [0, 1] for safety. - analyze_per_step() for "when did drift start?" sparkline rendering. The deterministic baseline is intentionally crude — fires on the obviously-wandered cases without hitting an LLM. The judge slot is where smarter contributors can drop in something better; the recipe ships next to the module. Connects to the OTel semconv: hinters + adapters that compute drift should set agent.goal.drift_delta on agent.turn spans. Contributor side-work: - Recipe `docs/agent-recipes/add-goal-drift-judge.md` walks through the type contract, fail-soft requirements, and four sized roadmap items (embedding-cosine judge, bag-of-bigrams baseline, --drift flag on `evalview replay`, OTel attribute emission). Coverage: 17 tests covering baseline correctness, severity buckets, empty-input safety, digit normalization, judge override, judge fallback paths (None / raise), score clamping, per-step monotonicity on a wandering trajectory.

…judge slot For RAG agents, observability tools mostly fail to answer the question that matters: of N chunks I retrieved, which ones did the agent actually use? Without that you can't tell dead-weight chunks (drop from index) from dominant ones (overfit), and you can't notice when retrieval quality silently degrades. What ships: - RetrievedChunk + ChunkAttribution + RetrievalLineage dataclasses. - MemoryEntry + StaleMemoryFlag for memory store reads. - Deterministic baseline: chunk-token recall in the output (NOT Jaccard — chunks shouldn't be penalized for the output containing extra material), normalized across the chunk set so scores sum to 1.0 and are comparable cross-run. - AttributionJudge type — pluggable callable for smarter scorers (embedding cosine, mechanistic, LLM). Returns None to fall back to the baseline; raises silently and falls back. Scores clamped to [0, 1]. - influential / dead_weight surfaces with separate thresholds — "this chunk added nothing measurable" is the actionable signal for index-pruning workflows, distinct from "barely used". - attribute_memory_reads() reuses the chunk machinery for memory stores (working / episodic / semantic / profile). - detect_stale_memory() flags entries older than a configurable age (default 7 days), oldest-first for digest rendering. Pure module — no I/O, no network, no LLM by default. Connects to the OTel semconv: adapters that compute lineage should set agent.retrieval.influence_scores on agent.retrieval spans, and agent.memory.age_seconds on agent.memory.read spans. Contributor side-work: - Recipe `docs/agent-recipes/add-retrieval-attribution.md` walks through the judge type contract, the recall-vs-Jaccard rationale, and four sized roadmap items (embedding-cosine judge, entity-overlap baseline, `evalview retrieval-stats` aggregator, OTel emission). Coverage: 16 tests covering baseline correctness, normalization invariant, judge override, judge fallback (None / raise), score clamping, influential / dead_weight thresholds, memory delegation, stale-memory detection (default + custom thresholds), and the type-alias signature.

…hipped) Static prompts/tasks miss the production failures users actually encounter: tools flaking, users changing their mind mid-task, contexts being corrupted, schemas drifting. Chaos-Monkey-style injection is the asked-for answer in the agent-eval discourse. This commit ships the *plan* layer — deterministic disruption scenarios that can be reused across runs and across machines. The simulator-side wiring (the handler that applies a planned disruption to a running simulation) is staged as contributor work via the recipe. What ships: - ChaosDisruption + ChaosScenario (frozen dataclasses, JSON-serializable). - 3 modes: * tool_failure — Nth invocation of a named tool returns an error. * latency_spike — Nth tool call sleeps an extra delay_ms. * goal_interruption — synthetic user message after step N (pairs naturally with goal_drift detection). - build_scenario() enforces "one disruption per step" so the simulation stays easy to reason about; raises on conflicts. - random_scenario(seed=...) — deterministic synthesis from a seed. Same (seed, tools, max_steps, n_disruptions, modes) → same plan every time, across CI runs and across machines. All randomness flows through hashlib-seeded selection, never random.random(). - Public CHAOS_MODES_ROADMAP listing 6 sized contributor PRs: info_drift, rate_limit, partial_handoff, memory_corruption, schema_drift, user_typo. Pure module — no I/O, no network. Adapter-agnostic by design. Contributor side-work: - Recipe `docs/agent-recipes/add-chaos-mode.md` walks contributors through adding a mode end-to-end: constant + builder + random_scenario branch + simulate_cmd handler + 4 tests, with the determinism-via-_seeded_choice rule called out. - Roadmap-shape test pins the CHAOS_MODES_ROADMAP surface across refactors. Coverage: 14 new tests covering builders, scenario invariants, JSON round-trip, determinism (same seed = same scenario, different seed = different scenario), step-bound respect, registry consistency, and frozen-dataclass immutability.

claude added 8 commits May 12, 2026 15:00

hidai25 merged commit 4253ad5 into main May 15, 2026
8 checks passed

hidai25 mentioned this pull request May 27, 2026

refactor(core): dedupe stoplist and normalize generics to typing.* #247

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add freshness, fleet, chaos, goal-drift, retrieval-lineage, and root-cause modules#233

Add freshness, fleet, chaos, goal-drift, retrieval-lineage, and root-cause modules#233
hidai25 merged 8 commits into
mainfrom
claude/agent-evals-observability-XjLi0

hidai25 commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hidai25 commented May 15, 2026

Description

Related Issue

Type of Change

Changes Made

Core Modules (Pure Functions)

CLI Commands

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants