Add freshness, fleet, chaos, goal-drift, retrieval-lineage, and root-cause modules#233
Merged
Merged
Conversation
`evalview autopr` already closes the loop from a failing production trace
to a pinned regression test. `evalview freshness` closes the complementary
loop — from drifted traffic to a new capability test — *even when nothing
has failed yet*. It directly attacks the "eval sets go stale, silent
failures slip through" complaint that's been dominating r/AI_Agents.
Pipeline:
evalview monitor ──▶ .evalview/incidents.jsonl ─┐
├──▶ freshness ──▶ tests/coverage/*.yaml
(optional) --from-log prod-queries.jsonl ────────┘
Algorithm (pure Jaccard, no network, no LLM, no embeddings — matches the
regression_synth contract):
1. Pull production queries from incidents.jsonl + optional --from-log.
2. Normalize (lowercase, strip punctuation, collapse digit runs to
<num> so order IDs don't shred clusters, drop a small stoplist).
3. Score each prod query's max Jaccard against the existing suite —
below --threshold = uncovered.
4. Greedy single-linkage cluster the uncovered queries.
5. Drop clusters below --min-cluster-size (likely one-offs).
6. Print a ranked report; with --propose, write one YAML stub per
cluster to tests/coverage/, idempotent on meta.coverage.slug.
Stubs are deliberately minimal: just the representative query, no
expected assertions. We don't know what "correct" looks like for an
uncovered query — pretending we do creates false confidence. The
reviewer's flow is review → snapshot → that becomes the baseline.
Wired into the CLI under the Production section next to monitor/autopr.
Coverage: 33 new tests; full suite stays green (1909 passed).
`detect_coordinated_incident` already answers "are these failures
correlated?". The follow-up question users actually ask in the war room
is "correlated *how*, and what do I do about it?". This commit ships a
pluggable hinter layer that synthesizes existing diff signals into a
narrated diagnosis with concrete next-step commands.
Architecture:
detect_coordinated_incident
↓ passes diffs to
analyze_root_cause_hint → picks the highest-priority matching
↓ RootCauseHint and attaches it to
Incident.hint (new optional field, backward-compatible)
↓ consumed by
Slack / Discord notifiers → render narrative + top action below
the existing headline
Five hinters ship in v1, each a small pure function in
`evalview/core/root_cause_hint.py`:
- provider_rollout (priority 100, high conf):
≥N tests with model_changed=True → likely silent provider rollout,
surfaces the actual model-ID transition and suggests
`evalview model-check --pin` first.
- runtime_fingerprint_shift (priority 80, medium conf):
≥N tests share a new fingerprint different from baseline.
- coordinated_tool_addition (priority 70, high conf):
≥N failing tests all newly call the same tool → prompt or tool-
description edit, suggests `git log` + `grep` recipes.
- coordinated_tool_removal (priority 70, high conf):
≥N failing tests all stopped calling the same tool.
- coordinated_output_drift (priority 50, medium conf):
≥N tests with similarity < 0.7 and no tool change → output-only
drift (model got chattier / terser / refused).
Selection is deterministic: highest (priority, confidence_rank) wins,
ties broken by HINTERS registration order. Hinters are conservative
by contract — when in doubt they return None and the incident still
gets its basic classification.
Contributor side-work:
- New recipe `docs/agent-recipes/add-root-cause-hint.md` walks
contributors through writing a new hinter as a single pure
function, with explicit done-criteria and pitfalls.
- `HINTERS_ROADMAP` in the module lists six concrete good-first-issue
hinters with sized scope (coordinated_cost_spike,
coordinated_latency_spike, coordinated_refusal,
coordinated_parameter_drift, coordinated_decision_drift,
coordinated_retrieval_drop). Each is one function + one test file
away from shipping.
Backward compatibility: `Incident.hint` is `Optional[RootCauseHint] =
None`; every existing call site still constructs Incidents cleanly,
notifiers fall back to the headline-only card when hint is None, and
the full test suite (1929 passed) remains green.
Coverage: 20 new tests covering each hinter's positive case, negative
case, and selection logic; integration test pins that
`detect_coordinated_incident` attaches a hint when a hinter matches.
`evalview since`, `progress`, and `drift` each tell you what happened in one monitor session. The moment you run more than one — canary + prod, a regional fleet, dev + CI both watching — those single-history commands start hiding regional failures inside fleet averages and vice-versa. The "observing 1 → N agents" complaint is exactly this gap. `evalview fleet` is the cross-instance synthesizer. evalview fleet --dir .evalview/history/ evalview fleet --history monitor-eu.jsonl --history monitor-us.jsonl evalview fleet --json > fleet.json evalview fleet --anomalies-only # only show pods deviating from mean evalview fleet --require-clean # CI gate What it produces: - Fleet header: instance count, total cycles, fleet pass rate, cost. - Per-instance table sorted by pass rate (worst first). - Anomaly callouts via Z-score against fleet pass-rate distribution (default ≥2σ). Refuses to call anything anomalous below n=3 instances — the math isn't meaningful and false anomalies are worse than none. - Fleet-wide failures: tests failing in ≥40% of instances surface separately so operators can tell "everywhere" from "one bad pod". Pure analytics — no network, no LLM. Reads JSONL written by `evalview monitor --history`. Forward-compatible: silently skips record shapes it doesn't recognize so future monitor record types don't break old rollups. Architecture: evalview/core/fleet.py # pure data + math (frozen dataclasses) evalview/commands/fleet_cmd.py # Click CLI + Rich rendering Coverage: 23 new tests. Full suite stays green (1952 passed). CLI registered in the Production section next to monitor / autopr / freshness so all the cross-cutting "what's happening across my agents" commands live together.
Adds a focused "Threat Model: Untrusted Inputs" section motivated by
the May 2026 Langfuse RCE (single OTel trace request → prototype
pollution → cross-project access). The new section enumerates every
external input EvalView accepts, the parser used for each, the
sandbox posture, and what EvalView explicitly does *not* do
(no pickle, no eval/exec, no unsafe yaml.load, no code execution from
cassettes).
Also documents:
- Cloud sync isolation contract (opt-in, OAuth, server-enforced
per-project boundary).
- MCP server attack surface (stdio by default; network is out-of-band
and unsupported).
- Where to report parser bugs (Security Advisory, not public issue).
No code changes; doc-only commit.
…ntions (draft v0)
Upstream OTel's gen_ai.* covers model-call instrumentation. The agent
layer above it — turns, tool selections, handoffs, memory reads,
retrieval, human interventions — has no shared vocabulary, so traces
from one stack are useless to another. Every observability vendor
invents its own attribute names. The "make traces portable" complaint
in the agent-eval discourse is exactly this gap.
This commit ships a portable spec EvalView can both emit and consume,
and that any other tool can adopt without depending on EvalView.
Coverage:
- 10 span kinds: agent.run, agent.turn, agent.tool_choice,
agent.tool_call, agent.handoff, agent.memory.{read,write},
agent.retrieval, agent.intervention, agent.plan.
- ~30 attributes covering identity, state fingerprints, goal text +
drift delta, tool choice rationale + alternatives, parameter
fingerprints, retrieval chunks + influence scores, memory keys +
age, handoff targets, intervention outcomes, cost, verdict.
- Validation helpers: is_known_span(), is_known_attribute(),
attributes_for_span(span) returning the recommended attribute set.
Status: draft v0. OTEL_SEMCONV_VERSION = "0.1.0"; bumped on breaking
change so consumers can branch on rev. EvalView root spans will stamp
the version so consumers know how to parse.
No new dependencies — constants are plain strings. Adapters with the
OTel SDK emit real spans; adapters without it can attach the same
attributes to their existing trace dicts.
Contributor side-work:
- Recipe `docs/agent-recipes/add-otel-emission.md` walks adapter
authors through wiring the constants into their emitter, with
pitfalls (no inventing names, fingerprint don't dump, preserve
parent_run.id across handoffs).
- Recipe lists 4 sized roadmap items as good first issues:
LangGraph wiring, `evalview validate-trace` command,
JSON-Schema export, verdict_at_step extension.
Coverage: 14 tests pinning constants↔index round-trip,
naming convention, validation helpers, and the recommendation table.
Full suite stays green.
…istic + judge slot)
Decision rationale logging answers *what* the agent chose at each step.
This module answers the question one layer above: *is the trajectory
still about the original ask, or did the agent quietly wander?*
The classic failure: user asks "cancel my subscription and refund the
last charge"; agent does account lookup, plan check, terms review,
then answers a pricing question. By step 12, the trajectory looks
nothing like the goal. A drift signal at step 6 ("the agent is no
longer working on a 'cancel + refund' goal") catches it cheaply
before the user is unhappy.
What ships:
- GoalEvent + GoalDriftAnalysis dataclasses (frozen, JSON-friendly).
- Deterministic baseline: Jaccard token overlap between stated goal
and a trajectory summary (last 8 events, weighted toward current
intent), with the same digit-collapse-to-<num> normalization
freshness uses so order IDs don't shred similarity.
- GoalDriftJudge type — pluggable callable for smarter judges
(embedding cosine, LLM, whatever). Returns None to fall back to
the baseline; raises silently and falls back. Score is clamped
to [0, 1] for safety.
- analyze_per_step() for "when did drift start?" sparkline rendering.
The deterministic baseline is intentionally crude — fires on the
obviously-wandered cases without hitting an LLM. The judge slot is
where smarter contributors can drop in something better; the recipe
ships next to the module.
Connects to the OTel semconv: hinters + adapters that compute drift
should set agent.goal.drift_delta on agent.turn spans.
Contributor side-work:
- Recipe `docs/agent-recipes/add-goal-drift-judge.md` walks
through the type contract, fail-soft requirements, and four
sized roadmap items (embedding-cosine judge, bag-of-bigrams
baseline, --drift flag on `evalview replay`, OTel attribute
emission).
Coverage: 17 tests covering baseline correctness, severity buckets,
empty-input safety, digit normalization, judge override, judge
fallback paths (None / raise), score clamping, per-step monotonicity
on a wandering trajectory.
…judge slot
For RAG agents, observability tools mostly fail to answer the question
that matters: of N chunks I retrieved, which ones did the agent
actually use? Without that you can't tell dead-weight chunks (drop
from index) from dominant ones (overfit), and you can't notice when
retrieval quality silently degrades.
What ships:
- RetrievedChunk + ChunkAttribution + RetrievalLineage dataclasses.
- MemoryEntry + StaleMemoryFlag for memory store reads.
- Deterministic baseline: chunk-token recall in the output (NOT
Jaccard — chunks shouldn't be penalized for the output containing
extra material), normalized across the chunk set so scores sum to
1.0 and are comparable cross-run.
- AttributionJudge type — pluggable callable for smarter scorers
(embedding cosine, mechanistic, LLM). Returns None to fall back
to the baseline; raises silently and falls back. Scores clamped
to [0, 1].
- influential / dead_weight surfaces with separate thresholds —
"this chunk added nothing measurable" is the actionable signal
for index-pruning workflows, distinct from "barely used".
- attribute_memory_reads() reuses the chunk machinery for memory
stores (working / episodic / semantic / profile).
- detect_stale_memory() flags entries older than a configurable
age (default 7 days), oldest-first for digest rendering.
Pure module — no I/O, no network, no LLM by default.
Connects to the OTel semconv: adapters that compute lineage should
set agent.retrieval.influence_scores on agent.retrieval spans, and
agent.memory.age_seconds on agent.memory.read spans.
Contributor side-work:
- Recipe `docs/agent-recipes/add-retrieval-attribution.md` walks
through the judge type contract, the recall-vs-Jaccard rationale,
and four sized roadmap items (embedding-cosine judge, entity-overlap
baseline, `evalview retrieval-stats` aggregator, OTel emission).
Coverage: 16 tests covering baseline correctness, normalization
invariant, judge override, judge fallback (None / raise), score
clamping, influential / dead_weight thresholds, memory delegation,
stale-memory detection (default + custom thresholds), and the
type-alias signature.
…hipped)
Static prompts/tasks miss the production failures users actually
encounter: tools flaking, users changing their mind mid-task, contexts
being corrupted, schemas drifting. Chaos-Monkey-style injection is the
asked-for answer in the agent-eval discourse.
This commit ships the *plan* layer — deterministic disruption
scenarios that can be reused across runs and across machines. The
simulator-side wiring (the handler that applies a planned disruption
to a running simulation) is staged as contributor work via the recipe.
What ships:
- ChaosDisruption + ChaosScenario (frozen dataclasses, JSON-serializable).
- 3 modes:
* tool_failure — Nth invocation of a named tool returns an error.
* latency_spike — Nth tool call sleeps an extra delay_ms.
* goal_interruption — synthetic user message after step N (pairs
naturally with goal_drift detection).
- build_scenario() enforces "one disruption per step" so the
simulation stays easy to reason about; raises on conflicts.
- random_scenario(seed=...) — deterministic synthesis from a seed.
Same (seed, tools, max_steps, n_disruptions, modes) → same plan
every time, across CI runs and across machines. All randomness
flows through hashlib-seeded selection, never random.random().
- Public CHAOS_MODES_ROADMAP listing 6 sized contributor PRs:
info_drift, rate_limit, partial_handoff, memory_corruption,
schema_drift, user_typo.
Pure module — no I/O, no network. Adapter-agnostic by design.
Contributor side-work:
- Recipe `docs/agent-recipes/add-chaos-mode.md` walks contributors
through adding a mode end-to-end: constant + builder +
random_scenario branch + simulate_cmd handler + 4 tests, with the
determinism-via-_seeded_choice rule called out.
- Roadmap-shape test pins the CHAOS_MODES_ROADMAP surface across
refactors.
Coverage: 14 new tests covering builders, scenario invariants,
JSON round-trip, determinism (same seed = same scenario, different
seed = different scenario), step-bound respect, registry
consistency, and frozen-dataclass immutability.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces six new pure-function modules and their corresponding CLI commands, plus comprehensive test coverage and contributor recipes. These modules extend EvalView's observability and testing capabilities:
Freshness (
evalview.core.freshness+evalview.commands.freshness_cmd): Detects production-query coverage gaps in the test suite using Jaccard token similarity. Identifies uncovered production queries, clusters them by similarity, and synthesizes new test cases to close gaps—even when nothing has failed yet.Fleet (
evalview.core.fleet+evalview.commands.fleet_cmd): Aggregates monitor history from multiple instances (canary, prod, regional pods) into a unified view. Detects anomalies (instances deviating from fleet mean) and fleet-wide failures.Root-Cause Hints (
evalview.core.root_cause_hint): Narrates coordinated test failures with structured diagnostics. Pluggable hinter architecture where each hinter looks for one specific cross-test pattern and returns aRootCauseHintwith cause ID, narrative, evidence, and suggested actions.Goal Drift (
evalview.core.goal_drift): Detects when an agent trajectory diverges from the original user intent using Jaccard token overlap (deterministic baseline) plus a pluggable judge interface for smarter implementations.Retrieval Lineage (
evalview.core.retrieval_lineage): Attributes which retrieved chunks actually influenced the agent's output. Ships a deterministic token-overlap baseline plus pluggable attribution judges for embedding-based or LLM-based methods.Chaos Injection (
evalview.core.chaos): Defines controlled disruption modes (tool failure, latency spike, goal interruption) for agent simulation. Deterministic scenario planning with seed-based reproducibility.OTel Semantic Conventions (
evalview.core.otel_semconv): Portable OpenTelemetry vocabulary for agent-layer instrumentation (turns, tool selections, handoffs, memory reads, retrieval). Enables traces from any adapter to be consumed by any observability tool.All modules are pure (no I/O, no network, no LLM by default) and follow EvalView's design patterns:
Related Issue
Closes foundational work for production-traffic analysis, multi-instance monitoring, failure diagnosis, and agent-behavior observability.
Type of Change
Changes Made
Core Modules (Pure Functions)
evalview/core/freshness.py(500 lines): Query normalization, Jaccard similarity, coverage computation, query clustering, test synthesisevalview/core/fleet.py(442 lines): Instance summarization, anomaly detection, fleet-wide failure detectionevalview/core/root_cause_hint.py(485 lines): Hinter registry, six shipped hinters (provider rollout, tool changes, output drift, runtime fingerprint, coordinated output drift, coordinated tool addition/removal)evalview/core/goal_drift.py(259 lines): Trajectory summarization, Jaccard-based drift detection, pluggable judge interfaceevalview/core/retrieval_lineage.py(318 lines): Token-overlap attribution baseline, pluggable judges, memory lineage, stale-memory detectionevalview/core/chaos.py(294 lines): Disruption modes (tool failure, latency spike, goal interruption), scenario planning, deterministic seed-based generationevalview/core/otel_semconv.py(307 lines): Portable OTel span names, attribute constants, enum values for agent-layer instrumentationCLI Commands
evalview/commands/freshness_cmd.py(418 lines): Load production queries from incidents/JSONL, compute coverage, propose newhttps://claude.ai/code/session_011waTjdqtPP891LzYG9xknk