Skip to content

Add freshness, fleet, chaos, goal-drift, retrieval-lineage, and root-cause modules#233

Merged
hidai25 merged 8 commits into
mainfrom
claude/agent-evals-observability-XjLi0
May 15, 2026
Merged

Add freshness, fleet, chaos, goal-drift, retrieval-lineage, and root-cause modules#233
hidai25 merged 8 commits into
mainfrom
claude/agent-evals-observability-XjLi0

Conversation

@hidai25
Copy link
Copy Markdown
Owner

@hidai25 hidai25 commented May 15, 2026

Description

This PR introduces six new pure-function modules and their corresponding CLI commands, plus comprehensive test coverage and contributor recipes. These modules extend EvalView's observability and testing capabilities:

  1. Freshness (evalview.core.freshness + evalview.commands.freshness_cmd): Detects production-query coverage gaps in the test suite using Jaccard token similarity. Identifies uncovered production queries, clusters them by similarity, and synthesizes new test cases to close gaps—even when nothing has failed yet.

  2. Fleet (evalview.core.fleet + evalview.commands.fleet_cmd): Aggregates monitor history from multiple instances (canary, prod, regional pods) into a unified view. Detects anomalies (instances deviating from fleet mean) and fleet-wide failures.

  3. Root-Cause Hints (evalview.core.root_cause_hint): Narrates coordinated test failures with structured diagnostics. Pluggable hinter architecture where each hinter looks for one specific cross-test pattern and returns a RootCauseHint with cause ID, narrative, evidence, and suggested actions.

  4. Goal Drift (evalview.core.goal_drift): Detects when an agent trajectory diverges from the original user intent using Jaccard token overlap (deterministic baseline) plus a pluggable judge interface for smarter implementations.

  5. Retrieval Lineage (evalview.core.retrieval_lineage): Attributes which retrieved chunks actually influenced the agent's output. Ships a deterministic token-overlap baseline plus pluggable attribution judges for embedding-based or LLM-based methods.

  6. Chaos Injection (evalview.core.chaos): Defines controlled disruption modes (tool failure, latency spike, goal interruption) for agent simulation. Deterministic scenario planning with seed-based reproducibility.

  7. OTel Semantic Conventions (evalview.core.otel_semconv): Portable OpenTelemetry vocabulary for agent-layer instrumentation (turns, tool selections, handoffs, memory reads, retrieval). Enables traces from any adapter to be consumed by any observability tool.

All modules are pure (no I/O, no network, no LLM by default) and follow EvalView's design patterns:

  • Deterministic, testable, CI-friendly
  • Pluggable interfaces for smarter implementations
  • Comprehensive docstrings and contributor recipes

Related Issue

Closes foundational work for production-traffic analysis, multi-instance monitoring, failure diagnosis, and agent-behavior observability.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Documentation update
  • Test improvements

Changes Made

Core Modules (Pure Functions)

  • evalview/core/freshness.py (500 lines): Query normalization, Jaccard similarity, coverage computation, query clustering, test synthesis
  • evalview/core/fleet.py (442 lines): Instance summarization, anomaly detection, fleet-wide failure detection
  • evalview/core/root_cause_hint.py (485 lines): Hinter registry, six shipped hinters (provider rollout, tool changes, output drift, runtime fingerprint, coordinated output drift, coordinated tool addition/removal)
  • evalview/core/goal_drift.py (259 lines): Trajectory summarization, Jaccard-based drift detection, pluggable judge interface
  • evalview/core/retrieval_lineage.py (318 lines): Token-overlap attribution baseline, pluggable judges, memory lineage, stale-memory detection
  • evalview/core/chaos.py (294 lines): Disruption modes (tool failure, latency spike, goal interruption), scenario planning, deterministic seed-based generation
  • evalview/core/otel_semconv.py (307 lines): Portable OTel span names, attribute constants, enum values for agent-layer instrumentation

CLI Commands

  • evalview/commands/freshness_cmd.py (418 lines): Load production queries from incidents/JSONL, compute coverage, propose new

https://claude.ai/code/session_011waTjdqtPP891LzYG9xknk

claude added 8 commits May 12, 2026 15:00
`evalview autopr` already closes the loop from a failing production trace
to a pinned regression test. `evalview freshness` closes the complementary
loop — from drifted traffic to a new capability test — *even when nothing
has failed yet*. It directly attacks the "eval sets go stale, silent
failures slip through" complaint that's been dominating r/AI_Agents.

Pipeline:

    evalview monitor  ──▶  .evalview/incidents.jsonl  ─┐
                                                       ├──▶  freshness ──▶ tests/coverage/*.yaml
    (optional)  --from-log prod-queries.jsonl  ────────┘

Algorithm (pure Jaccard, no network, no LLM, no embeddings — matches the
regression_synth contract):

  1. Pull production queries from incidents.jsonl + optional --from-log.
  2. Normalize (lowercase, strip punctuation, collapse digit runs to
     <num> so order IDs don't shred clusters, drop a small stoplist).
  3. Score each prod query's max Jaccard against the existing suite —
     below --threshold = uncovered.
  4. Greedy single-linkage cluster the uncovered queries.
  5. Drop clusters below --min-cluster-size (likely one-offs).
  6. Print a ranked report; with --propose, write one YAML stub per
     cluster to tests/coverage/, idempotent on meta.coverage.slug.

Stubs are deliberately minimal: just the representative query, no
expected assertions. We don't know what "correct" looks like for an
uncovered query — pretending we do creates false confidence. The
reviewer's flow is review → snapshot → that becomes the baseline.

Wired into the CLI under the Production section next to monitor/autopr.

Coverage: 33 new tests; full suite stays green (1909 passed).
`detect_coordinated_incident` already answers "are these failures
correlated?". The follow-up question users actually ask in the war room
is "correlated *how*, and what do I do about it?". This commit ships a
pluggable hinter layer that synthesizes existing diff signals into a
narrated diagnosis with concrete next-step commands.

Architecture:

  detect_coordinated_incident
        ↓ passes diffs to
  analyze_root_cause_hint  →  picks the highest-priority matching
        ↓                     RootCauseHint and attaches it to
  Incident.hint              (new optional field, backward-compatible)
        ↓ consumed by
  Slack / Discord notifiers  →  render narrative + top action below
                                the existing headline

Five hinters ship in v1, each a small pure function in
`evalview/core/root_cause_hint.py`:

  - provider_rollout            (priority 100, high conf):
      ≥N tests with model_changed=True → likely silent provider rollout,
      surfaces the actual model-ID transition and suggests
      `evalview model-check --pin` first.
  - runtime_fingerprint_shift   (priority 80, medium conf):
      ≥N tests share a new fingerprint different from baseline.
  - coordinated_tool_addition   (priority 70, high conf):
      ≥N failing tests all newly call the same tool → prompt or tool-
      description edit, suggests `git log` + `grep` recipes.
  - coordinated_tool_removal    (priority 70, high conf):
      ≥N failing tests all stopped calling the same tool.
  - coordinated_output_drift    (priority 50, medium conf):
      ≥N tests with similarity < 0.7 and no tool change → output-only
      drift (model got chattier / terser / refused).

Selection is deterministic: highest (priority, confidence_rank) wins,
ties broken by HINTERS registration order. Hinters are conservative
by contract — when in doubt they return None and the incident still
gets its basic classification.

Contributor side-work:

  - New recipe `docs/agent-recipes/add-root-cause-hint.md` walks
    contributors through writing a new hinter as a single pure
    function, with explicit done-criteria and pitfalls.
  - `HINTERS_ROADMAP` in the module lists six concrete good-first-issue
    hinters with sized scope (coordinated_cost_spike,
    coordinated_latency_spike, coordinated_refusal,
    coordinated_parameter_drift, coordinated_decision_drift,
    coordinated_retrieval_drop). Each is one function + one test file
    away from shipping.

Backward compatibility: `Incident.hint` is `Optional[RootCauseHint] =
None`; every existing call site still constructs Incidents cleanly,
notifiers fall back to the headline-only card when hint is None, and
the full test suite (1929 passed) remains green.

Coverage: 20 new tests covering each hinter's positive case, negative
case, and selection logic; integration test pins that
`detect_coordinated_incident` attaches a hint when a hinter matches.
`evalview since`, `progress`, and `drift` each tell you what happened in
one monitor session. The moment you run more than one — canary + prod, a
regional fleet, dev + CI both watching — those single-history commands
start hiding regional failures inside fleet averages and vice-versa. The
"observing 1 → N agents" complaint is exactly this gap.

`evalview fleet` is the cross-instance synthesizer.

  evalview fleet --dir .evalview/history/
  evalview fleet --history monitor-eu.jsonl --history monitor-us.jsonl
  evalview fleet --json > fleet.json
  evalview fleet --anomalies-only      # only show pods deviating from mean
  evalview fleet --require-clean       # CI gate

What it produces:

- Fleet header: instance count, total cycles, fleet pass rate, cost.
- Per-instance table sorted by pass rate (worst first).
- Anomaly callouts via Z-score against fleet pass-rate distribution
  (default ≥2σ). Refuses to call anything anomalous below n=3 instances
  — the math isn't meaningful and false anomalies are worse than none.
- Fleet-wide failures: tests failing in ≥40% of instances surface
  separately so operators can tell "everywhere" from "one bad pod".

Pure analytics — no network, no LLM. Reads JSONL written by
`evalview monitor --history`. Forward-compatible: silently skips record
shapes it doesn't recognize so future monitor record types don't break
old rollups.

Architecture:

  evalview/core/fleet.py        # pure data + math (frozen dataclasses)
  evalview/commands/fleet_cmd.py # Click CLI + Rich rendering

Coverage: 23 new tests. Full suite stays green (1952 passed).

CLI registered in the Production section next to monitor / autopr /
freshness so all the cross-cutting "what's happening across my agents"
commands live together.
Adds a focused "Threat Model: Untrusted Inputs" section motivated by
the May 2026 Langfuse RCE (single OTel trace request → prototype
pollution → cross-project access). The new section enumerates every
external input EvalView accepts, the parser used for each, the
sandbox posture, and what EvalView explicitly does *not* do
(no pickle, no eval/exec, no unsafe yaml.load, no code execution from
cassettes).

Also documents:

  - Cloud sync isolation contract (opt-in, OAuth, server-enforced
    per-project boundary).
  - MCP server attack surface (stdio by default; network is out-of-band
    and unsupported).
  - Where to report parser bugs (Security Advisory, not public issue).

No code changes; doc-only commit.
…ntions (draft v0)

Upstream OTel's gen_ai.* covers model-call instrumentation. The agent
layer above it — turns, tool selections, handoffs, memory reads,
retrieval, human interventions — has no shared vocabulary, so traces
from one stack are useless to another. Every observability vendor
invents its own attribute names. The "make traces portable" complaint
in the agent-eval discourse is exactly this gap.

This commit ships a portable spec EvalView can both emit and consume,
and that any other tool can adopt without depending on EvalView.

Coverage:

  - 10 span kinds: agent.run, agent.turn, agent.tool_choice,
    agent.tool_call, agent.handoff, agent.memory.{read,write},
    agent.retrieval, agent.intervention, agent.plan.
  - ~30 attributes covering identity, state fingerprints, goal text +
    drift delta, tool choice rationale + alternatives, parameter
    fingerprints, retrieval chunks + influence scores, memory keys +
    age, handoff targets, intervention outcomes, cost, verdict.
  - Validation helpers: is_known_span(), is_known_attribute(),
    attributes_for_span(span) returning the recommended attribute set.

Status: draft v0. OTEL_SEMCONV_VERSION = "0.1.0"; bumped on breaking
change so consumers can branch on rev. EvalView root spans will stamp
the version so consumers know how to parse.

No new dependencies — constants are plain strings. Adapters with the
OTel SDK emit real spans; adapters without it can attach the same
attributes to their existing trace dicts.

Contributor side-work:

  - Recipe `docs/agent-recipes/add-otel-emission.md` walks adapter
    authors through wiring the constants into their emitter, with
    pitfalls (no inventing names, fingerprint don't dump, preserve
    parent_run.id across handoffs).
  - Recipe lists 4 sized roadmap items as good first issues:
    LangGraph wiring, `evalview validate-trace` command,
    JSON-Schema export, verdict_at_step extension.

Coverage: 14 tests pinning constants↔index round-trip,
naming convention, validation helpers, and the recommendation table.
Full suite stays green.
…istic + judge slot)

Decision rationale logging answers *what* the agent chose at each step.
This module answers the question one layer above: *is the trajectory
still about the original ask, or did the agent quietly wander?*

The classic failure: user asks "cancel my subscription and refund the
last charge"; agent does account lookup, plan check, terms review,
then answers a pricing question. By step 12, the trajectory looks
nothing like the goal. A drift signal at step 6 ("the agent is no
longer working on a 'cancel + refund' goal") catches it cheaply
before the user is unhappy.

What ships:

  - GoalEvent + GoalDriftAnalysis dataclasses (frozen, JSON-friendly).
  - Deterministic baseline: Jaccard token overlap between stated goal
    and a trajectory summary (last 8 events, weighted toward current
    intent), with the same digit-collapse-to-<num> normalization
    freshness uses so order IDs don't shred similarity.
  - GoalDriftJudge type — pluggable callable for smarter judges
    (embedding cosine, LLM, whatever). Returns None to fall back to
    the baseline; raises silently and falls back. Score is clamped
    to [0, 1] for safety.
  - analyze_per_step() for "when did drift start?" sparkline rendering.

The deterministic baseline is intentionally crude — fires on the
obviously-wandered cases without hitting an LLM. The judge slot is
where smarter contributors can drop in something better; the recipe
ships next to the module.

Connects to the OTel semconv: hinters + adapters that compute drift
should set agent.goal.drift_delta on agent.turn spans.

Contributor side-work:

  - Recipe `docs/agent-recipes/add-goal-drift-judge.md` walks
    through the type contract, fail-soft requirements, and four
    sized roadmap items (embedding-cosine judge, bag-of-bigrams
    baseline, --drift flag on `evalview replay`, OTel attribute
    emission).

Coverage: 17 tests covering baseline correctness, severity buckets,
empty-input safety, digit normalization, judge override, judge
fallback paths (None / raise), score clamping, per-step monotonicity
on a wandering trajectory.
…judge slot

For RAG agents, observability tools mostly fail to answer the question
that matters: of N chunks I retrieved, which ones did the agent
actually use? Without that you can't tell dead-weight chunks (drop
from index) from dominant ones (overfit), and you can't notice when
retrieval quality silently degrades.

What ships:

  - RetrievedChunk + ChunkAttribution + RetrievalLineage dataclasses.
  - MemoryEntry + StaleMemoryFlag for memory store reads.
  - Deterministic baseline: chunk-token recall in the output (NOT
    Jaccard — chunks shouldn't be penalized for the output containing
    extra material), normalized across the chunk set so scores sum to
    1.0 and are comparable cross-run.
  - AttributionJudge type — pluggable callable for smarter scorers
    (embedding cosine, mechanistic, LLM). Returns None to fall back
    to the baseline; raises silently and falls back. Scores clamped
    to [0, 1].
  - influential / dead_weight surfaces with separate thresholds —
    "this chunk added nothing measurable" is the actionable signal
    for index-pruning workflows, distinct from "barely used".
  - attribute_memory_reads() reuses the chunk machinery for memory
    stores (working / episodic / semantic / profile).
  - detect_stale_memory() flags entries older than a configurable
    age (default 7 days), oldest-first for digest rendering.

Pure module — no I/O, no network, no LLM by default.

Connects to the OTel semconv: adapters that compute lineage should
set agent.retrieval.influence_scores on agent.retrieval spans, and
agent.memory.age_seconds on agent.memory.read spans.

Contributor side-work:

  - Recipe `docs/agent-recipes/add-retrieval-attribution.md` walks
    through the judge type contract, the recall-vs-Jaccard rationale,
    and four sized roadmap items (embedding-cosine judge, entity-overlap
    baseline, `evalview retrieval-stats` aggregator, OTel emission).

Coverage: 16 tests covering baseline correctness, normalization
invariant, judge override, judge fallback (None / raise), score
clamping, influential / dead_weight thresholds, memory delegation,
stale-memory detection (default + custom thresholds), and the
type-alias signature.
…hipped)

Static prompts/tasks miss the production failures users actually
encounter: tools flaking, users changing their mind mid-task, contexts
being corrupted, schemas drifting. Chaos-Monkey-style injection is the
asked-for answer in the agent-eval discourse.

This commit ships the *plan* layer — deterministic disruption
scenarios that can be reused across runs and across machines. The
simulator-side wiring (the handler that applies a planned disruption
to a running simulation) is staged as contributor work via the recipe.

What ships:

  - ChaosDisruption + ChaosScenario (frozen dataclasses, JSON-serializable).
  - 3 modes:
    * tool_failure — Nth invocation of a named tool returns an error.
    * latency_spike — Nth tool call sleeps an extra delay_ms.
    * goal_interruption — synthetic user message after step N (pairs
      naturally with goal_drift detection).
  - build_scenario() enforces "one disruption per step" so the
    simulation stays easy to reason about; raises on conflicts.
  - random_scenario(seed=...) — deterministic synthesis from a seed.
    Same (seed, tools, max_steps, n_disruptions, modes) → same plan
    every time, across CI runs and across machines. All randomness
    flows through hashlib-seeded selection, never random.random().
  - Public CHAOS_MODES_ROADMAP listing 6 sized contributor PRs:
    info_drift, rate_limit, partial_handoff, memory_corruption,
    schema_drift, user_typo.

Pure module — no I/O, no network. Adapter-agnostic by design.

Contributor side-work:

  - Recipe `docs/agent-recipes/add-chaos-mode.md` walks contributors
    through adding a mode end-to-end: constant + builder +
    random_scenario branch + simulate_cmd handler + 4 tests, with the
    determinism-via-_seeded_choice rule called out.
  - Roadmap-shape test pins the CHAOS_MODES_ROADMAP surface across
    refactors.

Coverage: 14 new tests covering builders, scenario invariants,
JSON round-trip, determinism (same seed = same scenario, different
seed = different scenario), step-bound respect, registry
consistency, and frozen-dataclass immutability.
@hidai25 hidai25 merged commit 4253ad5 into main May 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants