Refocus the bench on the five LinkedIn product claims (Visibility, Stages,
Dependencies, Approvals, Artifacts) instead of generic throughput. Each
claim now has a corresponding visibility scenario with a known answer key.
Cross-scenario N=2 result:
Scenario Claim naive agent-tasks
csv-export Visibility 4.0 / 10 10.0 / 10
audit-recall Artifacts 2.8 / 10 10.0 / 10
dep-aware-mgmt Dependencies 0.0 / 10 7.5 / 10
gates-and-approvals Approvals 0.5 / 10 9.75 / 10
AGGREGATE 1.83 / 10 9.31 / 10 (5x)
agent-tasks gives a manager observing a fleet of agents 5x better answer
correctness on standard 'whats going on' questions, and is the only way
to answer Dependencies or Approvals questions at all (naive scores literal
zero on those — the data physically does not exist in the file system).
Added: dep-aware-mgmt and gates-and-approvals visibility scenarios.
Removed: three v1.10.0 throughput pilots (task-claim-race, dependency-graph,
cross-session-pipeline) that produced no signal — their work units were
too small to amortize the per-task ~$0.15 MCP protocol overhead.
realistic-funcs remains as the only throughput pilot (it wins clearly).
Rewrote bench/README.md to lead with a single bottom-line table mapping
each LinkedIn claim to its bench scenario and result.
Cumulative bench evaluation cost across v1.10.x: ~$10.