Skip to content

v1.10.1

@keshrath keshrath tagged this 09 Apr 11:31
Refocus the bench on the five LinkedIn product claims (Visibility, Stages,
Dependencies, Approvals, Artifacts) instead of generic throughput. Each
claim now has a corresponding visibility scenario with a known answer key.

Cross-scenario N=2 result:
  Scenario              Claim          naive       agent-tasks
  csv-export            Visibility     4.0 / 10    10.0 / 10
  audit-recall          Artifacts      2.8 / 10    10.0 / 10
  dep-aware-mgmt        Dependencies   0.0 / 10    7.5 / 10
  gates-and-approvals   Approvals      0.5 / 10    9.75 / 10
  AGGREGATE                            1.83 / 10   9.31 / 10  (5x)

agent-tasks gives a manager observing a fleet of agents 5x better answer
correctness on standard 'whats going on' questions, and is the only way
to answer Dependencies or Approvals questions at all (naive scores literal
zero on those — the data physically does not exist in the file system).

Added: dep-aware-mgmt and gates-and-approvals visibility scenarios.
Removed: three v1.10.0 throughput pilots (task-claim-race, dependency-graph,
cross-session-pipeline) that produced no signal — their work units were
too small to amortize the per-task ~$0.15 MCP protocol overhead.
realistic-funcs remains as the only throughput pilot (it wins clearly).
Rewrote bench/README.md to lead with a single bottom-line table mapping
each LinkedIn claim to its bench scenario and result.

Cumulative bench evaluation cost across v1.10.x: ~$10.
Assets 2
Loading