Release v1.10.1: bench focused on LinkedIn product claims (visibility scenarios) · keshrath/agent-tasks

v1.10.1
7184dae
Choose a tag to compare

Filter

View all tags

v1.10.1: bench focused on LinkedIn product claims (visibility scenarios)

v1.10.1
7184dae
Choose a tag to compare

Filter

View all tags

keshrath tagged this 09 Apr 11:31

Refocus the bench on the five LinkedIn product claims (Visibility, Stages,
Dependencies, Approvals, Artifacts) instead of generic throughput. Each
claim now has a corresponding visibility scenario with a known answer key.

Cross-scenario N=2 result:
  Scenario              Claim          naive       agent-tasks
  csv-export            Visibility     4.0 / 10    10.0 / 10
  audit-recall          Artifacts      2.8 / 10    10.0 / 10
  dep-aware-mgmt        Dependencies   0.0 / 10    7.5 / 10
  gates-and-approvals   Approvals      0.5 / 10    9.75 / 10
  AGGREGATE                            1.83 / 10   9.31 / 10  (5x)

agent-tasks gives a manager observing a fleet of agents 5x better answer
correctness on standard 'whats going on' questions, and is the only way
to answer Dependencies or Approvals questions at all (naive scores literal
zero on those — the data physically does not exist in the file system).

Added: dep-aware-mgmt and gates-and-approvals visibility scenarios.
Removed: three v1.10.0 throughput pilots (task-claim-race, dependency-graph,
cross-session-pipeline) that produced no signal — their work units were
too small to amortize the per-task ~$0.15 MCP protocol overhead.
realistic-funcs remains as the only throughput pilot (it wins clearly).
Rewrote bench/README.md to lead with a single bottom-line table mapping
each LinkedIn claim to its bench scenario and result.

Cumulative bench evaluation cost across v1.10.x: ~$10.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!