Skip to content

v1.3.3

@keshrath keshrath tagged this 08 Apr 21:00
Adds the bench pilot that directly simulates the user's real-world scenario:
two Claude sessions in the same project at different times, conflict surfaces
at git commit. v1.3.2 shipped the workspace-awareness + bash-guard hooks but
they were only unit-tested in isolation. This pilot is the end-to-end check.

NEW bench/workloads/multi-term/

  4 source files (foo, bar, baz, qux), each a stub function. The driver
  inits a git repo around them at run time via the new gitInit: true option.

NEW driver options

  gitInit: true           run git init/add/commit in the shared dir before
                          agents spawn (used by multi-term-commit)
  installBashGuard: true  add bash-guard PreToolUse(Bash) hook to the
                          per-agent settings JSON alongside file-coord
  needDashboard now triggers on installBashGuard too (bash-guard talks to
  the REST API)

NEW runMultiTerminalCommit() pilot

  2 sequential agents in the SAME shared dir (not parallel — uses the
  sequentialAgents mode added in v1.3.1's async-handoff work).
  - Session A: implement add() in foo.js and subtract() in bar.js, do NOT
    commit, just edit and stop
  - Session B: implement multiply() in baz.js and divide() in qux.js, then
    run `git commit -am 'session-B: baz+qux'`

  Naive condition (no bash-guard): B's commit will likely include A's
  foo+bar files because git commit -am stages all modified files.

  Hooked condition (bash-guard installed): B's commit is BLOCKED with a
  'held by session-A' message; B has to react (selective stage, restore A's
  files, coordinate, etc.).

  Headline metric: commit_purity = does B's commit contain ONLY baz+qux?
  Post-run analyzer parses git log + git show --name-only from the run dir
  to compute it.

  Run with: npm run bench:run -- --real --pilot=multi-term

NOT YET RUN — rate-limited until 11pm Vienna. The pilot is validated as
code (typecheck + lint + 288/288 tests pass) but the actual end-to-end
validation against real Claude subagents requires API budget which resets
at 11pm. One command to validate when budget returns:

  npm run bench:run -- --real --pilot=multi-term

Cost ~$3-4 per run, ~3 minutes wall.
Assets 2
Loading