Adds the bench pilot that directly simulates the user's real-world scenario:
two Claude sessions in the same project at different times, conflict surfaces
at git commit. v1.3.2 shipped the workspace-awareness + bash-guard hooks but
they were only unit-tested in isolation. This pilot is the end-to-end check.
NEW bench/workloads/multi-term/
4 source files (foo, bar, baz, qux), each a stub function. The driver
inits a git repo around them at run time via the new gitInit: true option.
NEW driver options
gitInit: true run git init/add/commit in the shared dir before
agents spawn (used by multi-term-commit)
installBashGuard: true add bash-guard PreToolUse(Bash) hook to the
per-agent settings JSON alongside file-coord
needDashboard now triggers on installBashGuard too (bash-guard talks to
the REST API)
NEW runMultiTerminalCommit() pilot
2 sequential agents in the SAME shared dir (not parallel — uses the
sequentialAgents mode added in v1.3.1's async-handoff work).
- Session A: implement add() in foo.js and subtract() in bar.js, do NOT
commit, just edit and stop
- Session B: implement multiply() in baz.js and divide() in qux.js, then
run `git commit -am 'session-B: baz+qux'`
Naive condition (no bash-guard): B's commit will likely include A's
foo+bar files because git commit -am stages all modified files.
Hooked condition (bash-guard installed): B's commit is BLOCKED with a
'held by session-A' message; B has to react (selective stage, restore A's
files, coordinate, etc.).
Headline metric: commit_purity = does B's commit contain ONLY baz+qux?
Post-run analyzer parses git log + git show --name-only from the run dir
to compute it.
Run with: npm run bench:run -- --real --pilot=multi-term
NOT YET RUN — rate-limited until 11pm Vienna. The pilot is validated as
code (typecheck + lint + 288/288 tests pass) but the actual end-to-end
validation against real Claude subagents requires API budget which resets
at 11pm. One command to validate when budget returns:
npm run bench:run -- --real --pilot=multi-term
Cost ~$3-4 per run, ~3 minutes wall.