Proctor makes a coding agent prove it manually tested its own work before it can say "done".
The point is not browser automation by itself. Agents can already click buttons, run curl, and take screenshots. The missing piece is the contract:
- what had to be tested
- what counted as proof
- what blocked completion
Proctor creates that contract, records evidence against it, and refuses completion until the required proof exists.
The CLI is intentionally long-form. A fresh agent should be able to start with proctor --help, learn the workflow, and complete a run without reading Proctor's source.
brew tap nclandrei/tap
brew install nclandrei/tap/proctorTagged GitHub releases publish prebuilt Homebrew archives from
nclandrei/proctor and refresh nclandrei/homebrew-tap/Formula/proctor.rb
automatically.
The release workflow expects a HOMEBREW_TAP_TOKEN GitHub secret with push
access to nclandrei/homebrew-tap.
Proctor is agent-agnostic. It is meant to work from:
- Codex
- Claude Code
- any other coding agent with shell access
It does not assume one agent runtime, one browser driver, or one editor.
Proctor is not:
- a browser automation framework
- an iOS automation framework
- a hosted QA platform
Proctor is:
- a manual-test contract generator
- an evidence recorder
- a completion gate
- a shareable reporting layer
This is the kind of prompt Proctor is designed for:
We just implemented the new authentication flow.
Use proctor --help to manually test it.
That prompt should be enough. The agent should not need extra explanation from the human.
Reading proctor --help is not the task. It is the entry point. The agent is
still expected to inspect the current diff, identify the user-visible change,
create the right contract, run the manual checks, and record real evidence.
For user-visible web work, start here:
proctor start \
--platform web \
--feature "new authentication flow" \
--url http://127.0.0.1:3000/login \
--curl scenario \
--curl-endpoint "happy-path=POST /api/login" \
--curl-endpoint "failure-path=POST /api/login" \
--curl-endpoint "Already signed-in users are redirected away from /login=GET /api/session" \
--happy-path "Valid credentials redirect to the dashboard." \
--failure-path "Invalid credentials show an error and keep the user on /login." \
--edge-case "validation and malformed input=Bad email shows inline validation" \
--edge-case "empty or missing input=Empty email and password show required-field errors" \
--edge-case "retry or double-submit=Second submit does not create duplicate requests" \
--edge-case "loading, latency, and race conditions=Button stays disabled while the request is pending" \
--edge-case "network or server failure=500 response shows a retryable error state" \
--edge-case "auth and session state=Already signed-in users are redirected away from /login" \
--edge-case "refresh, back-navigation, and state persistence=Refresh preserves the authenticated state" \
--edge-case "mobile or responsive behavior=Login form stays usable at mobile width" \
--edge-case "accessibility and keyboard behavior=Enter submits from the password field; tab order stays correct" \
--edge-case "any feature-specific risks=N/A: no extra feature-specific risks"curl is decided per scenario. --curl scenario is the explicit risk-based mode, and each --curl-endpoint entry binds one or more endpoints to a named scenario. --curl required remains as a shorthand for requiring curl on both the happy path and failure path.
If the flow is mostly client-side and there is no meaningful backend or protocol risk, skip curl with an explicit reason:
proctor start \
--platform web \
--feature "What Did I Just Watch finder" \
--url http://127.0.0.1:4174/kimarite \
--curl skip \
--curl-skip-reason "Static client-side filter UI with no separate backend contract." \
--happy-path "Selecting plain-language finish and approach clues narrows the library and updates the URL." \
--failure-path "A user can back out with Not sure yet and return to the broad library without broken state." \
--edge-case "validation and malformed input=N/A: no freeform input, only preset links" \
--edge-case "empty or missing input=Starting with no clue selected still shows the broad library and finder." \
--edge-case "retry or double-submit=N/A: idempotent client-side link navigation only" \
--edge-case "loading, latency, and race conditions=N/A: static client-side filter state with no async mutation" \
--edge-case "network or server failure=N/A: no feature-specific backend dependency" \
--edge-case "auth and session state=N/A: public catalog page" \
--edge-case "refresh, back-navigation, and state persistence=Direct filtered URL preserves the selected clue state on load." \
--edge-case "mobile or responsive behavior=Filtered finder state remains readable and usable on mobile." \
--edge-case "accessibility and keyboard behavior=N/A: this pass is visual only" \
--edge-case "any feature-specific risks=N/A: reset behavior is covered by the main failure path"For iOS work, create an iOS contract instead:
proctor start \
--platform ios \
--feature "reader library relaunch" \
--ios-scheme Pagena \
--ios-bundle-id com.example.pagena \
--ios-simulator "iPhone 16 Pro" \
--curl skip \
--curl-skip-reason "UI-only iOS verification for this pass." \
--happy-path "Launching the app lands on the library screen." \
--failure-path "Missing content shows a visible recovery state instead of a blank screen." \
--edge-case "validation and malformed input=N/A: no freeform input in this flow" \
--edge-case "empty or missing input=N/A: no required input in this flow" \
--edge-case "retry or double-submit=N/A: no repeated mutation in this flow" \
--edge-case "loading, latency, and race conditions=Loading placeholder settles once without duplicate content." \
--edge-case "network or server failure=Offline launch shows a recoverable empty state." \
--edge-case "auth and session state=N/A: anonymous browsing only" \
--edge-case "app lifecycle, relaunch, and state persistence=Foregrounding the app keeps the same selected title." \
--edge-case "device traits, orientation, and layout=Library remains readable on the target simulator." \
--edge-case "accessibility, dynamic type, and keyboard behavior=N/A: this pass is visual only" \
--edge-case "any feature-specific risks=N/A: no extra feature-specific risks"For CLI and TUI work, create a CLI contract instead:
proctor start \
--platform cli \
--feature "magellan prompt inspection flow" \
--cli-command "magellan prompts inspect onboarding" \
--happy-path "Inspecting a known prompt shows the body and metadata in a readable terminal layout." \
--failure-path "Inspecting an unknown prompt exits non-zero and prints a clear error." \
--edge-case "invalid or malformed input=Broken prompt syntax shows a validation error without a panic" \
--edge-case "missing required args, files, config, or env=Missing prompt slug explains what argument is required" \
--edge-case "retry, rerun, and idempotency=Running the same inspect command twice gives the same result" \
--edge-case "long-running output, streaming, or progress state=N/A: single-shot command with immediate output" \
--edge-case "interrupts, cancellation, and signals=N/A: command exits immediately" \
--edge-case "tty, pipe, and non-interactive behavior=Piped output still renders the inspected prompt body without ANSI garbage" \
--edge-case "terminal layout, wrapping, and resize behavior=The inspected prompt still wraps cleanly in a narrow terminal" \
--edge-case "keyboard navigation and shortcut behavior=N/A: single-shot command with no in-app key handling" \
--edge-case "state, config, and persistence across reruns=N/A: read-only inspection command" \
--edge-case "stderr, exit codes, and partial failure reporting=Unknown prompt returns a non-zero exit code and prints the error on stderr" \
--edge-case "any feature-specific risks=N/A: no extra feature-specific risks"Proctor does not drive the browser for you. Use your own browser tooling to produce:
- a desktop screenshot
- a mobile screenshot
- a
report.jsonfile with desktop and mobile final URL and issue counts
Proctor only needs a small report shape:
{
"desktop": {
"finalUrl": "http://127.0.0.1:3000/dashboard",
"issues": {
"consoleErrors": 0,
"consoleWarnings": 0,
"pageErrors": 0,
"failedRequests": 0,
"httpErrors": 0
}
},
"mobile": {
"finalUrl": "http://127.0.0.1:3000/dashboard",
"issues": {
"consoleErrors": 0,
"consoleWarnings": 0,
"pageErrors": 0,
"failedRequests": 0,
"httpErrors": 0
}
}
}consoleWarnings is part of the browser report schema so the run keeps the full
browser-health picture. By default, though, Proctor only blocks completion on
console errors, page errors, failed requests, and HTTP errors. Add an explicit
assertion such as console_warnings = 0 when warnings should fail the run too.
If your browser tool does not emit this exact file, that is still fine. Capture
the real browser session data, then write a tiny report.json file with this
shape and attach that to Proctor.
For CLI and TUI work, Proctor expects a real terminal session.
Preferred, not required: use a real terminal app plus tmux or an equivalent
persistent multiplexer so the agent can keep one session alive, drive keyboard
input deterministically, capture pane output, and take screenshots.
- run the CLI in a real terminal session
- capture at least one screenshot
- capture the terminal transcript from that session
- record the actual command you exercised
Each record browser command attaches one browser run to one scenario:
proctor record browser \
--scenario happy-path \
--session auth-browser-1 \
--report /abs/path/report.json \
--screenshot desktop=/abs/path/desktop.png \
--screenshot mobile=/abs/path/mobile.png \
--assert 'final_url contains /dashboard' \
--assert 'desktop_screenshot = true' \
--assert 'mobile_screenshot = true'You can reuse one browser report for multiple scenarios if it genuinely proves each one.
Proctor does not boot the simulator for you. Use your own simulator tooling to
build, launch, screenshot, and inspect logs. Proctor only needs a screenshot
plus a small ios-report.json file:
{
"simulator": {
"name": "iPhone 16 Pro",
"runtime": "iOS 18.2"
},
"app": {
"bundleId": "com.example.pagena",
"screen": "Library",
"state": "foreground"
},
"issues": {
"launchErrors": 0,
"crashes": 0,
"fatalLogs": 0
}
}Then record that evidence against the scenario:
proctor record ios \
--scenario happy-path \
--session pagena-library-1 \
--report /abs/path/ios-report.json \
--screenshot library=/abs/path/library.png \
--assert 'screen contains Library' \
--assert 'bundle_id = com.example.pagena' \
--assert 'app_launch = true'One simulator report can be reused for multiple scenarios if it genuinely proves each one.
Then record the terminal evidence against the scenario:
proctor record cli \
--scenario happy-path \
--session magellan-cli-1 \
--command "magellan prompts inspect onboarding" \
--transcript /abs/path/pane.txt \
--screenshot terminal=/abs/path/terminal.png \
--exit-code 0 \
--assert 'output contains onboarding' \
--assert 'exit_code = 0' \
--assert 'screenshot = true'When a scenario requires curl, wrap the real command:
proctor record curl \
--scenario failure-path \
--assert 'status = 401' \
--assert 'body contains invalid' \
--assert 'header.content-type contains application/json' \
-- \
curl -si -X POST http://127.0.0.1:3000/api/login \
-H 'content-type: application/json' \
-d '{"email":"demo@example.com","password":"wrong"}'proctor status
proctor done
proctor reportproctor done is the real completion gate. If it fails, the run is not complete.
Freehand notes do not count.
For browser evidence, Proctor expects:
- a session id string
- desktop and mobile screenshots across the run
- a report JSON artifact
- at least one passing assertion
The report JSON can be synthesized from real browser-session output. It does not have to come from one specific browser helper.
For web runs, mobile proof is mandatory. Even when the primary scenario is
desktop-first, proctor done still requires at least one desktop screenshot and
at least one mobile screenshot somewhere in the recorded browser evidence.
For iOS evidence, Proctor expects:
- a simulator session id string
- at least one simulator screenshot across the run
- an
ios-report.jsonartifact - at least one passing assertion
The iOS report can be synthesized from real simulator-session output. It does not have to come from one specific helper.
For CLI evidence, Proctor expects:
- a terminal session id string
- at least one terminal screenshot across the run
- a transcript artifact from that session
- the actual exercised command
- at least one passing assertion
For curl evidence, Proctor expects:
- a real wrapped command
- the captured transcript
- at least one passing assertion
Provenance alone is not enough. Evidence must also include scenario-specific assertions.
curl is gated per scenario, not per endpoint. Endpoints are recorded on each scenario so the contract can say which HTTP surfaces carry risk, but proctor done still evaluates evidence scenario-by-scenario.
Examples:
final_url contains /dashboardfinal_url = http://127.0.0.1:3000/loginconsole_errors = 0console_warnings = 0failed_requests = 0http_errors = 1desktop_screenshot = truemobile_screenshot = truemobile.final_url contains /login
If you do not explicitly assert browser health counts, Proctor adds implicit zero-issue assertions for the blocking browser-health metrics:
- console errors
- page errors
- failed requests
- HTTP errors
Console warnings are deliberately excluded from that default gate. Proctor still
records consoleWarnings in the report so you can inspect them later or make
them blocking with an explicit assertion such as console_warnings = 0.
Examples:
screen contains Librarybundle_id = com.example.pagenasimulator contains iPhone 16 Proruntime contains iOSstate = foregroundapp_launch = truelaunch_errors = 0crashes = 0fatal_logs = 0screenshot = true
If you do not explicitly assert iOS health counts, Proctor adds implicit zero-issue assertions for:
- launch errors
- crashes
- fatal logs
Examples:
output contains onboardingoutput contains prompt not foundcommand contains magellansession contains cli-sessiontool = terminal-sessionexit_code = 0screenshot = true
Proctor does not accept "give me two edge cases".
Each category must be covered either by:
- one or more concrete scenarios
- or
N/Awith a reason
Current categories:
Web:
- validation and malformed input
- empty or missing input
- retry or double-submit
- loading, latency, and race conditions
- network or server failure
- auth and session state
- refresh, back-navigation, and state persistence
- mobile or responsive behavior
- accessibility and keyboard behavior
- any feature-specific risks
iOS:
- validation and malformed input
- empty or missing input
- retry or double-submit
- loading, latency, and race conditions
- network or server failure
- auth and session state
- app lifecycle, relaunch, and state persistence
- device traits, orientation, and layout
- accessibility, dynamic type, and keyboard behavior
- any feature-specific risks
CLI:
- invalid or malformed input
- missing required args, files, config, or env
- retry, rerun, and idempotency
- long-running output, streaming, or progress state
- interrupts, cancellation, and signals
- tty, pipe, and non-interactive behavior
- terminal layout, wrapping, and resize behavior
- keyboard navigation and shortcut behavior
- state, config, and persistence across reruns
- stderr, exit codes, and partial failure reporting
- any feature-specific risks
proctor --helpThe long-form agent onboarding surface.proctor startCreates the verification contract.proctor statusShows what still passes or fails.proctor record browserAttaches browser evidence to one scenario.proctor record cliAttaches terminal evidence to one scenario.proctor record iosAttaches iOS simulator evidence to one scenario.proctor record curlWraps and records one real HTTP command for one scenario.proctor doneFails until the contract is satisfied.proctor reportPrints the generated output paths.
Use subcommand help for exact flags:
proctor start --help
proctor record browser --help
proctor record cli --help
proctor record ios --help
proctor record curl --help
proctor done --helpArtifacts live outside the repo by default:
~/.proctor/runs/<repo-slug>/<run-id>/
Important files:
run.jsonevidence.jsonlcontract.mdreport.htmlartifacts/
contract.md and report.html are derived from the recorded evidence. They are human-facing outputs, not the source of truth.
report.html is always rendered in dark mode, keeps its styles, screenshot previews, and embedded log transcripts self-contained, lets readers enlarge screenshots inline, and keeps logs collapsed until the reader expands them.
Current supported surfaces:
- web browser evidence with desktop and mobile proof
- CLI and TUI evidence with screenshots plus transcripts
- iOS simulator evidence with screenshots plus simulator/app report metadata
- risk-based
curlevidence when backend or protocol verification matters curlrisk is modeled per scenario, with scenario-level endpoint lists and scenario-level completion gates
go test ./...
go run . --helpIf you are changing the browser reporting or CLI help, rerun a fresh-agent test. The target bar is simple:
- a new agent should start with
proctor --help - it should not need to read Proctor's source
- it should be able to create a run, record evidence, and finish with
proctor done