Skip to content

nclandrei/proctor

Repository files navigation

Proctor

Proctor makes a coding agent prove it manually tested its own work before it can say "done".

The point is not browser automation by itself. Agents can already click buttons, run curl, and take screenshots. The missing piece is the contract:

  • what had to be tested
  • what counted as proof
  • what blocked completion

Proctor creates that contract, records evidence against it, and refuses completion until the required proof exists.

The CLI is intentionally long-form. A fresh agent should be able to start with proctor --help, learn the workflow, and complete a run without reading Proctor's source.

Install

brew tap nclandrei/tap
brew install nclandrei/tap/proctor

Tagged GitHub releases publish prebuilt Homebrew archives from nclandrei/proctor and refresh nclandrei/homebrew-tap/Formula/proctor.rb automatically. The release workflow expects a HOMEBREW_TAP_TOKEN GitHub secret with push access to nclandrei/homebrew-tap.

Who This Is For

Proctor is agent-agnostic. It is meant to work from:

  • Codex
  • Claude Code
  • any other coding agent with shell access

It does not assume one agent runtime, one browser driver, or one editor.

What Proctor Does

Proctor is not:

  • a browser automation framework
  • an iOS automation framework
  • a hosted QA platform

Proctor is:

  • a manual-test contract generator
  • an evidence recorder
  • a completion gate
  • a shareable reporting layer

The Human Prompt

This is the kind of prompt Proctor is designed for:

We just implemented the new authentication flow.
Use proctor --help to manually test it.

That prompt should be enough. The agent should not need extra explanation from the human.

Reading proctor --help is not the task. It is the entry point. The agent is still expected to inspect the current diff, identify the user-visible change, create the right contract, run the manual checks, and record real evidence.

Quick Start

1. Create the Contract

For user-visible web work, start here:

proctor start \
  --platform web \
  --feature "new authentication flow" \
  --url http://127.0.0.1:3000/login \
  --curl scenario \
  --curl-endpoint "happy-path=POST /api/login" \
  --curl-endpoint "failure-path=POST /api/login" \
  --curl-endpoint "Already signed-in users are redirected away from /login=GET /api/session" \
  --happy-path "Valid credentials redirect to the dashboard." \
  --failure-path "Invalid credentials show an error and keep the user on /login." \
  --edge-case "validation and malformed input=Bad email shows inline validation" \
  --edge-case "empty or missing input=Empty email and password show required-field errors" \
  --edge-case "retry or double-submit=Second submit does not create duplicate requests" \
  --edge-case "loading, latency, and race conditions=Button stays disabled while the request is pending" \
  --edge-case "network or server failure=500 response shows a retryable error state" \
  --edge-case "auth and session state=Already signed-in users are redirected away from /login" \
  --edge-case "refresh, back-navigation, and state persistence=Refresh preserves the authenticated state" \
  --edge-case "mobile or responsive behavior=Login form stays usable at mobile width" \
  --edge-case "accessibility and keyboard behavior=Enter submits from the password field; tab order stays correct" \
  --edge-case "any feature-specific risks=N/A: no extra feature-specific risks"

curl is decided per scenario. --curl scenario is the explicit risk-based mode, and each --curl-endpoint entry binds one or more endpoints to a named scenario. --curl required remains as a shorthand for requiring curl on both the happy path and failure path.

If the flow is mostly client-side and there is no meaningful backend or protocol risk, skip curl with an explicit reason:

proctor start \
  --platform web \
  --feature "What Did I Just Watch finder" \
  --url http://127.0.0.1:4174/kimarite \
  --curl skip \
  --curl-skip-reason "Static client-side filter UI with no separate backend contract." \
  --happy-path "Selecting plain-language finish and approach clues narrows the library and updates the URL." \
  --failure-path "A user can back out with Not sure yet and return to the broad library without broken state." \
  --edge-case "validation and malformed input=N/A: no freeform input, only preset links" \
  --edge-case "empty or missing input=Starting with no clue selected still shows the broad library and finder." \
  --edge-case "retry or double-submit=N/A: idempotent client-side link navigation only" \
  --edge-case "loading, latency, and race conditions=N/A: static client-side filter state with no async mutation" \
  --edge-case "network or server failure=N/A: no feature-specific backend dependency" \
  --edge-case "auth and session state=N/A: public catalog page" \
  --edge-case "refresh, back-navigation, and state persistence=Direct filtered URL preserves the selected clue state on load." \
  --edge-case "mobile or responsive behavior=Filtered finder state remains readable and usable on mobile." \
  --edge-case "accessibility and keyboard behavior=N/A: this pass is visual only" \
  --edge-case "any feature-specific risks=N/A: reset behavior is covered by the main failure path"

For iOS work, create an iOS contract instead:

proctor start \
  --platform ios \
  --feature "reader library relaunch" \
  --ios-scheme Pagena \
  --ios-bundle-id com.example.pagena \
  --ios-simulator "iPhone 16 Pro" \
  --curl skip \
  --curl-skip-reason "UI-only iOS verification for this pass." \
  --happy-path "Launching the app lands on the library screen." \
  --failure-path "Missing content shows a visible recovery state instead of a blank screen." \
  --edge-case "validation and malformed input=N/A: no freeform input in this flow" \
  --edge-case "empty or missing input=N/A: no required input in this flow" \
  --edge-case "retry or double-submit=N/A: no repeated mutation in this flow" \
  --edge-case "loading, latency, and race conditions=Loading placeholder settles once without duplicate content." \
  --edge-case "network or server failure=Offline launch shows a recoverable empty state." \
  --edge-case "auth and session state=N/A: anonymous browsing only" \
  --edge-case "app lifecycle, relaunch, and state persistence=Foregrounding the app keeps the same selected title." \
  --edge-case "device traits, orientation, and layout=Library remains readable on the target simulator." \
  --edge-case "accessibility, dynamic type, and keyboard behavior=N/A: this pass is visual only" \
  --edge-case "any feature-specific risks=N/A: no extra feature-specific risks"

For CLI and TUI work, create a CLI contract instead:

proctor start \
  --platform cli \
  --feature "magellan prompt inspection flow" \
  --cli-command "magellan prompts inspect onboarding" \
  --happy-path "Inspecting a known prompt shows the body and metadata in a readable terminal layout." \
  --failure-path "Inspecting an unknown prompt exits non-zero and prints a clear error." \
  --edge-case "invalid or malformed input=Broken prompt syntax shows a validation error without a panic" \
  --edge-case "missing required args, files, config, or env=Missing prompt slug explains what argument is required" \
  --edge-case "retry, rerun, and idempotency=Running the same inspect command twice gives the same result" \
  --edge-case "long-running output, streaming, or progress state=N/A: single-shot command with immediate output" \
  --edge-case "interrupts, cancellation, and signals=N/A: command exits immediately" \
  --edge-case "tty, pipe, and non-interactive behavior=Piped output still renders the inspected prompt body without ANSI garbage" \
  --edge-case "terminal layout, wrapping, and resize behavior=The inspected prompt still wraps cleanly in a narrow terminal" \
  --edge-case "keyboard navigation and shortcut behavior=N/A: single-shot command with no in-app key handling" \
  --edge-case "state, config, and persistence across reruns=N/A: read-only inspection command" \
  --edge-case "stderr, exit codes, and partial failure reporting=Unknown prompt returns a non-zero exit code and prints the error on stderr" \
  --edge-case "any feature-specific risks=N/A: no extra feature-specific risks"

2. Capture Real Evidence

Proctor does not drive the browser for you. Use your own browser tooling to produce:

  • a desktop screenshot
  • a mobile screenshot
  • a report.json file with desktop and mobile final URL and issue counts

Proctor only needs a small report shape:

{
  "desktop": {
    "finalUrl": "http://127.0.0.1:3000/dashboard",
    "issues": {
      "consoleErrors": 0,
      "consoleWarnings": 0,
      "pageErrors": 0,
      "failedRequests": 0,
      "httpErrors": 0
    }
  },
  "mobile": {
    "finalUrl": "http://127.0.0.1:3000/dashboard",
    "issues": {
      "consoleErrors": 0,
      "consoleWarnings": 0,
      "pageErrors": 0,
      "failedRequests": 0,
      "httpErrors": 0
    }
  }
}

consoleWarnings is part of the browser report schema so the run keeps the full browser-health picture. By default, though, Proctor only blocks completion on console errors, page errors, failed requests, and HTTP errors. Add an explicit assertion such as console_warnings = 0 when warnings should fail the run too.

If your browser tool does not emit this exact file, that is still fine. Capture the real browser session data, then write a tiny report.json file with this shape and attach that to Proctor.

For CLI and TUI work, Proctor expects a real terminal session. Preferred, not required: use a real terminal app plus tmux or an equivalent persistent multiplexer so the agent can keep one session alive, drive keyboard input deterministically, capture pane output, and take screenshots.

  • run the CLI in a real terminal session
  • capture at least one screenshot
  • capture the terminal transcript from that session
  • record the actual command you exercised

3. Attach Browser Evidence

Each record browser command attaches one browser run to one scenario:

proctor record browser \
  --scenario happy-path \
  --session auth-browser-1 \
  --report /abs/path/report.json \
  --screenshot desktop=/abs/path/desktop.png \
  --screenshot mobile=/abs/path/mobile.png \
  --assert 'final_url contains /dashboard' \
  --assert 'desktop_screenshot = true' \
  --assert 'mobile_screenshot = true'

You can reuse one browser report for multiple scenarios if it genuinely proves each one.

4. Capture And Attach Real iOS Evidence

Proctor does not boot the simulator for you. Use your own simulator tooling to build, launch, screenshot, and inspect logs. Proctor only needs a screenshot plus a small ios-report.json file:

{
  "simulator": {
    "name": "iPhone 16 Pro",
    "runtime": "iOS 18.2"
  },
  "app": {
    "bundleId": "com.example.pagena",
    "screen": "Library",
    "state": "foreground"
  },
  "issues": {
    "launchErrors": 0,
    "crashes": 0,
    "fatalLogs": 0
  }
}

Then record that evidence against the scenario:

proctor record ios \
  --scenario happy-path \
  --session pagena-library-1 \
  --report /abs/path/ios-report.json \
  --screenshot library=/abs/path/library.png \
  --assert 'screen contains Library' \
  --assert 'bundle_id = com.example.pagena' \
  --assert 'app_launch = true'

One simulator report can be reused for multiple scenarios if it genuinely proves each one.

5. Attach Real CLI Evidence

Then record the terminal evidence against the scenario:

proctor record cli \
  --scenario happy-path \
  --session magellan-cli-1 \
  --command "magellan prompts inspect onboarding" \
  --transcript /abs/path/pane.txt \
  --screenshot terminal=/abs/path/terminal.png \
  --exit-code 0 \
  --assert 'output contains onboarding' \
  --assert 'exit_code = 0' \
  --assert 'screenshot = true'

6. Attach HTTP Evidence When Required

When a scenario requires curl, wrap the real command:

proctor record curl \
  --scenario failure-path \
  --assert 'status = 401' \
  --assert 'body contains invalid' \
  --assert 'header.content-type contains application/json' \
  -- \
  curl -si -X POST http://127.0.0.1:3000/api/login \
    -H 'content-type: application/json' \
    -d '{"email":"demo@example.com","password":"wrong"}'

7. Check Coverage And Finish

proctor status
proctor done
proctor report

proctor done is the real completion gate. If it fails, the run is not complete.

What Counts As Proof

Freehand notes do not count.

For browser evidence, Proctor expects:

  • a session id string
  • desktop and mobile screenshots across the run
  • a report JSON artifact
  • at least one passing assertion

The report JSON can be synthesized from real browser-session output. It does not have to come from one specific browser helper.

For web runs, mobile proof is mandatory. Even when the primary scenario is desktop-first, proctor done still requires at least one desktop screenshot and at least one mobile screenshot somewhere in the recorded browser evidence.

For iOS evidence, Proctor expects:

  • a simulator session id string
  • at least one simulator screenshot across the run
  • an ios-report.json artifact
  • at least one passing assertion

The iOS report can be synthesized from real simulator-session output. It does not have to come from one specific helper.

For CLI evidence, Proctor expects:

  • a terminal session id string
  • at least one terminal screenshot across the run
  • a transcript artifact from that session
  • the actual exercised command
  • at least one passing assertion

For curl evidence, Proctor expects:

  • a real wrapped command
  • the captured transcript
  • at least one passing assertion

Provenance alone is not enough. Evidence must also include scenario-specific assertions.

curl is gated per scenario, not per endpoint. Endpoints are recorded on each scenario so the contract can say which HTTP surfaces carry risk, but proctor done still evaluates evidence scenario-by-scenario.

Browser Assertions

Examples:

  • final_url contains /dashboard
  • final_url = http://127.0.0.1:3000/login
  • console_errors = 0
  • console_warnings = 0
  • failed_requests = 0
  • http_errors = 1
  • desktop_screenshot = true
  • mobile_screenshot = true
  • mobile.final_url contains /login

If you do not explicitly assert browser health counts, Proctor adds implicit zero-issue assertions for the blocking browser-health metrics:

  • console errors
  • page errors
  • failed requests
  • HTTP errors

Console warnings are deliberately excluded from that default gate. Proctor still records consoleWarnings in the report so you can inspect them later or make them blocking with an explicit assertion such as console_warnings = 0.

iOS Assertions

Examples:

  • screen contains Library
  • bundle_id = com.example.pagena
  • simulator contains iPhone 16 Pro
  • runtime contains iOS
  • state = foreground
  • app_launch = true
  • launch_errors = 0
  • crashes = 0
  • fatal_logs = 0
  • screenshot = true

If you do not explicitly assert iOS health counts, Proctor adds implicit zero-issue assertions for:

  • launch errors
  • crashes
  • fatal logs

CLI Assertions

Examples:

  • output contains onboarding
  • output contains prompt not found
  • command contains magellan
  • session contains cli-session
  • tool = terminal-session
  • exit_code = 0
  • screenshot = true

Edge Cases Are First-Class

Proctor does not accept "give me two edge cases".

Each category must be covered either by:

  • one or more concrete scenarios
  • or N/A with a reason

Current categories:

Web:

  • validation and malformed input
  • empty or missing input
  • retry or double-submit
  • loading, latency, and race conditions
  • network or server failure
  • auth and session state
  • refresh, back-navigation, and state persistence
  • mobile or responsive behavior
  • accessibility and keyboard behavior
  • any feature-specific risks

iOS:

  • validation and malformed input
  • empty or missing input
  • retry or double-submit
  • loading, latency, and race conditions
  • network or server failure
  • auth and session state
  • app lifecycle, relaunch, and state persistence
  • device traits, orientation, and layout
  • accessibility, dynamic type, and keyboard behavior
  • any feature-specific risks

CLI:

  • invalid or malformed input
  • missing required args, files, config, or env
  • retry, rerun, and idempotency
  • long-running output, streaming, or progress state
  • interrupts, cancellation, and signals
  • tty, pipe, and non-interactive behavior
  • terminal layout, wrapping, and resize behavior
  • keyboard navigation and shortcut behavior
  • state, config, and persistence across reruns
  • stderr, exit codes, and partial failure reporting
  • any feature-specific risks

Commands

  • proctor --help The long-form agent onboarding surface.
  • proctor start Creates the verification contract.
  • proctor status Shows what still passes or fails.
  • proctor record browser Attaches browser evidence to one scenario.
  • proctor record cli Attaches terminal evidence to one scenario.
  • proctor record ios Attaches iOS simulator evidence to one scenario.
  • proctor record curl Wraps and records one real HTTP command for one scenario.
  • proctor done Fails until the contract is satisfied.
  • proctor report Prints the generated output paths.

Use subcommand help for exact flags:

proctor start --help
proctor record browser --help
proctor record cli --help
proctor record ios --help
proctor record curl --help
proctor done --help

Outputs

Artifacts live outside the repo by default:

~/.proctor/runs/<repo-slug>/<run-id>/

Important files:

  • run.json
  • evidence.jsonl
  • contract.md
  • report.html
  • artifacts/

contract.md and report.html are derived from the recorded evidence. They are human-facing outputs, not the source of truth.

report.html is always rendered in dark mode, keeps its styles, screenshot previews, and embedded log transcripts self-contained, lets readers enlarge screenshots inline, and keeps logs collapsed until the reader expands them.

Current Scope

Current supported surfaces:

  • web browser evidence with desktop and mobile proof
  • CLI and TUI evidence with screenshots plus transcripts
  • iOS simulator evidence with screenshots plus simulator/app report metadata
  • risk-based curl evidence when backend or protocol verification matters
  • curl risk is modeled per scenario, with scenario-level endpoint lists and scenario-level completion gates

Development

go test ./...
go run . --help

If you are changing the browser reporting or CLI help, rerun a fresh-agent test. The target bar is simple:

  • a new agent should start with proctor --help
  • it should not need to read Proctor's source
  • it should be able to create a run, record evidence, and finish with proctor done

About

CLI that makes coding agents prove they manually tested their work before saying done

Resources

Stars

Watchers

Forks

Packages