Conversation
Adds two new commands that close the gap between "the agent claims it looked at the screenshot" and "an AI model actually examined it": - `proctor log` records step-by-step verification actions with screenshots, building a chronological visual diary in screenshot-log.jsonl. - `proctor analyze` sends screenshots to the Claude vision API, which describes what it sees, compares against scenario requirements, lists findings/concerns, and judges whether the visual state matches the scenario's intent. Results are stored in analysis.jsonl. Both commands are optional and do not gate `proctor done`. They enrich the audit trail and make verification evidence machine-readable rather than relying solely on the agent's freeform text notes. Includes comprehensive help text, unit tests with mock HTTP server for the Claude API, screenshot log ledger tests, and CLI integration tests. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
This reverts commit 918ee6d.
New command: proctor log enforces the Showboat pattern where the agent
narrates its own verification at every step. For each step, the agent
must provide:
--action what it did (navigated, clicked, typed, etc.)
--screenshot the screenshot it took of the result
--observation what it actually sees in the screenshot (the agent
looks at it with its own vision and describes it)
--comparison how what it sees compares to the scenario requirements
All three text fields enforce a 20-character minimum, same as pre-notes
and verify observations. No external API calls - the agent provides the
eyes. Proctor enforces the structure.
Steps are stored in screenshot-log.jsonl as an append-only ledger with
file locking, following the same pattern as notes.jsonl and captures.jsonl.
Steps auto-increment per (scenario, session) pair.
The log is optional for proctor done to pass, but produces a richer
audit trail than record+verify alone.
https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
The VerifyEvidence doc comment got partially overwritten when LogStepOptions was inserted. Remove the dangling fragment. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
Log steps now appear in both the markdown contract and the HTML report under each scenario, between pre-notes and evidence. Each step shows the action, observation, comparison, and an inline screenshot thumbnail with lightbox zoom. The writeReports function now loads screenshot-log.jsonl and passes entries through to RenderReports and buildScenarioHTMLReports. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
The Evaluate function now loads screenshot-log.jsonl and checks that every scenario has at least one log entry. Scenarios without log steps get LogOK=false with an issue message directing the agent to run proctor log. The done gate blocks and status shows the gap. All existing tests updated to call logStepsForAll or the CLI equivalent before done, matching the new mandatory flow. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
LogStep now validates screenshots with the same gates as record: - Minimum 10KB size (rejects placeholder files) - Maximum 30 minutes old (rejects stale screenshots) - SHA256 duplicate detection across scenarios Tests added for tiny and stale screenshot rejection. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
All screenshot validation paths (RecordBrowser, RecordIOS, RecordCLI, RecordDesktop, LogStep) now check file headers for known image format magic bytes before accepting the file: - PNG: \x89PNG - JPEG: \xFF\xD8\xFF - GIF: GIF8 - WebP: RIFF....WEBP Text files renamed to .png are rejected. Tests updated to write PNG magic bytes in all fixture screenshots. Added tests for format validation including isImageHeader unit tests. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
…ase rejection Observations (proctor log --observation, --comparison) and verify notes now enforce stricter quality gates: - Minimum length raised from 20 to 40 characters (pre-notes stay at 20) - Must contain at least 4 distinct words (blocks "aaa aaa aaa..." padding) - Exact vague filler phrases rejected: "looks good", "as expected", "no issues", "seems fine", "everything works", "lgtm", etc. Good: "login form with email input, password field, and blue Sign In button" Bad: "looks good" / "as expected" / "the page looks correct" Help text updated with examples of good vs bad observations. Tests added for validateObservationQuality, distinctWords, vague phrase rejection, and low word count rejection. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 121e6a7943
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return ScreenshotLogEntry{}, fmt.Errorf("unknown scenario: %s", scenarioID) | ||
| } | ||
|
|
||
| artifact, err := store.CopyArtifact(run, surface, scenarioID, "log-step", screenshotPath) |
There was a problem hiding this comment.
Validate log surface before writing screenshot artifacts
LogStep forwards opts.Surface directly into store.CopyArtifact, and that value is used as a filesystem path segment. Because there is no whitelist/sanitization here, a --surface containing traversal components (for example ../...) can make createArtifactFile write outside the intended artifacts/<surface>/<scenario> tree, which can pollute other run directories and store traversal paths in ledger entries. Restrict surface to known constants (browser, ios, cli, desktop) before calling CopyArtifact.
Useful? React with 👍 / 👎.
| The log is optional for proctor done to pass, but agents that log their | ||
| steps produce richer evidence and more trustworthy verification. |
There was a problem hiding this comment.
Fix log help text to reflect required completion gate
This help text says log entries are optional for proctor done, but the new evaluation logic now marks a scenario incomplete when no log entry exists. Users following this guidance will hit an unexpected done failure even though they followed documented steps, so the CLI documentation should state that at least one proctor log entry per scenario is required.
Useful? React with 👍 / 👎.
Summary
Adds the Showboat-style verification pattern: agents must now narrate each step of their verification with screenshots, observations, and comparisons — and proctor enforces the structure, quality, and completeness.
New command:
proctor logRecords step-by-step verification with three mandatory fields per step:
--action— what the agent did--observation— what the agent sees in the screenshot (it looks at it with its own vision)--comparison— how what it sees compares to the scenario requirementsVerification gates added
done— every scenario needs at least 1 log steprecord)Report integration
Log entries now appear in both HTML and markdown reports under each scenario, between pre-notes and evidence, with inline screenshot thumbnails and lightbox zoom.
Enforced flow
Test plan
go test ./...passes (all packages)gofmt -l .cleanvalidateObservationQuality,distinctWords, vague phrase rejectionisImageHeadermagic bytes (PNG, JPEG, GIF, WebP, text, empty)proctor log(step numbering, flag validation, short observation rejection)https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo