Add screenshot logging, verification gates, and quality enforcement by nclandrei · Pull Request #1 · nclandrei/proctor

nclandrei · 2026-04-08T12:36:36Z

Summary

Adds the Showboat-style verification pattern: agents must now narrate each step of their verification with screenshots, observations, and comparisons — and proctor enforces the structure, quality, and completeness.

New command: `proctor log`

Records step-by-step verification with three mandatory fields per step:

--action — what the agent did
--observation — what the agent sees in the screenshot (it looks at it with its own vision)
--comparison — how what it sees compares to the scenario requirements

Verification gates added

Log entries mandatory at done — every scenario needs at least 1 log step
Screenshot format validation — PNG/JPEG/GIF/WebP magic bytes checked on all screenshots (record + log); text files renamed to .png are rejected
Screenshot freshness/size/duplicate checks on log step screenshots (same gates as record)
Observation quality enforcement — 40-char minimum, 4+ distinct words, vague filler phrases rejected ("looks good", "as expected", "lgtm", etc.)

Report integration

Log entries now appear in both HTML and markdown reports under each scenario, between pre-notes and evidence, with inline screenshot thumbnails and lightbox zoom.

Enforced flow

proctor start    → define scenarios + edge cases
proctor note     → commit intent (20+ chars)
proctor log      → each step: action + screenshot + observation + comparison (mandatory)
proctor record   → final evidence with assertions
proctor verify   → re-read screenshot, write observation (40+ chars, 4+ words)
proctor done     → blocks unless ALL gates pass

Test plan

go test ./... passes (all packages)
gofmt -l . clean
Unit tests for LogStep, ledger CRUD, step auto-increment, field validation
Unit tests for observation quality: validateObservationQuality, distinctWords, vague phrase rejection
Unit tests for isImageHeader magic bytes (PNG, JPEG, GIF, WebP, text, empty)
Unit tests for screenshot size/freshness rejection in log steps
CLI integration tests for proctor log (step numbering, flag validation, short observation rejection)
All existing integration tests updated with log steps (CLI, iOS, desktop, verify flows)

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

Adds two new commands that close the gap between "the agent claims it looked at the screenshot" and "an AI model actually examined it": - `proctor log` records step-by-step verification actions with screenshots, building a chronological visual diary in screenshot-log.jsonl. - `proctor analyze` sends screenshots to the Claude vision API, which describes what it sees, compares against scenario requirements, lists findings/concerns, and judges whether the visual state matches the scenario's intent. Results are stored in analysis.jsonl. Both commands are optional and do not gate `proctor done`. They enrich the audit trail and make verification evidence machine-readable rather than relying solely on the agent's freeform text notes. Includes comprehensive help text, unit tests with mock HTTP server for the Claude API, screenshot log ledger tests, and CLI integration tests. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

This reverts commit 918ee6d.

New command: proctor log enforces the Showboat pattern where the agent narrates its own verification at every step. For each step, the agent must provide: --action what it did (navigated, clicked, typed, etc.) --screenshot the screenshot it took of the result --observation what it actually sees in the screenshot (the agent looks at it with its own vision and describes it) --comparison how what it sees compares to the scenario requirements All three text fields enforce a 20-character minimum, same as pre-notes and verify observations. No external API calls - the agent provides the eyes. Proctor enforces the structure. Steps are stored in screenshot-log.jsonl as an append-only ledger with file locking, following the same pattern as notes.jsonl and captures.jsonl. Steps auto-increment per (scenario, session) pair. The log is optional for proctor done to pass, but produces a richer audit trail than record+verify alone. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

The VerifyEvidence doc comment got partially overwritten when LogStepOptions was inserted. Remove the dangling fragment. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

Log steps now appear in both the markdown contract and the HTML report under each scenario, between pre-notes and evidence. Each step shows the action, observation, comparison, and an inline screenshot thumbnail with lightbox zoom. The writeReports function now loads screenshot-log.jsonl and passes entries through to RenderReports and buildScenarioHTMLReports. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

The Evaluate function now loads screenshot-log.jsonl and checks that every scenario has at least one log entry. Scenarios without log steps get LogOK=false with an issue message directing the agent to run proctor log. The done gate blocks and status shows the gap. All existing tests updated to call logStepsForAll or the CLI equivalent before done, matching the new mandatory flow. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

LogStep now validates screenshots with the same gates as record: - Minimum 10KB size (rejects placeholder files) - Maximum 30 minutes old (rejects stale screenshots) - SHA256 duplicate detection across scenarios Tests added for tiny and stale screenshot rejection. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

All screenshot validation paths (RecordBrowser, RecordIOS, RecordCLI, RecordDesktop, LogStep) now check file headers for known image format magic bytes before accepting the file: - PNG: \x89PNG - JPEG: \xFF\xD8\xFF - GIF: GIF8 - WebP: RIFF....WEBP Text files renamed to .png are rejected. Tests updated to write PNG magic bytes in all fixture screenshots. Added tests for format validation including isImageHeader unit tests. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

…ase rejection Observations (proctor log --observation, --comparison) and verify notes now enforce stricter quality gates: - Minimum length raised from 20 to 40 characters (pre-notes stay at 20) - Must contain at least 4 distinct words (blocks "aaa aaa aaa..." padding) - Exact vague filler phrases rejected: "looks good", "as expected", "no issues", "seems fine", "everything works", "lgtm", etc. Good: "login form with email input, password field, and blue Sign In button" Bad: "looks good" / "as expected" / "the page looks correct" Help text updated with examples of good vs bad observations. Tests added for validateObservationQuality, distinctWords, vague phrase rejection, and low word count rejection. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 121e6a7943

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T12:42:33Z

+		return ScreenshotLogEntry{}, fmt.Errorf("unknown scenario: %s", scenarioID)
+	}
+
+	artifact, err := store.CopyArtifact(run, surface, scenarioID, "log-step", screenshotPath)


Validate log surface before writing screenshot artifacts

LogStep forwards opts.Surface directly into store.CopyArtifact, and that value is used as a filesystem path segment. Because there is no whitelist/sanitization here, a --surface containing traversal components (for example ../...) can make createArtifactFile write outside the intended artifacts/<surface>/<scenario> tree, which can pollute other run directories and store traversal paths in ledger entries. Restrict surface to known constants (browser, ios, cli, desktop) before calling CopyArtifact.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T12:42:33Z

+The log is optional for proctor done to pass, but agents that log their
+steps produce richer evidence and more trustworthy verification.


Fix log help text to reflect required completion gate

This help text says log entries are optional for proctor done, but the new evaluation logic now marks a scenario incomplete when no log entry exists. Users following this guidance will hit an unexpected done failure even though they followed documented steps, so the CLI documentation should state that at least one proctor log entry per scenario is required.

Useful? React with 👍 / 👎.

claude added 9 commits April 8, 2026 11:08

Revert "Add screenshot logging and AI vision analysis commands"

db45362

This reverts commit 918ee6d.

Fix mangled comment in engine.go from earlier edit

878a644

The VerifyEvidence doc comment got partially overwritten when LogStepOptions was inserted. Remove the dangling fragment. https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

nclandrei merged commit f89236f into main Apr 8, 2026

chatgpt-codex-connector Bot reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add screenshot logging, verification gates, and quality enforcement#1

Add screenshot logging, verification gates, and quality enforcement#1
nclandrei merged 9 commits intomainfrom
claude/proctor-logging-screenshots-fO4W2

nclandrei commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		The log is optional for proctor done to pass, but agents that log their
		steps produce richer evidence and more trustworthy verification.

Conversation

nclandrei commented Apr 8, 2026

Summary

New command: proctor log

Verification gates added

Report integration

Enforced flow

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New command: `proctor log`