Skip to content

Add screenshot logging, verification gates, and quality enforcement#1

Merged
nclandrei merged 9 commits intomainfrom
claude/proctor-logging-screenshots-fO4W2
Apr 8, 2026
Merged

Add screenshot logging, verification gates, and quality enforcement#1
nclandrei merged 9 commits intomainfrom
claude/proctor-logging-screenshots-fO4W2

Conversation

@nclandrei
Copy link
Copy Markdown
Owner

Summary

Adds the Showboat-style verification pattern: agents must now narrate each step of their verification with screenshots, observations, and comparisons — and proctor enforces the structure, quality, and completeness.

New command: proctor log

Records step-by-step verification with three mandatory fields per step:

  • --action — what the agent did
  • --observation — what the agent sees in the screenshot (it looks at it with its own vision)
  • --comparison — how what it sees compares to the scenario requirements

Verification gates added

  • Log entries mandatory at done — every scenario needs at least 1 log step
  • Screenshot format validation — PNG/JPEG/GIF/WebP magic bytes checked on all screenshots (record + log); text files renamed to .png are rejected
  • Screenshot freshness/size/duplicate checks on log step screenshots (same gates as record)
  • Observation quality enforcement — 40-char minimum, 4+ distinct words, vague filler phrases rejected ("looks good", "as expected", "lgtm", etc.)

Report integration

Log entries now appear in both HTML and markdown reports under each scenario, between pre-notes and evidence, with inline screenshot thumbnails and lightbox zoom.

Enforced flow

proctor start    → define scenarios + edge cases
proctor note     → commit intent (20+ chars)
proctor log      → each step: action + screenshot + observation + comparison (mandatory)
proctor record   → final evidence with assertions
proctor verify   → re-read screenshot, write observation (40+ chars, 4+ words)
proctor done     → blocks unless ALL gates pass

Test plan

  • go test ./... passes (all packages)
  • gofmt -l . clean
  • Unit tests for LogStep, ledger CRUD, step auto-increment, field validation
  • Unit tests for observation quality: validateObservationQuality, distinctWords, vague phrase rejection
  • Unit tests for isImageHeader magic bytes (PNG, JPEG, GIF, WebP, text, empty)
  • Unit tests for screenshot size/freshness rejection in log steps
  • CLI integration tests for proctor log (step numbering, flag validation, short observation rejection)
  • All existing integration tests updated with log steps (CLI, iOS, desktop, verify flows)

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo

claude added 9 commits April 8, 2026 11:08
Adds two new commands that close the gap between "the agent claims it
looked at the screenshot" and "an AI model actually examined it":

- `proctor log` records step-by-step verification actions with
  screenshots, building a chronological visual diary in
  screenshot-log.jsonl.

- `proctor analyze` sends screenshots to the Claude vision API,
  which describes what it sees, compares against scenario requirements,
  lists findings/concerns, and judges whether the visual state matches
  the scenario's intent. Results are stored in analysis.jsonl.

Both commands are optional and do not gate `proctor done`. They enrich
the audit trail and make verification evidence machine-readable rather
than relying solely on the agent's freeform text notes.

Includes comprehensive help text, unit tests with mock HTTP server for
the Claude API, screenshot log ledger tests, and CLI integration tests.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
New command: proctor log enforces the Showboat pattern where the agent
narrates its own verification at every step. For each step, the agent
must provide:

  --action       what it did (navigated, clicked, typed, etc.)
  --screenshot   the screenshot it took of the result
  --observation  what it actually sees in the screenshot (the agent
                 looks at it with its own vision and describes it)
  --comparison   how what it sees compares to the scenario requirements

All three text fields enforce a 20-character minimum, same as pre-notes
and verify observations. No external API calls - the agent provides the
eyes. Proctor enforces the structure.

Steps are stored in screenshot-log.jsonl as an append-only ledger with
file locking, following the same pattern as notes.jsonl and captures.jsonl.
Steps auto-increment per (scenario, session) pair.

The log is optional for proctor done to pass, but produces a richer
audit trail than record+verify alone.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
The VerifyEvidence doc comment got partially overwritten when
LogStepOptions was inserted. Remove the dangling fragment.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
Log steps now appear in both the markdown contract and the HTML report
under each scenario, between pre-notes and evidence. Each step shows
the action, observation, comparison, and an inline screenshot thumbnail
with lightbox zoom.

The writeReports function now loads screenshot-log.jsonl and passes
entries through to RenderReports and buildScenarioHTMLReports.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
The Evaluate function now loads screenshot-log.jsonl and checks that
every scenario has at least one log entry. Scenarios without log steps
get LogOK=false with an issue message directing the agent to run
proctor log. The done gate blocks and status shows the gap.

All existing tests updated to call logStepsForAll or the CLI equivalent
before done, matching the new mandatory flow.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
LogStep now validates screenshots with the same gates as record:
- Minimum 10KB size (rejects placeholder files)
- Maximum 30 minutes old (rejects stale screenshots)
- SHA256 duplicate detection across scenarios

Tests added for tiny and stale screenshot rejection.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
All screenshot validation paths (RecordBrowser, RecordIOS, RecordCLI,
RecordDesktop, LogStep) now check file headers for known image format
magic bytes before accepting the file:

  - PNG: \x89PNG
  - JPEG: \xFF\xD8\xFF
  - GIF: GIF8
  - WebP: RIFF....WEBP

Text files renamed to .png are rejected. Tests updated to write PNG
magic bytes in all fixture screenshots. Added tests for format
validation including isImageHeader unit tests.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
…ase rejection

Observations (proctor log --observation, --comparison) and verify notes
now enforce stricter quality gates:

- Minimum length raised from 20 to 40 characters (pre-notes stay at 20)
- Must contain at least 4 distinct words (blocks "aaa aaa aaa..." padding)
- Exact vague filler phrases rejected: "looks good", "as expected",
  "no issues", "seems fine", "everything works", "lgtm", etc.

Good: "login form with email input, password field, and blue Sign In button"
Bad:  "looks good" / "as expected" / "the page looks correct"

Help text updated with examples of good vs bad observations. Tests added
for validateObservationQuality, distinctWords, vague phrase rejection,
and low word count rejection.

https://claude.ai/code/session_01BQHtYiAaSRvACcqQsiHWYo
@nclandrei nclandrei merged commit f89236f into main Apr 8, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 121e6a7943

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return ScreenshotLogEntry{}, fmt.Errorf("unknown scenario: %s", scenarioID)
}

artifact, err := store.CopyArtifact(run, surface, scenarioID, "log-step", screenshotPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate log surface before writing screenshot artifacts

LogStep forwards opts.Surface directly into store.CopyArtifact, and that value is used as a filesystem path segment. Because there is no whitelist/sanitization here, a --surface containing traversal components (for example ../...) can make createArtifactFile write outside the intended artifacts/<surface>/<scenario> tree, which can pollute other run directories and store traversal paths in ledger entries. Restrict surface to known constants (browser, ios, cli, desktop) before calling CopyArtifact.

Useful? React with 👍 / 👎.

Comment thread help.go
Comment on lines +1165 to +1166
The log is optional for proctor done to pass, but agents that log their
steps produce richer evidence and more trustworthy verification.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix log help text to reflect required completion gate

This help text says log entries are optional for proctor done, but the new evaluation logic now marks a scenario incomplete when no log entry exists. Users following this guidance will hit an unexpected done failure even though they followed documented steps, so the CLI documentation should state that at least one proctor log entry per scenario is required.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants