Background
While verifying visual-judge mode in pilo-evals-judge PR #73 against the eval cluster:
-
Local Pilo builds (fresh from main, with a .dockerignore to keep host node_modules/ out of the build context) emit event.screenshot base64 strings inside agent events in the NDJSON stdout stream. The downstream judge parses these via PiloEvent.screenshot and feeds them to a multimodal LLM judge.
-
The cluster-published image at gcr.io/moz-fx-tabs-nonprod/pilo-cli:latest does NOT appear to emit screenshots in the NDJSON. The judge sees 0 screenshots and falls back to action-summary-only reasoning. Same Pilo agent code in principle, same task, same downstream judge — only the build artifact differs.
Repro
- pilo-evals-judge worktree, branch
worktree-online-mind2web-benchmark
- Submit
argo submit --from workflowtemplate/pilo-batch-eval-from-file -p evaluations-gcs-key=eval-inputs/online-mind2web-smoke.jsonl -p judge-mode=visual -p judge-version=pr73-4c798a4 -p agent-version=latest
- Workflow
pilo-batch-eval-from-file-qqqwc logs show Visual judge: 0 screenshots, action summary len 283
- Compare to a local
make eval TASKS_FILE=evals-data/online-mind2web/smoke.jsonl JUDGE_MODE=visual after make build-agent PILO_REPO=/path/to/pilo, which produces ~97 events including base64 screenshots
Suspected cause
Either:
- The published image was built from a Pilo version that pre-dates
event.screenshot being emitted in the JSONL event stream
- There's a build-time flag or env var (e.g.
PILO_VISION=true?) that controls whether screenshots are serialized into NDJSON vs just attached to internal observability — and the cluster image isn't configured for it
Worth grepping packages/core/src/webAgent.ts for where screenshot ends up in the event payload, plus checking what pilo-vision: true actually controls.
Why it matters
Online-Mind2Web visual judge mode is materially weakened without screenshots — the LLM is asked to "evaluate from screenshots and actions" and has to confabulate the screenshot dimension. The local pilo-evals-judge#73 verification got real screenshots and grounded the judge verdicts in visible UI. The cluster path didn't. Bridging this would make the cluster a viable place to run OM2W as a routine eval.
Out of scope
Not blocking the pilo-evals-judge PR — that PR's plumbing is verified working end-to-end. This is a separate Pilo-build observation worth a quick audit.
Background
While verifying visual-judge mode in pilo-evals-judge PR #73 against the eval cluster:
Local Pilo builds (fresh from main, with a
.dockerignoreto keep hostnode_modules/out of the build context) emitevent.screenshotbase64 strings inside agent events in the NDJSON stdout stream. The downstream judge parses these viaPiloEvent.screenshotand feeds them to a multimodal LLM judge.The cluster-published image at
gcr.io/moz-fx-tabs-nonprod/pilo-cli:latestdoes NOT appear to emit screenshots in the NDJSON. The judge sees0 screenshotsand falls back to action-summary-only reasoning. Same Pilo agent code in principle, same task, same downstream judge — only the build artifact differs.Repro
worktree-online-mind2web-benchmarkargo submit --from workflowtemplate/pilo-batch-eval-from-file -p evaluations-gcs-key=eval-inputs/online-mind2web-smoke.jsonl -p judge-mode=visual -p judge-version=pr73-4c798a4 -p agent-version=latestpilo-batch-eval-from-file-qqqwclogs showVisual judge: 0 screenshots, action summary len 283make eval TASKS_FILE=evals-data/online-mind2web/smoke.jsonl JUDGE_MODE=visualaftermake build-agent PILO_REPO=/path/to/pilo, which produces ~97 events including base64 screenshotsSuspected cause
Either:
event.screenshotbeing emitted in the JSONL event streamPILO_VISION=true?) that controls whether screenshots are serialized into NDJSON vs just attached to internal observability — and the cluster image isn't configured for itWorth grepping
packages/core/src/webAgent.tsfor wherescreenshotends up in the event payload, plus checking whatpilo-vision: trueactually controls.Why it matters
Online-Mind2Web visual judge mode is materially weakened without screenshots — the LLM is asked to "evaluate from screenshots and actions" and has to confabulate the screenshot dimension. The local pilo-evals-judge#73 verification got real screenshots and grounded the judge verdicts in visible UI. The cluster path didn't. Bridging this would make the cluster a viable place to run OM2W as a routine eval.
Out of scope
Not blocking the pilo-evals-judge PR — that PR's plumbing is verified working end-to-end. This is a separate Pilo-build observation worth a quick audit.