pilo-cli published image not emitting event.screenshot base64 in NDJSON stream

## Background

While verifying visual-judge mode in pilo-evals-judge PR #73 against the eval cluster:

- Local Pilo builds (fresh from main, with a `.dockerignore` to keep host `node_modules/` out of the build context) emit `event.screenshot` base64 strings inside agent events in the NDJSON stdout stream. The downstream judge parses these via `PiloEvent.screenshot` and feeds them to a multimodal LLM judge.

- The cluster-published image at `gcr.io/moz-fx-tabs-nonprod/pilo-cli:latest` does NOT appear to emit screenshots in the NDJSON. The judge sees `0 screenshots` and falls back to action-summary-only reasoning. Same Pilo agent code in principle, same task, same downstream judge — only the build artifact differs.

## Repro

- pilo-evals-judge worktree, branch `worktree-online-mind2web-benchmark`
- Submit `argo submit --from workflowtemplate/pilo-batch-eval-from-file -p evaluations-gcs-key=eval-inputs/online-mind2web-smoke.jsonl -p judge-mode=visual -p judge-version=pr73-4c798a4 -p agent-version=latest`
- Workflow `pilo-batch-eval-from-file-qqqwc` logs show `Visual judge: 0 screenshots, action summary len 283`
- Compare to a local `make eval TASKS_FILE=evals-data/online-mind2web/smoke.jsonl JUDGE_MODE=visual` after `make build-agent PILO_REPO=/path/to/pilo`, which produces ~97 events including base64 screenshots

## Suspected cause

Either:
1. The published image was built from a Pilo version that pre-dates `event.screenshot` being emitted in the JSONL event stream
2. There's a build-time flag or env var (e.g. `PILO_VISION=true`?) that controls whether screenshots are serialized into NDJSON vs just attached to internal observability — and the cluster image isn't configured for it

Worth grepping `packages/core/src/webAgent.ts` for where `screenshot` ends up in the event payload, plus checking what `pilo-vision: true` actually controls.

## Why it matters

Online-Mind2Web visual judge mode is materially weakened without screenshots — the LLM is asked to "evaluate from screenshots and actions" and has to confabulate the screenshot dimension. The local pilo-evals-judge#73 verification got real screenshots and grounded the judge verdicts in visible UI. The cluster path didn't. Bridging this would make the cluster a viable place to run OM2W as a routine eval.

## Out of scope

Not blocking the pilo-evals-judge PR — that PR's plumbing is verified working end-to-end. This is a separate Pilo-build observation worth a quick audit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pilo-cli published image not emitting event.screenshot base64 in NDJSON stream #466

Background

Repro

Suspected cause

Why it matters

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pilo-cli published image not emitting event.screenshot base64 in NDJSON stream #466

Description

Background

Repro

Suspected cause

Why it matters

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions