Skip to content

pilo-cli published image not emitting event.screenshot base64 in NDJSON stream #466

@lmorchard

Description

@lmorchard

Background

While verifying visual-judge mode in pilo-evals-judge PR #73 against the eval cluster:

  • Local Pilo builds (fresh from main, with a .dockerignore to keep host node_modules/ out of the build context) emit event.screenshot base64 strings inside agent events in the NDJSON stdout stream. The downstream judge parses these via PiloEvent.screenshot and feeds them to a multimodal LLM judge.

  • The cluster-published image at gcr.io/moz-fx-tabs-nonprod/pilo-cli:latest does NOT appear to emit screenshots in the NDJSON. The judge sees 0 screenshots and falls back to action-summary-only reasoning. Same Pilo agent code in principle, same task, same downstream judge — only the build artifact differs.

Repro

  • pilo-evals-judge worktree, branch worktree-online-mind2web-benchmark
  • Submit argo submit --from workflowtemplate/pilo-batch-eval-from-file -p evaluations-gcs-key=eval-inputs/online-mind2web-smoke.jsonl -p judge-mode=visual -p judge-version=pr73-4c798a4 -p agent-version=latest
  • Workflow pilo-batch-eval-from-file-qqqwc logs show Visual judge: 0 screenshots, action summary len 283
  • Compare to a local make eval TASKS_FILE=evals-data/online-mind2web/smoke.jsonl JUDGE_MODE=visual after make build-agent PILO_REPO=/path/to/pilo, which produces ~97 events including base64 screenshots

Suspected cause

Either:

  1. The published image was built from a Pilo version that pre-dates event.screenshot being emitted in the JSONL event stream
  2. There's a build-time flag or env var (e.g. PILO_VISION=true?) that controls whether screenshots are serialized into NDJSON vs just attached to internal observability — and the cluster image isn't configured for it

Worth grepping packages/core/src/webAgent.ts for where screenshot ends up in the event payload, plus checking what pilo-vision: true actually controls.

Why it matters

Online-Mind2Web visual judge mode is materially weakened without screenshots — the LLM is asked to "evaluate from screenshots and actions" and has to confabulate the screenshot dimension. The local pilo-evals-judge#73 verification got real screenshots and grounded the judge verdicts in visible UI. The cluster path didn't. Bridging this would make the cluster a viable place to run OM2W as a routine eval.

Out of scope

Not blocking the pilo-evals-judge PR — that PR's plumbing is verified working end-to-end. This is a separate Pilo-build observation worth a quick audit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions