Skip to content

Vision Verification

nick3 edited this page May 28, 2026 · 1 revision

Vision Verification

Browser automation goes wrong in ways DOM checks can't catch: a cookie banner overlays your target button, an error toast appears that the DOM scan missed, the page renders but visually breaks. Vision verification closes that loop by asking a vision model "did the expected thing happen?"

Source: src/main/ai-tools/browser/vision.ts + the vision helpers in src/main/ai-manager.ts:buildVisionHelpers().


Two tools

browser_verify_visual_state(pane_id, expected_state)

Take a screenshot, ask a vision model whether expected_state is visible. Returns:

{
  success: true,
  verdict: 'yes' | 'no' | 'unclear',
  explanation: string,
  screenshot_path: string
}

Use after an action when DOM alone won't tell you: after a click that should open a modal, after a form submit that should redirect, after a hover that should reveal a tooltip.

Example AI usage:

[step 23] browser_click(selector="#submit")
[step 24] browser_verify_visual_state(expected_state="a success toast appears at the top of the page with the text 'Saved'")
          → {verdict: 'no', explanation: "I see a red error toast saying 'Username already taken'"}
[step 25] AI now knows the click "worked" mechanically but the action failed — adjusts

browser_describe_screen(pane_id, prompt?)

Free-form vision description. The model names the page, lists visible interactive elements with labels, notes modals/loading states. Useful when the page doesn't match expectations and the AI needs to figure out where it is before acting.

{
  success: true,
  description: string,
  screenshot_path: string
}

Example AI usage:

[step 30] AI: "I expected the dashboard but the page reads strange. Let me check."
[step 31] browser_describe_screen()
          → {description: "Login page. Visible: email input (focused), password input (empty), 
             'Forgot password?' link, 'Sign in' button (disabled), Google SSO button. Top-right: 
             'Sign up' link. Footer: privacy policy link, terms link."}
[step 32] AI: "I was logged out. Need to re-authenticate first."

How the vision call works

The vision helpers in AIManager.buildVisionHelpers():

  1. Read the screenshot file as Buffer
  2. Encode as data:image/png;base64,…
  3. Build an OpenAI-compat vision message: [{type: 'text', text: '...'}, {type: 'image_url', image_url: {url: 'data:image/png;base64,...'}}]
  4. POST to the vision provider (active provider with model swapped for visionModel if set)
  5. Strict prompt for verify: "Reply on the first line: YES / NO / UNCLEAR. Second line: one-sentence reason."
  6. Strict prompt for describe: "Name the page, list visible interactive elements, note modals/loading."

Vision model resolution

In buildVisionHelpers:

const visionProvider: AIProviderConfig = {
  ...baseProvider,
  model: baseProvider.visionModel || baseProvider.model
}

If you've set visionModel on the active provider (e.g., a separate "vision-capable" model), use it. Otherwise fall back to the main model — works for dual-purpose models like Claude (Sonnet, Opus, Haiku) and GPT-4o which handle text and vision in the same model.

When neither the main nor vision model can handle images, the vision tool calls will get a model-error response. The tool wraps it and returns {success: false, error: ...} so the AI knows to fall back to DOM-based verification.


Per-provider configuration

In AI-Providers:

Provider model visionModel
Claude claude-sonnet-4-5 leave blank
OpenAI gpt-4o leave blank
OpenAI cost-optimized gpt-4o-mini gpt-4o (small text model + vision fallback)
Ollama with text-only model mistral-nemo llava (separate vision model)
LM Studio (your loaded text model) (vision model if loaded separately)

For most users, Claude or GPT-4o "just works" — vision is built into the same model.


Cost

Each vision call is ~1 image + ~50 tokens of text out. For high-fidelity providers (Claude, GPT-4o), this is ~1-2 cents per call. For local vision models (llava), it's free.

If you're cost-conscious, set a separate cheap vision-capable provider as visionModel (e.g., gpt-4o-mini), and reserve the main provider for thinking.


When to use vision verification

Use it when:

  • DOM is misleading (modals, overlays, late JS render)
  • You want to confirm visual state matches expectations
  • You're navigating a page you've never seen before (describe_screen first, then act)
  • You're verifying success of a UI flow before claiming a goal complete

Don't use it for:

  • Things wait_for_selector can solve faster
  • Reading text on a page (get_axtree or get_content is cheaper)
  • Verifying URL or title changes (navigated field on click tools)

Where the screenshots go

Vision tools save screenshots to:

<userData>/clusterspace-data/browser-screenshots/shot-<timestamp>-<random>.png

The tool result returns screenshot_path so the AI can reference it later. ClusterSpace doesn't auto-clean this directory — periodically wipe it if it grows large.


Failure modes

Symptom Likely cause
All verdicts are unclear Vision model is poor; try Claude or GPT-4o
Verdicts disagree with what you see Prompt was vague — be specific about what "expected state" means visually
success: false, error: "No vision model available" No active provider configured; check AI-Providers
success: false, error: "Vision call failed: ..." Provider returned a model error; check if the model supports images
Screenshot is blank / mostly white Browser pane wasn't fully rendered; precede with wait_for_navigation or wait_for_selector

See also

Clone this wiki locally