Vision Verification

Browser automation goes wrong in ways DOM checks can't catch: a cookie banner overlays your target button, an error toast appears that the DOM scan missed, the page renders but visually breaks. Vision verification closes that loop by asking a vision model "did the expected thing happen?"

Source: src/main/ai-tools/browser/vision.ts + the vision helpers in src/main/ai-manager.ts:buildVisionHelpers().

Two tools

`browser_verify_visual_state(pane_id, expected_state)`

Take a screenshot, ask a vision model whether expected_state is visible. Returns:

{
  success: true,
  verdict: 'yes' | 'no' | 'unclear',
  explanation: string,
  screenshot_path: string
}

Use after an action when DOM alone won't tell you: after a click that should open a modal, after a form submit that should redirect, after a hover that should reveal a tooltip.

Example AI usage:

[step 23] browser_click(selector="#submit")
[step 24] browser_verify_visual_state(expected_state="a success toast appears at the top of the page with the text 'Saved'")
          → {verdict: 'no', explanation: "I see a red error toast saying 'Username already taken'"}
[step 25] AI now knows the click "worked" mechanically but the action failed — adjusts

`browser_describe_screen(pane_id, prompt?)`

Free-form vision description. The model names the page, lists visible interactive elements with labels, notes modals/loading states. Useful when the page doesn't match expectations and the AI needs to figure out where it is before acting.

{
  success: true,
  description: string,
  screenshot_path: string
}

Example AI usage:

[step 30] AI: "I expected the dashboard but the page reads strange. Let me check."
[step 31] browser_describe_screen()
          → {description: "Login page. Visible: email input (focused), password input (empty), 
             'Forgot password?' link, 'Sign in' button (disabled), Google SSO button. Top-right: 
             'Sign up' link. Footer: privacy policy link, terms link."}
[step 32] AI: "I was logged out. Need to re-authenticate first."

How the vision call works

The vision helpers in AIManager.buildVisionHelpers():

Read the screenshot file as Buffer
Encode as data:image/png;base64,…
Build an OpenAI-compat vision message: [{type: 'text', text: '...'}, {type: 'image_url', image_url: {url: 'data:image/png;base64,...'}}]
POST to the vision provider (active provider with model swapped for visionModel if set)
Strict prompt for verify: "Reply on the first line: YES / NO / UNCLEAR. Second line: one-sentence reason."
Strict prompt for describe: "Name the page, list visible interactive elements, note modals/loading."

Vision model resolution

In buildVisionHelpers:

const visionProvider: AIProviderConfig = {
  ...baseProvider,
  model: baseProvider.visionModel || baseProvider.model
}

If you've set visionModel on the active provider (e.g., a separate "vision-capable" model), use it. Otherwise fall back to the main model — works for dual-purpose models like Claude (Sonnet, Opus, Haiku) and GPT-4o which handle text and vision in the same model.

When neither the main nor vision model can handle images, the vision tool calls will get a model-error response. The tool wraps it and returns {success: false, error: ...} so the AI knows to fall back to DOM-based verification.

Per-provider configuration

In AI-Providers:

Provider	model	visionModel
Claude	`claude-sonnet-4-5`	leave blank
OpenAI	`gpt-4o`	leave blank
OpenAI cost-optimized	`gpt-4o-mini`	`gpt-4o` (small text model + vision fallback)
Ollama with text-only model	`mistral-nemo`	`llava` (separate vision model)
LM Studio	(your loaded text model)	(vision model if loaded separately)

For most users, Claude or GPT-4o "just works" — vision is built into the same model.

Cost

Each vision call is ~1 image + ~50 tokens of text out. For high-fidelity providers (Claude, GPT-4o), this is ~1-2 cents per call. For local vision models (llava), it's free.

If you're cost-conscious, set a separate cheap vision-capable provider as visionModel (e.g., gpt-4o-mini), and reserve the main provider for thinking.

When to use vision verification

Use it when:

DOM is misleading (modals, overlays, late JS render)
You want to confirm visual state matches expectations
You're navigating a page you've never seen before (describe_screen first, then act)
You're verifying success of a UI flow before claiming a goal complete

Don't use it for:

Things wait_for_selector can solve faster
Reading text on a page (get_axtree or get_content is cheaper)
Verifying URL or title changes (navigated field on click tools)

Where the screenshots go

Vision tools save screenshots to:

<userData>/clusterspace-data/browser-screenshots/shot-<timestamp>-<random>.png

The tool result returns screenshot_path so the AI can reference it later. ClusterSpace doesn't auto-clean this directory — periodically wipe it if it grows large.

Failure modes

Symptom	Likely cause
All verdicts are `unclear`	Vision model is poor; try Claude or GPT-4o
Verdicts disagree with what you see	Prompt was vague — be specific about what "expected state" means visually
`success: false, error: "No vision model available"`	No active provider configured; check AI-Providers
`success: false, error: "Vision call failed: ..."`	Provider returned a model error; check if the model supports images
Screenshot is blank / mostly white	Browser pane wasn't fully rendered; precede with `wait_for_navigation` or `wait_for_selector`

Vision Verification

Vision Verification

Two tools

browser_verify_visual_state(pane_id, expected_state)

browser_describe_screen(pane_id, prompt?)

How the vision call works

Vision model resolution

Per-provider configuration

Cost

When to use vision verification

Where the screenshots go

Failure modes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Start here

User guide

AI subsystem

Goal Runner

Developer

Reference

Clone this wiki locally

`browser_verify_visual_state(pane_id, expected_state)`

`browser_describe_screen(pane_id, prompt?)`