-
Notifications
You must be signed in to change notification settings - Fork 0
Vision Verification
Browser automation goes wrong in ways DOM checks can't catch: a cookie banner overlays your target button, an error toast appears that the DOM scan missed, the page renders but visually breaks. Vision verification closes that loop by asking a vision model "did the expected thing happen?"
Source: src/main/ai-tools/browser/vision.ts + the vision helpers in src/main/ai-manager.ts:buildVisionHelpers().
Take a screenshot, ask a vision model whether expected_state is visible. Returns:
{
success: true,
verdict: 'yes' | 'no' | 'unclear',
explanation: string,
screenshot_path: string
}Use after an action when DOM alone won't tell you: after a click that should open a modal, after a form submit that should redirect, after a hover that should reveal a tooltip.
Example AI usage:
[step 23] browser_click(selector="#submit")
[step 24] browser_verify_visual_state(expected_state="a success toast appears at the top of the page with the text 'Saved'")
→ {verdict: 'no', explanation: "I see a red error toast saying 'Username already taken'"}
[step 25] AI now knows the click "worked" mechanically but the action failed — adjusts
Free-form vision description. The model names the page, lists visible interactive elements with labels, notes modals/loading states. Useful when the page doesn't match expectations and the AI needs to figure out where it is before acting.
{
success: true,
description: string,
screenshot_path: string
}Example AI usage:
[step 30] AI: "I expected the dashboard but the page reads strange. Let me check."
[step 31] browser_describe_screen()
→ {description: "Login page. Visible: email input (focused), password input (empty),
'Forgot password?' link, 'Sign in' button (disabled), Google SSO button. Top-right:
'Sign up' link. Footer: privacy policy link, terms link."}
[step 32] AI: "I was logged out. Need to re-authenticate first."
The vision helpers in AIManager.buildVisionHelpers():
- Read the screenshot file as
Buffer - Encode as
data:image/png;base64,… - Build an OpenAI-compat vision message:
[{type: 'text', text: '...'}, {type: 'image_url', image_url: {url: 'data:image/png;base64,...'}}] - POST to the vision provider (active provider with
modelswapped forvisionModelif set) - Strict prompt for verify: "Reply on the first line: YES / NO / UNCLEAR. Second line: one-sentence reason."
- Strict prompt for describe: "Name the page, list visible interactive elements, note modals/loading."
In buildVisionHelpers:
const visionProvider: AIProviderConfig = {
...baseProvider,
model: baseProvider.visionModel || baseProvider.model
}If you've set visionModel on the active provider (e.g., a separate "vision-capable" model), use it. Otherwise fall back to the main model — works for dual-purpose models like Claude (Sonnet, Opus, Haiku) and GPT-4o which handle text and vision in the same model.
When neither the main nor vision model can handle images, the vision tool calls will get a model-error response. The tool wraps it and returns {success: false, error: ...} so the AI knows to fall back to DOM-based verification.
In AI-Providers:
| Provider | model | visionModel |
|---|---|---|
| Claude | claude-sonnet-4-5 |
leave blank |
| OpenAI | gpt-4o |
leave blank |
| OpenAI cost-optimized | gpt-4o-mini |
gpt-4o (small text model + vision fallback) |
| Ollama with text-only model | mistral-nemo |
llava (separate vision model) |
| LM Studio | (your loaded text model) | (vision model if loaded separately) |
For most users, Claude or GPT-4o "just works" — vision is built into the same model.
Each vision call is ~1 image + ~50 tokens of text out. For high-fidelity providers (Claude, GPT-4o), this is ~1-2 cents per call. For local vision models (llava), it's free.
If you're cost-conscious, set a separate cheap vision-capable provider as visionModel (e.g., gpt-4o-mini), and reserve the main provider for thinking.
Use it when:
- DOM is misleading (modals, overlays, late JS render)
- You want to confirm visual state matches expectations
- You're navigating a page you've never seen before (
describe_screenfirst, then act) - You're verifying success of a UI flow before claiming a goal complete
Don't use it for:
- Things
wait_for_selectorcan solve faster - Reading text on a page (
get_axtreeorget_contentis cheaper) - Verifying URL or title changes (
navigatedfield on click tools)
Vision tools save screenshots to:
<userData>/clusterspace-data/browser-screenshots/shot-<timestamp>-<random>.png
The tool result returns screenshot_path so the AI can reference it later. ClusterSpace doesn't auto-clean this directory — periodically wipe it if it grows large.
| Symptom | Likely cause |
|---|---|
All verdicts are unclear
|
Vision model is poor; try Claude or GPT-4o |
| Verdicts disagree with what you see | Prompt was vague — be specific about what "expected state" means visually |
success: false, error: "No vision model available" |
No active provider configured; check AI-Providers |
success: false, error: "Vision call failed: ..." |
Provider returned a model error; check if the model supports images |
| Screenshot is blank / mostly white | Browser pane wasn't fully rendered; precede with wait_for_navigation or wait_for_selector
|
- Browser-Panes — the panes vision tools target
- AI-Providers — vision model field per provider
- Goal-Runner-Overview — how vision tools integrate into autonomous loops
-
Success-Criteria —
model_question(similar pattern, text-only) - AI-Tools-Reference — full vision tool signatures
ClusterSpace · Issues · Releases · MIT License · Edit any page via the Edit button (top right of the wiki).
- Workspaces-and-Layout
- Terminal-Panes
- Per-Pane-Tabs
- SSH-and-tmux
- Browser-Panes
- Saved-Logins
- Command-Palette
- Broadcast-Mode
- Settings-and-Configuration
- AI-Overview
- AI-Providers
- AI-Chat-Panel
- AI-Tools-Reference
- Personas
- Skills
- Task-Templates
- Agent-Orchestration
- Fleet-Dashboard
- Goal-Runner-Overview
- Starting-a-Goal
- Success-Criteria
- Goal-Policy-and-Risk-Levels
- Critic-and-Replan
- Vision-Verification
- Goal-Dashboard