Releases: KeyCode17/rust-ai-surfer
v4.0.0 — Multi-Tenant Tier (BrowserContext isolation → SessionManager)
v4.0.0 — Multi-Tenant Tier
Consolidates everything from v3.2.0 → v4.0.0: the crates evolve from a single-user agent engine into a foundation you can embed in a multi-tenant SaaS, where many users each get their own AI agent + isolated browser session. Built as five phases, each released as a minor; v4.0.0 is the major that marks the tier complete (and carries the breaking API changes below).
Design + audit trail: docs/superpowers/specs/2026-05-30-multi-tenant-tier-design.md.
What landed, phase by phase
v3.2.0 — Phase 1: CDP BrowserContext isolation (the backbone)
ras_types::ContextId;BrowserPort::{create_context, close_context, new_target_in, list_targets_in}.ras-cdp::context_ops: isolated context per user, tabs scoped to a context, context-scoped cookie clear (Storage.clearCookies{browserContextId}), per-context download directory.- Proven by a live e2e against headless Chrome: a cookie set in context A is invisible in context B;
list_targets_inreturns only that context's tabs.
v3.3.0 — Phase 2: Egress / SSRF policy
ras_validation::EgressPolicy: scheme allowlist (http/https), blocks loopback/private/link-local + cloud metadata 169.254.169.254 +localhost, denied ports (incl. the CDP port), consumer allow/deny lists.navigateis gated through the policy before touching the browser, so an agent can't be steered straight to metadata/internal hosts/file://.
v3.4.0 — Phase 3: Agent target binding
ToolContext.target+RunAgent::with_target; every builtin now acts on its bound tab instead of "the last focused tab". Removes the cross-session tab race;focused_targetis gone from the tool layer.
v3.5.0 — Phase 4: Context-tagged event producer
- The CDP→
BrowserEventpipeline that previously discarded events now exists: per-tab listeners (attach_events) forward navigation + dialog events to a suppliedEventBus, so a session that owns its bus receives only its own tab's events — structural per-session event isolation.
v3.6.0 — Phase 5: ras-session crate (capstone)
SessionManager<Owner>:spawn/get/list, bounded (max_sessions+OnFull::Reject|EvictOldest), one-session-per-owner, background idle reaper.SessionHandle: per-session event stream,run(task)with a one-active-task guard (SessionError::Busy) wired to the bound agent, explicitclose()that disposes the context.BrowserProvidertrait (shipsSharedBrowserProvider= one Chrome, context-per-session; swap for process-per-tenant for a hard boundary).
v4.0.0 — major
- Marks the tier complete and absorbs the breaking changes below.
- Internal workspace dependency pins moved
3.x→4.0.0so the per-crate crates.io publish resolves at the new major.
Breaking changes
BrowserPortgained context +attach_eventsmethods (default-impl'd, so most consumers are unaffected; customBrowserPortimpls inherit safe defaults).ToolContextgained atarget: Option<TargetId>field — any code constructingToolContextdirectly must set it.
Isolation guarantees (and honest limits)
- Per-user separation is enforced at the primitive layer (contexts, scoped targets/storage, per-tab events) so it holds even if a consumer writes their own session manager.
- One shared Chrome = a soft process boundary, sufficient for web-session isolation; for a hard boundary against a hostile page, supply a process-per-tenant
BrowserProvider. - Auth/transport are intentionally not in the crates — the consumer owns them (a capability-by-
SessionIdmodel + example wiring is the intended pattern).
Deferred (tracked follow-ups)
- CDP
Fetch.requestPausedenforcement for 3xx redirects + JS-driven navigation, and DNS-rebinding (Phase 2b). - Broader event coverage (downloads, network, target lifecycle) on the existing per-tab→bus model.
🤖 Generated with Claude Code
v3.1.0 — Richer clickable naming for DOM grounding
v3.1.0 — Richer clickable naming
Improves the DOM-grounding clickable map the agent sees, so it can tell elements apart from the text channel instead of relying on the screenshot alone — especially anchors, JS-handler buttons, and icon-only controls.
What changed
- Detection now unions the markup heuristic with CDP's native
isClickable, catching elements made clickable via JS event listeners (not just<a>/<button>/onclick). - Naming falls back through a chain so elements stop arriving as bare
[N] tag:
aria-label/alt/title/name/placeholder→ descendant text (e.g.<a><span>Home</span></a>→ "Home") →onclickhandler name (viewBankAccount('0')→ "view bank account", verb only) → FontAwesome icon class (fa-eye→ "view",fa-plus-circle→ "add") →role. roleoverrides the displayed tag, so<i role="button">is shown asbutton, noti.- Clickable map cap raised 80 → 200, sorted visible-first so on-screen elements survive truncation on dense pages (e.g. long sidebars).
snapshot_parsersplit intosnapshot_parser/clickables/clickable_namingto honor the 200-LOC module cap.
Verification
- Unit tests for the naming chain (text/onclick/icon/role precedence, FontAwesome mapping, style-token filtering).
- Live e2e against headless Chrome confirming anchor text, nested-span text,
onclick-derived names, androle→buttonall resolve on a real page.
🤖 Generated with Claude Code
v3.0.0 — PerimeterX press-and-hold solver + CDP primitives
Highlights
End-to-end PerimeterX press-and-hold challenge solver via pure CDP — verified clearing the deny page on pedidosya.com.ar. Includes the new BrowserPort primitives that make it possible plus the agent + tools surface to drive it from an LLM.
⚠️ Breaking changes
- chromiumoxide 0.6 → 0.9 (Chrome 142+ wire compatibility — 0.6's protocol crate can't deserialize new CDP message variants)
BrowserPorttrait gains 6 required methods:mouse_down(x, y),mouse_up(x, y),mouse_move(x, y, buttons_mask)mouse_hold(x, y, ms)— humanized press-and-hold (pre-approach + ±1px jitter during hold)block_urls(patterns)—Network.setBlockedURLsglob suppressionclear_cookies(origin)—Storage.clearDataForOrigin(cookies + storage)
- All workspace internal version pins bumped 2.1.0 → 3.0.0
New tools (ras-tools)
press_and_hold_element { index, ms? }— resolves bbox center of a clickable, dispatches humanized hold (default 12000ms, max 60000ms)press_and_hold_coordinate { x, y, ms? }— raw pixel variant for buttons inside cross-origin iframes / closed shadow DOM
Logging
ras_agent::braintarget: per-stepagent decision,agent action params,action resultrecordsras_llm_openaiTRACE level: full request + response body dumps- Presets:
RUST_LOG=info,ras_agent::brain=info— brain + action streamRUST_LOG=info,ras_agent::brain=debug— with raw action paramsRUST_LOG=info,ras_llm_openai=trace— full LLM I/O
New examples
| Example | What it does |
|---|---|
pedidosya_px_solver |
Pure-CDP (no LLM) PerimeterX solver. Clears cookies, locates #px-captcha shadow host, holds ~18s with humanized approach + jitter. Verified to clear the deny page on pedidosya.com.ar. |
pedidosya_perimeterx_smoke |
LLM-driven variant exercising press_and_hold_* builtins. |
mouse_drag_smoke |
autodraw.com canvas drawing smoke — verifies CDP isTrusted=true mouse pipeline draws a recognizable circle. |
youtube_search_claude_code |
Claude Code OAuth path via ChatAnthropicClaudeCode. |
Internal
ras-cdpadapter refactored:mouse_input.rs+cdp_ext.rsextracted; newwith_pagehelper andop!macro deduplicate the page_for+within boilerplate.- Pre-commit guard scripts (
check-loc.sh,check-no-comments.sh,check-no-unwrap.sh) fixed to excludeexamples/,tests/,xtask/when staged without a parent path component.
Verified end-to-end
cargo check --workspace --all-targets✓cargo clippy --workspace --all-targets -- -D warnings✓cargo test -p ras-tools -p ras-agent -p ras-cdp✓- Manual:
cargo run --example pedidosya_px_solver— PASS (PX cleared, redirected to homepage) - Manual:
cargo run --example mouse_drag_smoke— PASS (visible circle drawn on autodraw canvas)
🤖 Generated with Claude Code
2.8.0
2.8.0
Fixed
ras-cdpno longer pullsasync-std. The workspacechromiumoxidepin previously enabledtokio-runtimeadditively on top ofchromiumoxide's defaults, which meantasync-std-runtimewas still on. We now setdefault-features = falseand enable onlytokio-runtime+bytes.- Resolves RUSTSEC-2025-0052 (async-std unmaintained) for every downstream that consumes
ras-cdp(e.g.ai-flow-surfer).
Why a minor bump
publish.yml gates patch versions; bumping to 2.8.0 fires the crates.io publish job. No public API changed.
Verification
cargo tree -p ras-cdp -i async-std → did not match any packages.
v2.7.0
v2.7.0 - grounding chain closes
v2.6.0
Highlights
Agent DOM grounding lands. v2.5.0 shipped ChromiumoxideDomExtractor but nothing called it. v2.6.0 closes the loop: every step captures a fresh DOM snapshot, the next prompt carries a numbered clickable map, and the LLM operates on real page state instead of fabricating from text narration.
This is Phase B of the agent grounding fix. Combined with v2.4.0 (vision feedback) and v2.5.0 (extractor implementation), the agent now sees and references the DOM the model claims to interact with.
Closes #31.
What's new
ras-cdp — ChromiumoxideAdapter::browser_arc() (closes #31)
Returns a clone of the adapter's Arc<Mutex<Browser>> so the extractor and adapter share one CDP connection / target space:
```rust
let adapter = ChromiumoxideAdapter::connect(ws, timeout).await?;
let extractor = ChromiumoxideDomExtractor::new(adapter.browser_arc(), timeout);
```
The v2.5.0 release notes referenced this accessor before it existed. It's real now.
ras-agent — DOM extractor wired through the loop
RunAgent::with_dom_extractor(Arc<dyn DomExtractor>)builder. None preserves pre-2.6.0 behavior.RunStepcallsextractor.snapshot(target)after every step and stores the result on the newStepRecord.summary: Option<BrowserStateSummary>.- Failures degrade gracefully:
tracing::warn+summary: None. Grounding is auxiliary, not required. - Field is
#[serde(default, skip_serializing_if = "Option::is_none")]— old serialized history files still deserialize cleanly.
Prompt format — numbered clickable map
```
Step 4 result:
url: https://example.com/login
action results:
[0] clicked login button
clickable_elements:
[0] button "Sign in"
[1] input "Email"
[2] input "Password"
…and 5 more (truncated)
```
Indexes match ClickableElement.index from the snapshot. click_element(index=N) is now grounded in real DOM state instead of model guesses. ax_name takes precedence over label. List capped at 80 elements with overflow marker.
Screenshot precedence
If summary.screenshot_b64 is present, it is the sole image part attached to the user message. Otherwise the legacy path (one part per ActionResult.images entry) kicks in. Steps that explicitly screenshot no longer end up with two image parts in the prompt.
ras-llm::ChatMessage::user_parts (carried context)
The constructor for mixed-content user messages introduced in v2.4.0 is the load-bearing primitive Phase B uses to attach text + screenshot in one turn.
Architecture decisions
ras-agentalready depended onras-dom; no new crate dep.clickable_mapextracted to its own module to keeprender_step_message.rsunder the 200-LOC cap. Split is also semantic — clickable rendering is independent of step message assembly.ScriptedDomExtractormock in the integration test bypasses real Chrome for fast deterministic coverage. Real CDP testing requires cosmium and remains a manual smoke step.
Tests
5 new clickable_map unit tests:
- empty clickables → empty string
- ax_name precedence over label
- label fallback when no ax_name
- no quotes when neither
- truncation past CLICKABLE_LIMIT (80) with "…and N more" marker
New integration test dom_extractor_grounding_reaches_next_prompt:
ScriptedDomExtractorreturns a cannedBrowserStateSummarywith two clickables (button "Sign in" + input "Email") and a known screenshot byte marker.- Asserts step 2's prompt contains
clickable_elements:text with both rendered indexes AND the extractor's screenshot bytes. - Asserts
extractor.snapshotwas invoked at least once across the run.
Total ras-agent: 18 unit + 5 integration. Workspace: 97 test groups all green.
Verification
cargo test --workspace --no-fail-fast— cleancargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
LOC per new/modified file (200 cap):
clickable_map.rs115render_step_message.rs173run_step.rs161run_agent.rs168
Compatibility
- All new APIs additive (
browser_arc(),with_dom_extractor,StepRecord.summary). - No public-API breaks anywhere in the workspace.
- Workspace MSRV unchanged.
- Old serialized
AgentHistoryListdeserialize cleanly (summary defaults to None).
Migration
```rust
// before (v2.5.0): no DOM grounding
let agent = RunAgent::new(task, llm, registry, browser, events);
// after (v2.6.0): wire extractor for real grounding
use std::sync::Arc;
use std::time::Duration;
use ras_dom::{ChromiumoxideDomExtractor, DomExtractor};
let adapter = ChromiumoxideAdapter::connect(ws_url, Duration::from_secs(60)).await?;
let extractor: Arc = Arc::new(
ChromiumoxideDomExtractor::new(adapter.browser_arc(), Duration::from_secs(30)),
);
let browser: Arc = Arc::new(adapter);
let agent = RunAgent::new(task, llm, registry, browser, events)
.with_dom_extractor(extractor);
```
Deferred (still after this release)
- Full
EnhancedDomTreeNodetree — currentlytree: NoneinBrowserStateSummary. stable_hash— currently empty inClickableElement.stable_hash.- Real AX tree via
Accessibility.getFullAXTree— currentax_namefrom attributes is a sound MVP. - Paint-order occlusion —
paint_ordersrequested but not yet used to drop covered elements. - Per-action snapshot — current snapshot fires once per step after all actions complete. Per-click feedback is plausible if models need finer-grained grounding; defer until evidence.
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: all
ras-*workspace crates published at2.6.0oncepublish.ymlfinishes
Pull requests
- #32 —
feat(agent): DOM grounding via ChromiumoxideDomExtractor (v2.6.0) - #33 —
release: v2.6.0 (agent DOM grounding)
Sub-phase commits
- B1
fix(cdp): add ChromiumoxideAdapter::browser_arc() accessor— 2.5.1 (closes #31) - B2
feat(agent): capture DOM snapshot per step via Option<Arc<dyn DomExtractor>>— 2.5.2 - B3
feat(agent): inject numbered clickable map + prefer extractor screenshot— 2.5.3 - B4
feat(examples): wire ChromiumoxideDomExtractor into claude_code_oauth_cosmium— 2.5.4 - B5
test(agent): integration test for end-to-end DOM grounding flow— 2.5.5 chore: bump to 2.6.0
Closes: #31
Full changelog: v2.5.0...v2.6.0
v2.5.0
Highlights
DomExtractor has a real implementation. Before v2.5.0 the trait existed in ras-dom with zero implementations — BrowserStateSummary was unreachable from any action and Phase B (numbered clickable index map for prompts) was blocked.
v2.5.0 ships ChromiumoxideDomExtractor using pure CDP via DOMSnapshot.captureSnapshot, the same primitive Puppeteer and Playwright use for fast structural snapshots.
This is Phase C of the agent grounding fix. Phase B (prompt wiring) ships as 2.6.0.
This release also carries the v2.4.1 anthropic ImageUrl fix (#27) to crates.io — that patch was held per publish.yml patch-skip gate and folded into this minor.
What's new
ras-dom::ChromiumoxideDomExtractor
- New module
ras-dom/src/infrastructure/chromiumoxide/withextractor.rs,snapshot.rs,snapshot_parser.rs,highlight.rs. ChromiumoxideDomExtractor::new(Arc<Mutex<Browser>>, Duration)— wires a browser handle plus request timeout.- Implements the
DomExtractortrait that has lived without an impl since 2.0. - Re-exported as
ras_dom::ChromiumoxideDomExtractor.
snapshot() — pure-CDP path
- One
Page.execute(CaptureSnapshotParams)round-trip withincludePaintOrder = true,includeDOMRects = true. - Parser walks
NodeTreeSnapshotparallel arrays (node_name,attributes,backendNodeId) resolvingStringIndexreferences throughresp.strings. - Layout:
node_index → BoundingBoxmap fromLayoutTreeSnapshot.bounds. - Clickable detection: tag in
{a, button, input, select, textarea, summary, label, details}OR presence ofonclick/tabindex/role/aria-pressed/aria-checked. ax_namederived from first non-empty ofaria-label,alt,title,name,placeholder.labelfromvalueattribute.- Tabs via
Browser.execute(GetTargetsParams)filtered totype=="page". - Inline screenshot via
Page.screenshot(PNG, viewport). - Whole flow wrapped in
tokio::time::timeout(request_timeout).
highlight() — draw-bbox canvas overlay
Page.evaluateinstalls a fixed-position 100vw/100vhpointer-events: noneoverlay div at z-index 2^31-1 (highest valid value, sits above app UI without intercepting events).- Same selector set as the snapshot parser. Slices to
options.max_index(default 200). - Per visible element: 2px
#ff3366border box (gated onoptions.draw_bounding_boxes) and[N]index label above it (gated onoptions.include_text_labels). - After screenshot, a second
Page.evaluateunconditionally removes the overlay. - Index labels match the index space
snapshot()produces — a model that sees[3]in the highlighted screenshot can callclick_element(index=3)directly. Phase B wires the prompt plumbing.
ras-llm-anthropic — ImageUrl native source.type=url (carried from v2.4.1 / #27)
AnthropicImageSourcerefactored from struct to enum;ContentPart::ImageUrlnow emits Anthropic's native{\"type\":\"image\",\"source\":{\"type\":\"url\",\"url\":\"...\"}}shape.
Known gap (#31)
ChromiumoxideAdapter does not yet expose a browser_arc() accessor for its Arc<Mutex<Browser>> field. Today the only ways to construct ChromiumoxideDomExtractor are:
- Open a second
Browser::connect_with_configto the same CDP URL (doubles WebSocket connections, separate target space). - Custom adapter path.
The accessor will land in v2.6.0 alongside Phase B's ToolContext wiring. Tracked at #31.
Architecture decisions
- Impl lives in
ras-dom, notras-cdp, becauseras-dom → ras-cdpis the existing dependency direction. Reversing it would have caused a cycle.ras-domnow depends onchromiumoxideandtokiodirectly. ChromiumoxideDomExtractortakes a sharedArc<Mutex<Browser>>instead of owning its own connection — caller decides how the handle is shared.- No fixture-JSON parser unit tests in this release. The
chromiumoxide_cdptypes are codegen'd from a.pdlfile; constructing validCaptureSnapshotReturnsby hand is mechanical busywork that doesn't catch real bugs (which live at the CDP wire level). Real verification needs a live Chrome.
Deferred to follow-ups
- #31 —
ChromiumoxideAdapter::browser_arc()accessor (target: v2.6.0). - Full
EnhancedDomTreeNodetree —tree: NoneinBrowserStateSummary. Phase B prompt injection only needsclickables. stable_hash— empty string inClickableElement.stable_hash. Wiringras_dom::application::stable_hashrequires building the tree first.- Real AX tree via
Accessibility.getFullAXTree— currentax_namefrom attributes is a sound MVP but misses computed accessibility names. - Paint-order occlusion —
paint_ordersrequested in the CDP call but not yet used.ras_dom::application::paint_orderexists; can plug in. - Phase B (v2.6.0) — wire
Arc<dyn DomExtractor>intoToolContext, post-action snapshot in click/navigate/scroll, numbered index map in agent prompt.
Verification
cargo test --workspace --no-fail-fast— all 97 test groups passcargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
LOC per file (200 cap):
extractor.rs51snapshot.rs150snapshot_parser.rs178highlight.rs115
Compatibility
- New types are purely additive (
ChromiumoxideDomExtractor, new module path). ras-domdirect dependencies grew: now depends onchromiumoxideandtokiodirectly (transitively viaras-cdpbefore, but explicit now).- No breaking changes to public APIs in any existing crate.
- Workspace MSRV unchanged.
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: all
ras-*workspace crates published at2.5.0oncepublish.ymlfinishes (v2.4.1 anthropic fix carried)
Pull requests
- #29 —
feat(dom): ChromiumoxideDomExtractor via DOMSnapshot.captureSnapshot (v2.5.0) - #30 —
release: v2.5.0 (CDP DomExtractor)
Sub-phase commits
feat(dom): scaffold ChromiumoxideDomExtractor (Phase C1)— 2.4.2feat(dom): implement snapshot() via DOMSnapshot.captureSnapshot (Phase C2)— 2.4.3feat(dom): implement highlight() with draw-bbox canvas overlay (Phase C3)— 2.4.4chore: bump to 2.5.0
Full changelog: v2.4.1...v2.5.0
v2.4.1
Highlights
Anthropic ImageUrl fix. Closes #24. Before this patch, ContentPart::ImageUrl { url } was silently mapped to AnthropicContent::Text { text: url } in ras-llm-anthropic — a URL-referenced image (CDN, hosted bucket, signed S3 link) lost its image semantics and the model received a literal URL string in a text block. No vision grounding for the URL path on Anthropic.
v2.4.1 emits Anthropic's native {"type":"image","source":{"type":"url","url":"..."}} shape.
What's new
ras-llm-anthropic — native URL source
AnthropicImageSourcerefactored from struct to enum tagged ontype:Base64 { media_type, data }— wire shape unchanged from 2.4.0.Url { url }— new variant emitting Anthropic's native URL image source.
content_part_to_anthropicnow maps bothContentPart::ImageBase64andContentPart::ImageUrltoAnthropicContent::Imagewith the appropriate source variant.
ras-agent — unchanged
Phase A (v2.4.0) emits ContentPart::ImageBase64 for screenshots, which already worked. This fix is only relevant to callers that bypass the screenshot pipeline and pass image URLs directly through ChatMessage::user_parts — for example, hosted screenshot services or pre-uploaded asset references.
Tests
2 new unit tests in ras-llm-anthropic/src/infrastructure/http/dto.rs:
image_base64_serializes_with_native_source— assertstype=image,source.type=base64,media_type+datapresent,urlabsent.image_url_serializes_with_native_url_source— assertstype=image,source.type=url,urlpresent,media_type+dataabsent.
Verification
cargo test --workspace --no-fail-fast— all suites passcargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
crates.io
Not published. publish.yml gate skips patch bumps. The fix folds into the next minor release (planned Phase B) and reaches crates.io there. If you need 2.4.1 on crates.io before then, trigger publish.yml manually via workflow_dispatch with force=true.
Compatibility
AnthropicImageSourceispubbut only used insideras-llm-anthropic::infrastructure::http::dto; no external consumers exist in the workspace. The struct → enum change is breaking for any out-of-tree consumer constructingAnthropicImageSourcedirectly, which is unlikely given the type's role as an internal serialization DTO.ContentPartAPI unchanged.ChatMessageAPI unchanged.- Workspace MSRV unchanged.
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: not published this release (see above)
Pull requests
- #27 —
fix(anthropic): emit native source.type=url for ContentPart::ImageUrl (v2.4.1) - #28 —
release: v2.4.1 (anthropic ImageUrl fix)
Closes: #24
Full changelog: v2.4.0...v2.4.1
v2.4.0
Highlights
Vision feedback lands. Before v2.4.0, ScreenshotAction captured PNG bytes via ActionResult::with_image(b64) but the next prompt's renderer returned a plain String and discarded the image. Models received only text narration, fabricated everything visual, and modern Claude/GPT-4o runs against real sites produced no observable network traffic past the initial GET. v2.4.0 closes that single broken link end-to-end.
This is Phase A of the broader grounding fix. Phase B (numbered clickable index map) and Phase C (full CDP DomExtractor impl) ship in later releases.
Surprise findings during scope review
The bug report assumed ChatMessage was text-only and that all 14 provider DTOs needed image support added. Actuals:
ContentPart::{ImageBase64, ImageUrl}already existed inras-llm.ras-llm-openaiDTO already serialized both to OpenAI vision format ({"type":"image_url","image_url":{"url":"data:..."}}).ras-llm-anthropicDTO already mappedImageBase64to native Anthropic image source.- 6 OpenAI-compatible providers (cerebras, deepseek, groq, mistral, openrouter, vercel) inherit serialization via
ChatOpenAICompatible. - 6 scaffold providers (bedrock, cloud, google, langchain, oci, ollama) have empty
mod.rs— noLlmClientimpl yet, nothing to update.
The single broken link was in ras-agent. Scope of actual code change collapsed to one new module + one constructor + one render call site swap.
What's new
ras-llm — multipart constructor
- New
ChatMessage::user_parts(parts: Vec<ContentPart>)constructor for emitting mixed-content user messages directly.
ras-agent — image-aware result message
- New module
ras_agent::application::render_step_message:- Returns
Option<ChatMessage>(None when results are empty so no spurious user turns). - One
ContentPart::Textpart with step header (Step N result:),url:line, and per-action result summaries (truncated to 480 chars, errors to 240). - One
ContentPart::ImageBase64 { media_type: "image/png", data }part perActionResult.imagesentry.
- Returns
run_agent::build_promptnow callsrender_step_messageand pushes the returnedChatMessage. Old text-onlyrender_step_resultsremoved.
Prompt — unchanged
Still tells the model to emit one JSON object matching AgentOutput shape, lists the action catalog with parameter schemas, warns that empty action lists are treated as failure. The new image part rides alongside that contract; no prompt rewrite was needed.
Reproduction
Before v2.4.0:
$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
[step 0] screenshot → captured (b64 discarded by renderer)
[step 1] LLM narrates "I see the login form" ← fabrication; received only text
[mitmproxy] only GET / and GET /favicon.ico across N calls
After v2.4.0: each step that produces a screenshot attaches the image to the next user turn. Vision-capable models receive the actual page bytes.
Migration
No code changes required for callers. RunAgent::new signature unchanged. cargo update -p ras-agent --precise 2.4.0 (or any workspace crate; the workspace bumps together).
If you previously consumed render_step_results or any internal renderer, note that it's removed — replaced by the public-via-pub(crate) render_step_message returning Option<ChatMessage>.
Compatibility
- New constructor
ChatMessage::user_partsis purely additive. - Existing text-only
ChatMessage::user_textunchanged. - No breaking changes to public APIs.
- Workspace MSRV unchanged.
Tests
5 new unit tests in render_step_message:
- empty results → no message
- text-only result → text part only
- screenshot result → text + 1 image part
- multiple images across results → all attached in order
- error result → error included in text
1 new integration test screenshot_image_reaches_next_prompt_as_image_part:
ScriptedLlmrecords every receivedVec<ChatMessage>.- After step 1's
screenshotaction, asserts step 2's prompt containsContentPart::ImageBase64withmedia_type = "image/png"and non-empty data.
Verification
cargo test --workspace --no-fail-fast— all suites pass (13 unit + 4 executor + 5 + 4 + 3 integration)cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: all
ras-*workspace crates published at2.4.0oncepublish.ymlfinishes
Follow-ups
- #24 —
ras-llm-anthropic:ContentPart::ImageUrlcurrently degrades to plaintext URL instead of using Anthropic's nativesource.type = urlimage format. Planned 2.4.1 patch (or fold into next minor —publish.ymlskips patch bumps). - Phase B — JS-eval clickable extractor (
querySelectorAllviaBrowserPort::evaluate) producing a numbered index map forclick_elementparameters. No full DomExtractor. - Phase C — Full CDP
DomExtractorimpl with paint-order occlusion + stable hashing across snapshots. The trait exists inras-domwith zero implementations today.
Pull requests
- #25 —
feat(agent): feed screenshot images to LLM as ImageBase64 (v2.4.0) - #26 —
release: v2.4.0 (vision feedback)
Full changelog: v2.3.0...v2.4.0
v2.3.0
Highlights
Parser hotfix. Strong models (Claude, GPT-4o, etc.) wrap their AgentOutput JSON in ```json ... ``` markdown fences ~30% of the time despite the system-prompt instruction not to. In v2.2.0 the strict parser failed on fenced output, fell back to an empty action[], and RunAgent::execute aborted via the 2-streak stall guard at step 0 — making the executor effectively unusable on most modern Claude/GPT-4o calls.
v2.3.0 makes the parser defensive without weakening the prompt.
What's new
ras-agent — fence-tolerant parser
parse_agent_outputextracted into its own module (ras_agent::application::parse_output).- Two-stage strategy:
- Fast path:
serde_json::from_stron the rawcontent(zero allocation, unchanged behavior for unfenced output). - Slow path: strip a leading
```json,```JSON,```Json, or plain```opening fence + a trailing```closing fence + surrounding whitespace, then retry.
- Fast path:
- If both paths fail, falls through to the existing empty-action fallback. Never panics on malformed model output.
Prompt — unchanged
The "no markdown fences" instruction stays as a hint. The parser is now defensive because strong models ignore the hint anyway; the prompt change wouldn't have fixed it.
Tests
- 8 new unit tests in
parse_outputcovering:- unfenced JSON (existing behavior)
```json … ```with newline after the open fence```JSON … ```and```Json … ```(case variants)``` … ```with no language tag- leading/trailing whitespace around the fence block
- fenced but invalid JSON → empty-action fallback
- unfenced and invalid JSON → empty-action fallback
- New integration test
agent_recovers_from_markdown_fenced_response— feeds a fenced response throughScriptedLlm+RunAgent, assertsoutput.actionis non-empty and the navigate call reaches the mockBrowserPort.
Reproduction
Before v2.3.0:
$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
ras_agent::application::run_agent: model returned empty action list (streak=1); treating as stalled
ras_agent::application::run_agent: model returned empty action list (streak=2); treating as stalled
ras_agent::application::run_agent: agent stalled: 2 consecutive empty action lists, aborting
[done] (no final result returned)
After v2.3.0: the fenced JSON parses, navigate reaches the browser, the loop progresses past step 0.
Migration
No code changes required — drop-in replacement for 2.2.0. cargo update -p ras-agent --precise 2.3.0 (or any workspace crate; the workspace bumps together).
Compatibility
- No public API changes.
- No breaking changes.
- Workspace MSRV unchanged.
Verification
cargo test --workspace --no-fail-fast— all suites pass (8 new unit tests + 1 new integration test green)cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: all
ras-*workspace crates published at2.3.0oncepublish.ymlfinishes
Pull requests
- #22 —
fix(agent): parse markdown-fenced LLM responses (v2.3.0) - #23 —
release: v2.3.0 (parser hotfix)
Full changelog: v2.2.0...v2.3.0