Skip to content

Releases: KeyCode17/rust-ai-surfer

v4.0.0 — Multi-Tenant Tier (BrowserContext isolation → SessionManager)

30 May 03:01
55832e8

Choose a tag to compare

v4.0.0 — Multi-Tenant Tier

Consolidates everything from v3.2.0 → v4.0.0: the crates evolve from a single-user agent engine into a foundation you can embed in a multi-tenant SaaS, where many users each get their own AI agent + isolated browser session. Built as five phases, each released as a minor; v4.0.0 is the major that marks the tier complete (and carries the breaking API changes below).

Design + audit trail: docs/superpowers/specs/2026-05-30-multi-tenant-tier-design.md.

What landed, phase by phase

v3.2.0 — Phase 1: CDP BrowserContext isolation (the backbone)

  • ras_types::ContextId; BrowserPort::{create_context, close_context, new_target_in, list_targets_in}.
  • ras-cdp::context_ops: isolated context per user, tabs scoped to a context, context-scoped cookie clear (Storage.clearCookies{browserContextId}), per-context download directory.
  • Proven by a live e2e against headless Chrome: a cookie set in context A is invisible in context B; list_targets_in returns only that context's tabs.

v3.3.0 — Phase 2: Egress / SSRF policy

  • ras_validation::EgressPolicy: scheme allowlist (http/https), blocks loopback/private/link-local + cloud metadata 169.254.169.254 + localhost, denied ports (incl. the CDP port), consumer allow/deny lists.
  • navigate is gated through the policy before touching the browser, so an agent can't be steered straight to metadata/internal hosts/file://.

v3.4.0 — Phase 3: Agent target binding

  • ToolContext.target + RunAgent::with_target; every builtin now acts on its bound tab instead of "the last focused tab". Removes the cross-session tab race; focused_target is gone from the tool layer.

v3.5.0 — Phase 4: Context-tagged event producer

  • The CDP→BrowserEvent pipeline that previously discarded events now exists: per-tab listeners (attach_events) forward navigation + dialog events to a supplied EventBus, so a session that owns its bus receives only its own tab's events — structural per-session event isolation.

v3.6.0 — Phase 5: ras-session crate (capstone)

  • SessionManager<Owner>: spawn/get/list, bounded (max_sessions + OnFull::Reject|EvictOldest), one-session-per-owner, background idle reaper.
  • SessionHandle: per-session event stream, run(task) with a one-active-task guard (SessionError::Busy) wired to the bound agent, explicit close() that disposes the context.
  • BrowserProvider trait (ships SharedBrowserProvider = one Chrome, context-per-session; swap for process-per-tenant for a hard boundary).

v4.0.0 — major

  • Marks the tier complete and absorbs the breaking changes below.
  • Internal workspace dependency pins moved 3.x4.0.0 so the per-crate crates.io publish resolves at the new major.

Breaking changes

  • BrowserPort gained context + attach_events methods (default-impl'd, so most consumers are unaffected; custom BrowserPort impls inherit safe defaults).
  • ToolContext gained a target: Option<TargetId> field — any code constructing ToolContext directly must set it.

Isolation guarantees (and honest limits)

  • Per-user separation is enforced at the primitive layer (contexts, scoped targets/storage, per-tab events) so it holds even if a consumer writes their own session manager.
  • One shared Chrome = a soft process boundary, sufficient for web-session isolation; for a hard boundary against a hostile page, supply a process-per-tenant BrowserProvider.
  • Auth/transport are intentionally not in the crates — the consumer owns them (a capability-by-SessionId model + example wiring is the intended pattern).

Deferred (tracked follow-ups)

  • CDP Fetch.requestPaused enforcement for 3xx redirects + JS-driven navigation, and DNS-rebinding (Phase 2b).
  • Broader event coverage (downloads, network, target lifecycle) on the existing per-tab→bus model.

🤖 Generated with Claude Code

v3.1.0 — Richer clickable naming for DOM grounding

29 May 21:41
cd4ee22

Choose a tag to compare

v3.1.0 — Richer clickable naming

Improves the DOM-grounding clickable map the agent sees, so it can tell elements apart from the text channel instead of relying on the screenshot alone — especially anchors, JS-handler buttons, and icon-only controls.

What changed

  • Detection now unions the markup heuristic with CDP's native isClickable, catching elements made clickable via JS event listeners (not just <a>/<button>/onclick).
  • Naming falls back through a chain so elements stop arriving as bare [N] tag:
    aria-label/alt/title/name/placeholderdescendant text (e.g. <a><span>Home</span></a> → "Home") → onclick handler name (viewBankAccount('0') → "view bank account", verb only) → FontAwesome icon class (fa-eye → "view", fa-plus-circle → "add") → role.
  • role overrides the displayed tag, so <i role="button"> is shown as button, not i.
  • Clickable map cap raised 80 → 200, sorted visible-first so on-screen elements survive truncation on dense pages (e.g. long sidebars).
  • snapshot_parser split into snapshot_parser / clickables / clickable_naming to honor the 200-LOC module cap.

Verification

  • Unit tests for the naming chain (text/onclick/icon/role precedence, FontAwesome mapping, style-token filtering).
  • Live e2e against headless Chrome confirming anchor text, nested-span text, onclick-derived names, and rolebutton all resolve on a real page.

🤖 Generated with Claude Code

v3.0.0 — PerimeterX press-and-hold solver + CDP primitives

15 May 21:51
afb2aa0

Choose a tag to compare

Highlights

End-to-end PerimeterX press-and-hold challenge solver via pure CDP — verified clearing the deny page on pedidosya.com.ar. Includes the new BrowserPort primitives that make it possible plus the agent + tools surface to drive it from an LLM.

⚠️ Breaking changes

  • chromiumoxide 0.6 → 0.9 (Chrome 142+ wire compatibility — 0.6's protocol crate can't deserialize new CDP message variants)
  • BrowserPort trait gains 6 required methods:
    • mouse_down(x, y), mouse_up(x, y), mouse_move(x, y, buttons_mask)
    • mouse_hold(x, y, ms) — humanized press-and-hold (pre-approach + ±1px jitter during hold)
    • block_urls(patterns)Network.setBlockedURLs glob suppression
    • clear_cookies(origin)Storage.clearDataForOrigin (cookies + storage)
  • All workspace internal version pins bumped 2.1.0 → 3.0.0

New tools (ras-tools)

  • press_and_hold_element { index, ms? } — resolves bbox center of a clickable, dispatches humanized hold (default 12000ms, max 60000ms)
  • press_and_hold_coordinate { x, y, ms? } — raw pixel variant for buttons inside cross-origin iframes / closed shadow DOM

Logging

  • ras_agent::brain target: per-step agent decision, agent action params, action result records
  • ras_llm_openai TRACE level: full request + response body dumps
  • Presets:
    • RUST_LOG=info,ras_agent::brain=info — brain + action stream
    • RUST_LOG=info,ras_agent::brain=debug — with raw action params
    • RUST_LOG=info,ras_llm_openai=trace — full LLM I/O

New examples

Example What it does
pedidosya_px_solver Pure-CDP (no LLM) PerimeterX solver. Clears cookies, locates #px-captcha shadow host, holds ~18s with humanized approach + jitter. Verified to clear the deny page on pedidosya.com.ar.
pedidosya_perimeterx_smoke LLM-driven variant exercising press_and_hold_* builtins.
mouse_drag_smoke autodraw.com canvas drawing smoke — verifies CDP isTrusted=true mouse pipeline draws a recognizable circle.
youtube_search_claude_code Claude Code OAuth path via ChatAnthropicClaudeCode.

Internal

  • ras-cdp adapter refactored: mouse_input.rs + cdp_ext.rs extracted; new with_page helper and op! macro deduplicate the page_for+within boilerplate.
  • Pre-commit guard scripts (check-loc.sh, check-no-comments.sh, check-no-unwrap.sh) fixed to exclude examples/, tests/, xtask/ when staged without a parent path component.

Verified end-to-end

  • cargo check --workspace --all-targets
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test -p ras-tools -p ras-agent -p ras-cdp
  • Manual: cargo run --example pedidosya_px_solver — PASS (PX cleared, redirected to homepage)
  • Manual: cargo run --example mouse_drag_smoke — PASS (visible circle drawn on autodraw canvas)

🤖 Generated with Claude Code

2.8.0

15 May 14:15

Choose a tag to compare

2.8.0

Fixed

  • ras-cdp no longer pulls async-std. The workspace chromiumoxide pin previously enabled tokio-runtime additively on top of chromiumoxide's defaults, which meant async-std-runtime was still on. We now set default-features = false and enable only tokio-runtime + bytes.
  • Resolves RUSTSEC-2025-0052 (async-std unmaintained) for every downstream that consumes ras-cdp (e.g. ai-flow-surfer).

Why a minor bump

publish.yml gates patch versions; bumping to 2.8.0 fires the crates.io publish job. No public API changed.

Verification

cargo tree -p ras-cdp -i async-stddid not match any packages.

v2.7.0

10 May 08:22
33a29e9

Choose a tag to compare

v2.7.0 - grounding chain closes

v2.6.0

10 May 06:37
02a8099

Choose a tag to compare

Highlights

Agent DOM grounding lands. v2.5.0 shipped ChromiumoxideDomExtractor but nothing called it. v2.6.0 closes the loop: every step captures a fresh DOM snapshot, the next prompt carries a numbered clickable map, and the LLM operates on real page state instead of fabricating from text narration.

This is Phase B of the agent grounding fix. Combined with v2.4.0 (vision feedback) and v2.5.0 (extractor implementation), the agent now sees and references the DOM the model claims to interact with.

Closes #31.

What's new

ras-cdpChromiumoxideAdapter::browser_arc() (closes #31)

Returns a clone of the adapter's Arc<Mutex<Browser>> so the extractor and adapter share one CDP connection / target space:
```rust
let adapter = ChromiumoxideAdapter::connect(ws, timeout).await?;
let extractor = ChromiumoxideDomExtractor::new(adapter.browser_arc(), timeout);
```
The v2.5.0 release notes referenced this accessor before it existed. It's real now.

ras-agent — DOM extractor wired through the loop

  • RunAgent::with_dom_extractor(Arc<dyn DomExtractor>) builder. None preserves pre-2.6.0 behavior.
  • RunStep calls extractor.snapshot(target) after every step and stores the result on the new StepRecord.summary: Option<BrowserStateSummary>.
  • Failures degrade gracefully: tracing::warn + summary: None. Grounding is auxiliary, not required.
  • Field is #[serde(default, skip_serializing_if = "Option::is_none")] — old serialized history files still deserialize cleanly.

Prompt format — numbered clickable map

```
Step 4 result:
url: https://example.com/login
action results:
[0] clicked login button
clickable_elements:
[0] button "Sign in"
[1] input "Email"
[2] input "Password"
…and 5 more (truncated)
```
Indexes match ClickableElement.index from the snapshot. click_element(index=N) is now grounded in real DOM state instead of model guesses. ax_name takes precedence over label. List capped at 80 elements with overflow marker.

Screenshot precedence

If summary.screenshot_b64 is present, it is the sole image part attached to the user message. Otherwise the legacy path (one part per ActionResult.images entry) kicks in. Steps that explicitly screenshot no longer end up with two image parts in the prompt.

ras-llm::ChatMessage::user_parts (carried context)

The constructor for mixed-content user messages introduced in v2.4.0 is the load-bearing primitive Phase B uses to attach text + screenshot in one turn.

Architecture decisions

  • ras-agent already depended on ras-dom; no new crate dep.
  • clickable_map extracted to its own module to keep render_step_message.rs under the 200-LOC cap. Split is also semantic — clickable rendering is independent of step message assembly.
  • ScriptedDomExtractor mock in the integration test bypasses real Chrome for fast deterministic coverage. Real CDP testing requires cosmium and remains a manual smoke step.

Tests

5 new clickable_map unit tests:

  • empty clickables → empty string
  • ax_name precedence over label
  • label fallback when no ax_name
  • no quotes when neither
  • truncation past CLICKABLE_LIMIT (80) with "…and N more" marker

New integration test dom_extractor_grounding_reaches_next_prompt:

  • ScriptedDomExtractor returns a canned BrowserStateSummary with two clickables (button "Sign in" + input "Email") and a known screenshot byte marker.
  • Asserts step 2's prompt contains clickable_elements: text with both rendered indexes AND the extractor's screenshot bytes.
  • Asserts extractor.snapshot was invoked at least once across the run.

Total ras-agent: 18 unit + 5 integration. Workspace: 97 test groups all green.

Verification

  • cargo test --workspace --no-fail-fast — clean
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

LOC per new/modified file (200 cap):

  • clickable_map.rs 115
  • render_step_message.rs 173
  • run_step.rs 161
  • run_agent.rs 168

Compatibility

  • All new APIs additive (browser_arc(), with_dom_extractor, StepRecord.summary).
  • No public-API breaks anywhere in the workspace.
  • Workspace MSRV unchanged.
  • Old serialized AgentHistoryList deserialize cleanly (summary defaults to None).

Migration

```rust
// before (v2.5.0): no DOM grounding
let agent = RunAgent::new(task, llm, registry, browser, events);

// after (v2.6.0): wire extractor for real grounding
use std::sync::Arc;
use std::time::Duration;
use ras_dom::{ChromiumoxideDomExtractor, DomExtractor};

let adapter = ChromiumoxideAdapter::connect(ws_url, Duration::from_secs(60)).await?;
let extractor: Arc = Arc::new(
ChromiumoxideDomExtractor::new(adapter.browser_arc(), Duration::from_secs(30)),
);
let browser: Arc = Arc::new(adapter);

let agent = RunAgent::new(task, llm, registry, browser, events)
.with_dom_extractor(extractor);
```

Deferred (still after this release)

  • Full EnhancedDomTreeNode tree — currently tree: None in BrowserStateSummary.
  • stable_hash — currently empty in ClickableElement.stable_hash.
  • Real AX tree via Accessibility.getFullAXTree — current ax_name from attributes is a sound MVP.
  • Paint-order occlusionpaint_orders requested but not yet used to drop covered elements.
  • Per-action snapshot — current snapshot fires once per step after all actions complete. Per-click feedback is plausible if models need finer-grained grounding; defer until evidence.

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: all ras-* workspace crates published at 2.6.0 once publish.yml finishes

Pull requests

  • #32feat(agent): DOM grounding via ChromiumoxideDomExtractor (v2.6.0)
  • #33release: v2.6.0 (agent DOM grounding)

Sub-phase commits

  • B1 fix(cdp): add ChromiumoxideAdapter::browser_arc() accessor — 2.5.1 (closes #31)
  • B2 feat(agent): capture DOM snapshot per step via Option<Arc<dyn DomExtractor>> — 2.5.2
  • B3 feat(agent): inject numbered clickable map + prefer extractor screenshot — 2.5.3
  • B4 feat(examples): wire ChromiumoxideDomExtractor into claude_code_oauth_cosmium — 2.5.4
  • B5 test(agent): integration test for end-to-end DOM grounding flow — 2.5.5
  • chore: bump to 2.6.0

Closes: #31

Full changelog: v2.5.0...v2.6.0

v2.5.0

10 May 06:05
2a1187b

Choose a tag to compare

Highlights

DomExtractor has a real implementation. Before v2.5.0 the trait existed in ras-dom with zero implementationsBrowserStateSummary was unreachable from any action and Phase B (numbered clickable index map for prompts) was blocked.

v2.5.0 ships ChromiumoxideDomExtractor using pure CDP via DOMSnapshot.captureSnapshot, the same primitive Puppeteer and Playwright use for fast structural snapshots.

This is Phase C of the agent grounding fix. Phase B (prompt wiring) ships as 2.6.0.

This release also carries the v2.4.1 anthropic ImageUrl fix (#27) to crates.io — that patch was held per publish.yml patch-skip gate and folded into this minor.

What's new

ras-dom::ChromiumoxideDomExtractor

  • New module ras-dom/src/infrastructure/chromiumoxide/ with extractor.rs, snapshot.rs, snapshot_parser.rs, highlight.rs.
  • ChromiumoxideDomExtractor::new(Arc<Mutex<Browser>>, Duration) — wires a browser handle plus request timeout.
  • Implements the DomExtractor trait that has lived without an impl since 2.0.
  • Re-exported as ras_dom::ChromiumoxideDomExtractor.

snapshot() — pure-CDP path

  • One Page.execute(CaptureSnapshotParams) round-trip with includePaintOrder = true, includeDOMRects = true.
  • Parser walks NodeTreeSnapshot parallel arrays (node_name, attributes, backendNodeId) resolving StringIndex references through resp.strings.
  • Layout: node_index → BoundingBox map from LayoutTreeSnapshot.bounds.
  • Clickable detection: tag in {a, button, input, select, textarea, summary, label, details} OR presence of onclick / tabindex / role / aria-pressed / aria-checked.
  • ax_name derived from first non-empty of aria-label, alt, title, name, placeholder.
  • label from value attribute.
  • Tabs via Browser.execute(GetTargetsParams) filtered to type=="page".
  • Inline screenshot via Page.screenshot (PNG, viewport).
  • Whole flow wrapped in tokio::time::timeout(request_timeout).

highlight() — draw-bbox canvas overlay

  • Page.evaluate installs a fixed-position 100vw/100vh pointer-events: none overlay div at z-index 2^31-1 (highest valid value, sits above app UI without intercepting events).
  • Same selector set as the snapshot parser. Slices to options.max_index (default 200).
  • Per visible element: 2px #ff3366 border box (gated on options.draw_bounding_boxes) and [N] index label above it (gated on options.include_text_labels).
  • After screenshot, a second Page.evaluate unconditionally removes the overlay.
  • Index labels match the index space snapshot() produces — a model that sees [3] in the highlighted screenshot can call click_element(index=3) directly. Phase B wires the prompt plumbing.

ras-llm-anthropic — ImageUrl native source.type=url (carried from v2.4.1 / #27)

  • AnthropicImageSource refactored from struct to enum; ContentPart::ImageUrl now emits Anthropic's native {\"type\":\"image\",\"source\":{\"type\":\"url\",\"url\":\"...\"}} shape.

Known gap (#31)

ChromiumoxideAdapter does not yet expose a browser_arc() accessor for its Arc<Mutex<Browser>> field. Today the only ways to construct ChromiumoxideDomExtractor are:

  1. Open a second Browser::connect_with_config to the same CDP URL (doubles WebSocket connections, separate target space).
  2. Custom adapter path.

The accessor will land in v2.6.0 alongside Phase B's ToolContext wiring. Tracked at #31.

Architecture decisions

  • Impl lives in ras-dom, not ras-cdp, because ras-dom → ras-cdp is the existing dependency direction. Reversing it would have caused a cycle. ras-dom now depends on chromiumoxide and tokio directly.
  • ChromiumoxideDomExtractor takes a shared Arc<Mutex<Browser>> instead of owning its own connection — caller decides how the handle is shared.
  • No fixture-JSON parser unit tests in this release. The chromiumoxide_cdp types are codegen'd from a .pdl file; constructing valid CaptureSnapshotReturns by hand is mechanical busywork that doesn't catch real bugs (which live at the CDP wire level). Real verification needs a live Chrome.

Deferred to follow-ups

  • #31ChromiumoxideAdapter::browser_arc() accessor (target: v2.6.0).
  • Full EnhancedDomTreeNode treetree: None in BrowserStateSummary. Phase B prompt injection only needs clickables.
  • stable_hash — empty string in ClickableElement.stable_hash. Wiring ras_dom::application::stable_hash requires building the tree first.
  • Real AX tree via Accessibility.getFullAXTree — current ax_name from attributes is a sound MVP but misses computed accessibility names.
  • Paint-order occlusionpaint_orders requested in the CDP call but not yet used. ras_dom::application::paint_order exists; can plug in.
  • Phase B (v2.6.0) — wire Arc<dyn DomExtractor> into ToolContext, post-action snapshot in click/navigate/scroll, numbered index map in agent prompt.

Verification

  • cargo test --workspace --no-fail-fast — all 97 test groups pass
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

LOC per file (200 cap):

  • extractor.rs 51
  • snapshot.rs 150
  • snapshot_parser.rs 178
  • highlight.rs 115

Compatibility

  • New types are purely additive (ChromiumoxideDomExtractor, new module path).
  • ras-dom direct dependencies grew: now depends on chromiumoxide and tokio directly (transitively via ras-cdp before, but explicit now).
  • No breaking changes to public APIs in any existing crate.
  • Workspace MSRV unchanged.

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: all ras-* workspace crates published at 2.5.0 once publish.yml finishes (v2.4.1 anthropic fix carried)

Pull requests

  • #29feat(dom): ChromiumoxideDomExtractor via DOMSnapshot.captureSnapshot (v2.5.0)
  • #30release: v2.5.0 (CDP DomExtractor)

Sub-phase commits

  • feat(dom): scaffold ChromiumoxideDomExtractor (Phase C1) — 2.4.2
  • feat(dom): implement snapshot() via DOMSnapshot.captureSnapshot (Phase C2) — 2.4.3
  • feat(dom): implement highlight() with draw-bbox canvas overlay (Phase C3) — 2.4.4
  • chore: bump to 2.5.0

Full changelog: v2.4.1...v2.5.0

v2.4.1

10 May 05:35
65ac701

Choose a tag to compare

Highlights

Anthropic ImageUrl fix. Closes #24. Before this patch, ContentPart::ImageUrl { url } was silently mapped to AnthropicContent::Text { text: url } in ras-llm-anthropic — a URL-referenced image (CDN, hosted bucket, signed S3 link) lost its image semantics and the model received a literal URL string in a text block. No vision grounding for the URL path on Anthropic.

v2.4.1 emits Anthropic's native {"type":"image","source":{"type":"url","url":"..."}} shape.

What's new

ras-llm-anthropic — native URL source

  • AnthropicImageSource refactored from struct to enum tagged on type:
    • Base64 { media_type, data } — wire shape unchanged from 2.4.0.
    • Url { url } — new variant emitting Anthropic's native URL image source.
  • content_part_to_anthropic now maps both ContentPart::ImageBase64 and ContentPart::ImageUrl to AnthropicContent::Image with the appropriate source variant.

ras-agent — unchanged

Phase A (v2.4.0) emits ContentPart::ImageBase64 for screenshots, which already worked. This fix is only relevant to callers that bypass the screenshot pipeline and pass image URLs directly through ChatMessage::user_parts — for example, hosted screenshot services or pre-uploaded asset references.

Tests

2 new unit tests in ras-llm-anthropic/src/infrastructure/http/dto.rs:

  • image_base64_serializes_with_native_source — asserts type=image, source.type=base64, media_type + data present, url absent.
  • image_url_serializes_with_native_url_source — asserts type=image, source.type=url, url present, media_type + data absent.

Verification

  • cargo test --workspace --no-fail-fast — all suites pass
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

crates.io

Not published. publish.yml gate skips patch bumps. The fix folds into the next minor release (planned Phase B) and reaches crates.io there. If you need 2.4.1 on crates.io before then, trigger publish.yml manually via workflow_dispatch with force=true.

Compatibility

  • AnthropicImageSource is pub but only used inside ras-llm-anthropic::infrastructure::http::dto; no external consumers exist in the workspace. The struct → enum change is breaking for any out-of-tree consumer constructing AnthropicImageSource directly, which is unlikely given the type's role as an internal serialization DTO.
  • ContentPart API unchanged.
  • ChatMessage API unchanged.
  • Workspace MSRV unchanged.

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: not published this release (see above)

Pull requests

  • #27fix(anthropic): emit native source.type=url for ContentPart::ImageUrl (v2.4.1)
  • #28release: v2.4.1 (anthropic ImageUrl fix)

Closes: #24

Full changelog: v2.4.0...v2.4.1

v2.4.0

10 May 05:23
6e7aa8c

Choose a tag to compare

Highlights

Vision feedback lands. Before v2.4.0, ScreenshotAction captured PNG bytes via ActionResult::with_image(b64) but the next prompt's renderer returned a plain String and discarded the image. Models received only text narration, fabricated everything visual, and modern Claude/GPT-4o runs against real sites produced no observable network traffic past the initial GET. v2.4.0 closes that single broken link end-to-end.

This is Phase A of the broader grounding fix. Phase B (numbered clickable index map) and Phase C (full CDP DomExtractor impl) ship in later releases.

Surprise findings during scope review

The bug report assumed ChatMessage was text-only and that all 14 provider DTOs needed image support added. Actuals:

  • ContentPart::{ImageBase64, ImageUrl} already existed in ras-llm.
  • ras-llm-openai DTO already serialized both to OpenAI vision format ({"type":"image_url","image_url":{"url":"data:..."}}).
  • ras-llm-anthropic DTO already mapped ImageBase64 to native Anthropic image source.
  • 6 OpenAI-compatible providers (cerebras, deepseek, groq, mistral, openrouter, vercel) inherit serialization via ChatOpenAICompatible.
  • 6 scaffold providers (bedrock, cloud, google, langchain, oci, ollama) have empty mod.rs — no LlmClient impl yet, nothing to update.

The single broken link was in ras-agent. Scope of actual code change collapsed to one new module + one constructor + one render call site swap.

What's new

ras-llm — multipart constructor

  • New ChatMessage::user_parts(parts: Vec<ContentPart>) constructor for emitting mixed-content user messages directly.

ras-agent — image-aware result message

  • New module ras_agent::application::render_step_message:
    • Returns Option<ChatMessage> (None when results are empty so no spurious user turns).
    • One ContentPart::Text part with step header (Step N result:), url: line, and per-action result summaries (truncated to 480 chars, errors to 240).
    • One ContentPart::ImageBase64 { media_type: "image/png", data } part per ActionResult.images entry.
  • run_agent::build_prompt now calls render_step_message and pushes the returned ChatMessage. Old text-only render_step_results removed.

Prompt — unchanged

Still tells the model to emit one JSON object matching AgentOutput shape, lists the action catalog with parameter schemas, warns that empty action lists are treated as failure. The new image part rides alongside that contract; no prompt rewrite was needed.

Reproduction

Before v2.4.0:

$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
[step 0] screenshot → captured (b64 discarded by renderer)
[step 1] LLM narrates "I see the login form"  ← fabrication; received only text
[mitmproxy] only GET / and GET /favicon.ico across N calls

After v2.4.0: each step that produces a screenshot attaches the image to the next user turn. Vision-capable models receive the actual page bytes.

Migration

No code changes required for callers. RunAgent::new signature unchanged. cargo update -p ras-agent --precise 2.4.0 (or any workspace crate; the workspace bumps together).

If you previously consumed render_step_results or any internal renderer, note that it's removed — replaced by the public-via-pub(crate) render_step_message returning Option<ChatMessage>.

Compatibility

  • New constructor ChatMessage::user_parts is purely additive.
  • Existing text-only ChatMessage::user_text unchanged.
  • No breaking changes to public APIs.
  • Workspace MSRV unchanged.

Tests

5 new unit tests in render_step_message:

  • empty results → no message
  • text-only result → text part only
  • screenshot result → text + 1 image part
  • multiple images across results → all attached in order
  • error result → error included in text

1 new integration test screenshot_image_reaches_next_prompt_as_image_part:

  • ScriptedLlm records every received Vec<ChatMessage>.
  • After step 1's screenshot action, asserts step 2's prompt contains ContentPart::ImageBase64 with media_type = "image/png" and non-empty data.

Verification

  • cargo test --workspace --no-fail-fast — all suites pass (13 unit + 4 executor + 5 + 4 + 3 integration)
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: all ras-* workspace crates published at 2.4.0 once publish.yml finishes

Follow-ups

  • #24ras-llm-anthropic: ContentPart::ImageUrl currently degrades to plaintext URL instead of using Anthropic's native source.type = url image format. Planned 2.4.1 patch (or fold into next minor — publish.yml skips patch bumps).
  • Phase B — JS-eval clickable extractor (querySelectorAll via BrowserPort::evaluate) producing a numbered index map for click_element parameters. No full DomExtractor.
  • Phase C — Full CDP DomExtractor impl with paint-order occlusion + stable hashing across snapshots. The trait exists in ras-dom with zero implementations today.

Pull requests

  • #25feat(agent): feed screenshot images to LLM as ImageBase64 (v2.4.0)
  • #26release: v2.4.0 (vision feedback)

Full changelog: v2.3.0...v2.4.0

v2.3.0

10 May 04:13
1e22fd2

Choose a tag to compare

Highlights

Parser hotfix. Strong models (Claude, GPT-4o, etc.) wrap their AgentOutput JSON in ```json ... ``` markdown fences ~30% of the time despite the system-prompt instruction not to. In v2.2.0 the strict parser failed on fenced output, fell back to an empty action[], and RunAgent::execute aborted via the 2-streak stall guard at step 0 — making the executor effectively unusable on most modern Claude/GPT-4o calls.

v2.3.0 makes the parser defensive without weakening the prompt.

What's new

ras-agent — fence-tolerant parser

  • parse_agent_output extracted into its own module (ras_agent::application::parse_output).
  • Two-stage strategy:
    • Fast path: serde_json::from_str on the raw content (zero allocation, unchanged behavior for unfenced output).
    • Slow path: strip a leading ```json, ```JSON, ```Json, or plain ``` opening fence + a trailing ``` closing fence + surrounding whitespace, then retry.
  • If both paths fail, falls through to the existing empty-action fallback. Never panics on malformed model output.

Prompt — unchanged

The "no markdown fences" instruction stays as a hint. The parser is now defensive because strong models ignore the hint anyway; the prompt change wouldn't have fixed it.

Tests

  • 8 new unit tests in parse_output covering:
    • unfenced JSON (existing behavior)
    • ```json … ``` with newline after the open fence
    • ```JSON … ``` and ```Json … ``` (case variants)
    • ``` … ``` with no language tag
    • leading/trailing whitespace around the fence block
    • fenced but invalid JSON → empty-action fallback
    • unfenced and invalid JSON → empty-action fallback
  • New integration test agent_recovers_from_markdown_fenced_response — feeds a fenced response through ScriptedLlm + RunAgent, asserts output.action is non-empty and the navigate call reaches the mock BrowserPort.

Reproduction

Before v2.3.0:

$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
ras_agent::application::run_agent: model returned empty action list (streak=1); treating as stalled
ras_agent::application::run_agent: model returned empty action list (streak=2); treating as stalled
ras_agent::application::run_agent: agent stalled: 2 consecutive empty action lists, aborting
[done] (no final result returned)

After v2.3.0: the fenced JSON parses, navigate reaches the browser, the loop progresses past step 0.

Migration

No code changes required — drop-in replacement for 2.2.0. cargo update -p ras-agent --precise 2.3.0 (or any workspace crate; the workspace bumps together).

Compatibility

  • No public API changes.
  • No breaking changes.
  • Workspace MSRV unchanged.

Verification

  • cargo test --workspace --no-fail-fast — all suites pass (8 new unit tests + 1 new integration test green)
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: all ras-* workspace crates published at 2.3.0 once publish.yml finishes

Pull requests

  • #22fix(agent): parse markdown-fenced LLM responses (v2.3.0)
  • #23release: v2.3.0 (parser hotfix)

Full changelog: v2.2.0...v2.3.0