Skip to content

Snapshot optimizations#13

Merged
kzajac-opera merged 34 commits into
mainfrom
snapshot-optimizations
May 22, 2026
Merged

Snapshot optimizations#13
kzajac-opera merged 34 commits into
mainfrom
snapshot-optimizations

Conversation

@kzajac-opera
Copy link
Copy Markdown
Contributor

MR: Snapshot token optimizations + benchmarks

Summary

Adds two benchmarks that quantify the token cost of opera-browser-cli snapshot output, and
introduces the snapshot compaction layer (Layer 1 + URL LUT) that they measure.


Benchmarks

benchmarks/page-token-benchmark

Measures raw token cost of CLI output across flag combinations, with no LLM involved. Runs
opera-browser-cli open on 50 static pages (Wikipedia, GitHub, MDN, Python docs, RFC Editor)
and counts tokens via tiktoken. Seven conditions: opera-compact, opera-compact-full,
opera-raw, opera-raw-full, mcp-raw, axi, axi-full.

--full option runs Opera CLI or AXI without hard limit of characters.

Results (50 runs each):

Condition Runs Errors Avg tokens Median tokens p95 tokens
opera-compact 50 0 3,729 3,682 4,732
opera-raw 50 0 4,931 4,920 5,810
axi 50 0 4,986 4,908 5,736
opera-compact-full 50 0 60,595 24,294 256,104
mcp-raw 50 0 94,652 44,962 391,250
opera-raw-full 50 0 94,915 45,130 381,434
axi-full 50 0 98,469 46,586 396,891

benchmarks/snapshot-agentic-use

End-to-end agentic benchmark: an LLM agent completes 7 browser tasks (adapted from the
axi bench-browser benchmark)
across 4 conditions, graded pass/fail by an LLM judge. Captures input tokens, snapshot size,
wall time, and tool call count per run. Agent is able to use --full flag or not use it.

Results (21 runs each):

Condition Runs Pass% Avg input tok Avg total tok Avg snap chars Avg wall (s) Avg tool calls
opera-compact 21 100% 41,572 41,717 76.5k 7.4 1.5
opera-raw 21 100% 90,808 90,959 186.3k 8.0 1.4
axi 21 100% 97,036 97,224 187.4k 9.9 1.8
mcp-raw 21 100% 199,015 199,164 213.0k 9.9 2.2

opera-compact saves 79% total tokens vs mcp-raw baseline, at identical 100% pass rate.


CLI Changes

  • src/snapshot.ts — Layer 1 compaction (role renames, echo-dedup, description dedup,
    numeric attr quotes, heading→markdown, text-run collapse) and URL LUT (Layer 2: dedup and
    whale-URL tokenisation with $uN trailer).
  • src/bridge.tsGET /last-snapshot endpoint; caches the most recent take_snapshot
    result for use by the snap CLI command.
  • src/cli.ts / src/run.ts — wiring for the snapshot pipeline and new CLI flags.

mugorski and others added 12 commits May 21, 2026 09:37
- compactSnapshot(): drop noise nodes, normalise PascalCase roles,
  convert headings to markdown, rewrite refs to @PAGE.ELEM dot form
  (better BPE tokenisation than uid=X_Y), strip ARIA default attributes
- Collapse consecutive same-indent text siblings into one line; drop
  the merged line when it echoes the parent label
- cleanUrl(): drop javascript:/data: URLs, same-origin → relative paths,
  strip generic cross-site tracking params (utm_*, gclid/fbclid family);
  removed Amazon-specific params (ie, _encoding, ref_, pd_rd_, pf_rd_)
- applyUrlLut(): dedup repeated URLs and hide whale URLs (≥200 chars)
  behind $uN tokens; full values printed in urls: footer trailer
- Compact truncation limit lowered 16k → 12k chars (raw keeps 16k);
  compaction savings recover the headroom
- --raw flag on all snapshot commands to bypass compaction
- `url <$uN|@ref>` command to resolve LUT tokens and element refs
- Test fixture (test/fixtures/elements.html) covering all major element types
- Task benchmark prompts in test/tasks/ for compact vs raw cost comparison
- compactSnapshot(): drop noise nodes, normalise PascalCase roles,
  convert headings to markdown, rewrite refs to @PAGE.ELEM dot form
  (better BPE tokenisation than uid=X_Y), strip ARIA default attributes
- Collapse consecutive same-indent text siblings into one line; drop
  the merged line when it echoes the parent label
- cleanUrl(): drop javascript:/data: URLs, same-origin → relative paths,
  strip generic cross-site tracking params (utm_*, gclid/fbclid family);
  removed Amazon-specific params (ie, _encoding, ref_, pd_rd_, pf_rd_)
- applyUrlLut(): dedup repeated URLs and hide whale URLs (≥200 chars)
  behind $uN tokens; full values printed in urls: footer trailer
- Compact truncation limit lowered 16k → 12k chars (raw keeps 16k);
  compaction savings recover the headroom
- --raw flag on all snapshot commands to bypass compaction
- `url <$uN|@ref>` command to resolve LUT tokens and element refs
- Test fixture (test/fixtures/elements.html) covering all major element types
- Task benchmark prompts in test/tasks/ for compact vs raw cost comparison
@kzajac-opera kzajac-opera self-assigned this May 21, 2026
Copy link
Copy Markdown
Contributor

@macieju-opera macieju-opera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: solid work. Two blockers to fix before merge, plus a couple of suggestions below.

Comment thread src/bridge.ts Outdated
writeJson(res, 200, lastSnapshot);
}
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker — duplicate handler (dead code). This block (lines 306–313) is an exact copy of the one 9 lines above it (297–304). The first block always returns, so this one is unreachable. Delete it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fixed.

Comment thread src/client.ts Outdated
raw: string;
pageUrl: string | null;
capturedAt: number;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker — duplicate interface. CachedSnapshot here has the exact same shape as LastSnapshotCache exported from bridge.ts ({ raw, pageUrl, capturedAt }). They're the same wire format under two names — one field addition will require updating both. Either import LastSnapshotCache from bridge.ts here, or define a shared type in snapshot.ts that both files import.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed as requested.

Comment thread src/cli.ts Outdated
// Re-derive the full (non-truncated) URL map so tokens match what the agent
// saw, regardless of the truncation applied to the original output.
const compact = compactSnapshot(raw);
const { body, urlMap } = applyUrlLut(compact);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — token map may not match what the agent saw. applyUrlLut is called here on the full (non-truncated) snapshot, but the $uN tokens the agent saw were derived from the truncated version. A URL that appears once in the visible window and once in the truncated tail would not be tokenised in the agent's output, but would be in the full snapshot — potentially with a different index. To guarantee exact match, cache the urlMap alongside raw in LastSnapshotCache and read it back here instead of re-deriving it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - the urlMap is now written to last-url-map.json immediately after the truncated render, so the url $uN command always reads back the exact same token assignments the agent saw.

Copy link
Copy Markdown
Contributor

@macieju-opera macieju-opera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python benchmark code review (benchmarks/).

self.tool_call_count += len(turn.tool_calls)
for tc in turn.tool_calls:
if tc.name in SNAPSHOT_TOOLS:
self.snapshot_chars.append(len(tool_results[tc.call_id]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — inconsistent dict access. Line 64 uses tool_results[tc.call_id] (raises KeyError if the key is missing), but line 70 in the same method uses tool_results.get(tc.call_id, "") (safe). The .get() on line 70 implies someone already knew this key could be absent. If dispatch() ever raises and is caught upstream before populating all call IDs, line 64 crashes the run. Use .get(tc.call_id) here for consistency, and decide what the right default is (probably "" or 0).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a bug, but changed for better consistency.

)
raw = turn.text.strip()
if raw.startswith("```"):
raw = raw.split("```")[1].removeprefix("json")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — fragile code-fence stripping. raw.split("\``")[1].removeprefix("json")works for the common```json\n{...}\n```format but silently breaks if the model uses```json with a trailing space, or outputs multiple code blocks. A wrong parse returns{"pass": False, ...}which poisons benchmark results without any loud failure. A more robust approach:re.search(r'```(?:json)?\s*([\s\S]+?)```', raw)and extract group 1, falling back toraw` if no match.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, used the regex for parsing.

Copy link
Copy Markdown
Contributor

@macieju-opera macieju-opera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking suggestions — TypeScript and Python.

Comment thread src/snapshot.ts
* actually see in the body. Token IDs are assigned in tree-walk (top-down)
* order and are therefore deterministic for identical input.
*/
export function applyUrlLut(text: string): UrlLutResult {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — split file responsibilities. snapshot.ts now covers ref conversion, compaction, URL cleaning, LUT tokenisation, and URL resolution (~500 lines, up from ~90). The URL LUT layer (applyUrlLut, resolveUrl, cleanUrl, UrlLutResult, whalePreview) is a self-contained concern that would sit cleanly in src/url-lut.ts. Makes both files easier to navigate and test in isolation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep this out of scope for now, we can address in some follow-up to avoid too many changes to the source package, WDYT?

Comment thread src/bridge.ts
}

/** Reset the snapshot cache — for use in tests only. */
export function resetLastSnapshotCache(): void {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — don't export test-only helpers. resetLastSnapshotCache is labelled "for use in tests only" but is part of the public module export. Either unexport it and access lastSnapshot indirectly in tests (e.g. via the /last-snapshot endpoint, which tests already exercise), or move it to a dedicated test helper file. Leaking it here means any consumer of bridge.ts can call it.

Copy link
Copy Markdown
Contributor Author

@kzajac-opera kzajac-opera May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it as it is right now, the function accesses the private module variables and extracting this could be a larger refactor.

@property
def all_errored(self) -> bool:
"""True if every tool call returned an error — indicates the tool is not installed/running."""
return bool(self.records) and all(r.result.startswith("[error:") for r in self.records)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — all_errored misses the two most common failure modes. _run() returns [error: ...] only for FileNotFoundError. Timeouts return [timeout after Ns] and non-zero exits return the stderr text or [exit N] — neither matches startswith("[error:"), so all_errored stays False even when every call failed. Fix: check all three, e.g.:

_ERROR_PREFIXES = ("[error:", "[timeout", "[exit ", "[unknown")

@property
def all_errored(self) -> bool:
    return bool(self.records) and all(
        r.result.startswith(_ERROR_PREFIXES) for r in self.records
    )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted.

judge_model = args.judge_model or models_cfg["judge"]["model"]
judge_effort = args.judge_reasoning_effort or models_cfg["judge"]["reasoning_effort"]

selected_conditions = args.conditions.split(",") if args.conditions else list(all_conditions.keys())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — strip whitespace from comma-split args. args.conditions.split(",") produces " b" for --conditions "a, b", which then fails validation at line 189 with a confusing "Unknown condition" error. Same issue on line 185 for --tasks.

selected_conditions = [c.strip() for c in args.conditions.split(",")] if args.conditions else list(all_conditions.keys())
selected_tasks = [t.strip() for t in args.tasks.split(",")] if args.tasks else list(all_tasks.keys())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


class Client:
def __init__(self, model: str, reasoning_effort: str = "medium"):
self._api = openai.OpenAI()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion — fail fast on missing API key. openai.OpenAI() succeeds even without OPENAI_API_KEY set; the error surfaces only on the first API call with a cryptic auth message. Add a startup check:

if not os.environ.get("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is not set")
self._api = openai.OpenAI()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this will be more explicit!

@kzajac-opera kzajac-opera force-pushed the snapshot-optimizations branch from a9dbdcc to ac89c74 Compare May 21, 2026 13:11
@kzajac-opera kzajac-opera force-pushed the snapshot-optimizations branch from b1abddc to 37b7625 Compare May 21, 2026 14:34
tiktoken_encoding: str = args.encoding or settings["tiktoken_encoding"]
run_id = datetime.now().strftime("%y%m%d%H%M")
results_dir = Path(__file__).parent.parent / settings["output_dir"] / run_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same whitespace-strip fix applied to snapshot-agentic-use/src/run_benchmark.py was missed here. --conditions "a, b" will silently drop b.

wanted = {c.strip() for c in args.conditions.split(",")}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@kzajac-opera kzajac-opera merged commit 3d26f62 into main May 22, 2026
2 checks passed
@kzajac-opera kzajac-opera deleted the snapshot-optimizations branch May 22, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants