Snapshot optimizations by kzajac-opera · Pull Request #13 · operasoftware/opera-browser-cli

kzajac-opera · 2026-05-21T07:55:58Z

MR: Snapshot token optimizations + benchmarks

Summary

Adds two benchmarks that quantify the token cost of opera-browser-cli snapshot output, and
introduces the snapshot compaction layer (Layer 1 + URL LUT) that they measure.

Benchmarks

`benchmarks/page-token-benchmark`

Measures raw token cost of CLI output across flag combinations, with no LLM involved. Runs
opera-browser-cli open on 50 static pages (Wikipedia, GitHub, MDN, Python docs, RFC Editor)
and counts tokens via tiktoken. Seven conditions: opera-compact, opera-compact-full,
opera-raw, opera-raw-full, mcp-raw, axi, axi-full.

--full option runs Opera CLI or AXI without hard limit of characters.

Results (50 runs each):

Condition	Runs	Avg tokens	Median tokens	p95 tokens
`opera-compact`	50	3,729	3,682	4,732
`opera-raw`	50	4,931	4,920	5,810
`axi`	50	4,986	4,908	5,736
`opera-compact-full`	50	60,595	24,294	256,104
`mcp-raw`	50	94,652	44,962	391,250
`opera-raw-full`	50	94,915	45,130	381,434
`axi-full`	50	98,469	46,586	396,891

`benchmarks/snapshot-agentic-use`

End-to-end agentic benchmark: an LLM agent completes 7 browser tasks (adapted from the
axi bench-browser benchmark)
across 4 conditions, graded pass/fail by an LLM judge. Captures input tokens, snapshot size,
wall time, and tool call count per run. Agent is able to use --full flag or not use it.

Results (21 runs each):

Condition	Runs	Pass%	Avg input tok	Avg total tok	Avg snap chars	Avg wall (s)	Avg tool calls
opera-compact	21	100%	41,572	41,717	76.5k	7.4	1.5
opera-raw	21	100%	90,808	90,959	186.3k	8.0	1.4
axi	21	100%	97,036	97,224	187.4k	9.9	1.8
mcp-raw	21	100%	199,015	199,164	213.0k	9.9	2.2

opera-compact saves 79% total tokens vs mcp-raw baseline, at identical 100% pass rate.

CLI Changes

src/snapshot.ts — Layer 1 compaction (role renames, echo-dedup, description dedup,
numeric attr quotes, heading→markdown, text-run collapse) and URL LUT (Layer 2: dedup and
whale-URL tokenisation with $uN trailer).
src/bridge.ts — GET /last-snapshot endpoint; caches the most recent take_snapshot
result for use by the snap CLI command.
src/cli.ts / src/run.ts — wiring for the snapshot pipeline and new CLI flags.

@ref

- compactSnapshot(): drop noise nodes, normalise PascalCase roles, convert headings to markdown, rewrite refs to @PAGE.ELEM dot form (better BPE tokenisation than uid=X_Y), strip ARIA default attributes - Collapse consecutive same-indent text siblings into one line; drop the merged line when it echoes the parent label - cleanUrl(): drop javascript:/data: URLs, same-origin → relative paths, strip generic cross-site tracking params (utm_*, gclid/fbclid family); removed Amazon-specific params (ie, _encoding, ref_, pd_rd_, pf_rd_) - applyUrlLut(): dedup repeated URLs and hide whale URLs (≥200 chars) behind $uN tokens; full values printed in urls: footer trailer - Compact truncation limit lowered 16k → 12k chars (raw keeps 16k); compaction savings recover the headroom - --raw flag on all snapshot commands to bypass compaction - `url <$uN|@ref>` command to resolve LUT tokens and element refs - Test fixture (test/fixtures/elements.html) covering all major element types - Task benchmark prompts in test/tasks/ for compact vs raw cost comparison

@ref

- compactSnapshot(): drop noise nodes, normalise PascalCase roles, convert headings to markdown, rewrite refs to @PAGE.ELEM dot form (better BPE tokenisation than uid=X_Y), strip ARIA default attributes - Collapse consecutive same-indent text siblings into one line; drop the merged line when it echoes the parent label - cleanUrl(): drop javascript:/data: URLs, same-origin → relative paths, strip generic cross-site tracking params (utm_*, gclid/fbclid family); removed Amazon-specific params (ie, _encoding, ref_, pd_rd_, pf_rd_) - applyUrlLut(): dedup repeated URLs and hide whale URLs (≥200 chars) behind $uN tokens; full values printed in urls: footer trailer - Compact truncation limit lowered 16k → 12k chars (raw keeps 16k); compaction savings recover the headroom - --raw flag on all snapshot commands to bypass compaction - `url <$uN|@ref>` command to resolve LUT tokens and element refs - Test fixture (test/fixtures/elements.html) covering all major element types - Task benchmark prompts in test/tasks/ for compact vs raw cost comparison

macieju-opera

Overall: solid work. Two blockers to fix before merge, plus a couple of suggestions below.

macieju-opera · 2026-05-21T12:57:04Z

+      writeJson(res, 200, lastSnapshot);
+    }
+    return;
+  }


Blocker — duplicate handler (dead code). This block (lines 306–313) is an exact copy of the one 9 lines above it (297–304). The first block always returns, so this one is unreachable. Delete it.

macieju-opera · 2026-05-21T12:57:04Z

+  raw: string;
+  pageUrl: string | null;
+  capturedAt: number;
+}


Blocker — duplicate interface. CachedSnapshot here has the exact same shape as LastSnapshotCache exported from bridge.ts ({ raw, pageUrl, capturedAt }). They're the same wire format under two names — one field addition will require updating both. Either import LastSnapshotCache from bridge.ts here, or define a shared type in snapshot.ts that both files import.

Ok, changed as requested.

macieju-opera · 2026-05-21T12:57:04Z

+  // Re-derive the full (non-truncated) URL map so tokens match what the agent
+  // saw, regardless of the truncation applied to the original output.
+  const compact = compactSnapshot(raw);
+  const { body, urlMap } = applyUrlLut(compact);


Suggestion — token map may not match what the agent saw. applyUrlLut is called here on the full (non-truncated) snapshot, but the $uN tokens the agent saw were derived from the truncated version. A URL that appears once in the visible window and once in the truncated tail would not be tokenised in the agent's output, but would be in the full snapshot — potentially with a different index. To guarantee exact match, cache the urlMap alongside raw in LastSnapshotCache and read it back here instead of re-deriving it.

Fixed - the urlMap is now written to last-url-map.json immediately after the truncated render, so the url $uN command always reads back the exact same token assignments the agent saw.

…o 2025

…sume --full

macieju-opera

Python benchmark code review (benchmarks/).

macieju-opera · 2026-05-21T13:01:20Z

+        self.tool_call_count += len(turn.tool_calls)
+        for tc in turn.tool_calls:
+            if tc.name in SNAPSHOT_TOOLS:
+                self.snapshot_chars.append(len(tool_results[tc.call_id]))


Bug — inconsistent dict access. Line 64 uses tool_results[tc.call_id] (raises KeyError if the key is missing), but line 70 in the same method uses tool_results.get(tc.call_id, "") (safe). The .get() on line 70 implies someone already knew this key could be absent. If dispatch() ever raises and is caught upstream before populating all call IDs, line 64 crashes the run. Use .get(tc.call_id) here for consistency, and decide what the right default is (probably "" or 0).

This is not really a bug, but changed for better consistency.

macieju-opera · 2026-05-21T13:01:20Z

+        )
+        raw = turn.text.strip()
+        if raw.startswith("```"):
+            raw = raw.split("```")[1].removeprefix("json")


Suggestion — fragile code-fence stripping. raw.split("\``")[1].removeprefix("json")works for the common```json\n{...}\n```format but silently breaks if the model uses```json with a trailing space, or outputs multiple code blocks. A wrong parse returns{"pass": False, ...}which poisons benchmark results without any loud failure. A more robust approach:re.search(r'```(?:json)?\s*([\s\S]+?)```', raw)and extract group 1, falling back toraw` if no match.

Ok, used the regex for parsing.

macieju-opera

Non-blocking suggestions — TypeScript and Python.

macieju-opera · 2026-05-21T13:05:59Z

+ * actually see in the body.  Token IDs are assigned in tree-walk (top-down)
+ * order and are therefore deterministic for identical input.
+ */
+export function applyUrlLut(text: string): UrlLutResult {


Suggestion — split file responsibilities. snapshot.ts now covers ref conversion, compaction, URL cleaning, LUT tokenisation, and URL resolution (~500 lines, up from ~90). The URL LUT layer (applyUrlLut, resolveUrl, cleanUrl, UrlLutResult, whalePreview) is a self-contained concern that would sit cleanly in src/url-lut.ts. Makes both files easier to navigate and test in isolation.

I'd keep this out of scope for now, we can address in some follow-up to avoid too many changes to the source package, WDYT?

macieju-opera · 2026-05-21T13:05:59Z

+}
+
+/** Reset the snapshot cache — for use in tests only. */
+export function resetLastSnapshotCache(): void {


Suggestion — don't export test-only helpers. resetLastSnapshotCache is labelled "for use in tests only" but is part of the public module export. Either unexport it and access lastSnapshot indirectly in tests (e.g. via the /last-snapshot endpoint, which tests already exercise), or move it to a dedicated test helper file. Leaking it here means any consumer of bridge.ts can call it.

I'd keep it as it is right now, the function accesses the private module variables and extracting this could be a larger refactor.

macieju-opera · 2026-05-21T13:05:59Z

+    @property
+    def all_errored(self) -> bool:
+        """True if every tool call returned an error — indicates the tool is not installed/running."""
+        return bool(self.records) and all(r.result.startswith("[error:") for r in self.records)


Suggestion — all_errored misses the two most common failure modes. _run() returns [error: ...] only for FileNotFoundError. Timeouts return [timeout after Ns] and non-zero exits return the stderr text or [exit N] — neither matches startswith("[error:"), so all_errored stays False even when every call failed. Fix: check all three, e.g.:

_ERROR_PREFIXES = ("[error:", "[timeout", "[exit ", "[unknown") @property def all_errored(self) -> bool: return bool(self.records) and all( r.result.startswith(_ERROR_PREFIXES) for r in self.records )

macieju-opera · 2026-05-21T13:06:00Z

+    judge_model = args.judge_model or models_cfg["judge"]["model"]
+    judge_effort = args.judge_reasoning_effort or models_cfg["judge"]["reasoning_effort"]
+
+    selected_conditions = args.conditions.split(",") if args.conditions else list(all_conditions.keys())


Suggestion — strip whitespace from comma-split args. args.conditions.split(",") produces " b" for --conditions "a, b", which then fails validation at line 189 with a confusing "Unknown condition" error. Same issue on line 185 for --tasks.

selected_conditions = [c.strip() for c in args.conditions.split(",")] if args.conditions else list(all_conditions.keys()) selected_tasks = [t.strip() for t in args.tasks.split(",")] if args.tasks else list(all_tasks.keys())

macieju-opera · 2026-05-21T13:06:00Z

+
+class Client:
+    def __init__(self, model: str, reasoning_effort: str = "medium"):
+        self._api = openai.OpenAI()


Suggestion — fail fast on missing API key. openai.OpenAI() succeeds even without OPENAI_API_KEY set; the error surfaces only on the first API call with a cryptic auth message. Add a startup check:

if not os.environ.get("OPENAI_API_KEY"): raise RuntimeError("OPENAI_API_KEY is not set") self._api = openai.OpenAI()

Good catch, this will be more explicit!

macieju-opera · 2026-05-22T12:41:12Z

+    tiktoken_encoding: str = args.encoding or settings["tiktoken_encoding"]
+    run_id = datetime.now().strftime("%y%m%d%H%M")
+    results_dir = Path(__file__).parent.parent / settings["output_dir"] / run_id
+


Same whitespace-strip fix applied to snapshot-agentic-use/src/run_benchmark.py was missed here. --conditions "a, b" will silently drop b.

wanted = {c.strip() for c in args.conditions.split(",")}

mugorski and others added 12 commits May 21, 2026 09:37

feat: add python package for benchmarking

7ee028b

feat: add python package for benchmarking

4df1d09

feat: add benchmark report generation

1770e5b

feat: add benchmark configuration files

a5cc1e8

docs: add benchmark documentation

602cb59

chore: Add python CI and reformat code

394aedd

chore: Use secure command parsing with shlex

2726d16

chore: Rename snapshot-efficiency to snapshot-agentic-use

f7858b1

chore: Move python configuration one-level up

6c4f3da

chore: Add shared code with token counter

0bfdaab

kzajac-opera self-assigned this May 21, 2026

kzajac-opera requested a review from macieju-opera May 21, 2026 12:15

macieju-opera reviewed May 21, 2026

View reviewed changes

kzajac-opera added 12 commits May 21, 2026 14:57

docs: Update benchmark docs

3982ac5

feat: Add benchmark config for snapshotting single page

3a58b59

docs: Document snapshot benchmark

3aab459

fix: Fix linter issues after mv

2d39628

chore: Add more conditions to page-token-benchmark

15c9b82

feat: Add explicit --full option to agentic-use-benchmark

6a45631

chore: Move token_counter.py

5368860

fix: Explicitly request open before snapshot in every mode

b501288

feat: Update SKILL.md after token optimization

5aa9f2f

fix: Set devtools ports explicitly to avoid port collision

be6a2fc

fix: Update tasks for wikipedia extraction -> year change from 2024 t…

6b012dd

…o 2025

chore: remove running benchmark with and without --full and always as…

856366c

…sume --full

macieju-opera reviewed May 21, 2026

View reviewed changes

docs: Combine two CLAUDE files into one for benchmarks

5ce1714

docs: Mention the results in main README

ac89c74

kzajac-opera force-pushed the snapshot-optimizations branch from a9dbdcc to ac89c74 Compare May 21, 2026 13:11

kzajac-opera added 2 commits May 21, 2026 15:23

fix: Fixes from review

440542f

fix: Add python safety check and verbose error handling

37b7625

kzajac-opera force-pushed the snapshot-optimizations branch from b1abddc to 37b7625 Compare May 21, 2026 14:34

kzajac-opera added 5 commits May 22, 2026 11:16

fix: Error message readability fix

3be0b73

chore: execute in benchmark run in different directory

44319ea

chore: use consistent table formatting in report and README

55e0341

chore: add debug helper for CLAUDE

4672b00

docs: Document experiment results in README

33567f6

macieju-opera reviewed May 22, 2026

View reviewed changes

fix: strip whitespace from --conditions split

485aa49

macieju-opera approved these changes May 22, 2026

View reviewed changes

kzajac-opera merged commit 3d26f62 into main May 22, 2026
2 checks passed

kzajac-opera deleted the snapshot-optimizations branch May 22, 2026 13:58

Conversation

kzajac-opera commented May 21, 2026

MR: Snapshot token optimizations + benchmarks

Summary

Benchmarks

benchmarks/page-token-benchmark

benchmarks/snapshot-agentic-use

CLI Changes

Uh oh!

macieju-opera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macieju-opera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macieju-opera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kzajac-opera May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`benchmarks/page-token-benchmark`

`benchmarks/snapshot-agentic-use`

kzajac-opera May 22, 2026 •

edited

Loading