Snapshot optimizations#13
Conversation
- compactSnapshot(): drop noise nodes, normalise PascalCase roles, convert headings to markdown, rewrite refs to @PAGE.ELEM dot form (better BPE tokenisation than uid=X_Y), strip ARIA default attributes - Collapse consecutive same-indent text siblings into one line; drop the merged line when it echoes the parent label - cleanUrl(): drop javascript:/data: URLs, same-origin → relative paths, strip generic cross-site tracking params (utm_*, gclid/fbclid family); removed Amazon-specific params (ie, _encoding, ref_, pd_rd_, pf_rd_) - applyUrlLut(): dedup repeated URLs and hide whale URLs (≥200 chars) behind $uN tokens; full values printed in urls: footer trailer - Compact truncation limit lowered 16k → 12k chars (raw keeps 16k); compaction savings recover the headroom - --raw flag on all snapshot commands to bypass compaction - `url <$uN|@ref>` command to resolve LUT tokens and element refs - Test fixture (test/fixtures/elements.html) covering all major element types - Task benchmark prompts in test/tasks/ for compact vs raw cost comparison
- compactSnapshot(): drop noise nodes, normalise PascalCase roles, convert headings to markdown, rewrite refs to @PAGE.ELEM dot form (better BPE tokenisation than uid=X_Y), strip ARIA default attributes - Collapse consecutive same-indent text siblings into one line; drop the merged line when it echoes the parent label - cleanUrl(): drop javascript:/data: URLs, same-origin → relative paths, strip generic cross-site tracking params (utm_*, gclid/fbclid family); removed Amazon-specific params (ie, _encoding, ref_, pd_rd_, pf_rd_) - applyUrlLut(): dedup repeated URLs and hide whale URLs (≥200 chars) behind $uN tokens; full values printed in urls: footer trailer - Compact truncation limit lowered 16k → 12k chars (raw keeps 16k); compaction savings recover the headroom - --raw flag on all snapshot commands to bypass compaction - `url <$uN|@ref>` command to resolve LUT tokens and element refs - Test fixture (test/fixtures/elements.html) covering all major element types - Task benchmark prompts in test/tasks/ for compact vs raw cost comparison
macieju-opera
left a comment
There was a problem hiding this comment.
Overall: solid work. Two blockers to fix before merge, plus a couple of suggestions below.
| writeJson(res, 200, lastSnapshot); | ||
| } | ||
| return; | ||
| } |
There was a problem hiding this comment.
Blocker — duplicate handler (dead code). This block (lines 306–313) is an exact copy of the one 9 lines above it (297–304). The first block always returns, so this one is unreachable. Delete it.
| raw: string; | ||
| pageUrl: string | null; | ||
| capturedAt: number; | ||
| } |
There was a problem hiding this comment.
Blocker — duplicate interface. CachedSnapshot here has the exact same shape as LastSnapshotCache exported from bridge.ts ({ raw, pageUrl, capturedAt }). They're the same wire format under two names — one field addition will require updating both. Either import LastSnapshotCache from bridge.ts here, or define a shared type in snapshot.ts that both files import.
There was a problem hiding this comment.
Ok, changed as requested.
| // Re-derive the full (non-truncated) URL map so tokens match what the agent | ||
| // saw, regardless of the truncation applied to the original output. | ||
| const compact = compactSnapshot(raw); | ||
| const { body, urlMap } = applyUrlLut(compact); |
There was a problem hiding this comment.
Suggestion — token map may not match what the agent saw. applyUrlLut is called here on the full (non-truncated) snapshot, but the $uN tokens the agent saw were derived from the truncated version. A URL that appears once in the visible window and once in the truncated tail would not be tokenised in the agent's output, but would be in the full snapshot — potentially with a different index. To guarantee exact match, cache the urlMap alongside raw in LastSnapshotCache and read it back here instead of re-deriving it.
There was a problem hiding this comment.
Fixed - the urlMap is now written to last-url-map.json immediately after the truncated render, so the url $uN command always reads back the exact same token assignments the agent saw.
macieju-opera
left a comment
There was a problem hiding this comment.
Python benchmark code review (benchmarks/).
| self.tool_call_count += len(turn.tool_calls) | ||
| for tc in turn.tool_calls: | ||
| if tc.name in SNAPSHOT_TOOLS: | ||
| self.snapshot_chars.append(len(tool_results[tc.call_id])) |
There was a problem hiding this comment.
Bug — inconsistent dict access. Line 64 uses tool_results[tc.call_id] (raises KeyError if the key is missing), but line 70 in the same method uses tool_results.get(tc.call_id, "") (safe). The .get() on line 70 implies someone already knew this key could be absent. If dispatch() ever raises and is caught upstream before populating all call IDs, line 64 crashes the run. Use .get(tc.call_id) here for consistency, and decide what the right default is (probably "" or 0).
There was a problem hiding this comment.
This is not really a bug, but changed for better consistency.
| ) | ||
| raw = turn.text.strip() | ||
| if raw.startswith("```"): | ||
| raw = raw.split("```")[1].removeprefix("json") |
There was a problem hiding this comment.
Suggestion — fragile code-fence stripping. raw.split("\``")[1].removeprefix("json")works for the common```json\n{...}\n```format but silently breaks if the model uses```json with a trailing space, or outputs multiple code blocks. A wrong parse returns{"pass": False, ...}which poisons benchmark results without any loud failure. A more robust approach:re.search(r'```(?:json)?\s*([\s\S]+?)```', raw)and extract group 1, falling back toraw` if no match.
There was a problem hiding this comment.
Ok, used the regex for parsing.
macieju-opera
left a comment
There was a problem hiding this comment.
Non-blocking suggestions — TypeScript and Python.
| * actually see in the body. Token IDs are assigned in tree-walk (top-down) | ||
| * order and are therefore deterministic for identical input. | ||
| */ | ||
| export function applyUrlLut(text: string): UrlLutResult { |
There was a problem hiding this comment.
Suggestion — split file responsibilities. snapshot.ts now covers ref conversion, compaction, URL cleaning, LUT tokenisation, and URL resolution (~500 lines, up from ~90). The URL LUT layer (applyUrlLut, resolveUrl, cleanUrl, UrlLutResult, whalePreview) is a self-contained concern that would sit cleanly in src/url-lut.ts. Makes both files easier to navigate and test in isolation.
There was a problem hiding this comment.
I'd keep this out of scope for now, we can address in some follow-up to avoid too many changes to the source package, WDYT?
| } | ||
|
|
||
| /** Reset the snapshot cache — for use in tests only. */ | ||
| export function resetLastSnapshotCache(): void { |
There was a problem hiding this comment.
Suggestion — don't export test-only helpers. resetLastSnapshotCache is labelled "for use in tests only" but is part of the public module export. Either unexport it and access lastSnapshot indirectly in tests (e.g. via the /last-snapshot endpoint, which tests already exercise), or move it to a dedicated test helper file. Leaking it here means any consumer of bridge.ts can call it.
There was a problem hiding this comment.
I'd keep it as it is right now, the function accesses the private module variables and extracting this could be a larger refactor.
| @property | ||
| def all_errored(self) -> bool: | ||
| """True if every tool call returned an error — indicates the tool is not installed/running.""" | ||
| return bool(self.records) and all(r.result.startswith("[error:") for r in self.records) |
There was a problem hiding this comment.
Suggestion — all_errored misses the two most common failure modes. _run() returns [error: ...] only for FileNotFoundError. Timeouts return [timeout after Ns] and non-zero exits return the stderr text or [exit N] — neither matches startswith("[error:"), so all_errored stays False even when every call failed. Fix: check all three, e.g.:
_ERROR_PREFIXES = ("[error:", "[timeout", "[exit ", "[unknown")
@property
def all_errored(self) -> bool:
return bool(self.records) and all(
r.result.startswith(_ERROR_PREFIXES) for r in self.records
)| judge_model = args.judge_model or models_cfg["judge"]["model"] | ||
| judge_effort = args.judge_reasoning_effort or models_cfg["judge"]["reasoning_effort"] | ||
|
|
||
| selected_conditions = args.conditions.split(",") if args.conditions else list(all_conditions.keys()) |
There was a problem hiding this comment.
Suggestion — strip whitespace from comma-split args. args.conditions.split(",") produces " b" for --conditions "a, b", which then fails validation at line 189 with a confusing "Unknown condition" error. Same issue on line 185 for --tasks.
selected_conditions = [c.strip() for c in args.conditions.split(",")] if args.conditions else list(all_conditions.keys())
selected_tasks = [t.strip() for t in args.tasks.split(",")] if args.tasks else list(all_tasks.keys())|
|
||
| class Client: | ||
| def __init__(self, model: str, reasoning_effort: str = "medium"): | ||
| self._api = openai.OpenAI() |
There was a problem hiding this comment.
Suggestion — fail fast on missing API key. openai.OpenAI() succeeds even without OPENAI_API_KEY set; the error surfaces only on the first API call with a cryptic auth message. Add a startup check:
if not os.environ.get("OPENAI_API_KEY"):
raise RuntimeError("OPENAI_API_KEY is not set")
self._api = openai.OpenAI()There was a problem hiding this comment.
Good catch, this will be more explicit!
a9dbdcc to
ac89c74
Compare
b1abddc to
37b7625
Compare
| tiktoken_encoding: str = args.encoding or settings["tiktoken_encoding"] | ||
| run_id = datetime.now().strftime("%y%m%d%H%M") | ||
| results_dir = Path(__file__).parent.parent / settings["output_dir"] / run_id | ||
|
|
There was a problem hiding this comment.
Same whitespace-strip fix applied to snapshot-agentic-use/src/run_benchmark.py was missed here. --conditions "a, b" will silently drop b.
wanted = {c.strip() for c in args.conditions.split(",")}
MR: Snapshot token optimizations + benchmarks
Summary
Adds two benchmarks that quantify the token cost of
opera-browser-clisnapshot output, andintroduces the snapshot compaction layer (Layer 1 + URL LUT) that they measure.
Benchmarks
benchmarks/page-token-benchmarkMeasures raw token cost of CLI output across flag combinations, with no LLM involved. Runs
opera-browser-cli openon 50 static pages (Wikipedia, GitHub, MDN, Python docs, RFC Editor)and counts tokens via tiktoken. Seven conditions:
opera-compact,opera-compact-full,opera-raw,opera-raw-full,mcp-raw,axi,axi-full.--fulloption runs Opera CLI or AXI without hard limit of characters.Results (50 runs each):
opera-compactopera-rawaxiopera-compact-fullmcp-rawopera-raw-fullaxi-fullbenchmarks/snapshot-agentic-useEnd-to-end agentic benchmark: an LLM agent completes 7 browser tasks (adapted from the
axi bench-browser benchmark)
across 4 conditions, graded pass/fail by an LLM judge. Captures input tokens, snapshot size,
wall time, and tool call count per run. Agent is able to use
--fullflag or not use it.Results (21 runs each):
CLI Changes
src/snapshot.ts— Layer 1 compaction (role renames, echo-dedup, description dedup,numeric attr quotes, heading→markdown, text-run collapse) and URL LUT (Layer 2: dedup and
whale-URL tokenisation with
$uNtrailer).src/bridge.ts—GET /last-snapshotendpoint; caches the most recenttake_snapshotresult for use by the
snapCLI command.src/cli.ts/src/run.ts— wiring for the snapshot pipeline and new CLI flags.