Conversation
…ts, extract command
Three fixes/additions driven by agent-usage gaps, as one complete change:
- network (P0 fix): lift silent 4000-char body truncation in CDP + extension
paths to an 8MB memory-guard cap, and surface body_truncated / body_full_size
/ body_truncation_reason in the --detail envelope so the agent sees when a
body was cut. List view also exposes body_truncated_count and per-entry flag.
Adds --max-body flag for explicit caller-side capping.
- get html --as json (P1): add --depth / --children-max / --text-max budget
knobs on the tree serializer, plus a truncated={depth,children_dropped,
text_truncated} envelope that only appears when a budget is hit. Lets the
agent narrow DOM output without walking away empty-handed.
- extract (P2 new command): agent-native article/content channel. Scope →
denoise (strip nav/header/footer/scripts/forms/etc.) → HTML→markdown via
existing htmlToMarkdown → paragraph-boundary-aware chunk with stateless
next_start_char resume cursor. Agents no longer misuse `get html` to read.
…/fallback Addresses review blockers on #1104: - NETWORK_INTERCEPTOR_JS fallback no longer silently drops bodies above the per-entry cap. Raised cap to 1 MiB (ring stays at 200 entries), and on overflow keeps the string prefix + sets `bodyTruncated` / `bodyFullSize` so `browser network` propagates the same agent-visible signal the CDP / extension paths emit. - `CachedNetworkEntry` schema switches from internal camelCase `bodyTruncated` to the user-facing `body_truncated` / `body_full_size` fields. `--raw` emits cache entries verbatim, so this removes the snake_case/camelCase split across list / --detail / --raw. - Adds a `--raw` truncation-contract test that also asserts the camelCase fields do not leak through.
luxiaolei
pushed a commit
to luxiaolei/OpenCLI
that referenced
this pull request
Apr 21, 2026
…ts, extract command (jackwener#1104) * feat(browser): agent-native payload — network bodies, html tree budgets, extract command Three fixes/additions driven by agent-usage gaps, as one complete change: - network (P0 fix): lift silent 4000-char body truncation in CDP + extension paths to an 8MB memory-guard cap, and surface body_truncated / body_full_size / body_truncation_reason in the --detail envelope so the agent sees when a body was cut. List view also exposes body_truncated_count and per-entry flag. Adds --max-body flag for explicit caller-side capping. - get html --as json (P1): add --depth / --children-max / --text-max budget knobs on the tree serializer, plus a truncated={depth,children_dropped, text_truncated} envelope that only appears when a budget is hit. Lets the agent narrow DOM output without walking away empty-handed. - extract (P2 new command): agent-native article/content channel. Scope → denoise (strip nav/header/footer/scripts/forms/etc.) → HTML→markdown via existing htmlToMarkdown → paragraph-boundary-aware chunk with stateless next_start_char resume cursor. Agents no longer misuse `get html` to read. * fix(browser): unify body-truncation signal contract across raw/detail/fallback Addresses review blockers on jackwener#1104: - NETWORK_INTERCEPTOR_JS fallback no longer silently drops bodies above the per-entry cap. Raised cap to 1 MiB (ring stays at 200 entries), and on overflow keeps the string prefix + sets `bodyTruncated` / `bodyFullSize` so `browser network` propagates the same agent-visible signal the CDP / extension paths emit. - `CachedNetworkEntry` schema switches from internal camelCase `bodyTruncated` to the user-facing `body_truncated` / `body_full_size` fields. `--raw` emits cache entries verbatim, so this removes the snake_case/camelCase split across list / --detail / --raw. - Adds a `--raw` truncation-contract test that also asserts the camelCase fields do not leak through. (cherry picked from commit acb08a4)
This was referenced Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Three independent agent-usage gaps in the
browsersurface, addressed as one complete change per direction from #opencli-browser thread:network --detailsilently cut response bodies at 4000 chars, breakingJSON.parseon any non-trivial API response. Purely invisible to the agent.get html --as jsonhad no budget knobs, so the agent either got a full-page tree dump or nothing — no middle ground when scoping down.get htmlto try to read articles, which returns DOM structure, not prose.All three are about the same thing from first principles: agent-readable payload shape, with visible truncation so the agent can continue rather than fail silently.
What changed
browser network— body truncation becomes visiblesrc/browser/cdp.ts+extension/src/cdp.ts: raise the silent 4KB slice to an 8MB memory-guard cap and recordresponseBodyFullSize/responseBodyTruncated(extension path also gets request-body mirrors).src/browser/network-cache.ts:CachedNetworkEntry.bodyTruncatedpropagates through.src/cli.tsbrowser network:body_truncated_count+ per-entrybody_truncatedflag.--detailenvelope emitsbody_truncated,body_full_size,body_truncation_reason: 'capture-limit' | 'max-body'.--max-body <chars>flag for explicit caller-side cap (default0= unlimited).browser get html --as json— budget knobssrc/browser/html-tree.ts: adddepth/childrenMax/textMaxbudgets to the in-page serializer. The result envelope includestruncated: {depth, children_dropped, text_truncated}only when a budget was actually hit.src/cli.ts: expose as--depth <n>/--children-max <n>/--text-max <n>.browser extract— new agent-native reading channelsrc/browser/extract.ts(new):buildExtractHtmlJs(selector)— in-page clone + denoise (stripscript/style/nav/header/footer/aside/iframe/form/button/...+role=navigation/banner/contentinfo/complementary+aria-hidden=true; scrubon*/style/data-*attrs).htmlToMarkdown(existing helper) → structure-awarechunkMarkdownthat snaps chunk end to the nearest\n\n(or\n) within a 15% boundary window, falling back to a hard cut.next_start_charcursor.src/cli.ts: newbrowser extract [url] --selector <sel> --start <n> --chunk-size <n>command.Truncation signal contract (all three channels)
Every channel makes cuts visible to the agent in the envelope:
body_truncated_count, per-entrybody_truncatedbody_truncated,body_full_size,body_truncation_reasontruncated: { depth?, children_dropped?, text_truncated? }(only when hit)total_chars,chunk_size,start,end,next_start_charTest plan
pnpm test src/browser/extract.test.ts— 13 tests pass (chunking boundaries, resume chain, min-clamp, envelope wrapping)pnpm test src/browser/html-tree.test.ts— all green incl. 5 new budget testspnpm test src/cli.test.ts— all green incl. 5 new body-truncation-signal testspnpm typecheck— cleanbrowser network --detail ... --max-body 500returnsbody_truncated: true+body_full_sizebrowser extract <url>on a real article page + resume via--start <next>Scope discipline
package.json;htmlToMarkdownalready exists insrc/utils.ts).body_truncated_countfield (additive).cc @codex-mini1 @First-principles-1 — requesting review per thread assignment.