Skip to content

feat(web,download): absorb #1048 — video/audio/iframe + --stdout#1146

Merged
jackwener merged 3 commits intomainfrom
feat/web-read-media-stdout
Apr 22, 2026
Merged

feat(web,download): absorb #1048 — video/audio/iframe + --stdout#1146
jackwener merged 3 commits intomainfrom
feat/web-read-media-stdout

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

Per @卡比卡比's direction (#other 18:05): close out PR #1048 (web md, standalone command) by folding only its useful pieces into the existing shared pipeline. No new command. The pipeline-level hardening that #1048 was originally written against (Readability, GFM, base64 drop, page chrome) is already in main via #1143, so only the non-overlapping bits are absorbed here.

What changed

src/download/article-download.ts

  • Turndown rules for <video>, <audio>, <iframe>:
    • Video / audio → inline HTML (<video src controls [poster]>, <audio src controls>), falls back to <source> when no direct src.
    • Iframe → markdown link [title](src) (defaults title to Embedded content). So YouTube / CodePen / Twitter embeds keep a reachable URL instead of being dropped.
    • iframe removed from STRIPPED_TAGS since it's now handled explicitly.
  • New stdout?: boolean option on ArticleDownloadOptions:
    • Writes full markdown to process.stdout, skips image download + mkdir + writeFile, returns row with saved: '-'.
    • Remote image URLs remain intact so piped output is self-contained.
    • size is still reported (Buffer byte length) for the result row.

clis/web/read.js

  • New --stdout flag (boolean, default false), passes through to downloadArticle.
  • Lazy-load image URL rewrite before innerHTML is captured: promotes data-src / data-original / data-lazy-src / first data-srcset entry onto src. Fixes a latent bug where the markdown body kept placeholder.gif URLs while the image downloader fetched a different URL from the collected list — now the body and the download list always agree.

Intentionally NOT kept from #1048

Test plan

  • npx vitest run src/download/article-download.test.ts20 passed (was 13; +7 for video/audio/iframe/stdout).
    • Iframe strip test updated: asserts iframe now degrades to [title](src) link.
    • New: video inline HTML with poster, <source> fallback, dropped-when-no-src, audio inline HTML, iframe to-link with title + default, dropped iframe.
    • New stdout mode: asserts nothing hits disk, markdown emitted via process.stdout ends with \n, remote image URLs preserved.
  • npx vitest run --project e2e tests/e2e/article-download-pipeline.test.ts6/6 real sites pass in 25s (invariants unaffected).
  • npm run build — clean, manifest 624 unchanged (no new command).
  • npx tsc --noEmit — clean.
  • Live CLI smoke: opencli web read --url https://example.com/ --stdout --format json emits the markdown to stdout followed by the JSON result row — as expected.

Closes #1048.

Distill the useful pieces of the abandoned PR #1048 (`web md`) into the
existing shared pipeline instead of introducing a parallel command:

- Turndown rules for <video> / <audio> / <iframe>. Video and audio are
  emitted as inline HTML so renderers that support it keep playback,
  and iframes degrade to markdown links (title + src) so embedded
  content (YouTube, CodePen, …) stays reachable. `iframe` moves out of
  STRIPPED_TAGS since it's now handled explicitly.
- `stdout` option on ArticleDownloadOptions: writes the full markdown
  to process.stdout, skips image download + mkdir + file write, and
  reports saved='-'. Remote image URLs stay intact so piped output is
  self-contained.
- `web read --stdout` wires the above through.
- Lazy-load src rewrite: the extractor now promotes data-src /
  data-original / data-lazy-src / data-srcset onto `src` before the
  HTML is frozen, so the markdown body and the image-download list
  reference the same URL (previously a page with placeholder.gif +
  data-src produced broken image links in the output).

Nothing in #1048 that overlapped with the already-merged #1143
hardening was kept — no new Readability wiring, no duplicate Turndown
config, no new command.
- article-extract e2e fixture test: iframe now converts to a markdown
  link instead of being stripped, so assert the YouTube embed link
  survives rather than asserting its absence.
- clis/web/read.test.js: replace vi.importActual('../../src/registry.js')
  with a direct __test__.command export from read.js; the relative
  import into src/ tripped the package-exports adapter guardrail.
@jackwener jackwener merged commit 648390e into main Apr 22, 2026
13 checks passed
@jackwener jackwener deleted the feat/web-read-media-stdout branch April 22, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant