feat(web,download): absorb #1048 — video/audio/iframe + --stdout#1146
Merged
feat(web,download): absorb #1048 — video/audio/iframe + --stdout#1146
Conversation
Distill the useful pieces of the abandoned PR #1048 (`web md`) into the existing shared pipeline instead of introducing a parallel command: - Turndown rules for <video> / <audio> / <iframe>. Video and audio are emitted as inline HTML so renderers that support it keep playback, and iframes degrade to markdown links (title + src) so embedded content (YouTube, CodePen, …) stays reachable. `iframe` moves out of STRIPPED_TAGS since it's now handled explicitly. - `stdout` option on ArticleDownloadOptions: writes the full markdown to process.stdout, skips image download + mkdir + file write, and reports saved='-'. Remote image URLs stay intact so piped output is self-contained. - `web read --stdout` wires the above through. - Lazy-load src rewrite: the extractor now promotes data-src / data-original / data-lazy-src / data-srcset onto `src` before the HTML is frozen, so the markdown body and the image-download list reference the same URL (previously a page with placeholder.gif + data-src produced broken image links in the output). Nothing in #1048 that overlapped with the already-merged #1143 hardening was kept — no new Readability wiring, no duplicate Turndown config, no new command.
6 tasks
- article-extract e2e fixture test: iframe now converts to a markdown
link instead of being stripped, so assert the YouTube embed link
survives rather than asserting its absence.
- clis/web/read.test.js: replace vi.importActual('../../src/registry.js')
with a direct __test__.command export from read.js; the relative
import into src/ tripped the package-exports adapter guardrail.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per @卡比卡比's direction (
#other18:05): close out PR #1048 (web md, standalone command) by folding only its useful pieces into the existing shared pipeline. No new command. The pipeline-level hardening that #1048 was originally written against (Readability, GFM, base64 drop, page chrome) is already inmainvia #1143, so only the non-overlapping bits are absorbed here.What changed
src/download/article-download.ts<video>,<audio>,<iframe>:<video src controls [poster]>,<audio src controls>), falls back to<source>when no directsrc.[title](src)(defaults title toEmbedded content). So YouTube / CodePen / Twitter embeds keep a reachable URL instead of being dropped.iframeremoved fromSTRIPPED_TAGSsince it's now handled explicitly.stdout?: booleanoption onArticleDownloadOptions:process.stdout, skips image download +mkdir+writeFile, returns row withsaved: '-'.sizeis still reported (Buffer byte length) for the result row.clis/web/read.js--stdoutflag (boolean, default false), passes through todownloadArticle.innerHTMLis captured: promotesdata-src/data-original/data-lazy-src/ firstdata-srcsetentry ontosrc. Fixes a latent bug where the markdown body kept placeholder.gif URLs while the image downloader fetched a different URL from the collected list — now the body and the download list always agree.Intentionally NOT kept from #1048
web mdcommand itself (duplicatesweb readpost-feat(download): harden HTML→Markdown pipeline #1143).src/browser/article-extract.ts).mediaUrlsobject (dead code in the original PR).# Title / > meta / ---frontmatter).Test plan
npx vitest run src/download/article-download.test.ts— 20 passed (was 13; +7 for video/audio/iframe/stdout).[title](src)link.<source>fallback, dropped-when-no-src, audio inline HTML, iframe to-link with title + default, dropped iframe.\n, remote image URLs preserved.npx vitest run --project e2e tests/e2e/article-download-pipeline.test.ts— 6/6 real sites pass in 25s (invariants unaffected).npm run build— clean, manifest 624 unchanged (no new command).npx tsc --noEmit— clean.opencli web read --url https://example.com/ --stdout --format jsonemits the markdown to stdout followed by the JSON result row — as expected.Closes #1048.