Skip to content

ninejuan/webstract

Repository files navigation

Webstract

npm version license

CLI to snapshot a web page for offline use: downloads HTML plus CSS/JS/images on the same site, rewrites references, and saves everything locally.

Install

npm install -g webstract          # after publishing
# or run without installing
npx webstract <url> <outputDir>

Quick start

yarn start <url> <outputDir> [--concurrency <n>] [--timeout <ms>]
yarn start https://example.com ./dump

Build once for distribution:

yarn build
node dist/cli.js <url> <outputDir>

What it does

  • Follows redirects; the final URL is the base for asset rewriting.
  • Saves index.html under <outputDir>/<domain>/ and rewrites references to point to downloaded files.
  • Downloads assets on the same registrable domain or same root label (e.g., daum.netdaumcdn.net), not just strict origin.
  • Collects linked CSS (link[rel=stylesheet]), JS (script[src]), images (img/srcset, img[src], source[src|srcset], icons), inline CSS in <style>/style=, and meta images (OG/Twitter).
  • Parses downloaded CSS for @import and url(...) references on the same domain/root label.
  • External origins remain absolute; skipped items are listed in missing-assets.json. Use --download-external to force-download other domains (saved under a hostname-prefixed path).
  • Writes _WST.md summary with request/final URLs and download/skip/fail counts.

CLI options

Option Description Default
-c, --concurrency <n> Concurrent downloads WEBSTRACT_CONCURRENCY or 5
-t, --timeout <ms> Request timeout in ms WEBSTRACT_TIMEOUT_MS or 15000
-r, --retries <n> Retry attempts per request WEBSTRACT_MAX_RETRIES or 3
--retry-delay <ms> Delay between retries (exponential backoff) WEBSTRACT_RETRY_DELAY_MS or 1000
--user-agent <ua> Custom User-Agent string WEBSTRACT_USER_AGENT
--no-follow-redirects Do not follow HTTP redirects follow redirects
--insecure Allow insecure TLS (self-signed) off
--download-external Force download of external-domain assets (prefixed by hostname) off
--no-css-parse Skip CSS @import/url() parsing on
--no-meta Skip meta (OG/Twitter) image discovery on
`--summary-format <md json>` _WST summary format
--output-name <name> Override output folder name derived from domain
--quiet / --verbose Control log verbosity normal

Environment variables: WEBSTRACT_CONCURRENCY, WEBSTRACT_TIMEOUT_MS, WEBSTRACT_USER_AGENT.

Output layout

<outputDir>/<domain>/
├─ index.html
├─ _WST.md                  # summary
├─ missing-assets.json      # only if something was skipped
├─ css/...
├─ js/...
└─ images/...               # file tree mirrors remote paths

Open index.html in a browser for the offline copy. Check _WST.md for a quick summary and missing-assets.json to see which external assets stayed remote.

Programmatic use

import { webstract } from "webstract";

await webstract("https://example.com", "./dump/example.com");

Project layout

  • src/webstract.ts: Orchestrates extraction and options.
  • src/lib/: Shared utilities (HTTP client, logger).
  • src/extract/: Core extraction logic (collector, CSS parsing, downloader, rewriter, output).
  • CLI entry: src/cli.ts.

Environment variables are loaded via dotenv (quiet mode); keep your .env out of version control.

About

Webstract extracts a website’s complete HTML DOM along with all related CSS, JavaScript, and image assets, representing everything as a structured tree.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors