CLI to snapshot a web page for offline use: downloads HTML plus CSS/JS/images on the same site, rewrites references, and saves everything locally.
npm install -g webstract # after publishing
# or run without installing
npx webstract <url> <outputDir>yarn start <url> <outputDir> [--concurrency <n>] [--timeout <ms>]
yarn start https://example.com ./dumpBuild once for distribution:
yarn build
node dist/cli.js <url> <outputDir>- Follows redirects; the final URL is the base for asset rewriting.
- Saves
index.htmlunder<outputDir>/<domain>/and rewrites references to point to downloaded files. - Downloads assets on the same registrable domain or same root label (e.g.,
daum.net→daumcdn.net), not just strict origin. - Collects linked CSS (
link[rel=stylesheet]), JS (script[src]), images (img/srcset,img[src],source[src|srcset], icons), inline CSS in<style>/style=, and meta images (OG/Twitter). - Parses downloaded CSS for
@importandurl(...)references on the same domain/root label. - External origins remain absolute; skipped items are listed in
missing-assets.json. Use--download-externalto force-download other domains (saved under a hostname-prefixed path). - Writes
_WST.mdsummary with request/final URLs and download/skip/fail counts.
| Option | Description | Default |
|---|---|---|
-c, --concurrency <n> |
Concurrent downloads | WEBSTRACT_CONCURRENCY or 5 |
-t, --timeout <ms> |
Request timeout in ms | WEBSTRACT_TIMEOUT_MS or 15000 |
-r, --retries <n> |
Retry attempts per request | WEBSTRACT_MAX_RETRIES or 3 |
--retry-delay <ms> |
Delay between retries (exponential backoff) | WEBSTRACT_RETRY_DELAY_MS or 1000 |
--user-agent <ua> |
Custom User-Agent string | WEBSTRACT_USER_AGENT |
--no-follow-redirects |
Do not follow HTTP redirects | follow redirects |
--insecure |
Allow insecure TLS (self-signed) | off |
--download-external |
Force download of external-domain assets (prefixed by hostname) | off |
--no-css-parse |
Skip CSS @import/url() parsing | on |
--no-meta |
Skip meta (OG/Twitter) image discovery | on |
| `--summary-format <md | json>` | _WST summary format |
--output-name <name> |
Override output folder name | derived from domain |
--quiet / --verbose |
Control log verbosity | normal |
Environment variables: WEBSTRACT_CONCURRENCY, WEBSTRACT_TIMEOUT_MS, WEBSTRACT_USER_AGENT.
<outputDir>/<domain>/
├─ index.html
├─ _WST.md # summary
├─ missing-assets.json # only if something was skipped
├─ css/...
├─ js/...
└─ images/... # file tree mirrors remote paths
Open index.html in a browser for the offline copy. Check _WST.md for a quick summary and missing-assets.json to see which external assets stayed remote.
import { webstract } from "webstract";
await webstract("https://example.com", "./dump/example.com");src/webstract.ts: Orchestrates extraction and options.src/lib/: Shared utilities (HTTP client, logger).src/extract/: Core extraction logic (collector, CSS parsing, downloader, rewriter, output).- CLI entry:
src/cli.ts.
Environment variables are loaded via dotenv (quiet mode); keep your .env out of version control.