cyte is a TypeScript CLI for extracting website content into Markdown, discovering links, and crawling internal pages recursively.
- Extract a page into clean Markdown.
- Discover all links on a page with
internal/externalclassification. - Recursively crawl internal pages and save content by domain + route.
- Output JSON for automation/agent workflows.
Global install (recommended for regular usage):
pnpm add -g cyte
cyte --helpNo-install one-off usage:
npx cyte --help
npx cyte https://example.comPNPM one-off alternative:
pnpm dlx cyte --helpFor local development of this repo:
pnpm install
pnpm buildSingle page extraction (stdout):
cyte vercel.com
npx cyte vercel.comLinks only:
cyte links vercel.com --jsonDeep crawl + file output:
cyte vercel.com --deep --depth 2Extract a single page into Markdown.
Default behavior:
- Prints Markdown to
stdout. - Does not save files unless
--deepis enabled.
Examples:
cyte https://example.com
cyte example.com
cyte example.com --jsonOptions:
--deep: enable recursive internal crawl and file output.--depth <number>: max crawl depth, default1.--delay <number>: delay between requests in ms, default150.--concurrency <number>: max parallel crawl requests, default3.--output <path>: output directory for deep crawl, default./cyte.--clean: remove target domain output directory before deep crawl.--sitemap: seed crawl from sitemap URLs (including robots sitemap entries).--no-respect-robots: ignore robots.txt rules during deep crawl.--json: return structured JSON instead of human output.--format <type>:jsonorjsonl(used with--json), defaultjson.--download-media: reserved flag (not active yet).
Return links found in a page.
Examples:
cyte links https://example.com
cyte links example.com --json
cyte links example.com --internal
cyte links example.com --external --match docsOptions:
--internal: only internal links.--external: only external links.--match <pattern>: filter by title or URL substring.--json: output JSON array.--format <type>:jsonorjsonl(used with--json), defaultjson.
- Bare domains are accepted:
vercel.com->https://vercel.com/. - For failed
https://requests, cyte retries withhttp://. - URLs are normalized for crawl deduplication:
- hash fragments removed
- query string removed in crawl/link normalization
- trailing slashes normalized
- Skips unsupported link protocols:
mailto:javascript:tel:
- Returns extracted markdown to stdout.
- Also prints success metadata (source URL, links discovered).
- Does not write files.
- Returns links table or JSON.
- Does not write files.
- Crawls internal links only.
- Writes markdown files grouped by domain and route:
cyte/
example.com/
index.md
docs/
index.md
intro/
index.md
- Existing output files at the same path are overwritten.
- Missing directories are created automatically.
- Deep crawl ensures
.gitignorecontainscyte/. - If crawl failures occur, an error report is written to:
cyte/<domain>/_errors.json
{
"url": "https://example.com/",
"title": "Example Domain",
"markdown": "# Example Domain\n...",
"links": [
{
"title": "Learn more",
"url": "https://iana.org/domains/example",
"type": "external"
}
]
}[
{
"title": "Docs",
"url": "https://example.com/docs",
"type": "internal"
}
]{
"startUrl": "https://example.com/",
"pagesVisited": 10,
"pagesSucceeded": 10,
"pagesFailed": 0,
"pages": []
}Use --format jsonl with --json.
Examples:
cyte links docs.example.com --json --format jsonl
cyte docs.example.com --deep --json --format jsonlNotes:
linksemits one link object per line.extractemits one object line.deepemits:- one
summaryline - one
pageline per crawled page
- one
cyte is agent-friendly by default: deterministic CLI, URL normalization, and machine-readable --json output.
- Discover routes, then fetch selected pages:
cyte links https://docs.example.com --json
cyte https://docs.example.com/authentication --json- Filter internal links by topic:
cyte links https://docs.example.com --internal --match auth --json- Build a knowledge snapshot for RAG:
cyte https://docs.example.com --deep --depth 2 --json- Run
links --jsonon the seed page. - Keep only
internallinks and apply topic filters (--match). - Fetch top candidate pages with
cyte <url> --json. - Escalate to deep crawl if coverage is insufficient.
- Index markdown + metadata for retrieval.
cyte links <url> --jsonreturns an array of:{ title, url, type }
cyte <url> --jsonreturns:{ url, title, markdown, links }
cyte <url> --deep --jsonreturns summary:{ startUrl, pagesVisited, pagesSucceeded, pagesFailed, pages }
- Use
--jsonin automation paths. - Start conservative on crawling:
--depth 1 --concurrency 2 --delay 200
- Treat page-level failures as partial success and continue.
- Re-crawls overwrite existing files by output path.
import { execFile } from "node:child_process";
import { promisify } from "node:util";
const execFileAsync = promisify(execFile);
async function discoverLinks(url: string) {
const { stdout } = await execFileAsync("cyte", ["links", url, "--json"]);
return JSON.parse(stdout) as Array<{
title: string;
url: string;
type: string;
}>;
}- Uses Readability + fallback extraction for landing pages where Readability is too thin.
- Preserves headings, lists, tables, code blocks, blockquotes.
- Converts relative media URLs to absolute URLs:
/logo.png->https://domain.com/logo.png
Run in dev mode:
pnpm dev -- --helpBuild:
pnpm buildTests:
pnpm test
pnpm test:watch- Node.js + TypeScript
- Commander
- Undici
- JSDOM + Readability
- Turndown + GFM plugin
- Cheerio
- p-limit
- fs-extra
- If output seems too thin on a landing page, rerun and compare
--jsonoutput to inspect extracted content and links. - If a site blocks requests, try a lower concurrency and add delay:
--concurrency 1 --delay 400
- If deep crawl seems incomplete, increase
--depth. - If crawl coverage is still low, try
--sitemap. - If pages are skipped unexpectedly, verify robots rules or use
--no-respect-robots.
- Changelog: see
CHANGELOG.md. - Versioning: semantic versioning (
major.minor.patch). - Publish flow:
- update
CHANGELOG.md - bump version (
pnpm version patch|minor|major) - publish (
pnpm publish --access public)
- update
MIT. See LICENSE.