Skip to content

PhumudzoSly/cyte

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cyte

cyte is a TypeScript CLI for extracting website content into Markdown, discovering links, and crawling internal pages recursively.

What It Does

  • Extract a page into clean Markdown.
  • Discover all links on a page with internal/external classification.
  • Recursively crawl internal pages and save content by domain + route.
  • Output JSON for automation/agent workflows.

Install

Global install (recommended for regular usage):

pnpm add -g cyte
cyte --help

No-install one-off usage:

npx cyte --help
npx cyte https://example.com

PNPM one-off alternative:

pnpm dlx cyte --help

For local development of this repo:

pnpm install
pnpm build

Quickstart

Single page extraction (stdout):

cyte vercel.com
npx cyte vercel.com

Links only:

cyte links vercel.com --json

Deep crawl + file output:

cyte vercel.com --deep --depth 2

Command Reference

cyte <url>

Extract a single page into Markdown.

Default behavior:

  • Prints Markdown to stdout.
  • Does not save files unless --deep is enabled.

Examples:

cyte https://example.com
cyte example.com
cyte example.com --json

Options:

  • --deep: enable recursive internal crawl and file output.
  • --depth <number>: max crawl depth, default 1.
  • --delay <number>: delay between requests in ms, default 150.
  • --concurrency <number>: max parallel crawl requests, default 3.
  • --output <path>: output directory for deep crawl, default ./cyte.
  • --clean: remove target domain output directory before deep crawl.
  • --sitemap: seed crawl from sitemap URLs (including robots sitemap entries).
  • --no-respect-robots: ignore robots.txt rules during deep crawl.
  • --json: return structured JSON instead of human output.
  • --format <type>: json or jsonl (used with --json), default json.
  • --download-media: reserved flag (not active yet).

cyte links <url>

Return links found in a page.

Examples:

cyte links https://example.com
cyte links example.com --json
cyte links example.com --internal
cyte links example.com --external --match docs

Options:

  • --internal: only internal links.
  • --external: only external links.
  • --match <pattern>: filter by title or URL substring.
  • --json: output JSON array.
  • --format <type>: json or jsonl (used with --json), default json.

URL Handling

  • Bare domains are accepted: vercel.com -> https://vercel.com/.
  • For failed https:// requests, cyte retries with http://.
  • URLs are normalized for crawl deduplication:
    • hash fragments removed
    • query string removed in crawl/link normalization
    • trailing slashes normalized
  • Skips unsupported link protocols:
    • mailto:
    • javascript:
    • tel:

Output Behavior

Single page mode (cyte <url>)

  • Returns extracted markdown to stdout.
  • Also prints success metadata (source URL, links discovered).
  • Does not write files.

Links mode (cyte links <url>)

  • Returns links table or JSON.
  • Does not write files.

Deep mode (cyte <url> --deep)

  • Crawls internal links only.
  • Writes markdown files grouped by domain and route:
cyte/
  example.com/
    index.md
    docs/
      index.md
      intro/
        index.md
  • Existing output files at the same path are overwritten.
  • Missing directories are created automatically.
  • Deep crawl ensures .gitignore contains cyte/.
  • If crawl failures occur, an error report is written to:
    • cyte/<domain>/_errors.json

JSON Output Contracts

Extract JSON (cyte <url> --json)

{
  "url": "https://example.com/",
  "title": "Example Domain",
  "markdown": "# Example Domain\n...",
  "links": [
    {
      "title": "Learn more",
      "url": "https://iana.org/domains/example",
      "type": "external"
    }
  ]
}

Links JSON (cyte links <url> --json)

[
  {
    "title": "Docs",
    "url": "https://example.com/docs",
    "type": "internal"
  }
]

Crawl JSON (cyte <url> --deep --json)

{
  "startUrl": "https://example.com/",
  "pagesVisited": 10,
  "pagesSucceeded": 10,
  "pagesFailed": 0,
  "pages": []
}

JSONL output

Use --format jsonl with --json.

Examples:

cyte links docs.example.com --json --format jsonl
cyte docs.example.com --deep --json --format jsonl

Notes:

  • links emits one link object per line.
  • extract emits one object line.
  • deep emits:
    • one summary line
    • one page line per crawled page

AI Agent Usage

cyte is agent-friendly by default: deterministic CLI, URL normalization, and machine-readable --json output.

Core agent workflows

  1. Discover routes, then fetch selected pages:
cyte links https://docs.example.com --json
cyte https://docs.example.com/authentication --json
  1. Filter internal links by topic:
cyte links https://docs.example.com --internal --match auth --json
  1. Build a knowledge snapshot for RAG:
cyte https://docs.example.com --deep --depth 2 --json

Recommended decision loop for agents

  1. Run links --json on the seed page.
  2. Keep only internal links and apply topic filters (--match).
  3. Fetch top candidate pages with cyte <url> --json.
  4. Escalate to deep crawl if coverage is insufficient.
  5. Index markdown + metadata for retrieval.

Contracts agents can rely on

  • cyte links <url> --json returns an array of:
    • { title, url, type }
  • cyte <url> --json returns:
    • { url, title, markdown, links }
  • cyte <url> --deep --json returns summary:
    • { startUrl, pagesVisited, pagesSucceeded, pagesFailed, pages }

Production notes for agent pipelines

  • Use --json in automation paths.
  • Start conservative on crawling:
    • --depth 1 --concurrency 2 --delay 200
  • Treat page-level failures as partial success and continue.
  • Re-crawls overwrite existing files by output path.

Node.js tool wrapper example

import { execFile } from "node:child_process";
import { promisify } from "node:util";

const execFileAsync = promisify(execFile);

async function discoverLinks(url: string) {
  const { stdout } = await execFileAsync("cyte", ["links", url, "--json"]);
  return JSON.parse(stdout) as Array<{
    title: string;
    url: string;
    type: string;
  }>;
}

Extraction Details

  • Uses Readability + fallback extraction for landing pages where Readability is too thin.
  • Preserves headings, lists, tables, code blocks, blockquotes.
  • Converts relative media URLs to absolute URLs:
    • /logo.png -> https://domain.com/logo.png

Development

Run in dev mode:

pnpm dev -- --help

Build:

pnpm build

Tests:

pnpm test
pnpm test:watch

Tech Stack

  • Node.js + TypeScript
  • Commander
  • Undici
  • JSDOM + Readability
  • Turndown + GFM plugin
  • Cheerio
  • p-limit
  • fs-extra

Troubleshooting

  • If output seems too thin on a landing page, rerun and compare --json output to inspect extracted content and links.
  • If a site blocks requests, try a lower concurrency and add delay:
    • --concurrency 1 --delay 400
  • If deep crawl seems incomplete, increase --depth.
  • If crawl coverage is still low, try --sitemap.
  • If pages are skipped unexpectedly, verify robots rules or use --no-respect-robots.

Releases

  • Changelog: see CHANGELOG.md.
  • Versioning: semantic versioning (major.minor.patch).
  • Publish flow:
    1. update CHANGELOG.md
    2. bump version (pnpm version patch|minor|major)
    3. publish (pnpm publish --access public)

License

MIT. See LICENSE.

About

a TypeScript CLI for extracting website content into Markdown, discovering links, and crawling internal pages recursively.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors