Skip to content

jmagar/noxa

 
 

Repository files navigation

The fastest web scraper for AI agents.
67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.

Stars Version License Config guide


Claude Code: web_fetch gets 403, noxa extracts successfully
Claude Code's built-in web_fetch → 403 Forbidden. noxa → clean markdown.


Your AI agent calls fetch() and gets a 403. Or 142KB of raw HTML that burns through your token budget. noxa fixes both.

It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: 67% fewer tokens than raw HTML, with metadata, links, and images preserved.

                     Raw HTML                          noxa
┌──────────────────────────────────┐    ┌──────────────────────────────────┐
│ <div class="ad-wrapper">         │    │ # Breaking: AI Breakthrough      │
│ <nav class="global-nav">         │    │                                  │
│ <script>window.__NEXT_DATA__     │    │ Researchers achieved 94%         │
│ ={...8KB of JSON...}</script>    │    │ accuracy on cross-domain         │
│ <div class="social-share">       │    │ reasoning benchmarks.            │
│ <button>Tweet</button>           │    │                                  │
│ <footer class="site-footer">     │    │ ## Key Findings                  │
│ <!-- 142,847 characters -->      │    │ - 3x faster inference            │
│                                  │    │ - Open-source weights            │
│         4,820 tokens             │    │         1,590 tokens             │
└──────────────────────────────────┘    └──────────────────────────────────┘

Get Started (30 seconds)

Need config details? See docs/config.md for how config.json and .env work together.

One-liner install

curl -fsSL https://raw.githubusercontent.com/jmagar/noxa/main/install.sh | bash

Installs Rust if needed, builds both binaries, and launches noxa setup — an interactive wizard that configures .env, optionally installs Ollama, and wires up Claude Desktop.

For AI agents (Claude, Cursor, Windsurf, VS Code)

npx create-noxa

Auto-detects your AI tools, downloads the MCP server, and configures everything. One command.

Prebuilt binaries

Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).

Cargo

cargo install --git https://github.com/jmagar/noxa.git --bin noxa --bin noxa-mcp

Then run the interactive setup wizard:

noxa setup

From source (clone + build)

git clone https://github.com/jmagar/noxa.git && cd noxa
./setup.sh    # builds (first run) then delegates to noxa setup

Why noxa?

noxa Firecrawl Trafilatura Readability
Extraction accuracy 95.1% 80.6% 83.5%
Token efficiency -67% -55% -51%
Speed (100KB page) 3.2ms ~500ms 18.4ms 8.7ms
TLS fingerprinting Yes No No No
Self-hosted Yes No Yes Yes
MCP (Claude/Cursor) Yes No No No
No browser required Yes No Yes Yes
Cost Free $$$$ Free Free

Choose noxa if you want fast local extraction, LLM-optimized output, and native AI agent integration.


What it looks like

$ noxa https://stripe.com -f llm

> URL: https://stripe.com
> Title: Stripe | Financial Infrastructure for the Internet
> Language: en
> Word count: 847

# Stripe | Financial Infrastructure for the Internet

Stripe is a suite of APIs powering online payment processing
and commerce solutions for internet businesses of all sizes.

## Products
- Payments — Accept payments online and in person
- Billing — Manage subscriptions and invoicing
- Connect — Build a marketplace or platform
...
$ noxa https://github.com --brand

{
  "name": "GitHub",
  "colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
  "fonts": ["Mona Sans", "ui-monospace"],
  "logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
}
$ noxa https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Crawling... 50/50 pages extracted
---
# Page 1: https://docs.rust-lang.org/
...
# Page 2: https://docs.rust-lang.org/book/
...

Examples

Basic Extraction

# Extract as markdown (default)
noxa https://example.com

# Multiple output formats
noxa https://example.com -f markdown    # Clean markdown
noxa https://example.com -f json        # Full structured JSON
noxa https://example.com -f text        # Plain text (no formatting)
noxa https://example.com -f llm         # Token-optimized for LLMs (67% fewer tokens)

# Bare domains work (auto-prepends https://)
noxa example.com

Content Filtering

# Only extract main content (skip nav, sidebar, footer)
noxa https://docs.rs/tokio --only-main-content

# Include specific CSS selectors
noxa https://news.ycombinator.com --include ".titleline,.score"

# Exclude specific elements
noxa https://example.com --exclude "nav,footer,.ads,.sidebar"

# Combine both
noxa https://docs.rs/reqwest --only-main-content --exclude ".sidebar"

Brand Identity Extraction

# Extract colors, fonts, logos from any website
noxa --brand https://stripe.com
# Output: { "name": "Stripe", "colors": [...], "fonts": ["Sohne"], "logos": [...] }

noxa --brand https://github.com
# Output: { "name": "GitHub", "colors": [{"hex": "#1F2328", ...}], "fonts": ["Mona Sans"], ... }

noxa --brand wikipedia.org
# Output: 10 colors, 5 fonts, favicon, logo URL

Sitemap Discovery

# Discover all URLs from a site's sitemaps
noxa --map https://sitemaps.org
# Output: one URL per line (84 URLs found)

# JSON output with metadata
noxa --map https://sitemaps.org -f json
# Output: [{ "url": "...", "last_modified": "...", "priority": 0.8 }]

Recursive Crawling

# Crawl a site (default: depth 1, max 20 pages)
noxa --crawl https://example.com

# Control depth and page limit
noxa --crawl --depth 2 --max-pages 50 https://docs.rs/tokio

# Crawl with sitemap seeding (finds more pages)
noxa --crawl --sitemap --depth 2 https://docs.rs/tokio

# Filter crawl paths
noxa --crawl --include-paths "/api/*,/guide/*" https://docs.example.com
noxa --crawl --exclude-paths "/changelog/*,/blog/*" https://docs.example.com

# Control concurrency and delay
noxa --crawl --concurrency 10 --delay 200 https://example.com

Change Detection (Diff)

# Step 1: Save a snapshot
noxa https://example.com -f json > snapshot.json

# Step 2: Later, compare against the snapshot
noxa --diff-with snapshot.json https://example.com
# Output:
#   Status: Same
#   Word count delta: +0

# If the page changed:
#   Status: Changed
#   Word count delta: +42
#   --- old
#   +++ new
#   @@ -1,3 +1,3 @@
#   -Old content here
#   +New content here

Change Monitoring (Watch)

# Poll a URL for changes every 5 minutes (default interval)
noxa --watch https://example.com

# Custom check interval (seconds)
noxa --watch --watch-interval 60 https://example.com

# Run a command on change — diff JSON is piped to stdin
noxa --watch --on-change "jq '.summary' >> changes.log" https://example.com

# Combine with a webhook — POST diff payload on each change
noxa --watch --webhook https://hooks.example.com/notify https://example.com

PDF Extraction

# PDF URLs are auto-detected via Content-Type
noxa https://example.com/report.pdf

# Control PDF mode
noxa --pdf-mode auto https://example.com/report.pdf  # Error on empty (catches scanned PDFs)
noxa --pdf-mode fast https://example.com/report.pdf  # Return whatever text is found

Batch Processing

# Multiple URLs in one command
noxa https://example.com https://httpbin.org/html https://rust-lang.org

# URLs from a file (one per line, # comments supported)
noxa --urls-file urls.txt

# Batch with JSON output
noxa --urls-file urls.txt -f json

# Proxy rotation for large batches
noxa --urls-file urls.txt --proxy-file proxies.txt --concurrency 10

Local Files & Stdin

# Extract from a local HTML file
noxa --file page.html

# Pipe HTML from another command
curl -s https://example.com | noxa --stdin

# Chain with other tools
noxa https://example.com -f text | wc -w    # Word count
noxa https://example.com -f json | jq '.metadata.title'  # Extract title with jq

Browser Impersonation

# Chrome (default) — latest Chrome TLS fingerprint
noxa https://example.com

# Firefox fingerprint
noxa --browser firefox https://example.com

# Random browser per request (good for batch)
noxa --browser random --urls-file urls.txt

Custom Headers & Cookies

# Custom headers
noxa -H "Authorization: Bearer token123" https://api.example.com
noxa -H "Accept-Language: de-DE" https://example.com

# Cookies
noxa --cookie "session=abc123; theme=dark" https://example.com

# Multiple headers
noxa -H "X-Custom: value" -H "Authorization: Bearer token" https://example.com

LLM-Powered Features

These require an LLM provider. noxa tries Gemini CLI first (requires the gemini binary on PATH and uses GEMINI_MODEL, default gemini-2.5-pro), then falls back to OpenAI, Ollama local, and Anthropic. Structured JSON extraction is automatically validated against your schema and retried once with a correction prompt if the first attempt fails.

# Summarize a page (default: 3 sentences)
noxa --summarize https://example.com

# Control summary length
noxa --summarize 5 https://example.com

# Extract structured JSON with a schema
noxa --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' https://example.com/product

# Extract with a schema from file
noxa --extract-json @schema.json https://example.com/product

# Extract with natural language prompt
noxa --extract-prompt "Get all pricing tiers with name, price, and features" https://stripe.com/pricing

# Use a specific LLM provider
noxa --llm-provider gemini --summarize https://example.com
noxa --llm-provider ollama --summarize https://example.com
noxa --llm-provider openai --llm-model gpt-4o --extract-prompt "..." https://example.com
noxa --llm-provider anthropic --summarize https://example.com

Raw HTML Output

# Get the raw fetched HTML (no extraction)
noxa --raw-html https://example.com

# Useful for debugging extraction issues
noxa --raw-html https://example.com > raw.html
noxa --file raw.html  # Then extract locally

Metadata & Verbose Mode

# Include YAML frontmatter with metadata
noxa --metadata https://example.com
# Output:
#   ---
#   title: "Example Domain"
#   source: "https://example.com"
#   word_count: 20
#   ---
#   # Example Domain
#   ...

# Verbose logging (debug extraction pipeline)
noxa -v https://example.com

Proxy Usage

# Single proxy
noxa --proxy http://user:pass@proxy.example.com:8080 https://example.com

# SOCKS5 proxy
noxa --proxy socks5://proxy.example.com:1080 https://example.com

# Proxy rotation from file (one per line: host:port:user:pass)
noxa --proxy-file proxies.txt https://example.com

# Auto-load proxies.txt from current directory
echo "proxy1.com:8080:user:pass" > proxies.txt
noxa https://example.com  # Automatically detects and uses proxies.txt

Web Search

Search the web and scrape the top results in one command. Uses SearXNG for fully local, private search — or falls back to the cloud API.

# Search via self-hosted SearXNG (no API key needed)
export SEARXNG_URL=https://your-searxng-instance.example.com
noxa --search "rust async best practices"

# Control result count (default: 10, max: 50)
noxa --search "rust async" --num-results 5

# Snippets only — skip scraping result URLs
noxa --search "rust async" --no-scrape

# Search via cloud API (no SearXNG required)
noxa --search "rust async" --api-key $NOXA_API_KEY

Results include title, URL, and snippet for each hit. When scraping is enabled (the default), extracted content from each page is also shown. All scraped pages are auto-persisted to ~/.noxa/content/.

Content Store

Every extraction automatically persists to ~/.noxa/content/{domain}/{path}.{md,json}. Works across all modes: scrape, batch, crawl, PDF, search, and MCP tools.

# Files are written automatically — no flags needed
noxa https://example.com
# → ~/.noxa/content/example_com/index.md
# → ~/.noxa/content/example_com/index.json

# Disable for a single run
noxa --no-store https://example.com

# Disable globally via env var
export NOXA_NO_STORE=1

The JSON file contains the full ExtractionResult including metadata, structured data, and content. The MCP diff tool reads from the store automatically — no need to pass a previous snapshot manually after the first run.

Content Store — Search & Retrieve

# Full-text search across all stored docs (uses ripgrep if available)
noxa --grep "authentication"
noxa --grep "rate limit" --format json

# List all stored domains
noxa --list

# List all docs for a specific domain
noxa --list docs.rust-lang.org

# Retrieve a cached doc by exact URL
noxa --retrieve https://docs.rust-lang.org/book/

# Retrieve by fuzzy query
noxa --retrieve "rust async book"

# Check background crawl status
noxa --status docs.rust-lang.org

# Re-fetch all cached docs for one stored domain
noxa --refresh docs.rust-lang.org

Note — scaling characteristics: Exact-URL retrieval is an O(1) path lookup. Fuzzy --retrieve and --refresh are backed by list_all_docs / list_domain_urls, which perform a recursive filesystem walk and parse every .json sidecar in the store (or domain subdirectory). Performance scales linearly with document count and directory depth. On stores with a few hundred documents, both commands complete in well under a second. On large stores (10 000+ documents), expect traversal and sidecar-parsing overhead of several seconds. No in-memory index is maintained between runs.

Save to Files

# Save each page to a separate file instead of stdout
noxa --crawl --output-dir ./docs https://docs.rust-lang.org

# Works with batch mode too
noxa --urls-file urls.txt --output-dir ./output

# Single URL to file
noxa https://example.com --output-dir ./pages
# → ./pages/index.md

Real-World Recipes

# Monitor competitor pricing — save today's pricing
noxa --extract-json '{"type":"array","items":{"type":"object","properties":{"plan":{"type":"string"},"price":{"type":"string"}}}}' \
  https://competitor.com/pricing -f json > pricing-$(date +%Y%m%d).json

# Build a documentation search index
noxa --crawl --sitemap --depth 3 --max-pages 500 -f llm https://docs.example.com > docs.txt

# Extract all images from a page
noxa https://example.com -f json | jq -r '.content.images[].src'

# Get all external links
noxa https://example.com -f json | jq -r '.content.links[] | select(.href | startswith("http")) | .href'

# Compare two pages
noxa https://site-a.com -f json > a.json
noxa https://site-b.com --diff-with a.json

Claude Code Plugin

noxa ships as a Claude Code plugin that adds a skill (auto-activates on scrape/crawl/search triggers) and wires up the MCP server in one step.

# Add the marketplace and install
/plugin marketplace add jmagar/noxa
/plugin install noxa
/reload-plugins

The plugin provides:

  • noxa skill — auto-activates when you ask to scrape, crawl, extract, search, watch, or summarize URLs; covers all flag combinations and common recipes
  • MCP server — all 10 tools available directly to Claude (scrape, crawl, map, batch, extract, summarize, diff, brand, search, research)

Requires noxa on PATH. Run noxa setup after installing to configure everything.


MCP Server — 10 tools for AI agents

noxa MCP server

noxa ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.

npx create-noxa    # auto-detects and configures everything

Or manual setup — add to your Claude Desktop config:

{
  "mcpServers": {
    "noxa": {
      "command": "noxa",
      "args": ["mcp"]
    }
  }
}

Then in Claude: "Scrape the top 5 results for 'web scraping tools' and compare their pricing" — it just works.

Available tools

Tool Description Requires API key?
scrape Extract content from any URL No
crawl Recursive site crawl No
map Discover URLs from sitemaps No
batch Parallel multi-URL extraction No
extract LLM-powered structured extraction No (needs local LLM or NOXA_API_KEY)
summarize Page summarization No (needs local LLM or NOXA_API_KEY)
diff Content change detection No
brand Brand identity extraction No
search Web search + scrape results SEARXNG_URL: No, cloud: Yes
research Deep multi-source research Yes

9 of 10 tools work locally — no account, no API key, fully private.


Features

Extraction

  • Readability scoring — multi-signal content detection (text density, semantic tags, link ratio)
  • Noise filtering — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
  • Data island extraction — catches React/Next.js JSON payloads, JSON-LD, hydration data
  • YouTube metadata — structured data from any YouTube video
  • PDF extraction — auto-detected via Content-Type
  • 5 output formats — markdown, text, JSON, LLM-optimized, HTML

Content control

noxa URL --include "article, .content"       # CSS selector include
noxa URL --exclude "nav, footer, .sidebar"    # CSS selector exclude
noxa URL --only-main-content                  # Auto-detect main content

Crawling

noxa URL --crawl --depth 3 --max-pages 100   # BFS same-origin crawl
noxa URL --crawl --sitemap                    # Seed from sitemap
noxa URL --map                                # Discover URLs only

LLM features (Ollama / OpenAI / Anthropic)

noxa URL --summarize                          # Page summary
noxa URL --extract-prompt "Get all prices"    # Natural language extraction
noxa URL --extract-json '{"type":"object"}'   # Schema-enforced extraction

Change tracking

noxa URL -f json > snap.json                  # Take snapshot
noxa URL --diff-with snap.json                # Compare later

Content store

Every extraction auto-persists to ~/.noxa/content/. Works across CLI and MCP.

noxa URL                                      # Writes ~/.noxa/content/...{.md,.json}
noxa --no-store URL                           # Opt out for one run

Web search

noxa --search "query"                         # SearXNG (SEARXNG_URL) or cloud
noxa --search "query" --no-scrape             # Snippets only, skip scraping

Brand extraction

noxa URL --brand                              # Colors, fonts, logos, OG image

Proxy rotation

noxa URL --proxy http://user:pass@host:port   # Single proxy
noxa URLs --proxy-file proxies.txt            # Pool rotation

Benchmarks

All numbers from real tests on 50 diverse pages. See benchmarks/ for methodology and reproduction instructions.

Extraction quality

Accuracy      noxa     ███████████████████ 95.1%
              readability ████████████████▋   83.5%
              trafilatura ████████████████    80.6%
              newspaper3k █████████████▎      66.4%

Noise removal noxa     ███████████████████ 96.1%
              readability █████████████████▊  89.4%
              trafilatura ██████████████████▏ 91.2%
              newspaper3k ███████████████▎    76.8%

Speed (pure extraction, no network)

10KB page     noxa     ██                   0.8ms
              readability █████                2.1ms
              trafilatura ██████████           4.3ms

100KB page    noxa     ██                   3.2ms
              readability █████                8.7ms
              trafilatura ██████████           18.4ms

Token efficiency (feeding to Claude/GPT)

Format Tokens vs Raw HTML
Raw HTML 4,820 baseline
readability 2,340 -51%
trafilatura 2,180 -55%
noxa llm 1,590 -67%

Crawl speed

Concurrency noxa Crawl4AI Scrapy
5 9.8 pg/s 5.2 pg/s 7.1 pg/s
10 18.4 pg/s 8.7 pg/s 12.3 pg/s
20 32.1 pg/s 14.2 pg/s 21.8 pg/s

Architecture

noxa/
  crates/
    noxa-core     Pure extraction engine. Zero network deps. WASM-safe.
    noxa-fetch    HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.
    noxa-llm      LLM provider chain (Gemini CLI -> OpenAI -> Ollama -> Anthropic)
    noxa-pdf      PDF text extraction
    noxa-mcp      MCP server (10 tools for AI agents)  → run via: noxa mcp
    noxa-rag      RAG pipeline (TEI embeddings + Qdrant vector store)  → binary: noxa-rag-daemon
    noxa-cli      CLI binary  → binary: noxa

noxa-core takes raw HTML as a &str and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.


Configuration

Non-secret defaults live in config.json in your working directory. The full behavior contract is documented in docs/config.md. config/config.example.json is the template you copy to config.json, and config/config.schema.json documents the accepted keys. config/.env.example is the template you copy to .env for secrets and runtime URLs such as SEARXNG_URL. Set output_dir in config.json if you want results written to files instead of stdout.

Copy the example:

cp config/config.example.json config.json

Precedence: CLI flags > config.json > built-in defaults

For llm_provider and llm_model, leaving the keys unset preserves the Gemini -> OpenAI -> Ollama -> Anthropic fallback chain. Setting them in config.json or on the CLI forces that specific provider/model.

Secrets and runtime URLs always go in .env, not config.json:

cp config/.env.example .env

Override config path for a single run:

NOXA_CONFIG=/path/to/other-config.json noxa https://example.com
NOXA_CONFIG=/dev/null noxa https://example.com  # bypass config entirely

Bool flag limitation: flags like --metadata, --only-main-content, --verbose, and --use-sitemap set to true in config.json cannot be overridden to false from the CLI for a single run (clap has no --no-flag variant). Use NOXA_CONFIG=/dev/null to bypass.

Cloud configuration

The cloud block in config.json allows you to configure the cloud provider settings.

{
  "cloud": {
    "provider": "gcp",
    "project": "my-gcp-project",
    "zone": "us-central1-a",
    "cluster": "my-cluster",
    "service_account_key": "/path/to/key.json",
    "disabled": false
  }
}

These settings can also be controlled via command-line flags:

  • --cloud-provider: Cloud provider to use (e.g. "gcp", "aws")
  • --cloud-project: Cloud project ID
  • --cloud-zone: Cloud zone or region
  • --cloud-cluster: Cloud cluster name
  • --cloud-service-account-key: Path to cloud service account key file
  • --cloud-disabled: Disable cloud features

Environment variables

Variable Description
NOXA_API_KEY Cloud API key (enables bot bypass, JS rendering, search, research)
SEARXNG_URL Self-hosted SearXNG base URL for local search (no API key required)
NOXA_NO_STORE Set to any non-empty value to disable automatic content persistence
NOXA_PROXY Single proxy URL
NOXA_PROXY_FILE Path to proxy pool file
NOXA_WEBHOOK_URL Webhook URL for notifications (also accepted by --webhook)
NOXA_LLM_BASE_URL LLM base URL for Ollama or OpenAI-compatible endpoints
NOXA_LLM_PROVIDER Default LLM provider (gemini, openai, ollama, anthropic)
NOXA_LLM_MODEL Default LLM model name
OLLAMA_HEALTH_TIMEOUT_MS Ollama availability check timeout in milliseconds
NOXA_CONFIG Path to config.json or /dev/null to bypass it
NOXA_CLOUD_PROVIDER Cloud provider (gcp, aws)
NOXA_CLOUD_PROJECT Cloud project ID
NOXA_CLOUD_ZONE Cloud zone or region
NOXA_CLOUD_CLUSTER Cloud cluster name
NOXA_CLOUD_SERVICE_ACCOUNT_KEY Path to cloud service account key file

The config/.env.example file covers the runtime noxa variables above. Use it alongside config/config.example.json and config/config.schema.json, which cover the non-secret JSON config contract.

noxa setup (or ./setup.sh for repo clones) generates a .env interactively and can configure these for you.

Local deployment and Ollama variables — used by noxa setup:

Variable Description
NOXA_PORT REST API listen port (default: 3000)
NOXA_HOST REST API bind address (default: 0.0.0.0)
NOXA_AUTH_KEY REST API authentication key
NOXA_LOG Log level (default: info)
OLLAMA_HOST Ollama base URL (default: http://localhost:11434)
OLLAMA_MODEL Default Ollama model (default: qwen3.5:9b)

LLM provider API keys:

Variable Description
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key
GEMINI_MODEL Gemini model override (default: gemini-2.5-pro)
OPENAI_BASE_URL Override OpenAI base URL (alternative to NOXA_LLM_BASE_URL)

Cloud API (optional)

For bot-protected sites, JS rendering, and advanced features, noxa offers a hosted API at noxa.io.

The CLI and MCP server work locally first. Cloud is used as a fallback when:

  • A site has bot protection (Cloudflare, DataDome, WAF)
  • A page requires JavaScript rendering
  • You use research
  • You use search without SEARXNG_URL

If SEARXNG_URL is set, search runs entirely through your self-hosted SearXNG instance and does not require NOXA_API_KEY. SEARXNG_URL and NOXA_WEBHOOK_URL are operator-supplied endpoints, so they are validated for URL syntax and scheme but are allowed to point at localhost or private network addresses. Search result URLs are still filtered through the public-address SSRF validator before scraping.

export NOXA_API_KEY=wc_your_key

# Automatic: tries local first, cloud on bot detection
noxa https://protected-site.com

# Force cloud
noxa --cloud https://spa-site.com

SDKs

npm install @noxa/sdk                  # TypeScript/JavaScript
pip install noxa                        # Python
go get github.com/jmagar/noxa-go      # Go

Use cases

  • AI agents — Give Claude/Cursor/GPT real-time web access via MCP
  • Research — Crawl documentation, competitor sites, news archives
  • Price monitoring — Track changes with --diff-with snapshots
  • Training data — Prepare web content for fine-tuning with token-optimized output
  • Content pipelines — Batch extract + summarize in CI/CD
  • Brand intelligence — Extract visual identity from any website

Community

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Acknowledgments

TLS and HTTP/2 browser fingerprinting is powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.

License

AGPL-3.0

About

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Rust 98.5%
  • JavaScript 1.1%
  • Other 0.4%