The fastest web scraper for AI agents.
67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.
Claude Code's built-in web_fetch → 403 Forbidden. noxa → clean markdown.
Your AI agent calls fetch() and gets a 403. Or 142KB of raw HTML that burns through your token budget. noxa fixes both.
It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: 67% fewer tokens than raw HTML, with metadata, links, and images preserved.
Raw HTML noxa
┌──────────────────────────────────┐ ┌──────────────────────────────────┐
│ <div class="ad-wrapper"> │ │ # Breaking: AI Breakthrough │
│ <nav class="global-nav"> │ │ │
│ <script>window.__NEXT_DATA__ │ │ Researchers achieved 94% │
│ ={...8KB of JSON...}</script> │ │ accuracy on cross-domain │
│ <div class="social-share"> │ │ reasoning benchmarks. │
│ <button>Tweet</button> │ │ │
│ <footer class="site-footer"> │ │ ## Key Findings │
│ <!-- 142,847 characters --> │ │ - 3x faster inference │
│ │ │ - Open-source weights │
│ 4,820 tokens │ │ 1,590 tokens │
└──────────────────────────────────┘ └──────────────────────────────────┘
Need config details? See docs/config.md for how config.json and .env work together.
curl -fsSL https://raw.githubusercontent.com/jmagar/noxa/main/install.sh | bashInstalls Rust if needed, builds both binaries, and launches noxa setup — an interactive wizard that configures .env, optionally installs Ollama, and wires up Claude Desktop.
npx create-noxaAuto-detects your AI tools, downloads the MCP server, and configures everything. One command.
Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).
cargo install --git https://github.com/jmagar/noxa.git --bin noxa --bin noxa-mcpThen run the interactive setup wizard:
noxa setupgit clone https://github.com/jmagar/noxa.git && cd noxa
./setup.sh # builds (first run) then delegates to noxa setup| noxa | Firecrawl | Trafilatura | Readability | |
|---|---|---|---|---|
| Extraction accuracy | 95.1% | — | 80.6% | 83.5% |
| Token efficiency | -67% | — | -55% | -51% |
| Speed (100KB page) | 3.2ms | ~500ms | 18.4ms | 8.7ms |
| TLS fingerprinting | Yes | No | No | No |
| Self-hosted | Yes | No | Yes | Yes |
| MCP (Claude/Cursor) | Yes | No | No | No |
| No browser required | Yes | No | Yes | Yes |
| Cost | Free | $$$$ | Free | Free |
Choose noxa if you want fast local extraction, LLM-optimized output, and native AI agent integration.
$ noxa https://stripe.com -f llm
> URL: https://stripe.com
> Title: Stripe | Financial Infrastructure for the Internet
> Language: en
> Word count: 847
# Stripe | Financial Infrastructure for the Internet
Stripe is a suite of APIs powering online payment processing
and commerce solutions for internet businesses of all sizes.
## Products
- Payments — Accept payments online and in person
- Billing — Manage subscriptions and invoicing
- Connect — Build a marketplace or platform
...$ noxa https://github.com --brand
{
"name": "GitHub",
"colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
"fonts": ["Mona Sans", "ui-monospace"],
"logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
}$ noxa https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
Crawling... 50/50 pages extracted
---
# Page 1: https://docs.rust-lang.org/
...
# Page 2: https://docs.rust-lang.org/book/
...# Extract as markdown (default)
noxa https://example.com
# Multiple output formats
noxa https://example.com -f markdown # Clean markdown
noxa https://example.com -f json # Full structured JSON
noxa https://example.com -f text # Plain text (no formatting)
noxa https://example.com -f llm # Token-optimized for LLMs (67% fewer tokens)
# Bare domains work (auto-prepends https://)
noxa example.com# Only extract main content (skip nav, sidebar, footer)
noxa https://docs.rs/tokio --only-main-content
# Include specific CSS selectors
noxa https://news.ycombinator.com --include ".titleline,.score"
# Exclude specific elements
noxa https://example.com --exclude "nav,footer,.ads,.sidebar"
# Combine both
noxa https://docs.rs/reqwest --only-main-content --exclude ".sidebar"# Extract colors, fonts, logos from any website
noxa --brand https://stripe.com
# Output: { "name": "Stripe", "colors": [...], "fonts": ["Sohne"], "logos": [...] }
noxa --brand https://github.com
# Output: { "name": "GitHub", "colors": [{"hex": "#1F2328", ...}], "fonts": ["Mona Sans"], ... }
noxa --brand wikipedia.org
# Output: 10 colors, 5 fonts, favicon, logo URL# Discover all URLs from a site's sitemaps
noxa --map https://sitemaps.org
# Output: one URL per line (84 URLs found)
# JSON output with metadata
noxa --map https://sitemaps.org -f json
# Output: [{ "url": "...", "last_modified": "...", "priority": 0.8 }]# Crawl a site (default: depth 1, max 20 pages)
noxa --crawl https://example.com
# Control depth and page limit
noxa --crawl --depth 2 --max-pages 50 https://docs.rs/tokio
# Crawl with sitemap seeding (finds more pages)
noxa --crawl --sitemap --depth 2 https://docs.rs/tokio
# Filter crawl paths
noxa --crawl --include-paths "/api/*,/guide/*" https://docs.example.com
noxa --crawl --exclude-paths "/changelog/*,/blog/*" https://docs.example.com
# Control concurrency and delay
noxa --crawl --concurrency 10 --delay 200 https://example.com# Step 1: Save a snapshot
noxa https://example.com -f json > snapshot.json
# Step 2: Later, compare against the snapshot
noxa --diff-with snapshot.json https://example.com
# Output:
# Status: Same
# Word count delta: +0
# If the page changed:
# Status: Changed
# Word count delta: +42
# --- old
# +++ new
# @@ -1,3 +1,3 @@
# -Old content here
# +New content here# Poll a URL for changes every 5 minutes (default interval)
noxa --watch https://example.com
# Custom check interval (seconds)
noxa --watch --watch-interval 60 https://example.com
# Run a command on change — diff JSON is piped to stdin
noxa --watch --on-change "jq '.summary' >> changes.log" https://example.com
# Combine with a webhook — POST diff payload on each change
noxa --watch --webhook https://hooks.example.com/notify https://example.com# PDF URLs are auto-detected via Content-Type
noxa https://example.com/report.pdf
# Control PDF mode
noxa --pdf-mode auto https://example.com/report.pdf # Error on empty (catches scanned PDFs)
noxa --pdf-mode fast https://example.com/report.pdf # Return whatever text is found# Multiple URLs in one command
noxa https://example.com https://httpbin.org/html https://rust-lang.org
# URLs from a file (one per line, # comments supported)
noxa --urls-file urls.txt
# Batch with JSON output
noxa --urls-file urls.txt -f json
# Proxy rotation for large batches
noxa --urls-file urls.txt --proxy-file proxies.txt --concurrency 10# Extract from a local HTML file
noxa --file page.html
# Pipe HTML from another command
curl -s https://example.com | noxa --stdin
# Chain with other tools
noxa https://example.com -f text | wc -w # Word count
noxa https://example.com -f json | jq '.metadata.title' # Extract title with jq# Chrome (default) — latest Chrome TLS fingerprint
noxa https://example.com
# Firefox fingerprint
noxa --browser firefox https://example.com
# Random browser per request (good for batch)
noxa --browser random --urls-file urls.txt# Custom headers
noxa -H "Authorization: Bearer token123" https://api.example.com
noxa -H "Accept-Language: de-DE" https://example.com
# Cookies
noxa --cookie "session=abc123; theme=dark" https://example.com
# Multiple headers
noxa -H "X-Custom: value" -H "Authorization: Bearer token" https://example.comThese require an LLM provider. noxa tries Gemini CLI first (requires the gemini
binary on PATH and uses GEMINI_MODEL, default gemini-2.5-pro), then falls
back to OpenAI, Ollama local, and Anthropic. Structured JSON extraction is
automatically validated against your schema and retried once with a correction
prompt if the first attempt fails.
# Summarize a page (default: 3 sentences)
noxa --summarize https://example.com
# Control summary length
noxa --summarize 5 https://example.com
# Extract structured JSON with a schema
noxa --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' https://example.com/product
# Extract with a schema from file
noxa --extract-json @schema.json https://example.com/product
# Extract with natural language prompt
noxa --extract-prompt "Get all pricing tiers with name, price, and features" https://stripe.com/pricing
# Use a specific LLM provider
noxa --llm-provider gemini --summarize https://example.com
noxa --llm-provider ollama --summarize https://example.com
noxa --llm-provider openai --llm-model gpt-4o --extract-prompt "..." https://example.com
noxa --llm-provider anthropic --summarize https://example.com# Get the raw fetched HTML (no extraction)
noxa --raw-html https://example.com
# Useful for debugging extraction issues
noxa --raw-html https://example.com > raw.html
noxa --file raw.html # Then extract locally# Include YAML frontmatter with metadata
noxa --metadata https://example.com
# Output:
# ---
# title: "Example Domain"
# source: "https://example.com"
# word_count: 20
# ---
# # Example Domain
# ...
# Verbose logging (debug extraction pipeline)
noxa -v https://example.com# Single proxy
noxa --proxy http://user:pass@proxy.example.com:8080 https://example.com
# SOCKS5 proxy
noxa --proxy socks5://proxy.example.com:1080 https://example.com
# Proxy rotation from file (one per line: host:port:user:pass)
noxa --proxy-file proxies.txt https://example.com
# Auto-load proxies.txt from current directory
echo "proxy1.com:8080:user:pass" > proxies.txt
noxa https://example.com # Automatically detects and uses proxies.txtSearch the web and scrape the top results in one command. Uses SearXNG for fully local, private search — or falls back to the cloud API.
# Search via self-hosted SearXNG (no API key needed)
export SEARXNG_URL=https://your-searxng-instance.example.com
noxa --search "rust async best practices"
# Control result count (default: 10, max: 50)
noxa --search "rust async" --num-results 5
# Snippets only — skip scraping result URLs
noxa --search "rust async" --no-scrape
# Search via cloud API (no SearXNG required)
noxa --search "rust async" --api-key $NOXA_API_KEYResults include title, URL, and snippet for each hit. When scraping is enabled (the default), extracted content from each page is also shown. All scraped pages are auto-persisted to ~/.noxa/content/.
Every extraction automatically persists to ~/.noxa/content/{domain}/{path}.{md,json}. Works across all modes: scrape, batch, crawl, PDF, search, and MCP tools.
# Files are written automatically — no flags needed
noxa https://example.com
# → ~/.noxa/content/example_com/index.md
# → ~/.noxa/content/example_com/index.json
# Disable for a single run
noxa --no-store https://example.com
# Disable globally via env var
export NOXA_NO_STORE=1The JSON file contains the full ExtractionResult including metadata, structured data, and content. The MCP diff tool reads from the store automatically — no need to pass a previous snapshot manually after the first run.
# Full-text search across all stored docs (uses ripgrep if available)
noxa --grep "authentication"
noxa --grep "rate limit" --format json
# List all stored domains
noxa --list
# List all docs for a specific domain
noxa --list docs.rust-lang.org
# Retrieve a cached doc by exact URL
noxa --retrieve https://docs.rust-lang.org/book/
# Retrieve by fuzzy query
noxa --retrieve "rust async book"
# Check background crawl status
noxa --status docs.rust-lang.org
# Re-fetch all cached docs for one stored domain
noxa --refresh docs.rust-lang.orgNote — scaling characteristics: Exact-URL retrieval is an O(1) path lookup. Fuzzy
--retrieveand--refreshare backed bylist_all_docs/list_domain_urls, which perform a recursive filesystem walk and parse every.jsonsidecar in the store (or domain subdirectory). Performance scales linearly with document count and directory depth. On stores with a few hundred documents, both commands complete in well under a second. On large stores (10 000+ documents), expect traversal and sidecar-parsing overhead of several seconds. No in-memory index is maintained between runs.
# Save each page to a separate file instead of stdout
noxa --crawl --output-dir ./docs https://docs.rust-lang.org
# Works with batch mode too
noxa --urls-file urls.txt --output-dir ./output
# Single URL to file
noxa https://example.com --output-dir ./pages
# → ./pages/index.md# Monitor competitor pricing — save today's pricing
noxa --extract-json '{"type":"array","items":{"type":"object","properties":{"plan":{"type":"string"},"price":{"type":"string"}}}}' \
https://competitor.com/pricing -f json > pricing-$(date +%Y%m%d).json
# Build a documentation search index
noxa --crawl --sitemap --depth 3 --max-pages 500 -f llm https://docs.example.com > docs.txt
# Extract all images from a page
noxa https://example.com -f json | jq -r '.content.images[].src'
# Get all external links
noxa https://example.com -f json | jq -r '.content.links[] | select(.href | startswith("http")) | .href'
# Compare two pages
noxa https://site-a.com -f json > a.json
noxa https://site-b.com --diff-with a.jsonnoxa ships as a Claude Code plugin that adds a skill (auto-activates on scrape/crawl/search triggers) and wires up the MCP server in one step.
# Add the marketplace and install
/plugin marketplace add jmagar/noxa
/plugin install noxa
/reload-pluginsThe plugin provides:
noxaskill — auto-activates when you ask to scrape, crawl, extract, search, watch, or summarize URLs; covers all flag combinations and common recipes- MCP server — all 10 tools available directly to Claude (
scrape,crawl,map,batch,extract,summarize,diff,brand,search,research)
Requires noxa on PATH. Run noxa setup after installing to configure everything.
noxa ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.
npx create-noxa # auto-detects and configures everythingOr manual setup — add to your Claude Desktop config:
{
"mcpServers": {
"noxa": {
"command": "noxa",
"args": ["mcp"]
}
}
}Then in Claude: "Scrape the top 5 results for 'web scraping tools' and compare their pricing" — it just works.
| Tool | Description | Requires API key? |
|---|---|---|
scrape |
Extract content from any URL | No |
crawl |
Recursive site crawl | No |
map |
Discover URLs from sitemaps | No |
batch |
Parallel multi-URL extraction | No |
extract |
LLM-powered structured extraction | No (needs local LLM or NOXA_API_KEY) |
summarize |
Page summarization | No (needs local LLM or NOXA_API_KEY) |
diff |
Content change detection | No |
brand |
Brand identity extraction | No |
search |
Web search + scrape results | SEARXNG_URL: No, cloud: Yes |
research |
Deep multi-source research | Yes |
9 of 10 tools work locally — no account, no API key, fully private.
- Readability scoring — multi-signal content detection (text density, semantic tags, link ratio)
- Noise filtering — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
- Data island extraction — catches React/Next.js JSON payloads, JSON-LD, hydration data
- YouTube metadata — structured data from any YouTube video
- PDF extraction — auto-detected via Content-Type
- 5 output formats — markdown, text, JSON, LLM-optimized, HTML
noxa URL --include "article, .content" # CSS selector include
noxa URL --exclude "nav, footer, .sidebar" # CSS selector exclude
noxa URL --only-main-content # Auto-detect main contentnoxa URL --crawl --depth 3 --max-pages 100 # BFS same-origin crawl
noxa URL --crawl --sitemap # Seed from sitemap
noxa URL --map # Discover URLs onlynoxa URL --summarize # Page summary
noxa URL --extract-prompt "Get all prices" # Natural language extraction
noxa URL --extract-json '{"type":"object"}' # Schema-enforced extractionnoxa URL -f json > snap.json # Take snapshot
noxa URL --diff-with snap.json # Compare laterEvery extraction auto-persists to ~/.noxa/content/. Works across CLI and MCP.
noxa URL # Writes ~/.noxa/content/...{.md,.json}
noxa --no-store URL # Opt out for one runnoxa --search "query" # SearXNG (SEARXNG_URL) or cloud
noxa --search "query" --no-scrape # Snippets only, skip scrapingnoxa URL --brand # Colors, fonts, logos, OG imagenoxa URL --proxy http://user:pass@host:port # Single proxy
noxa URLs --proxy-file proxies.txt # Pool rotationAll numbers from real tests on 50 diverse pages. See benchmarks/ for methodology and reproduction instructions.
Accuracy noxa ███████████████████ 95.1%
readability ████████████████▋ 83.5%
trafilatura ████████████████ 80.6%
newspaper3k █████████████▎ 66.4%
Noise removal noxa ███████████████████ 96.1%
readability █████████████████▊ 89.4%
trafilatura ██████████████████▏ 91.2%
newspaper3k ███████████████▎ 76.8%
10KB page noxa ██ 0.8ms
readability █████ 2.1ms
trafilatura ██████████ 4.3ms
100KB page noxa ██ 3.2ms
readability █████ 8.7ms
trafilatura ██████████ 18.4ms
| Format | Tokens | vs Raw HTML |
|---|---|---|
| Raw HTML | 4,820 | baseline |
| readability | 2,340 | -51% |
| trafilatura | 2,180 | -55% |
| noxa llm | 1,590 | -67% |
| Concurrency | noxa | Crawl4AI | Scrapy |
|---|---|---|---|
| 5 | 9.8 pg/s | 5.2 pg/s | 7.1 pg/s |
| 10 | 18.4 pg/s | 8.7 pg/s | 12.3 pg/s |
| 20 | 32.1 pg/s | 14.2 pg/s | 21.8 pg/s |
noxa/
crates/
noxa-core Pure extraction engine. Zero network deps. WASM-safe.
noxa-fetch HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.
noxa-llm LLM provider chain (Gemini CLI -> OpenAI -> Ollama -> Anthropic)
noxa-pdf PDF text extraction
noxa-mcp MCP server (10 tools for AI agents) → run via: noxa mcp
noxa-rag RAG pipeline (TEI embeddings + Qdrant vector store) → binary: noxa-rag-daemon
noxa-cli CLI binary → binary: noxa
noxa-core takes raw HTML as a &str and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.
Non-secret defaults live in config.json in your working directory. The full behavior contract is documented in docs/config.md.
config/config.example.json is the template you copy to config.json, and config/config.schema.json documents the accepted keys.
config/.env.example is the template you copy to .env for secrets and runtime URLs such as SEARXNG_URL.
Set output_dir in config.json if you want results written to files instead of stdout.
Copy the example:
cp config/config.example.json config.jsonPrecedence: CLI flags > config.json > built-in defaults
For llm_provider and llm_model, leaving the keys unset preserves the
Gemini -> OpenAI -> Ollama -> Anthropic fallback chain. Setting them in
config.json or on the CLI forces that specific provider/model.
Secrets and runtime URLs always go in .env, not config.json:
cp config/.env.example .envOverride config path for a single run:
NOXA_CONFIG=/path/to/other-config.json noxa https://example.com
NOXA_CONFIG=/dev/null noxa https://example.com # bypass config entirelyBool flag limitation: flags like --metadata, --only-main-content, --verbose, and --use-sitemap set to true in config.json cannot be overridden to false from the CLI for a single run (clap has no --no-flag variant). Use NOXA_CONFIG=/dev/null to bypass.
The cloud block in config.json allows you to configure the cloud provider settings.
{
"cloud": {
"provider": "gcp",
"project": "my-gcp-project",
"zone": "us-central1-a",
"cluster": "my-cluster",
"service_account_key": "/path/to/key.json",
"disabled": false
}
}These settings can also be controlled via command-line flags:
--cloud-provider: Cloud provider to use (e.g. "gcp", "aws")--cloud-project: Cloud project ID--cloud-zone: Cloud zone or region--cloud-cluster: Cloud cluster name--cloud-service-account-key: Path to cloud service account key file--cloud-disabled: Disable cloud features
| Variable | Description |
|---|---|
NOXA_API_KEY |
Cloud API key (enables bot bypass, JS rendering, search, research) |
SEARXNG_URL |
Self-hosted SearXNG base URL for local search (no API key required) |
NOXA_NO_STORE |
Set to any non-empty value to disable automatic content persistence |
NOXA_PROXY |
Single proxy URL |
NOXA_PROXY_FILE |
Path to proxy pool file |
NOXA_WEBHOOK_URL |
Webhook URL for notifications (also accepted by --webhook) |
NOXA_LLM_BASE_URL |
LLM base URL for Ollama or OpenAI-compatible endpoints |
NOXA_LLM_PROVIDER |
Default LLM provider (gemini, openai, ollama, anthropic) |
NOXA_LLM_MODEL |
Default LLM model name |
OLLAMA_HEALTH_TIMEOUT_MS |
Ollama availability check timeout in milliseconds |
NOXA_CONFIG |
Path to config.json or /dev/null to bypass it |
NOXA_CLOUD_PROVIDER |
Cloud provider (gcp, aws) |
NOXA_CLOUD_PROJECT |
Cloud project ID |
NOXA_CLOUD_ZONE |
Cloud zone or region |
NOXA_CLOUD_CLUSTER |
Cloud cluster name |
NOXA_CLOUD_SERVICE_ACCOUNT_KEY |
Path to cloud service account key file |
The config/.env.example file covers the runtime noxa variables above. Use it alongside config/config.example.json and config/config.schema.json, which cover the non-secret JSON config contract.
noxa setup (or ./setup.sh for repo clones) generates a .env interactively and can configure these for you.
Local deployment and Ollama variables — used by noxa setup:
| Variable | Description |
|---|---|
NOXA_PORT |
REST API listen port (default: 3000) |
NOXA_HOST |
REST API bind address (default: 0.0.0.0) |
NOXA_AUTH_KEY |
REST API authentication key |
NOXA_LOG |
Log level (default: info) |
OLLAMA_HOST |
Ollama base URL (default: http://localhost:11434) |
OLLAMA_MODEL |
Default Ollama model (default: qwen3.5:9b) |
LLM provider API keys:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key |
ANTHROPIC_API_KEY |
Anthropic API key |
GEMINI_MODEL |
Gemini model override (default: gemini-2.5-pro) |
OPENAI_BASE_URL |
Override OpenAI base URL (alternative to NOXA_LLM_BASE_URL) |
For bot-protected sites, JS rendering, and advanced features, noxa offers a hosted API at noxa.io.
The CLI and MCP server work locally first. Cloud is used as a fallback when:
- A site has bot protection (Cloudflare, DataDome, WAF)
- A page requires JavaScript rendering
- You use
research - You use
searchwithoutSEARXNG_URL
If SEARXNG_URL is set, search runs entirely through your self-hosted SearXNG instance and does not require NOXA_API_KEY.
SEARXNG_URL and NOXA_WEBHOOK_URL are operator-supplied endpoints, so they are validated for URL syntax and scheme but are allowed to point at localhost or private network addresses. Search result URLs are still filtered through the public-address SSRF validator before scraping.
export NOXA_API_KEY=wc_your_key
# Automatic: tries local first, cloud on bot detection
noxa https://protected-site.com
# Force cloud
noxa --cloud https://spa-site.comnpm install @noxa/sdk # TypeScript/JavaScript
pip install noxa # Python
go get github.com/jmagar/noxa-go # Go- AI agents — Give Claude/Cursor/GPT real-time web access via MCP
- Research — Crawl documentation, competitor sites, news archives
- Price monitoring — Track changes with
--diff-withsnapshots - Training data — Prepare web content for fine-tuning with token-optimized output
- Content pipelines — Batch extract + summarize in CI/CD
- Brand intelligence — Extract visual identity from any website
- GitHub Issues — bug reports and feature requests
We welcome contributions! See CONTRIBUTING.md for guidelines.
TLS and HTTP/2 browser fingerprinting is powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.