GitHub - jmagar/noxa: Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

The fastest web scraper for AI agents.
_{67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.}

_{Claude Code's built-in web_fetch → 403 Forbidden. noxa → clean markdown.}

Your AI agent calls fetch() and gets a 403. Or 142KB of raw HTML that burns through your token budget. noxa fixes both.

It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: 67% fewer tokens than raw HTML, with metadata, links, and images preserved.

                     Raw HTML                          noxa
┌──────────────────────────────────┐    ┌──────────────────────────────────┐
│ <div class="ad-wrapper">         │    │ # Breaking: AI Breakthrough      │
│ <nav class="global-nav">         │    │                                  │
│ <script>window.__NEXT_DATA__     │    │ Researchers achieved 94%         │
│ ={...8KB of JSON...}</script>    │    │ accuracy on cross-domain         │
│ <div class="social-share">       │    │ reasoning benchmarks.            │
│ <button>Tweet</button>           │    │                                  │
│ <footer class="site-footer">     │    │ ## Key Findings                  │
│ <!-- 142,847 characters -->      │    │ - 3x faster inference            │
│                                  │    │ - Open-source weights            │
│         4,820 tokens             │    │         1,590 tokens             │
└──────────────────────────────────┘    └──────────────────────────────────┘

Get Started (30 seconds)

Need config details? See docs/config.md for how config.json and .env work together.

One-liner install

curl -fsSL https://raw.githubusercontent.com/jmagar/noxa/main/install.sh | bash

Installs Rust if needed, builds both binaries, and launches noxa setup — an interactive wizard that configures .env, optionally installs Ollama, and wires up Claude Desktop.

For AI agents (Claude, Cursor, Windsurf, VS Code)

npx create-noxa

Auto-detects your AI tools, downloads the MCP server, and configures everything. One command.

Prebuilt binaries

Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).

Cargo

cargo install --git https://github.com/jmagar/noxa.git --bin noxa --bin noxa-mcp

Then run the interactive setup wizard:

noxa setup

From source (clone + build)

git clone https://github.com/jmagar/noxa.git && cd noxa
./setup.sh    # builds (first run) then delegates to noxa setup

Why noxa?

	noxa	Firecrawl	Trafilatura	Readability
Extraction accuracy	95.1%	—	80.6%	83.5%
Token efficiency	-67%	—	-55%	-51%
Speed (100KB page)	3.2ms	~500ms	18.4ms	8.7ms
TLS fingerprinting	Yes	No	No	No
Self-hosted	Yes	No	Yes	Yes
MCP (Claude/Cursor)	Yes	No	No	No
No browser required	Yes	No	Yes	Yes
Cost	Free	$$$$	Free	Free

Choose noxa if you want fast local extraction, LLM-optimized output, and native AI agent integration.

What it looks like

$ noxa https://stripe.com -f llm

> URL: https://stripe.com
> Title: Stripe | Financial Infrastructure for the Internet
> Language: en
> Word count: 847

# Stripe | Financial Infrastructure for the Internet

Stripe is a suite of APIs powering online payment processing
and commerce solutions for internet businesses of all sizes.

## Products
- Payments — Accept payments online and in person
- Billing — Manage subscriptions and invoicing
- Connect — Build a marketplace or platform
...

$ noxa https://github.com --brand

{
  "name": "GitHub",
  "colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
  "fonts": ["Mona Sans", "ui-monospace"],
  "logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
}

$ noxa https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Crawling... 50/50 pages extracted
---
# Page 1: https://docs.rust-lang.org/
...
# Page 2: https://docs.rust-lang.org/book/
...

Examples

Basic Extraction

# Extract as markdown (default)
noxa https://example.com

# Multiple output formats
noxa https://example.com -f markdown    # Clean markdown
noxa https://example.com -f json        # Full structured JSON
noxa https://example.com -f text        # Plain text (no formatting)
noxa https://example.com -f llm         # Token-optimized for LLMs (67% fewer tokens)

# Bare domains work (auto-prepends https://)
noxa example.com

Content Filtering

# Only extract main content (skip nav, sidebar, footer)
noxa https://docs.rs/tokio --only-main-content

# Include specific CSS selectors
noxa https://news.ycombinator.com --include ".titleline,.score"

# Exclude specific elements
noxa https://example.com --exclude "nav,footer,.ads,.sidebar"

# Combine both
noxa https://docs.rs/reqwest --only-main-content --exclude ".sidebar"

Brand Identity Extraction

# Extract colors, fonts, logos from any website
noxa --brand https://stripe.com
# Output: { "name": "Stripe", "colors": [...], "fonts": ["Sohne"], "logos": [...] }

noxa --brand https://github.com
# Output: { "name": "GitHub", "colors": [{"hex": "#1F2328", ...}], "fonts": ["Mona Sans"], ... }

noxa --brand wikipedia.org
# Output: 10 colors, 5 fonts, favicon, logo URL

Sitemap Discovery

# Discover all URLs from a site's sitemaps
noxa --map https://sitemaps.org
# Output: one URL per line (84 URLs found)

# JSON output with metadata
noxa --map https://sitemaps.org -f json
# Output: [{ "url": "...", "last_modified": "...", "priority": 0.8 }]

Recursive Crawling

# Crawl a site (default: depth 1, max 20 pages)
noxa --crawl https://example.com

# Control depth and page limit
noxa --crawl --depth 2 --max-pages 50 https://docs.rs/tokio

# Crawl with sitemap seeding (finds more pages)
noxa --crawl --sitemap --depth 2 https://docs.rs/tokio

# Filter crawl paths
noxa --crawl --include-paths "/api/*,/guide/*" https://docs.example.com
noxa --crawl --exclude-paths "/changelog/*,/blog/*" https://docs.example.com

# Control concurrency and delay
noxa --crawl --concurrency 10 --delay 200 https://example.com

Change Detection (Diff)

# Step 1: Save a snapshot
noxa https://example.com -f json > snapshot.json

# Step 2: Later, compare against the snapshot
noxa --diff-with snapshot.json https://example.com
# Output:
#   Status: Same
#   Word count delta: +0

# If the page changed:
#   Status: Changed
#   Word count delta: +42
#   --- old
#   +++ new
#   @@ -1,3 +1,3 @@
#   -Old content here
#   +New content here

Change Monitoring (Watch)

# Poll a URL for changes every 5 minutes (default interval)
noxa --watch https://example.com

# Custom check interval (seconds)
noxa --watch --watch-interval 60 https://example.com

# Run a command on change — diff JSON is piped to stdin
noxa --watch --on-change "jq '.summary' >> changes.log" https://example.com

# Combine with a webhook — POST diff payload on each change
noxa --watch --webhook https://hooks.example.com/notify https://example.com

PDF Extraction

# PDF URLs are auto-detected via Content-Type
noxa https://example.com/report.pdf

# Control PDF mode
noxa --pdf-mode auto https://example.com/report.pdf  # Error on empty (catches scanned PDFs)
noxa --pdf-mode fast https://example.com/report.pdf  # Return whatever text is found

Batch Processing

# Multiple URLs in one command
noxa https://example.com https://httpbin.org/html https://rust-lang.org

# URLs from a file (one per line, # comments supported)
noxa --urls-file urls.txt

# Batch with JSON output
noxa --urls-file urls.txt -f json

# Proxy rotation for large batches
noxa --urls-file urls.txt --proxy-file proxies.txt --concurrency 10

Local Files & Stdin

# Extract from a local HTML file
noxa --file page.html

# Pipe HTML from another command
curl -s https://example.com | noxa --stdin

# Chain with other tools
noxa https://example.com -f text | wc -w    # Word count
noxa https://example.com -f json | jq '.metadata.title'  # Extract title with jq

Browser Impersonation

# Chrome (default) — latest Chrome TLS fingerprint
noxa https://example.com

# Firefox fingerprint
noxa --browser firefox https://example.com

# Random browser per request (good for batch)
noxa --browser random --urls-file urls.txt

Custom Headers & Cookies

# Custom headers
noxa -H "Authorization: Bearer token123" https://api.example.com
noxa -H "Accept-Language: de-DE" https://example.com

# Cookies
noxa --cookie "session=abc123; theme=dark" https://example.com

# Multiple headers
noxa -H "X-Custom: value" -H "Authorization: Bearer token" https://example.com

LLM-Powered Features

These require an LLM provider. noxa tries Gemini CLI first (requires the gemini binary on PATH and uses GEMINI_MODEL, default gemini-2.5-pro), then falls back to OpenAI, Ollama local, and Anthropic. Structured JSON extraction is automatically validated against your schema and retried once with a correction prompt if the first attempt fails.

# Summarize a page (default: 3 sentences)
noxa --summarize https://example.com

# Control summary length
noxa --summarize 5 https://example.com

# Extract structured JSON with a schema
noxa --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' https://example.com/product

# Extract with a schema from file
noxa --extract-json @schema.json https://example.com/product

# Extract with natural language prompt
noxa --extract-prompt "Get all pricing tiers with name, price, and features" https://stripe.com/pricing

# Use a specific LLM provider
noxa --llm-provider gemini --summarize https://example.com
noxa --llm-provider ollama --summarize https://example.com
noxa --llm-provider openai --llm-model gpt-4o --extract-prompt "..." https://example.com
noxa --llm-provider anthropic --summarize https://example.com

Raw HTML Output

# Get the raw fetched HTML (no extraction)
noxa --raw-html https://example.com

# Useful for debugging extraction issues
noxa --raw-html https://example.com > raw.html
noxa --file raw.html  # Then extract locally

Metadata & Verbose Mode

# Include YAML frontmatter with metadata
noxa --metadata https://example.com
# Output:
#   ---
#   title: "Example Domain"
#   source: "https://example.com"
#   word_count: 20
#   ---
#   # Example Domain
#   ...

# Verbose logging (debug extraction pipeline)
noxa -v https://example.com

Proxy Usage

# Single proxy
noxa --proxy http://user:pass@proxy.example.com:8080 https://example.com

# SOCKS5 proxy
noxa --proxy socks5://proxy.example.com:1080 https://example.com

# Proxy rotation from file (one per line: host:port:user:pass)
noxa --proxy-file proxies.txt https://example.com

# Auto-load proxies.txt from current directory
echo "proxy1.com:8080:user:pass" > proxies.txt
noxa https://example.com  # Automatically detects and uses proxies.txt

Web Search

Search the web and scrape the top results in one command. Uses SearXNG for fully local, private search — or falls back to the cloud API.

# Search via self-hosted SearXNG (no API key needed)
export SEARXNG_URL=https://your-searxng-instance.example.com
noxa --search "rust async best practices"

# Control result count (default: 10, max: 50)
noxa --search "rust async" --num-results 5

# Snippets only — skip scraping result URLs
noxa --search "rust async" --no-scrape

# Search via cloud API (no SearXNG required)
noxa --search "rust async" --api-key $NOXA_API_KEY

Results include title, URL, and snippet for each hit. When scraping is enabled (the default), extracted content from each page is also shown. All scraped pages are auto-persisted to ~/.noxa/content/.

Content Store

Every extraction automatically persists to ~/.noxa/content/{domain}/{path}.{md,json}. Works across all modes: scrape, batch, crawl, PDF, search, and MCP tools.

# Files are written automatically — no flags needed
noxa https://example.com
# → ~/.noxa/content/example_com/index.md
# → ~/.noxa/content/example_com/index.json

# Disable for a single run
noxa --no-store https://example.com

# Disable globally via env var
export NOXA_NO_STORE=1

The JSON file contains the full ExtractionResult including metadata, structured data, and content. The MCP diff tool reads from the store automatically — no need to pass a previous snapshot manually after the first run.

Content Store — Search & Retrieve

# Full-text search across all stored docs (uses ripgrep if available)
noxa --grep "authentication"
noxa --grep "rate limit" --format json

# List all stored domains
noxa --list

# List all docs for a specific domain
noxa --list docs.rust-lang.org

# Retrieve a cached doc by exact URL
noxa --retrieve https://docs.rust-lang.org/book/

# Retrieve by fuzzy query
noxa --retrieve "rust async book"

# Check background crawl status
noxa --status docs.rust-lang.org

# Re-fetch all cached docs for one stored domain
noxa --refresh docs.rust-lang.org

Note — scaling characteristics: Exact-URL retrieval is an O(1) path lookup. Fuzzy --retrieve and --refresh are backed by list_all_docs / list_domain_urls, which perform a recursive filesystem walk and parse every .json sidecar in the store (or domain subdirectory). Performance scales linearly with document count and directory depth. On stores with a few hundred documents, both commands complete in well under a second. On large stores (10 000+ documents), expect traversal and sidecar-parsing overhead of several seconds. No in-memory index is maintained between runs.

Save to Files

# Save each page to a separate file instead of stdout
noxa --crawl --output-dir ./docs https://docs.rust-lang.org

# Works with batch mode too
noxa --urls-file urls.txt --output-dir ./output

# Single URL to file
noxa https://example.com --output-dir ./pages
# → ./pages/index.md

Real-World Recipes

# Monitor competitor pricing — save today's pricing
noxa --extract-json '{"type":"array","items":{"type":"object","properties":{"plan":{"type":"string"},"price":{"type":"string"}}}}' \
  https://competitor.com/pricing -f json > pricing-$(date +%Y%m%d).json

# Build a documentation search index
noxa --crawl --sitemap --depth 3 --max-pages 500 -f llm https://docs.example.com > docs.txt

# Extract all images from a page
noxa https://example.com -f json | jq -r '.content.images[].src'

# Get all external links
noxa https://example.com -f json | jq -r '.content.links[] | select(.href | startswith("http")) | .href'

# Compare two pages
noxa https://site-a.com -f json > a.json
noxa https://site-b.com --diff-with a.json

Claude Code Plugin

noxa ships as a Claude Code plugin that adds a skill (auto-activates on scrape/crawl/search triggers) and wires up the MCP server in one step.

# Add the marketplace and install
/plugin marketplace add jmagar/noxa
/plugin install noxa
/reload-plugins

The plugin provides:

noxa skill — auto-activates when you ask to scrape, crawl, extract, search, watch, or summarize URLs; covers all flag combinations and common recipes
MCP server — all 10 tools available directly to Claude (scrape, crawl, map, batch, extract, summarize, diff, brand, search, research)

Requires noxa on PATH. Run noxa setup after installing to configure everything.

MCP Server — 10 tools for AI agents

noxa ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.

npx create-noxa    # auto-detects and configures everything

Or manual setup — add to your Claude Desktop config:

{
  "mcpServers": {
    "noxa": {
      "command": "noxa",
      "args": ["mcp"]
    }
  }
}

Then in Claude: "Scrape the top 5 results for 'web scraping tools' and compare their pricing" — it just works.

Available tools

Tool	Description	Requires API key?
`scrape`	Extract content from any URL	No
`crawl`	Recursive site crawl	No
`map`	Discover URLs from sitemaps	No
`batch`	Parallel multi-URL extraction	No
`extract`	LLM-powered structured extraction	No (needs local LLM or `NOXA_API_KEY`)
`summarize`	Page summarization	No (needs local LLM or `NOXA_API_KEY`)
`diff`	Content change detection	No
`brand`	Brand identity extraction	No
`search`	Web search + scrape results	`SEARXNG_URL`: No, cloud: Yes
`research`	Deep multi-source research	Yes

9 of 10 tools work locally — no account, no API key, fully private.

Features

Extraction

Readability scoring — multi-signal content detection (text density, semantic tags, link ratio)
Noise filtering — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
Data island extraction — catches React/Next.js JSON payloads, JSON-LD, hydration data
YouTube metadata — structured data from any YouTube video
PDF extraction — auto-detected via Content-Type
5 output formats — markdown, text, JSON, LLM-optimized, HTML

Content control

noxa URL --include "article, .content"       # CSS selector include
noxa URL --exclude "nav, footer, .sidebar"    # CSS selector exclude
noxa URL --only-main-content                  # Auto-detect main content

Crawling

noxa URL --crawl --depth 3 --max-pages 100   # BFS same-origin crawl
noxa URL --crawl --sitemap                    # Seed from sitemap
noxa URL --map                                # Discover URLs only

LLM features (Ollama / OpenAI / Anthropic)

noxa URL --summarize                          # Page summary
noxa URL --extract-prompt "Get all prices"    # Natural language extraction
noxa URL --extract-json '{"type":"object"}'   # Schema-enforced extraction

Change tracking

noxa URL -f json > snap.json                  # Take snapshot
noxa URL --diff-with snap.json                # Compare later

Content store

Every extraction auto-persists to ~/.noxa/content/. Works across CLI and MCP.

noxa URL                                      # Writes ~/.noxa/content/...{.md,.json}
noxa --no-store URL                           # Opt out for one run

Web search

noxa --search "query"                         # SearXNG (SEARXNG_URL) or cloud
noxa --search "query" --no-scrape             # Snippets only, skip scraping

Brand extraction

noxa URL --brand                              # Colors, fonts, logos, OG image

Proxy rotation

noxa URL --proxy http://user:pass@host:port   # Single proxy
noxa URLs --proxy-file proxies.txt            # Pool rotation

Benchmarks

All numbers from real tests on 50 diverse pages. See benchmarks/ for methodology and reproduction instructions.

Extraction quality

Accuracy      noxa     ███████████████████ 95.1%
              readability ████████████████▋   83.5%
              trafilatura ████████████████    80.6%
              newspaper3k █████████████▎      66.4%

Noise removal noxa     ███████████████████ 96.1%
              readability █████████████████▊  89.4%
              trafilatura ██████████████████▏ 91.2%
              newspaper3k ███████████████▎    76.8%

Speed (pure extraction, no network)

10KB page     noxa     ██                   0.8ms
              readability █████                2.1ms
              trafilatura ██████████           4.3ms

100KB page    noxa     ██                   3.2ms
              readability █████                8.7ms
              trafilatura ██████████           18.4ms

Token efficiency (feeding to Claude/GPT)

Format	Tokens	vs Raw HTML
Raw HTML	4,820	baseline
readability	2,340	-51%
trafilatura	2,180	-55%
noxa llm	1,590	-67%

Crawl speed

Concurrency	noxa	Crawl4AI	Scrapy
5	9.8 pg/s	5.2 pg/s	7.1 pg/s
10	18.4 pg/s	8.7 pg/s	12.3 pg/s
20	32.1 pg/s	14.2 pg/s	21.8 pg/s

Architecture

noxa/
  crates/
    noxa-core     Pure extraction engine. Zero network deps. WASM-safe.
    noxa-fetch    HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.
    noxa-llm      LLM provider chain (Gemini CLI -> OpenAI -> Ollama -> Anthropic)
    noxa-pdf      PDF text extraction
    noxa-mcp      MCP server (10 tools for AI agents)  → run via: noxa mcp
    noxa-rag      RAG pipeline (TEI embeddings + Qdrant vector store)  → binary: noxa-rag-daemon
    noxa-cli      CLI binary  → binary: noxa

noxa-core takes raw HTML as a &str and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.

Configuration

Non-secret defaults live in config.json in your working directory. The full behavior contract is documented in docs/config.md. config/config.example.json is the template you copy to config.json, and config/config.schema.json documents the accepted keys. config/.env.example is the template you copy to .env for secrets and runtime URLs such as SEARXNG_URL. Set output_dir in config.json if you want results written to files instead of stdout.

Copy the example:

cp config/config.example.json config.json

Precedence: CLI flags > config.json > built-in defaults

For llm_provider and llm_model, leaving the keys unset preserves the Gemini -> OpenAI -> Ollama -> Anthropic fallback chain. Setting them in config.json or on the CLI forces that specific provider/model.

Secrets and runtime URLs always go in .env, not config.json:

cp config/.env.example .env

Override config path for a single run:

NOXA_CONFIG=/path/to/other-config.json noxa https://example.com
NOXA_CONFIG=/dev/null noxa https://example.com  # bypass config entirely

Bool flag limitation: flags like --metadata, --only-main-content, --verbose, and --use-sitemap set to true in config.json cannot be overridden to false from the CLI for a single run (clap has no --no-flag variant). Use NOXA_CONFIG=/dev/null to bypass.

Cloud configuration

The cloud block in config.json allows you to configure the cloud provider settings.

{
  "cloud": {
    "provider": "gcp",
    "project": "my-gcp-project",
    "zone": "us-central1-a",
    "cluster": "my-cluster",
    "service_account_key": "/path/to/key.json",
    "disabled": false
  }
}

These settings can also be controlled via command-line flags:

--cloud-provider: Cloud provider to use (e.g. "gcp", "aws")
--cloud-project: Cloud project ID
--cloud-zone: Cloud zone or region
--cloud-cluster: Cloud cluster name
--cloud-service-account-key: Path to cloud service account key file
--cloud-disabled: Disable cloud features

Environment variables

Variable	Description
`NOXA_API_KEY`	Cloud API key (enables bot bypass, JS rendering, search, research)
`SEARXNG_URL`	Self-hosted SearXNG base URL for local search (no API key required)
`NOXA_NO_STORE`	Set to any non-empty value to disable automatic content persistence
`NOXA_PROXY`	Single proxy URL
`NOXA_PROXY_FILE`	Path to proxy pool file
`NOXA_WEBHOOK_URL`	Webhook URL for notifications (also accepted by `--webhook`)
`NOXA_LLM_BASE_URL`	LLM base URL for Ollama or OpenAI-compatible endpoints
`NOXA_LLM_PROVIDER`	Default LLM provider (`gemini`, `openai`, `ollama`, `anthropic`)
`NOXA_LLM_MODEL`	Default LLM model name
`OLLAMA_HEALTH_TIMEOUT_MS`	Ollama availability check timeout in milliseconds
`NOXA_CONFIG`	Path to `config.json` or `/dev/null` to bypass it
`NOXA_CLOUD_PROVIDER`	Cloud provider (`gcp`, `aws`)
`NOXA_CLOUD_PROJECT`	Cloud project ID
`NOXA_CLOUD_ZONE`	Cloud zone or region
`NOXA_CLOUD_CLUSTER`	Cloud cluster name
`NOXA_CLOUD_SERVICE_ACCOUNT_KEY`	Path to cloud service account key file

The config/.env.example file covers the runtime noxa variables above. Use it alongside config/config.example.json and config/config.schema.json, which cover the non-secret JSON config contract.

noxa setup (or ./setup.sh for repo clones) generates a .env interactively and can configure these for you.

Local deployment and Ollama variables — used by noxa setup:

Variable	Description
`NOXA_PORT`	REST API listen port (default: `3000`)
`NOXA_HOST`	REST API bind address (default: `0.0.0.0`)
`NOXA_AUTH_KEY`	REST API authentication key
`NOXA_LOG`	Log level (default: `info`)
`OLLAMA_HOST`	Ollama base URL (default: `http://localhost:11434`)
`OLLAMA_MODEL`	Default Ollama model (default: `qwen3.5:9b`)

LLM provider API keys:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic API key
`GEMINI_MODEL`	Gemini model override (default: `gemini-2.5-pro`)
`OPENAI_BASE_URL`	Override OpenAI base URL (alternative to `NOXA_LLM_BASE_URL`)

Cloud API (optional)

For bot-protected sites, JS rendering, and advanced features, noxa offers a hosted API at noxa.io.

The CLI and MCP server work locally first. Cloud is used as a fallback when:

A site has bot protection (Cloudflare, DataDome, WAF)
A page requires JavaScript rendering
You use research
You use search without SEARXNG_URL

If SEARXNG_URL is set, search runs entirely through your self-hosted SearXNG instance and does not require NOXA_API_KEY. SEARXNG_URL and NOXA_WEBHOOK_URL are operator-supplied endpoints, so they are validated for URL syntax and scheme but are allowed to point at localhost or private network addresses. Search result URLs are still filtered through the public-address SSRF validator before scraping.

export NOXA_API_KEY=wc_your_key

# Automatic: tries local first, cloud on bot detection
noxa https://protected-site.com

# Force cloud
noxa --cloud https://spa-site.com

SDKs

npm install @noxa/sdk                  # TypeScript/JavaScript
pip install noxa                        # Python
go get github.com/jmagar/noxa-go      # Go

Use cases

AI agents — Give Claude/Cursor/GPT real-time web access via MCP
Research — Crawl documentation, competitor sites, news archives
Price monitoring — Track changes with --diff-with snapshots
Training data — Prepare web content for fine-tuning with token-optimized output
Content pipelines — Batch extract + summarize in CI/CD
Brand intelligence — Extract visual identity from any website

Community

GitHub Issues — bug reports and feature requests

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Acknowledgments

TLS and HTTP/2 browser fingerprinting is powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.

License

AGPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.agents/plugins		.agents/plugins
.cargo		.cargo
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
benchmarks		benchmarks
config		config
crates		crates
docs		docs
packages/create-noxa		packages/create-noxa
plugins		plugins
.codex		.codex
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
docker-compose.yml		docker-compose.yml
install.sh		install.sh
rustfmt.toml		rustfmt.toml
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

The fastest web scraper for AI agents. 67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.

Get Started (30 seconds)

One-liner install

For AI agents (Claude, Cursor, Windsurf, VS Code)

Prebuilt binaries

Cargo

From source (clone + build)

Why noxa?

What it looks like

Examples

Basic Extraction

Content Filtering

Brand Identity Extraction

Sitemap Discovery

Recursive Crawling

Change Detection (Diff)

Change Monitoring (Watch)

PDF Extraction

Batch Processing

Local Files & Stdin

Browser Impersonation

Custom Headers & Cookies

LLM-Powered Features

Raw HTML Output

Metadata & Verbose Mode

Proxy Usage

Web Search

Content Store

Content Store — Search & Retrieve

Save to Files

Real-World Recipes

Claude Code Plugin

MCP Server — 10 tools for AI agents

Available tools

Features

Extraction

Content control

Crawling

LLM features (Ollama / OpenAI / Anthropic)

Change tracking

Content store

Web search

Brand extraction

Proxy rotation

Benchmarks

Extraction quality

Speed (pure extraction, no network)

Token efficiency (feeding to Claude/GPT)

Crawl speed

Architecture

Configuration

Cloud configuration

Environment variables

Cloud API (optional)

SDKs

Use cases

Community

Contributing

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The fastest web scraper for AI agents.
_{67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.}

Packages