Web search, scraping, and live package-version lookup β
for your terminal, your code, and your local LLM.
No accounts. No API keys. No cloud.
SoloCrawl is a small, self-hosted, fully async Python tool that does three things well: it searches the web across many sources at once, scrapes pages into clean markdown that's ready for an LLM, and looks up the latest package version from official registries. It runs on your machine, for free, and works the moment you install it β nothing to sign up for, no keys to paste.
It's built for individual developers and people tinkering with local LLMs. Use it as an MCP server (LM Studio, Claude Desktop, OpenCode, β¦), straight from the CLI, or as a Python library.
| π Zero config | Search, scrape, and package lookup all work out of the box β no accounts, no keys. |
| π Federated search | Queries 11 possible sources, merges them with Reciprocal Rank Fusion and de-duplicates URLs into one clean ranking. |
| π Smart scraping | HTML β tidy markdown via trafilatura/readability, with a Playwright browser fallback for JS-heavy pages β used only when it's actually needed. |
| π¦ Live package versions | 10 ecosystems (PyPI, npm, crates.io, Maven, Go, β¦) resolved live from official registries β never a stale local DB. |
| π€ MCP-native | Drops straight into local LLM tooling as a stdio MCP server with five ready tools. |
| β‘ Fully async, bounded | One shared HTTP client, one recycled browser, global + per-domain concurrency limits. Fast without hammering anyone. |
| π§© Hackable | Add a search or package provider as a single self-registering file β the core stays untouched. |
| π Safe by default | Blocks localhost/private/cloud-metadata targets and honours robots.txt. |
The recommended way to install SoloCrawl is pipx β it drops the solocrawl and solocrawl-mcp
commands onto your PATH in their own isolated environment, so you can run them from anywhere without
juggling a virtualenv:
git clone https://github.com/hlavacm/solocrawl.git
pipx install ./solocrawl # or an absolute path: pipx install /path/to/solocrawlAlready on PyPI? Then it's just
pipx install solocrawlβ no checkout needed.
That's it β now run the three core commands from any directory:
# Scrape a page to markdown
solocrawl scrape https://example.com
# Federated web search (Wikipedia + DuckDuckGo + StackExchange by default)
solocrawl search "python asyncio semaphore" --limit 5
# Live package version lookup
solocrawl package requests --ecosystem pypi# Installed from PyPI:
pipx upgrade solocrawl
# Installed from a local checkout β pull the latest changes, then reinstall:
cd /path/to/solocrawl && git pull && pipx install --force . # alias: pipx reinstall solocrawlAfter upgrading, restart your MCP client (LM Studio, Cursor, Claude Desktop, β¦) so it picks up the new
solocrawl-mcp binary β your mcp.json needs no changes as long as it points at solocrawl-mcp on
your PATH.
One query, many sources, a single merged ranking β no single provider deciding everything for you.
solocrawl search "python asyncio semaphore" --limit 5
# Pick exactly which sources to hit, and get machine-readable output
solocrawl search "django orm" --sources wikipedia,stackexchange --jsonTurn any URL into LLM-ready markdown with page metadata (title, author, date, β¦) as front-matter.
solocrawl scrape https://example.com
# Save to a file, or force the browser for a JS-rendered page
solocrawl scrape https://example.com --out page.md
solocrawl scrape https://example.com --force-browserThe current version β and the one matching your constraint β straight from the official registry.
solocrawl package react --ecosystem npm --constraint ">=18,<19"
solocrawl package monolog/monolog --ecosystem packagist --json
solocrawl package some-lib --ecosystem pypi --allow-prereleaseThe classic LLM workflow β search, scrape the top hits, and get back one aggregated, cited report.
solocrawl research "python asyncio semaphore" --depth 3Fetch a whole list at once under the same bounded concurrency; --out-dir writes one file per URL.
solocrawl batch https://example.com https://www.python.org --out-dir /tmp/scrape
solocrawl batch --from-file urls.txt --out-dir /tmp/scrape# List every registered provider (search + package), default vs. opt-in
solocrawl providersThis is where SoloCrawl really shines β give your local model (LM Studio, OpenCode, Claude Desktop, β¦)
the ability to search, scrape, and check versions. The pipx install from the
Quick start already put solocrawl-mcp on your PATH, so all that's left is pointing
your MCP client at it.
LM Studio / Claude Desktop β ready-to-use config at examples/mcp.json.
Drop it into your client's MCP settings (mcp.json):
{
"mcpServers": {
"solocrawl": {
"command": "solocrawl-mcp",
"args": [],
"env": {
"SOLOCRAWL_LOG_LEVEL": "INFO",
"SOLOCRAWL_LOG_FILE": "~/.local/state/solocrawl/mcp.log"
}
}
}
}OpenCode β uses a different config format. Copy
examples/opencode.jsonc into ~/.config/opencode/opencode.jsonc
(global) or opencode.jsonc in your project root. OpenCode expects type: "local", command as an
array, and environment instead of env.
If your MCP client doesn't inherit your shell PATH, replace "solocrawl-mcp" with the full path
from which solocrawl-mcp (typically ~/.local/bin/solocrawl-mcp after pipx install). Logs go to
stderr (visible in LM Studio Developer Logs) and optionally to SOLOCRAWL_LOG_FILE.
The server exposes five tools:
web_search(query, limit=5, sources=None)β federated search across enabled providersscrape(url)β fetch and extract markdown (with page metadata) from a URLresearch(query, depth=3)β search, scrape the top results, and return an aggregated cited reportpackage_version(name, ecosystem, constraint=None, allow_prerelease=False)β live registry lookuplist_providers(provider_type="all")β list registered search/package providers (default vs. opt-in)
To check the active version and command path:
pipx list | grep solocrawl
which solocrawl-mcpWorking from a local clone? A
pipx-installedsolocrawl-mcpis a snapshot β editing the repo does not update the command on yourPATH, so your MCP client keeps running the old code. After changing the source, refresh it withpipx install --force .(or install once withpipx install --editable .so future edits are picked up automatically).
import asyncio
from solocrawl.config import load_config
from solocrawl.core.search import federated_search, select_providers
from solocrawl.core.search.providers import duckduckgo, stackexchange, wikipedia # noqa: F401
async def main() -> None:
providers = select_providers(load_config())
results = await federated_search(providers, "asyncio python", limit=3)
for result in results:
print(result.title, result.url)
asyncio.run(main())See examples/library_search.py for a runnable example.
Default (zero-config, always enabled):
| Provider | Source |
|---|---|
wikipedia |
MediaWiki API |
duckduckgo |
ddgs package |
stackexchange |
Stack Exchange API (Stack Overflow) |
Opt-in (enable with SOLOCRAWL_ENABLE_PROVIDERS):
| Provider | Source |
|---|---|
wikidata |
Wikidata entity search |
hackernews |
Hacker News (Algolia) |
arxiv |
arXiv Atom API |
pubmed |
PubMed/NCBI E-utilities |
github |
GitHub repository search |
mdn |
MDN Web Docs search |
reddit |
Reddit post search (search.json) |
searxng |
Self-hosted SearXNG (set SOLOCRAWL_SEARXNG_URL) |
SOLOCRAWL_ENABLE_PROVIDERS=arxiv,hackernews solocrawl search "transformer attention" --limit 6
SOLOCRAWL_ENABLE_PROVIDERS=github,mdn solocrawl search "fetch api" --limit 6Default ecosystems: PyPI, npm, Packagist, crates.io, NuGet, Maven Central,
RubyGems, Go modules, pub.dev, Swift. Versions are always fetched live from official
registries β SoloCrawl does not maintain its own version database. Swift packages have no central
registry, so versions come from the repository's git tags (owner/repo on GitHub).
solocrawl package serde --ecosystem crates
solocrawl package Newtonsoft.Json --ecosystem nuget
solocrawl package org.junit.jupiter:junit-jupiter --ecosystem maven
solocrawl package github.com/gorilla/mux --ecosystem go
solocrawl package apple/swift-argument-parser --ecosystem swift# Browser fallback for JS-heavy pages (Playwright)
pip install -e ".[browser]"
playwright install chromium
solocrawl scrape https://example.com --force-browser
# Install everything
pip install -e ".[all]"All defaults work with no configuration. Everything below is optional and uses the SOLOCRAWL_
prefix. For local development, copy .env.dist to .env and uncomment what you need β SoloCrawl
loads .env automatically via python-dotenv, and existing
shell environment variables take precedence.
Environment variables
| Variable | Default | Purpose |
|---|---|---|
SOLOCRAWL_ENABLE_PROVIDERS |
(empty) | Comma-separated opt-in provider names |
SOLOCRAWL_SEARXNG_URL |
(empty) | Base URL of a self-hosted SearXNG instance (enables the searxng provider) |
SOLOCRAWL_RESPECT_ROBOTS |
true |
Honour robots.txt on scrape (fail-open); set false to skip |
SOLOCRAWL_CACHE_TTL_SECONDS |
0 |
In-memory fetch cache TTL in seconds (0 = disabled) |
SOLOCRAWL_MAX_CONCURRENCY |
10 |
Global fetch concurrency limit |
SOLOCRAWL_PER_DOMAIN_LIMIT |
2 |
Per-domain concurrency limit |
SOLOCRAWL_TIMEOUT_SECONDS |
30 |
Per-request timeout in seconds |
SOLOCRAWL_MAX_RETRIES |
3 |
Retries on network errors / rate limits |
SOLOCRAWL_MAX_RESPONSE_BYTES |
10485760 |
Cap on fetched response body size (10 MiB); larger bodies are truncated |
SOLOCRAWL_PROXY_ENABLED |
false |
Enable optional proxy layer |
SOLOCRAWL_PROXY_MODE |
list |
Proxy mode: list (rotate a pool) or endpoint (single rotating endpoint) |
SOLOCRAWL_PROXY_LIST |
(empty) | Comma-separated proxy URLs |
SOLOCRAWL_PROXY_ENDPOINT |
(empty) | Single rotating proxy endpoint |
SOLOCRAWL_PROXY_USERNAME |
(empty) | Proxy auth username |
SOLOCRAWL_PROXY_PASSWORD |
(empty) | Proxy auth password |
SOLOCRAWL_ALLOW_INTERNAL_URLS |
false |
Allow scraping localhost/private IPs (dev only) |
SOLOCRAWL_USER_AGENT |
(SoloCrawl default) | Override HTTP User-Agent for API requests |
SOLOCRAWL_BROWSER_ALLOWED |
true |
Allow Playwright fallback when installed |
SOLOCRAWL_LOG_LEVEL |
WARNING |
Log level: DEBUG, INFO, WARNING, ERROR |
SOLOCRAWL_LOG_FILE |
(empty) | Optional log file path (also logs to stderr) |
By default SoloCrawl refuses to fetch localhost, link-local, private, reserved, and
cloud-metadata addresses. It checks literal hosts, DNS-resolved A/AAAA records, HTTP redirect
targets, and Playwright's final browser URL. SoloCrawl is still a single-user local tool, not a
hostile-multi-tenant proxy β do not expose it to untrusted network callers.
SOLOCRAWL_ALLOW_INTERNAL_URLS=true disables these internal-target checks entirely (intended for
trusted local development only).
The whole point of the plugin layout is that adding a source is a single self-registering file β the core never changes. To add a search provider:
- Create
src/solocrawl/core/search/providers/myprovider.pyimplementingSearchProvider. - Register with
@register("myprovider", zero_config=True)or as opt-in. - Import the module in
src/solocrawl/core/search/providers/__init__.pyso registration runs. - Add fixture-based tests in
tests/.
The same pattern applies to package providers under src/solocrawl/core/packages/providers/.
Work from a checkout in a virtualenv with an editable install β this also drops the solocrawl and
solocrawl-mcp scripts into .venv/bin/:
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"Then run the quality gate:
ruff check . && ruff format --check .
pyright
pytest
# β¦or all in one line:
ruff check . && pyright && pytestSoloCrawl is built for individual developers and local LLM tooling. It respects the robots.txt and
terms of service of target sites β scrape consults robots.txt and refuses disallowed URLs by
default (fail-open on errors; opt out with SOLOCRAWL_RESPECT_ROBOTS=false). The proxy and scraping
features are not intended to bypass site rules, captchas, or anti-bot systems. Use responsibly and
stay within legitimate access patterns.
MIT β see LICENSE.