Self-hosted, API-compatible Firecrawl alternative with Agent endpoint. MIT licensed. One docker compose up and you're running.
GroktoCrawl implements the Firecrawl v2 API surface — scrape, search, map, crawl, extract, browser sessions, monitors, and the Agent endpoint (autonomous web research) — without the closed-source dependencies. Runs entirely in Docker on your own hardware. Bring your own LLM or use the built-in fixtures.
cp .env.sample .env
docker compose up --build -dEight containers start. The stack includes SearXNG for real web search, a smart scraper, and an Ofelia-scheduled monitor system.
# CLI
./groktocrawl scrape https://example.com
./groktocrawl search "raspberry pi 5" --limit 3
./groktocrawl agent "What were the key Google I/O 2025 announcements?"
# Or raw curl
curl http://localhost:8080/health
curl -X POST http://localhost:8080/v2/scrape -H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Edit .env to point at a real LLM:
# DeepSeek
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com/v1
LLM_MODEL=deepseek-v4-flash
# OpenAI
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini
# Ollama (local)
LLM_BASE_URL=http://host.docker.internal:11434/v1
LLM_MODEL=llama3.2flowchart TD
subgraph compose["docker-compose.yml"]
valkey[("valkey<br/>(queue + storage)")]
searxng["searxng<br/>(web search)"]
scraper("scraper-svc<br/>(smart fetch)")
browser["browser-svc<br/>(Playwright sessions)"]
agent("agent-svc<br/>(FastAPI + workers)")
ofelia["ofelia<br/>(cron scheduler)"]
valkey --- agent
searxng --- agent
scraper --- agent
browser --- agent
ofelia -.->|docker exec| agent
end
llm_provider("LLM Provider<br/>(DeepSeek / OpenAI / Ollama)")
llm_provider -.->|LLM_BASE_URL| agent
style valkey fill:#ffe0b0
style searxng fill:#b0d4ff
style scraper fill:#b0ffb0
style browser fill:#d4b0ff
style agent fill:#ffb0b0
style ofelia fill:#b0b0b0
The scraper uses a three-tier strategy: check /llms.txt first, try Accept: text/markdown second, render with Playwright third.
groktocrawl is a CLI tool in the repo root. It needs requests (pip install requests).
./groktocrawl scrape <url> # Scrape a page to markdown
./groktocrawl search <query> --limit 5 # Search the web
./groktocrawl map <url> --limit 100 # Discover URLs on a site
./groktocrawl crawl <url> --max-depth 2 # Crawl a website
./groktocrawl agent "<prompt>" # Autonomous research agent
./groktocrawl --json --server <url> <cmd> # JSON output, custom server| Method | Endpoint | Description |
|---|---|---|
| POST | /v2/scrape |
Scrape a single URL to clean markdown |
| POST | /v2/agent |
Start an autonomous research agent |
| GET | /v2/agent/:jobId |
Get agent job status and results |
| DELETE | /v2/agent/:jobId |
Cancel an agent job |
| POST | /v2/extract |
Extract structured data from URLs (with schema) |
| GET | /v2/extract/:jobId |
Get extract status and results |
| POST | /v2/crawl |
Crawl a website |
| GET | /v2/crawl/:jobId |
Get crawl status |
| DELETE | /v2/crawl/:jobId |
Cancel a crawl |
| POST | /v2/batch/scrape |
Scrape multiple URLs |
| POST | /v2/search |
Search the web with content |
| POST | /v2/map |
Discover URLs on a site |
| POST | /v2/parse |
Upload a file (PDF, DOCX, PPTX, XLSX) and get markdown back |
| POST | /v2/browser |
Create a headless browser session |
| GET | /v2/browser |
List active browser sessions |
| POST | /v2/browser/:id/execute |
Execute action (navigate, click, screenshot, etc.) |
| DELETE | /v2/browser/:id |
Destroy a browser session |
| POST | /v2/monitor |
Create a scheduled change monitor |
| GET | /v2/monitor |
List all monitors |
| GET | /v2/monitor/:id |
Get monitor status and history |
| PATCH | /v2/monitor/:id |
Update monitor config |
| DELETE | /v2/monitor/:id |
Delete a monitor |
| POST | /v2/generate-llmstxt |
Generate an llms.txt file for a website |
| GET | /v2/generate-llmstxt/:jobId |
Get generation status and result |
All Firecrawl v2 API-compatible in request/response shape.
Interactive API documentation is available when the stack is running:
- Swagger UI:
http://localhost:8080/docs - Raw OpenAPI spec:
http://localhost:8080/openapi.json
The spec is auto-generated by FastAPI from the route handlers and Pydantic models — always up to date with the running code. All 17+ endpoints with request/response schemas are documented.
| Feature | Firecrawl Cloud | Firecrawl Self-Hosted | GroktoCrawl |
|---|---|---|---|
| Scrape / Crawl / Map / Search | ✅ | ✅ | ✅ |
| Agent endpoint | ✅ | ❌ (closed-source) | ✅ |
| Extract (schema-based) | ✅ | ❌ (closed-source) | ✅ |
| Browser sessions | ✅ | ❌ (closed-source) | ✅ |
| Scheduled monitors | ✅ | ❌ (closed-source) | ✅ |
| Parse (PDF, DOCX) | ✅ | ✅ | ✅ |
| Generate llms.txt | ❌ (deprecated in v2) | ❌ (deprecated in v2) | ✅ |
| Webhook delivery | ✅ | ✅ | ✅ |
| License | Proprietary | AGPL-3.0 | MIT |
| Self-contained Docker | ✅ | ✅ one file | |
| LLM integration | Built-in | Requires API key | BYO or fixture |
GroktoCrawl ships as an AgentSkills-compatible skill at skills/groktocrawl/. Any agent that supports the AgentSkills format (Claude Code, Cursor, etc.) can load it:
skills/groktocrawl/
├── SKILL.md # Metadata + instructions
├── scripts/groktocrawl # CLI — all endpoints
├── references/triggers.md # When to use which command
└── assets/examples.md # Usage examples
The skill bundles the CLI directly — no additional setup required beyond having the repo on disk.
If you use Hermes Agent, GroktoCrawl replaces the built-in web_search and web_extract tools with more capable alternatives. To avoid competition between tools:
Remove web from default_toolsets and platform_toolsets.cli in ~/.hermes/config.yaml:
# Before
default_toolsets:
- terminal
- file
- web # ← remove
# After
default_toolsets:
- terminal
- fileThis removes web_search and web_extract from your agent's toolset. All web tasks will route through groktocrawl instead.
The CLI is at groktocrawl in the repo root. Copy it to your PATH:
cp groktocrawl ~/.local/bin/The bundled skill at skills/groktocrawl/ follows the AgentSkills spec. Symlink it into your Hermes skills directory:
ln -sf "$PWD/skills/groktocrawl" ~/.hermes/skills/Then load it in-session with /skill groktocrawl or preload it via hermes -s groktocrawl.
The CLI discovers the server in this order:
--server <url>flagGROKTOCRAWL_API_URLenv varFIRECRAWL_API_URLenv var (backward compat)~/.hermes/.envfile- Default:
http://localhost:8080
Add to ~/.hermes/.env if your instance runs elsewhere:
GROKTOCRAWL_API_URL=http://localhost:8080Active development. All core Firecrawl v2 API endpoints implemented and integration-tested. See issues for upcoming features. Contributions welcome — see CONTRIBUTING.md.