Live web · JavaScript rendering · Internet Archive · Global discovery · Multi-engine routing
Quick Start · Architecture · Workflows · Engines · API · Config
Omnispider is a unified crawl orchestrator that routes every request through policy, discovery, and the best engine for the job — static HTTP, browser rendering, archival snapshots, or fast link discovery.
It synthesizes patterns from 12 crawler ecosystems (Scrapy, Playwright, Puppeteer, Crawlee, Colly, Katana, Splash, MechanicalSoup, Portia, Heritrix3, Nutch, StormCrawler) into one pipeline with a CLI, REST API, and persistent frontier.
|
Temporal
|
Spatial
|
Surface
|
Scope note: Omnispider maximizes reachable public coverage ethically — respecting
robots.txt, rate limits, and domain policy. No crawler can fetch every page ever published; deleted, private, and auth-gated content remains out of scope.
flowchart TB
subgraph INPUT["Input Layer"]
SEEDS["Seed URLs"]
GLOBAL["Global Seeds / ccTLD"]
SITEMAP["Sitemap / robots.txt"]
ARCHIVE_SEED["Wayback CDX Snapshots"]
end
subgraph CORE["Omnispider Core"]
FRONTIER["Frontier Queue<br/><i>SQLite · priority · depth</i>"]
POLICY["Policy Gate<br/><i>robots.txt · rate limit · domain</i>"]
ROUTER["Engine Router<br/><i>auto · http · playwright · archive</i>"]
end
subgraph ENGINES["Engine Layer"]
HTTP["HTTP Engine"]
PW["Playwright"]
ARC["Archive / Wayback"]
KAT["Katana Discovery"]
SPL["Splash Sidecar"]
MECH["MechanicalSoup"]
SCR["Scrapy Batch"]
end
subgraph OUTPUT["Output Layer"]
STORE["SQLite + Content Store"]
LINKS["Link Extractor"]
API["REST API / CLI"]
end
SEEDS --> FRONTIER
GLOBAL --> FRONTIER
SITEMAP --> FRONTIER
ARCHIVE_SEED --> FRONTIER
FRONTIER --> POLICY
POLICY --> ROUTER
ROUTER --> HTTP
ROUTER --> PW
ROUTER --> ARC
ROUTER --> KAT
ROUTER --> SPL
ROUTER --> MECH
ROUTER --> SCR
HTTP --> STORE
PW --> STORE
ARC --> STORE
SPL --> STORE
MECH --> STORE
STORE --> LINKS
LINKS -->|"expand frontier"| FRONTIER
STORE --> API
sequenceDiagram
autonumber
actor User
participant CLI as CLI / API
participant ORCH as Orchestrator
participant DISC as Discovery
participant FR as Frontier
participant POL as Policy
participant ENG as Engine
participant DB as Storage
User->>CLI: crawl / POST /v1/jobs
CLI->>ORCH: submit_job(seeds, depth, limits)
ORCH->>DISC: sitemaps · robots · katana · wayback
DISC->>FR: enqueue seed URLs
loop Until max_pages or frontier empty
FR->>POL: pop next URL
POL->>POL: robots.txt + rate limit + domain
alt blocked
POL-->>FR: skip
else allowed
POL->>ENG: route engine (auto/http/playwright/archive)
ENG->>ENG: fetch + render if needed
ENG->>DB: save page + metadata
DB->>FR: extract links → enqueue children
end
end
ORCH->>CLI: job completed
CLI->>User: pages crawled / failed
flowchart TD
START(["Incoming URL"]) --> SOURCE{Source type?}
SOURCE -->|Wayback / Archive| ARC["Archive Engine"]
SOURCE -->|Live web| JS{JS rendering forced?}
JS -->|Yes| PW["Playwright Engine"]
JS -->|No| FETCH["HTTP Engine"]
FETCH --> CHECK{Response OK?}
CHECK -->|No| PW
CHECK -->|Yes| HEUR{Page needs JS?}
HEUR -->|Yes| PW
HEUR -->|No| DONE(["Store + extract links"])
PW --> DONE
ARC --> DONE
style START fill:#0f172a,color:#e2e8f0,stroke:#334155
style DONE fill:#14532d,color:#ecfdf5,stroke:#166534
style ARC fill:#1e3a5f,color:#dbeafe,stroke:#2563eb
style PW fill:#3b0764,color:#f3e8ff,stroke:#9333ea
style FETCH fill:#1e293b,color:#e2e8f0,stroke:#475569
flowchart LR
S["Seed URL"] --> R["robots.txt"]
S --> M["sitemap.xml"]
S --> K["Katana binary"]
S --> W["Wayback CDX"]
R --> F["Unified Frontier"]
M --> F
K --> F
W --> F
F --> C["Concurrent Workers"]
C --> P["Policy Gate"]
P --> E["Engine Fetch"]
E --> X["Link Extraction"]
X --> F
| Engine | Best for | Vendor inspiration | Install |
|---|---|---|---|
http |
Static pages, APIs, feeds | Scrapy · Colly | built-in |
playwright |
SPAs, React, Vue, Next.js | Playwright · Puppeteer | npx playwright install chromium |
archive |
Historical snapshots | Heritrix · Internet Archive | built-in |
katana |
Fast link discovery | Katana · Colly | Install Katana (external) |
splash |
JS render sidecar | Splash | run Splash on :8050 (external) |
mechanical |
Forms, sessions | MechanicalSoup | external adapter |
scrapy |
Batch spider projects | Scrapy · Portia | external adapter |
auto |
Smart routing (default) | Crawlee patterns | built-in |
Reference vendor trees can be extracted locally — see vendors/README.md.
git clone https://github.com/houseofasher/web-crawlers.git
cd web-crawlers
npm install
npm run buildLink the CLI globally (optional):
npm link
# or run via npm scripts: npm run dev -- crawl ...# Live web + sitemaps + Wayback snapshots
omnispider crawl https://example.com --depth 3 --max-pages 500
# Skip archive layer
omnispider crawl https://example.com --no-archive
# Force JavaScript rendering
omnispider crawl https://example.com --js# Find pages about a person/topic (use --seed for direct profile URLs)
omnispider lookup "Asher Shepherd Newton Cape Coral Florida" \
--seed https://github.com/houseofasher \
--seed https://github.com/shep95 \
--json ./data/reports/lookup.json
# List Wayback Machine snapshots for a URL
omnispider archive https://example.comomnispider serve --port 8080
# → http://127.0.0.1:8080/healthnpx playwright install chromium # JS rendering
npm test # run Vitest suite# Start a crawl job
curl -X POST http://127.0.0.1:8080/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"seeds": ["https://example.com"],
"max_depth": 3,
"max_pages": 100,
"include_archive": true,
"js_rendering": false
}'
# Poll job status
curl http://127.0.0.1:8080/v1/jobs/{job_id}
# List crawled pages
curl "http://127.0.0.1:8080/v1/jobs/{job_id}/pages?limit=50"| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Service health |
POST |
/v1/jobs |
Create crawl job |
GET |
/v1/jobs |
List jobs |
GET |
/v1/jobs/{id} |
Job status |
GET |
/v1/jobs/{id}/pages |
Paginated page results |
GET |
/v1/engines |
Engine catalog |
Edit config/default.yaml:
orchestrator:
max_concurrency: 16
max_depth: 5
max_pages_per_job: 10000
policy:
respect_robots_txt: true
rate_limit_per_host: 2.0
archive:
enabled: true
discovery:
sitemap: true
global_seeds:
- "https://www.wikipedia.org/"
storage:
database_path: "./data/omnispider.db"
content_dir: "./data/content"web-crawlers/
├── src/
│ ├── cli.ts # Commander CLI
│ ├── api.ts # Fastify REST server
│ ├── core/
│ │ ├── orchestrator.ts # Main crawl loop
│ │ ├── storage.ts # Jobs + pages persistence (SQLite)
│ │ └── policy.ts # robots.txt + rate limits
│ ├── engines/ # HTTP, Playwright, Archive
│ ├── discovery/ # Sitemaps, link extraction
│ ├── security/ # Nomad Cyber stack
│ └── topic/ # Topic-centric lookup
├── config/default.yaml
├── tests/
└── vendors/ # Optional local reference trees
| Artifact | Location | Contents |
|---|---|---|
| Job store | ./data/omnispider.db |
Jobs, frontier, page metadata |
| HTML shards | ./data/content/ |
SHA-256 sharded page bodies |
| Logs | stdout (pino) | Structured JSON logs |
Omnispider integrates the Nomad Cyber Algorithm sovereign security stack — adapted from its gateway, audit, replay, and organism patterns.
flowchart TB
subgraph perimeter [API Perimeter]
HDR[OWASP Security Headers]
RL[Rate Limiter]
RBAC[RBAC + API Keys]
RG[Replay Guard]
end
subgraph organism [Sovereign Organism]
AUDIT["Audit Immune<br/>HMAC chain"]
VITAL["Vital Guard<br/>pulse every 30s"]
SSRF["SSRF Lungs<br/>block private IPs"]
end
REQ[API Request] --> HDR --> RL --> RBAC --> RG
RG --> VITAL
VITAL -->|vital| HANDLER[Crawl Handler]
VITAL -->|lockdown| BLOCK[503 ORGANISM_LOCKDOWN]
HANDLER --> SSRF
SSRF --> AUDIT
HANDLER --> CRAWL[Orchestrator]
| Organ | Protection |
|---|---|
| Gateway Skin | RBAC roles, bearer API keys, body limits |
| Replay Nerves | X-Nonce + X-Timestamp on POST /v1/jobs |
| SSRF Lungs | Blocks localhost, private IPs, metadata endpoints in crawl targets |
| Audit Immune | Tamper-evident HMAC-chained JSONL log |
| Sovereign Organism | All organs vital or total API lockdown |
# Generate production API key
omnispider security generate-key --role admin
# Check organism vitals
omnispider security vitals
curl http://127.0.0.1:8080/organism/vitals
# Authenticated crawl job (production)
curl -X POST http://127.0.0.1:8080/v1/jobs \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "X-Nonce: $(uuidgen)" \
-H "X-Timestamp: $(date +%s000)" \
-H "Content-Type: application/json" \
-d '{"seeds":["https://example.com"],"max_depth":2,"max_pages":50}'See SECURITY.md for production hardening checklist.
This project is maintained at:
MIT — see LICENSE.