Omnispider

All-in-one web spider orchestrator for the full digital surface

Live web · JavaScript rendering · Internet Archive · Global discovery · Multi-engine routing

Quick Start · Architecture · Workflows · Engines · API · Config

Overview

Omnispider is a unified crawl orchestrator that routes every request through policy, discovery, and the best engine for the job — static HTTP, browser rendering, archival snapshots, or fast link discovery.

It synthesizes patterns from 12 crawler ecosystems (Scrapy, Playwright, Puppeteer, Crawlee, Colly, Katana, Splash, MechanicalSoup, Portia, Heritrix3, Nutch, StormCrawler) into one pipeline with a CLI, REST API, and persistent frontier.

Temporal

Wayback Machine CDX
Historical snapshots
Past → present

Spatial

Sitemaps · robots.txt
Global seeds · ccTLD
Four corners of the web

Surface

Static HTML · JS apps
Forms · feeds · archives
Every page type

Scope note: Omnispider maximizes reachable public coverage ethically — respecting robots.txt, rate limits, and domain policy. No crawler can fetch every page ever published; deleted, private, and auth-gated content remains out of scope.

Architecture

flowchart TB
    subgraph INPUT["Input Layer"]
        SEEDS["Seed URLs"]
        GLOBAL["Global Seeds / ccTLD"]
        SITEMAP["Sitemap / robots.txt"]
        ARCHIVE_SEED["Wayback CDX Snapshots"]
    end

    subgraph CORE["Omnispider Core"]
        FRONTIER["Frontier Queue<br/><i>SQLite · priority · depth</i>"]
        POLICY["Policy Gate<br/><i>robots.txt · rate limit · domain</i>"]
        ROUTER["Engine Router<br/><i>auto · http · playwright · archive</i>"]
    end

    subgraph ENGINES["Engine Layer"]
        HTTP["HTTP Engine"]
        PW["Playwright"]
        ARC["Archive / Wayback"]
        KAT["Katana Discovery"]
        SPL["Splash Sidecar"]
        MECH["MechanicalSoup"]
        SCR["Scrapy Batch"]
    end

    subgraph OUTPUT["Output Layer"]
        STORE["SQLite + Content Store"]
        LINKS["Link Extractor"]
        API["REST API / CLI"]
    end

    SEEDS --> FRONTIER
    GLOBAL --> FRONTIER
    SITEMAP --> FRONTIER
    ARCHIVE_SEED --> FRONTIER

    FRONTIER --> POLICY
    POLICY --> ROUTER

    ROUTER --> HTTP
    ROUTER --> PW
    ROUTER --> ARC
    ROUTER --> KAT
    ROUTER --> SPL
    ROUTER --> MECH
    ROUTER --> SCR

    HTTP --> STORE
    PW --> STORE
    ARC --> STORE
    SPL --> STORE
    MECH --> STORE

    STORE --> LINKS
    LINKS -->|"expand frontier"| FRONTIER
    STORE --> API

Workflow logic

End-to-end crawl lifecycle

sequenceDiagram
    autonumber
    actor User
    participant CLI as CLI / API
    participant ORCH as Orchestrator
    participant DISC as Discovery
    participant FR as Frontier
    participant POL as Policy
    participant ENG as Engine
    participant DB as Storage

    User->>CLI: crawl / POST /v1/jobs
    CLI->>ORCH: submit_job(seeds, depth, limits)
    ORCH->>DISC: sitemaps · robots · katana · wayback
    DISC->>FR: enqueue seed URLs
    loop Until max_pages or frontier empty
        FR->>POL: pop next URL
        POL->>POL: robots.txt + rate limit + domain
        alt blocked
            POL-->>FR: skip
        else allowed
            POL->>ENG: route engine (auto/http/playwright/archive)
            ENG->>ENG: fetch + render if needed
            ENG->>DB: save page + metadata
            DB->>FR: extract links → enqueue children
        end
    end
    ORCH->>CLI: job completed
    CLI->>User: pages crawled / failed

Engine auto-routing decision tree

flowchart TD
    START(["Incoming URL"]) --> SOURCE{Source type?}

    SOURCE -->|Wayback / Archive| ARC["Archive Engine"]
    SOURCE -->|Live web| JS{JS rendering forced?}

    JS -->|Yes| PW["Playwright Engine"]
    JS -->|No| FETCH["HTTP Engine"]

    FETCH --> CHECK{Response OK?}
    CHECK -->|No| PW
    CHECK -->|Yes| HEUR{Page needs JS?}

    HEUR -->|Yes| PW
    HEUR -->|No| DONE(["Store + extract links"])

    PW --> DONE
    ARC --> DONE

    style START fill:#0f172a,color:#e2e8f0,stroke:#334155
    style DONE fill:#14532d,color:#ecfdf5,stroke:#166534
    style ARC fill:#1e3a5f,color:#dbeafe,stroke:#2563eb
    style PW fill:#3b0764,color:#f3e8ff,stroke:#9333ea
    style FETCH fill:#1e293b,color:#e2e8f0,stroke:#475569

Discovery pipeline

flowchart LR
    S["Seed URL"] --> R["robots.txt"]
    S --> M["sitemap.xml"]
    S --> K["Katana binary"]
    S --> W["Wayback CDX"]

    R --> F["Unified Frontier"]
    M --> F
    K --> F
    W --> F

    F --> C["Concurrent Workers"]
    C --> P["Policy Gate"]
    P --> E["Engine Fetch"]
    E --> X["Link Extraction"]
    X --> F

Engine matrix

Engine	Best for	Vendor inspiration	Install
`http`	Static pages, APIs, feeds	Scrapy · Colly	built-in
`playwright`	SPAs, React, Vue, Next.js	Playwright · Puppeteer	`npx playwright install chromium`
`archive`	Historical snapshots	Heritrix · Internet Archive	built-in
`katana`	Fast link discovery	Katana · Colly	Install Katana (external)
`splash`	JS render sidecar	Splash	run Splash on `:8050` (external)
`mechanical`	Forms, sessions	MechanicalSoup	external adapter
`scrapy`	Batch spider projects	Scrapy · Portia	external adapter
`auto`	Smart routing (default)	Crawlee patterns	built-in

Reference vendor trees can be extracted locally — see vendors/README.md.

Quick start

1 · Install

git clone https://github.com/houseofasher/web-crawlers.git
cd web-crawlers

npm install
npm run build

Link the CLI globally (optional):

npm link
# or run via npm scripts: npm run dev -- crawl ...

2 · Crawl

# Live web + sitemaps + Wayback snapshots
omnispider crawl https://example.com --depth 3 --max-pages 500

# Skip archive layer
omnispider crawl https://example.com --no-archive

# Force JavaScript rendering
omnispider crawl https://example.com --js

3 · Topic lookup & archive

# Find pages about a person/topic (use --seed for direct profile URLs)
omnispider lookup "Asher Shepherd Newton Cape Coral Florida" \
  --seed https://github.com/houseofasher \
  --seed https://github.com/shep95 \
  --json ./data/reports/lookup.json

# List Wayback Machine snapshots for a URL
omnispider archive https://example.com

4 · Serve API

omnispider serve --port 8080
# → http://127.0.0.1:8080/health

Optional power-ups

npx playwright install chromium   # JS rendering
npm test                          # run Vitest suite

REST API

# Start a crawl job
curl -X POST http://127.0.0.1:8080/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "seeds": ["https://example.com"],
    "max_depth": 3,
    "max_pages": 100,
    "include_archive": true,
    "js_rendering": false
  }'

# Poll job status
curl http://127.0.0.1:8080/v1/jobs/{job_id}

# List crawled pages
curl "http://127.0.0.1:8080/v1/jobs/{job_id}/pages?limit=50"

Method	Endpoint	Description
`GET`	`/health`	Service health
`POST`	`/v1/jobs`	Create crawl job
`GET`	`/v1/jobs`	List jobs
`GET`	`/v1/jobs/{id}`	Job status
`GET`	`/v1/jobs/{id}/pages`	Paginated page results
`GET`	`/v1/engines`	Engine catalog

Configuration

Edit config/default.yaml:

orchestrator:
  max_concurrency: 16
  max_depth: 5
  max_pages_per_job: 10000

policy:
  respect_robots_txt: true
  rate_limit_per_host: 2.0

archive:
  enabled: true

discovery:
  sitemap: true
  global_seeds:
    - "https://www.wikipedia.org/"

storage:
  database_path: "./data/omnispider.db"
  content_dir: "./data/content"

Project layout

web-crawlers/
├── src/
│   ├── cli.ts              # Commander CLI
│   ├── api.ts              # Fastify REST server
│   ├── core/
│   │   ├── orchestrator.ts # Main crawl loop
│   │   ├── storage.ts      # Jobs + pages persistence (SQLite)
│   │   └── policy.ts       # robots.txt + rate limits
│   ├── engines/            # HTTP, Playwright, Archive
│   ├── discovery/          # Sitemaps, link extraction
│   ├── security/           # Nomad Cyber stack
│   └── topic/              # Topic-centric lookup
├── config/default.yaml
├── tests/
└── vendors/                # Optional local reference trees

Data output

Artifact	Location	Contents
Job store	`./data/omnispider.db`	Jobs, frontier, page metadata
HTML shards	`./data/content/`	SHA-256 sharded page bodies
Logs	stdout (pino)	Structured JSON logs

Nomad Cyber security

Omnispider integrates the Nomad Cyber Algorithm sovereign security stack — adapted from its gateway, audit, replay, and organism patterns.

flowchart TB
    subgraph perimeter [API Perimeter]
        HDR[OWASP Security Headers]
        RL[Rate Limiter]
        RBAC[RBAC + API Keys]
        RG[Replay Guard]
    end

    subgraph organism [Sovereign Organism]
        AUDIT["Audit Immune<br/>HMAC chain"]
        VITAL["Vital Guard<br/>pulse every 30s"]
        SSRF["SSRF Lungs<br/>block private IPs"]
    end

    REQ[API Request] --> HDR --> RL --> RBAC --> RG
    RG --> VITAL
    VITAL -->|vital| HANDLER[Crawl Handler]
    VITAL -->|lockdown| BLOCK[503 ORGANISM_LOCKDOWN]
    HANDLER --> SSRF
    SSRF --> AUDIT
    HANDLER --> CRAWL[Orchestrator]

Organ	Protection
Gateway Skin	RBAC roles, bearer API keys, body limits
Replay Nerves	`X-Nonce` + `X-Timestamp` on `POST /v1/jobs`
SSRF Lungs	Blocks localhost, private IPs, metadata endpoints in crawl targets
Audit Immune	Tamper-evident HMAC-chained JSONL log
Sovereign Organism	All organs vital or total API lockdown

# Generate production API key
omnispider security generate-key --role admin

# Check organism vitals
omnispider security vitals
curl http://127.0.0.1:8080/organism/vitals

# Authenticated crawl job (production)
curl -X POST http://127.0.0.1:8080/v1/jobs \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Nonce: $(uuidgen)" \
  -H "X-Timestamp: $(date +%s000)" \
  -H "Content-Type: application/json" \
  -d '{"seeds":["https://example.com"],"max_depth":2,"max_pages":50}'

See SECURITY.md for production hardening checklist.

Repositories

This project is maintained at:

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
config		config
src		src
tests		tests
vendors		vendors
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Omnispider

All-in-one web spider orchestrator for the full digital surface

Overview

Architecture

Workflow logic

End-to-end crawl lifecycle

Engine auto-routing decision tree

Discovery pipeline

Engine matrix

Quick start

1 · Install

2 · Crawl

3 · Topic lookup & archive

4 · Serve API

Optional power-ups

REST API

Configuration

Project layout

Data output

Nomad Cyber security

Repositories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Omnispider

All-in-one web spider orchestrator for the full digital surface

Overview

Architecture

Workflow logic

End-to-end crawl lifecycle

Engine auto-routing decision tree

Discovery pipeline

Engine matrix

Quick start

1 · Install

2 · Crawl

3 · Topic lookup & archive

4 · Serve API

Optional power-ups

REST API

Configuration

Project layout

Data output

Nomad Cyber security

Repositories

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages