Skip to content

houseofasher/aureon_algorithm

Repository files navigation

Omnispider

All-in-one web spider orchestrator for the full digital surface

Live web · JavaScript rendering · Internet Archive · Global discovery · Multi-engine routing


Node.js 20+ TypeScript Fastify License: MIT Status


Quick Start · Architecture · Workflows · Engines · API · Config


Overview

Omnispider is a unified crawl orchestrator that routes every request through policy, discovery, and the best engine for the job — static HTTP, browser rendering, archival snapshots, or fast link discovery.

It synthesizes patterns from 12 crawler ecosystems (Scrapy, Playwright, Puppeteer, Crawlee, Colly, Katana, Splash, MechanicalSoup, Portia, Heritrix3, Nutch, StormCrawler) into one pipeline with a CLI, REST API, and persistent frontier.

Temporal

Wayback Machine CDX
Historical snapshots
Past → present

Spatial

Sitemaps · robots.txt
Global seeds · ccTLD
Four corners of the web

Surface

Static HTML · JS apps
Forms · feeds · archives
Every page type

Scope note: Omnispider maximizes reachable public coverage ethically — respecting robots.txt, rate limits, and domain policy. No crawler can fetch every page ever published; deleted, private, and auth-gated content remains out of scope.


Architecture

flowchart TB
    subgraph INPUT["Input Layer"]
        SEEDS["Seed URLs"]
        GLOBAL["Global Seeds / ccTLD"]
        SITEMAP["Sitemap / robots.txt"]
        ARCHIVE_SEED["Wayback CDX Snapshots"]
    end

    subgraph CORE["Omnispider Core"]
        FRONTIER["Frontier Queue<br/><i>SQLite · priority · depth</i>"]
        POLICY["Policy Gate<br/><i>robots.txt · rate limit · domain</i>"]
        ROUTER["Engine Router<br/><i>auto · http · playwright · archive</i>"]
    end

    subgraph ENGINES["Engine Layer"]
        HTTP["HTTP Engine"]
        PW["Playwright"]
        ARC["Archive / Wayback"]
        KAT["Katana Discovery"]
        SPL["Splash Sidecar"]
        MECH["MechanicalSoup"]
        SCR["Scrapy Batch"]
    end

    subgraph OUTPUT["Output Layer"]
        STORE["SQLite + Content Store"]
        LINKS["Link Extractor"]
        API["REST API / CLI"]
    end

    SEEDS --> FRONTIER
    GLOBAL --> FRONTIER
    SITEMAP --> FRONTIER
    ARCHIVE_SEED --> FRONTIER

    FRONTIER --> POLICY
    POLICY --> ROUTER

    ROUTER --> HTTP
    ROUTER --> PW
    ROUTER --> ARC
    ROUTER --> KAT
    ROUTER --> SPL
    ROUTER --> MECH
    ROUTER --> SCR

    HTTP --> STORE
    PW --> STORE
    ARC --> STORE
    SPL --> STORE
    MECH --> STORE

    STORE --> LINKS
    LINKS -->|"expand frontier"| FRONTIER
    STORE --> API
Loading

Workflow logic

End-to-end crawl lifecycle

sequenceDiagram
    autonumber
    actor User
    participant CLI as CLI / API
    participant ORCH as Orchestrator
    participant DISC as Discovery
    participant FR as Frontier
    participant POL as Policy
    participant ENG as Engine
    participant DB as Storage

    User->>CLI: crawl / POST /v1/jobs
    CLI->>ORCH: submit_job(seeds, depth, limits)
    ORCH->>DISC: sitemaps · robots · katana · wayback
    DISC->>FR: enqueue seed URLs
    loop Until max_pages or frontier empty
        FR->>POL: pop next URL
        POL->>POL: robots.txt + rate limit + domain
        alt blocked
            POL-->>FR: skip
        else allowed
            POL->>ENG: route engine (auto/http/playwright/archive)
            ENG->>ENG: fetch + render if needed
            ENG->>DB: save page + metadata
            DB->>FR: extract links → enqueue children
        end
    end
    ORCH->>CLI: job completed
    CLI->>User: pages crawled / failed
Loading

Engine auto-routing decision tree

flowchart TD
    START(["Incoming URL"]) --> SOURCE{Source type?}

    SOURCE -->|Wayback / Archive| ARC["Archive Engine"]
    SOURCE -->|Live web| JS{JS rendering forced?}

    JS -->|Yes| PW["Playwright Engine"]
    JS -->|No| FETCH["HTTP Engine"]

    FETCH --> CHECK{Response OK?}
    CHECK -->|No| PW
    CHECK -->|Yes| HEUR{Page needs JS?}

    HEUR -->|Yes| PW
    HEUR -->|No| DONE(["Store + extract links"])

    PW --> DONE
    ARC --> DONE

    style START fill:#0f172a,color:#e2e8f0,stroke:#334155
    style DONE fill:#14532d,color:#ecfdf5,stroke:#166534
    style ARC fill:#1e3a5f,color:#dbeafe,stroke:#2563eb
    style PW fill:#3b0764,color:#f3e8ff,stroke:#9333ea
    style FETCH fill:#1e293b,color:#e2e8f0,stroke:#475569
Loading

Discovery pipeline

flowchart LR
    S["Seed URL"] --> R["robots.txt"]
    S --> M["sitemap.xml"]
    S --> K["Katana binary"]
    S --> W["Wayback CDX"]

    R --> F["Unified Frontier"]
    M --> F
    K --> F
    W --> F

    F --> C["Concurrent Workers"]
    C --> P["Policy Gate"]
    P --> E["Engine Fetch"]
    E --> X["Link Extraction"]
    X --> F
Loading

Engine matrix

Engine Best for Vendor inspiration Install
http Static pages, APIs, feeds Scrapy · Colly built-in
playwright SPAs, React, Vue, Next.js Playwright · Puppeteer npx playwright install chromium
archive Historical snapshots Heritrix · Internet Archive built-in
katana Fast link discovery Katana · Colly Install Katana (external)
splash JS render sidecar Splash run Splash on :8050 (external)
mechanical Forms, sessions MechanicalSoup external adapter
scrapy Batch spider projects Scrapy · Portia external adapter
auto Smart routing (default) Crawlee patterns built-in

Reference vendor trees can be extracted locally — see vendors/README.md.


Quick start

1 · Install

git clone https://github.com/houseofasher/web-crawlers.git
cd web-crawlers

npm install
npm run build

Link the CLI globally (optional):

npm link
# or run via npm scripts: npm run dev -- crawl ...

2 · Crawl

# Live web + sitemaps + Wayback snapshots
omnispider crawl https://example.com --depth 3 --max-pages 500

# Skip archive layer
omnispider crawl https://example.com --no-archive

# Force JavaScript rendering
omnispider crawl https://example.com --js

3 · Topic lookup & archive

# Find pages about a person/topic (use --seed for direct profile URLs)
omnispider lookup "Asher Shepherd Newton Cape Coral Florida" \
  --seed https://github.com/houseofasher \
  --seed https://github.com/shep95 \
  --json ./data/reports/lookup.json

# List Wayback Machine snapshots for a URL
omnispider archive https://example.com

4 · Serve API

omnispider serve --port 8080
# → http://127.0.0.1:8080/health

Optional power-ups

npx playwright install chromium   # JS rendering
npm test                          # run Vitest suite

REST API

# Start a crawl job
curl -X POST http://127.0.0.1:8080/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "seeds": ["https://example.com"],
    "max_depth": 3,
    "max_pages": 100,
    "include_archive": true,
    "js_rendering": false
  }'

# Poll job status
curl http://127.0.0.1:8080/v1/jobs/{job_id}

# List crawled pages
curl "http://127.0.0.1:8080/v1/jobs/{job_id}/pages?limit=50"
Method Endpoint Description
GET /health Service health
POST /v1/jobs Create crawl job
GET /v1/jobs List jobs
GET /v1/jobs/{id} Job status
GET /v1/jobs/{id}/pages Paginated page results
GET /v1/engines Engine catalog

Configuration

Edit config/default.yaml:

orchestrator:
  max_concurrency: 16
  max_depth: 5
  max_pages_per_job: 10000

policy:
  respect_robots_txt: true
  rate_limit_per_host: 2.0

archive:
  enabled: true

discovery:
  sitemap: true
  global_seeds:
    - "https://www.wikipedia.org/"

storage:
  database_path: "./data/omnispider.db"
  content_dir: "./data/content"

Project layout

web-crawlers/
├── src/
│   ├── cli.ts              # Commander CLI
│   ├── api.ts              # Fastify REST server
│   ├── core/
│   │   ├── orchestrator.ts # Main crawl loop
│   │   ├── storage.ts      # Jobs + pages persistence (SQLite)
│   │   └── policy.ts       # robots.txt + rate limits
│   ├── engines/            # HTTP, Playwright, Archive
│   ├── discovery/          # Sitemaps, link extraction
│   ├── security/           # Nomad Cyber stack
│   └── topic/              # Topic-centric lookup
├── config/default.yaml
├── tests/
└── vendors/                # Optional local reference trees

Data output

Artifact Location Contents
Job store ./data/omnispider.db Jobs, frontier, page metadata
HTML shards ./data/content/ SHA-256 sharded page bodies
Logs stdout (pino) Structured JSON logs

Nomad Cyber security

Omnispider integrates the Nomad Cyber Algorithm sovereign security stack — adapted from its gateway, audit, replay, and organism patterns.

flowchart TB
    subgraph perimeter [API Perimeter]
        HDR[OWASP Security Headers]
        RL[Rate Limiter]
        RBAC[RBAC + API Keys]
        RG[Replay Guard]
    end

    subgraph organism [Sovereign Organism]
        AUDIT["Audit Immune<br/>HMAC chain"]
        VITAL["Vital Guard<br/>pulse every 30s"]
        SSRF["SSRF Lungs<br/>block private IPs"]
    end

    REQ[API Request] --> HDR --> RL --> RBAC --> RG
    RG --> VITAL
    VITAL -->|vital| HANDLER[Crawl Handler]
    VITAL -->|lockdown| BLOCK[503 ORGANISM_LOCKDOWN]
    HANDLER --> SSRF
    SSRF --> AUDIT
    HANDLER --> CRAWL[Orchestrator]
Loading
Organ Protection
Gateway Skin RBAC roles, bearer API keys, body limits
Replay Nerves X-Nonce + X-Timestamp on POST /v1/jobs
SSRF Lungs Blocks localhost, private IPs, metadata endpoints in crawl targets
Audit Immune Tamper-evident HMAC-chained JSONL log
Sovereign Organism All organs vital or total API lockdown
# Generate production API key
omnispider security generate-key --role admin

# Check organism vitals
omnispider security vitals
curl http://127.0.0.1:8080/organism/vitals

# Authenticated crawl job (production)
curl -X POST http://127.0.0.1:8080/v1/jobs \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Nonce: $(uuidgen)" \
  -H "X-Timestamp: $(date +%s000)" \
  -H "Content-Type: application/json" \
  -d '{"seeds":["https://example.com"],"max_depth":2,"max_pages":50}'

See SECURITY.md for production hardening checklist.


Repositories

This project is maintained at:


License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors