AI-powered knowledge ingestion system. Discovers sources, rates content through multi-lens AI scoring, and serves scored/clustered content via REST API, CLI, MCP, dashboard, webhooks, and public feed.
Agent-first design — DAI agents are primary consumers, humans secondary.
Three layers:
- Scout — Source discovery. RSS/Atom feeds, X/Twitter accounts, YouTube channels, web scrapers. Automated intent-based discovery finds new sources matching your interests.
- Comb — Multi-lens AI rating. Triage assigns relevant lenses, Gemini Flash rates 1-100 per lens, Workers AI embeds, Vectorize clusters by similarity.
- Waggle — Delivery. REST API, CLI, MCP server, Next.js dashboard, push webhooks, public feed.
Sources --> Scout (fetch + extract) --> Comb (triage + rate + embed + cluster) --> Waggle (API/CLI/MCP/dashboard/webhooks)
Sources → source-fetcher (cron) → Jina/Crawl4AI extract → dedup → enqueue → triage (Gemini Flash) → process-handler (rate + embed + cluster in 1 queue msg) → delivered via API/CLI/MCP/webhooks
Embed/cluster phase is non-blocking — Workers AI quota exhaustion doesn't fail ratings.
| Layer | Technology |
|---|---|
| Runtime | Bun + TypeScript |
| API | Hono on Cloudflare Workers |
| Database | Cloudflare D1 (SQLite) |
| Vectors | Cloudflare Vectorize (1024-dim cosine) |
| Queue | Cloudflare Queues (async pipeline) |
| LLM | Google Gemini Flash (primary, free tier) |
| Embeddings | Cloudflare Workers AI (bge-large-en-v1.5, free tier) |
| Content Extraction | Jina Reader API (with 402→free tier fallback) + Crawl4AI (local) |
| Dashboard | Next.js 16 via OpenNext on Workers |
| CLI | Commander.js on Bun |
| Package | Type | Purpose |
|---|---|---|
@cortex/api |
Worker | Hono REST API — 13 route modules, Bearer auth |
@cortex/analyzer |
Worker | Queue consumer, cron dispatcher, rating pipeline |
@cortex/shared |
Library | Types, lens configs, filter parser, LLM pricing |
@cortex/local-fetch |
Local | CLI, MCP server, Playwright fetch service |
@cortex/dashboard |
Worker | Next.js dashboard via OpenNext |
- Bun runtime
- Cloudflare account with Workers, D1, KV, Queues, Vectorize
- API keys: Google AI Studio (Gemini Flash), Jina AI (optional, free tier works)
# Install dependencies
bun install
# Create Cloudflare resources
wrangler d1 create cortex-db
wrangler kv namespace create cortex-cache
wrangler vectorize create cortex-articles --dimensions=1024 --metric=cosine
# Update wrangler.jsonc files with resource IDs
# packages/api/wrangler.jsonc — database_id, KV id
# packages/analyzer/wrangler.jsonc — database_id, KV id
# Apply migrations
cd packages/api && wrangler d1 migrations apply cortex-db --remote
# Set secrets
wrangler secret put CORTEX_API_TOKEN --name cortex-api
wrangler secret put JINA_API_KEY --name cortex-api
wrangler secret put GEMINI_API_KEY --name cortex-analyzer
wrangler secret put JINA_API_KEY --name cortex-analyzer
wrangler secret put CORTEX_API_TOKEN --name cortex-dashboard
# Deploy (order matters — API first)
cd packages/api && wrangler deploy
cd ../analyzer && wrangler deploy
cd ../dashboard && npm run deploybun run dev # All packages in dev mode
cd packages/api && bun run dev # API with local D1
cd packages/dashboard && bun run dev # Next.js dev server
cd packages/local-fetch && bun run dev # CLIFor bulk re-extraction of articles with empty content:
python3 -m venv .venv && source .venv/bin/activate
pip install crawl4ai && crawl4ai-setup
python scripts/crawl4ai-extract.py --batch-size 50 --concurrency 5# Sources
cortex source list
cortex source add https://example.com/feed.xml --name "Example Blog"
# Feed
cortex trending
cortex trending --lens security --period 7d
# Search
cortex search "kubernetes security"
cortex search "lens:security tier:S after:7d"
# Stats
cortex stats# Articles feed
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/feed
# Semantic search
curl -H "Authorization: Bearer $TOKEN" "https://your-api.workers.dev/api/search?q=kubernetes"
# Structured filter syntax
curl -H "Authorization: Bearer $TOKEN" "https://your-api.workers.dev/api/search?fq=lens:security+tier:S+after:7d"
# Health check
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/health
# Public feed (no auth)
curl https://your-api.workers.dev/public/feed# Backfill unrated articles through the pipeline
curl -X POST -H "Authorization: Bearer $TOKEN" \
-d '{"batch_size": 200}' https://your-api.workers.dev/api/internal/backfill-unrated
# Re-extract content for empty articles (via Jina)
curl -X POST -H "Authorization: Bearer $TOKEN" \
-d '{"batch_size": 10}' https://your-api.workers.dev/api/internal/re-extract
# List articles needing local extraction
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/internal/empty-articles
# Automated drain (re-extract + backfill)
./scripts/drain-backlog.shcurl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"url":"https://your-endpoint.com/hook","events":["new_s_tier"],"secret":"hmac-secret"}' \
https://your-api.workers.dev/api/webhooks/subscribeEvents: new_s_tier, trending_cluster, pipeline_stall, new_source_discovered, score_drift, inference_escalated, budget_alert
Each article is triaged and rated through relevant lenses on a 1-100 scale:
| Lens | ID | Focus |
|---|---|---|
| AI Agents | ai-agents |
Autonomous agents, tool use, orchestration |
| Security | security |
Vulnerabilities, threat intelligence, defense |
| Engineering | engineering |
Systems design, performance, developer tooling |
| Business | business |
Strategy, market analysis, industry trends |
| Geopolitics | geopolitics |
International relations, policy, conflict |
| Privacy | privacy |
Data protection, surveillance, compliance |
Tiers: S (90-100) A (75-89) B (50-74) C (25-49) D (1-24)
- Auto-escalation — Gemini Flash is primary. Haiku available as fallback. Auto-fallback on 429 with probe-based recovery.
- Non-blocking embeddings — Workers AI quota exhaustion doesn't fail ratings. Embeddings catch up when quota resets.
- Content extraction fallback — Jina 402 (expired key) auto-falls back to free tier. Local Crawl4AI for bulk re-extraction.
- Calibration monitoring — Golden set tracks score drift. Cron fires
score_driftwebhook when any lens drifts >10%. - Cost tracking — Every LLM call logs tokens and cost. Budget alerts at 80% and 100% thresholds.
- Structured filters —
lens:security tier:S after:7d score:>80in CLI and API. - Pipeline observability —
GET /api/healthsurfaces degraded sources, stale jobs, extraction errors, pipeline throughput.
Author: mj-deving
