Cortex

AI-powered knowledge ingestion system. Discovers sources, rates content through multi-lens AI scoring, and serves scored/clustered content via REST API, CLI, MCP, dashboard, webhooks, and public feed.

Agent-first design — DAI agents are primary consumers, humans secondary.

Architecture

Three layers:

Scout — Source discovery. RSS/Atom feeds, X/Twitter accounts, YouTube channels, web scrapers. Automated intent-based discovery finds new sources matching your interests.
Comb — Multi-lens AI rating. Triage assigns relevant lenses, Gemini Flash rates 1-100 per lens, Workers AI embeds, Vectorize clusters by similarity.
Waggle — Delivery. REST API, CLI, MCP server, Next.js dashboard, push webhooks, public feed.

Sources --> Scout (fetch + extract) --> Comb (triage + rate + embed + cluster) --> Waggle (API/CLI/MCP/dashboard/webhooks)

Data Flow

Sources → source-fetcher (cron) → Jina/Crawl4AI extract → dedup → enqueue → triage (Gemini Flash) → process-handler (rate + embed + cluster in 1 queue msg) → delivered via API/CLI/MCP/webhooks

Embed/cluster phase is non-blocking — Workers AI quota exhaustion doesn't fail ratings.

Tech Stack

Layer	Technology
Runtime	Bun + TypeScript
API	Hono on Cloudflare Workers
Database	Cloudflare D1 (SQLite)
Vectors	Cloudflare Vectorize (1024-dim cosine)
Queue	Cloudflare Queues (async pipeline)
LLM	Google Gemini Flash (primary, free tier)
Embeddings	Cloudflare Workers AI (bge-large-en-v1.5, free tier)
Content Extraction	Jina Reader API (with 402→free tier fallback) + Crawl4AI (local)
Dashboard	Next.js 16 via OpenNext on Workers
CLI	Commander.js on Bun

Packages

Package	Type	Purpose
`@cortex/api`	Worker	Hono REST API — 13 route modules, Bearer auth
`@cortex/analyzer`	Worker	Queue consumer, cron dispatcher, rating pipeline
`@cortex/shared`	Library	Types, lens configs, filter parser, LLM pricing
`@cortex/local-fetch`	Local	CLI, MCP server, Playwright fetch service
`@cortex/dashboard`	Worker	Next.js dashboard via OpenNext

Quick Start

Prerequisites

Bun runtime
Cloudflare account with Workers, D1, KV, Queues, Vectorize
API keys: Google AI Studio (Gemini Flash), Jina AI (optional, free tier works)

Setup

# Install dependencies
bun install

# Create Cloudflare resources
wrangler d1 create cortex-db
wrangler kv namespace create cortex-cache
wrangler vectorize create cortex-articles --dimensions=1024 --metric=cosine

# Update wrangler.jsonc files with resource IDs
# packages/api/wrangler.jsonc — database_id, KV id
# packages/analyzer/wrangler.jsonc — database_id, KV id

# Apply migrations
cd packages/api && wrangler d1 migrations apply cortex-db --remote

# Set secrets
wrangler secret put CORTEX_API_TOKEN --name cortex-api
wrangler secret put JINA_API_KEY --name cortex-api

wrangler secret put GEMINI_API_KEY --name cortex-analyzer
wrangler secret put JINA_API_KEY --name cortex-analyzer

wrangler secret put CORTEX_API_TOKEN --name cortex-dashboard

# Deploy (order matters — API first)
cd packages/api && wrangler deploy
cd ../analyzer && wrangler deploy
cd ../dashboard && npm run deploy

Local Development

bun run dev                            # All packages in dev mode
cd packages/api && bun run dev         # API with local D1
cd packages/dashboard && bun run dev   # Next.js dev server
cd packages/local-fetch && bun run dev # CLI

Local Content Extraction (Crawl4AI)

For bulk re-extraction of articles with empty content:

python3 -m venv .venv && source .venv/bin/activate
pip install crawl4ai && crawl4ai-setup
python scripts/crawl4ai-extract.py --batch-size 50 --concurrency 5

Usage

CLI

# Sources
cortex source list
cortex source add https://example.com/feed.xml --name "Example Blog"

# Feed
cortex trending
cortex trending --lens security --period 7d

# Search
cortex search "kubernetes security"
cortex search "lens:security tier:S after:7d"

# Stats
cortex stats

API

# Articles feed
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/feed

# Semantic search
curl -H "Authorization: Bearer $TOKEN" "https://your-api.workers.dev/api/search?q=kubernetes"

# Structured filter syntax
curl -H "Authorization: Bearer $TOKEN" "https://your-api.workers.dev/api/search?fq=lens:security+tier:S+after:7d"

# Health check
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/health

# Public feed (no auth)
curl https://your-api.workers.dev/public/feed

Internal Endpoints

# Backfill unrated articles through the pipeline
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -d '{"batch_size": 200}' https://your-api.workers.dev/api/internal/backfill-unrated

# Re-extract content for empty articles (via Jina)
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -d '{"batch_size": 10}' https://your-api.workers.dev/api/internal/re-extract

# List articles needing local extraction
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/internal/empty-articles

# Automated drain (re-extract + backfill)
./scripts/drain-backlog.sh

Webhooks

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://your-endpoint.com/hook","events":["new_s_tier"],"secret":"hmac-secret"}' \
  https://your-api.workers.dev/api/webhooks/subscribe

Events: new_s_tier, trending_cluster, pipeline_stall, new_source_discovered, score_drift, inference_escalated, budget_alert

Rating Lenses

Each article is triaged and rated through relevant lenses on a 1-100 scale:

Lens	ID	Focus
AI Agents	`ai-agents`	Autonomous agents, tool use, orchestration
Security	`security`	Vulnerabilities, threat intelligence, defense
Engineering	`engineering`	Systems design, performance, developer tooling
Business	`business`	Strategy, market analysis, industry trends
Geopolitics	`geopolitics`	International relations, policy, conflict
Privacy	`privacy`	Data protection, surveillance, compliance

Tiers: S (90-100) A (75-89) B (50-74) C (25-49) D (1-24)

Scale Features

Auto-escalation — Gemini Flash is primary. Haiku available as fallback. Auto-fallback on 429 with probe-based recovery.
Non-blocking embeddings — Workers AI quota exhaustion doesn't fail ratings. Embeddings catch up when quota resets.
Content extraction fallback — Jina 402 (expired key) auto-falls back to free tier. Local Crawl4AI for bulk re-extraction.
Calibration monitoring — Golden set tracks score drift. Cron fires score_drift webhook when any lens drifts >10%.
Cost tracking — Every LLM call logs tokens and cost. Budget alerts at 80% and 100% thresholds.
Structured filters — lens:security tier:S after:7d score:>80 in CLI and API.
Pipeline observability — GET /api/health surfaces degraded sources, stale jobs, extraction errors, pipeline throughput.

Author: mj-deving

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
.claude/worktrees		.claude/worktrees
.github/workflows		.github/workflows
.planning		.planning
.sessions/2026-03		.sessions/2026-03
MEMORY/WORK		MEMORY/WORK
Plans		Plans
docs/screenshots		docs/screenshots
migrations		migrations
packages		packages
scripts		scripts
src		src
.dev.vars.example		.dev.vars.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
bunfig.toml		bunfig.toml
package.json		package.json
tsconfig.base.json		tsconfig.base.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cortex

Architecture

Data Flow

Tech Stack

Packages

Quick Start

Prerequisites

Setup

Local Development

Local Content Extraction (Crawl4AI)

Usage

CLI

API

Internal Endpoints

Webhooks

Rating Lenses

Scale Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cortex

Architecture

Data Flow

Tech Stack

Packages

Quick Start

Prerequisites

Setup

Local Development

Local Content Extraction (Crawl4AI)

Usage

CLI

API

Internal Endpoints

Webhooks

Rating Lenses

Scale Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages