Skip to content

mj-deving/cortex

Repository files navigation

Cortex

TypeScript Cloudflare

AI-powered knowledge ingestion system. Discovers sources, rates content through multi-lens AI scoring, and serves scored/clustered content via REST API, CLI, MCP, dashboard, webhooks, and public feed.

Agent-first design — DAI agents are primary consumers, humans secondary.

Public Feed

Architecture

Three layers:

  • Scout — Source discovery. RSS/Atom feeds, X/Twitter accounts, YouTube channels, web scrapers. Automated intent-based discovery finds new sources matching your interests.
  • Comb — Multi-lens AI rating. Triage assigns relevant lenses, Gemini Flash rates 1-100 per lens, Workers AI embeds, Vectorize clusters by similarity.
  • Waggle — Delivery. REST API, CLI, MCP server, Next.js dashboard, push webhooks, public feed.
Sources --> Scout (fetch + extract) --> Comb (triage + rate + embed + cluster) --> Waggle (API/CLI/MCP/dashboard/webhooks)

Data Flow

Sources → source-fetcher (cron) → Jina/Crawl4AI extract → dedup → enqueue → triage (Gemini Flash) → process-handler (rate + embed + cluster in 1 queue msg) → delivered via API/CLI/MCP/webhooks

Embed/cluster phase is non-blocking — Workers AI quota exhaustion doesn't fail ratings.

Tech Stack

Layer Technology
Runtime Bun + TypeScript
API Hono on Cloudflare Workers
Database Cloudflare D1 (SQLite)
Vectors Cloudflare Vectorize (1024-dim cosine)
Queue Cloudflare Queues (async pipeline)
LLM Google Gemini Flash (primary, free tier)
Embeddings Cloudflare Workers AI (bge-large-en-v1.5, free tier)
Content Extraction Jina Reader API (with 402→free tier fallback) + Crawl4AI (local)
Dashboard Next.js 16 via OpenNext on Workers
CLI Commander.js on Bun

Packages

Package Type Purpose
@cortex/api Worker Hono REST API — 13 route modules, Bearer auth
@cortex/analyzer Worker Queue consumer, cron dispatcher, rating pipeline
@cortex/shared Library Types, lens configs, filter parser, LLM pricing
@cortex/local-fetch Local CLI, MCP server, Playwright fetch service
@cortex/dashboard Worker Next.js dashboard via OpenNext

Quick Start

Prerequisites

Setup

# Install dependencies
bun install

# Create Cloudflare resources
wrangler d1 create cortex-db
wrangler kv namespace create cortex-cache
wrangler vectorize create cortex-articles --dimensions=1024 --metric=cosine

# Update wrangler.jsonc files with resource IDs
# packages/api/wrangler.jsonc — database_id, KV id
# packages/analyzer/wrangler.jsonc — database_id, KV id

# Apply migrations
cd packages/api && wrangler d1 migrations apply cortex-db --remote

# Set secrets
wrangler secret put CORTEX_API_TOKEN --name cortex-api
wrangler secret put JINA_API_KEY --name cortex-api

wrangler secret put GEMINI_API_KEY --name cortex-analyzer
wrangler secret put JINA_API_KEY --name cortex-analyzer

wrangler secret put CORTEX_API_TOKEN --name cortex-dashboard

# Deploy (order matters — API first)
cd packages/api && wrangler deploy
cd ../analyzer && wrangler deploy
cd ../dashboard && npm run deploy

Local Development

bun run dev                            # All packages in dev mode
cd packages/api && bun run dev         # API with local D1
cd packages/dashboard && bun run dev   # Next.js dev server
cd packages/local-fetch && bun run dev # CLI

Local Content Extraction (Crawl4AI)

For bulk re-extraction of articles with empty content:

python3 -m venv .venv && source .venv/bin/activate
pip install crawl4ai && crawl4ai-setup
python scripts/crawl4ai-extract.py --batch-size 50 --concurrency 5

Usage

CLI

# Sources
cortex source list
cortex source add https://example.com/feed.xml --name "Example Blog"

# Feed
cortex trending
cortex trending --lens security --period 7d

# Search
cortex search "kubernetes security"
cortex search "lens:security tier:S after:7d"

# Stats
cortex stats

API

# Articles feed
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/feed

# Semantic search
curl -H "Authorization: Bearer $TOKEN" "https://your-api.workers.dev/api/search?q=kubernetes"

# Structured filter syntax
curl -H "Authorization: Bearer $TOKEN" "https://your-api.workers.dev/api/search?fq=lens:security+tier:S+after:7d"

# Health check
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/health

# Public feed (no auth)
curl https://your-api.workers.dev/public/feed

Internal Endpoints

# Backfill unrated articles through the pipeline
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -d '{"batch_size": 200}' https://your-api.workers.dev/api/internal/backfill-unrated

# Re-extract content for empty articles (via Jina)
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -d '{"batch_size": 10}' https://your-api.workers.dev/api/internal/re-extract

# List articles needing local extraction
curl -H "Authorization: Bearer $TOKEN" https://your-api.workers.dev/api/internal/empty-articles

# Automated drain (re-extract + backfill)
./scripts/drain-backlog.sh

Webhooks

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://your-endpoint.com/hook","events":["new_s_tier"],"secret":"hmac-secret"}' \
  https://your-api.workers.dev/api/webhooks/subscribe

Events: new_s_tier, trending_cluster, pipeline_stall, new_source_discovered, score_drift, inference_escalated, budget_alert

Rating Lenses

Each article is triaged and rated through relevant lenses on a 1-100 scale:

Lens ID Focus
AI Agents ai-agents Autonomous agents, tool use, orchestration
Security security Vulnerabilities, threat intelligence, defense
Engineering engineering Systems design, performance, developer tooling
Business business Strategy, market analysis, industry trends
Geopolitics geopolitics International relations, policy, conflict
Privacy privacy Data protection, surveillance, compliance

Tiers: S (90-100) A (75-89) B (50-74) C (25-49) D (1-24)

Scale Features

  • Auto-escalation — Gemini Flash is primary. Haiku available as fallback. Auto-fallback on 429 with probe-based recovery.
  • Non-blocking embeddings — Workers AI quota exhaustion doesn't fail ratings. Embeddings catch up when quota resets.
  • Content extraction fallback — Jina 402 (expired key) auto-falls back to free tier. Local Crawl4AI for bulk re-extraction.
  • Calibration monitoring — Golden set tracks score drift. Cron fires score_drift webhook when any lens drifts >10%.
  • Cost tracking — Every LLM call logs tokens and cost. Budget alerts at 80% and 100% thresholds.
  • Structured filterslens:security tier:S after:7d score:>80 in CLI and API.
  • Pipeline observabilityGET /api/health surfaces degraded sources, stale jobs, extraction errors, pipeline throughput.

Author: mj-deving

About

AI-powered knowledge ingestion system — discovers sources (Scout), rates content through multi-lens AI scoring (Comb), and serves scored/clustered content via MCP/CLI/REST/webhooks (Waggle)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors