Live demo: 2chain.dev
A self-hostable tool registry with hybrid retrieval, reliability gating, and JSON Schema contract enforcement for AI agents.
Two ways to run:
- Personal tier (v2) β SQLite + sqlite-vec + FTS5 + Ollama (
nomic-embed-text). Zero cloud dependencies.npm run setup:personal && npm run seed:v2 && npm run dev:v2. 341 tools (199 demo fixtures + 142 real-corpus catalog) embedded in ~2.6s on local Ollama. - Hackathon demo (v1) β MongoDB Atlas Vector Search +
$rankFusion+ Voyage AI. The original submission for the MongoDB Agentic Evolution Hackathon, May 2026.
Both run the same agent-facing surface: /discover (hybrid retrieval), /push (eval + register), /call (contract-enforced invocation), MCP server, live SSE dashboard.
π¬ 60-second demo video: youtu.be/puINYgtQXdM
π 3-min stage script: demo/SCRIPT.md Β· demo prompts: demo/prompts.md
π v1 -> v2 retrieval baseline: docs/perf/phase-1-baseline.md
# 1. Install Ollama and pull the embedder
curl -fsSL https://ollama.com/install.sh | sh # or download from ollama.com
ollama pull nomic-embed-text
# 2. Install + preflight + seed + run
npm install
npm run setup:personal # 5 hard checks: Ollama reachable, model present,
# sqlite-vec loadable, ~/.2chain writable, warm probe
npm run seed:v2 # 341 tools (199 demo + 142 real catalog), ~2.6s
npm run dev:v2 # http://localhost:3030
# 3. Verify the demo arc routes correctly
npm run smoke:v2:demos # 3/3 strict pass: DCF, arxiv, securityTo grow the catalog: edit src/fixtures/real-corpus.ts (12 domains pre-seeded with named, real-world tool specs from MCP registry, public APIs, well-known SaaS) and re-run npm run seed:v2.
To disable the catalog (just the 199 demo fixtures): INCLUDE_REAL_CORPUS=false npm run seed:v2.
The registry indexes four discriminated kinds of unit, all sharing the same retrieval pipeline (RRF over sqlite-vec + FTS5) and discovery surface (/discover returns tool_kind on every result):
toolβ RPC-style endpoints with JSON Schema input/output contracts. Default. The original 2chain unit.skillβ Anthropic Claude Code skills (~/.claude/skills/<slug>/SKILL.md). Discovery-only; agents load matched skills into context rather than calling them. Imported vianpm run import:skills.subagentβ Claude Code subagents (~/.claude/agents/*.md). Discovery-only; agents spawn matched subagents via the Task tool. Imported vianpm run import:subagents.promptβ Curated parameterised prompt templates with{{var}}substitution. Callable:/callreturns{ rendered: string }. Seeded fromsrc/import/prompts-seed.ts(12 templates: commit, PR, postmortem, grant impact, etc.). Imported vianpm run import:prompts.
Schema discriminator is tools.tool_kind (CHECK-constrained, default 'tool'). Filter by kind: storage.listTools({ kind: 'skill' }). End-to-end smoke: npm run smoke:v2:mixed.
Every layer of 2chain is an Atlas primitive. Nothing custom-built that Atlas does better.
| Layer | Atlas primitive | Role |
|---|---|---|
| Semantic match | Atlas Vector Search (1024-dim cosine, Voyage voyage-3) |
Finds tools whose capability means the same thing as the user's query |
| Lexical match | Atlas Search (lucene.english BM25) |
Catches concrete keyword hits the embedding misses |
| Hybrid fusion | $rankFusion (new Atlas operator) |
Reciprocal rank fusion of both pipelines in a single server-side aggregation. No client merge, no two-DB juggle |
| Reliability gate | Filtered $vectorSearch |
metadata.reliability_score >= 0.80 enforced inside the index β bad tools never even score |
| Live dashboard | Change streams (tools.watch(), usage.watch(), violations.watch()) |
SSE-driven UI with zero polling |
| Audit trail | Standard collections | usage, violations, eval_runs, rankings β every action observable |
The whole hybrid retrieval + reliability gate is six lines of aggregation pipeline. That's only possible because $rankFusion ships in Atlas.
ββββββββββββββββββββ ββββββββββββββββ
β caller agent β ββ "Extract tables from PDF" ββββββββββΊ β /discover β
ββββββββββββββββββββ ββββββββ¬ββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Atlas Vector Search + 0.80 reliability gate β
β Voyage embeddings 1024d + 0.70 vec-score gate β
β composite = 0.4Β·vec + 0.6Β·reliability β
β dedupe by tool name (latest version wins) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββ β
tool author βββββββΊ β /push embed + run inline evals β ββββ ranked top-N
β status='active' + rel score β visible to agent
βββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββ β
caller agent βββββΊ β /call input contract + stub + β ββββ
β output contract + circuit- β
β break on violation β
βββββββββββββββββββββββββββββββββββββββββ
Three trust layers:
- Discovery filter β pre-search, only
status='active'tools withreliability β₯ 0.80are even considered. - Relevance gate β post-search, vector similarity must be β₯ 0.70 (drops semantic noise).
- Contract enforcement β at call time, input + output schemas are validated; tools that lie circuit-break.
Two retrieval modes:
- Vector (
/discover?mode=vector, default) β$vectorSearch+ composite re-rank (0.4Β·vec + 0.6Β·reliability). Best for natural-language queries. - Hybrid (
/discover?mode=hybrid) β Atlas$rankFusionof$vectorSearch(Voyage embeddings) +$search(Atlas Search text). Reciprocal rank fusion with0.7 vector / 0.3 textweights. Best for queries that mix semantic intent with concrete keywords ("lint javascript", "extract financial tables from PDF"). Pure adaptive retrieval β different rank arms agree on the trustworthy answer.
199 tools live in the registry. Headline demo prompts run via Claude Code over MCP:
| # | User prompt to Claude Code | Tool 2chain picks | What happens |
|---|---|---|---|
| 1 | "Build a DCF for NVIDIA, pull the income statement" | sec-edgar-financials@1.0 |
Real fetch from data.sec.gov XBRL API. Live numbers, source URL, schema-validated |
| 2 | "Lit review on Mamba state-space models, fetch top 3 papers" | arxiv-paper-search@1.0 |
Real fetch from export.arxiv.org. Live papers, abstracts, PDF URLs |
| 3 | "Lint this JS for our CI dashboard, structured findings" | eslint-snitch@7.5 |
Returns the {issues: [...]} contract every time |
| 4 | "Audit this Python auth function, OWASP-graded" | security-scanner@1.5 |
Wins over pylint-pro because reliability is graded on security, not style |
| 5 | "Try malformed-bot v1.0 directly" | malformed-bot@1.0 |
Returns prose instead of JSON β ajv catches it on the wire β tool flips to circuit_broken in MongoDB β change stream fires β dashboard ticks red. Every future agent is now protected. |
One command to dry-run end-to-end without an agent:
npm run dev # in terminal 1
npm run demo:full # in terminal 2Open http://127.0.0.1:3030 for the live dashboard. Between recording takes, npm run reset:state clears violations and unbreaks circuit-broken tools without re-seeding.
Full prompt set with expected dashboard reactions: demo/prompts.md. Stage script: demo/SCRIPT.md. Original 4-beat dry-run narrative (pdf-extractor v3.1 reliability drop): DEMO.md.
Measured latencies (real Atlas M10, eu-west-2):
| Operation | Latency |
|---|---|
/discover (warm β query embed pre-cached) |
30ms |
/discover (cold β Voyage call) |
320ms |
/push (embed + 5 evals + status flip) |
~340ms |
/call happy path |
~40ms |
/call triggering circuit-break |
~80ms |
Prerequisites: Node 20+, an Atlas cluster (M10+ for change streams), a Voyage AI API key.
git clone https://github.com/kitfunso/2chain.git
cd 2chain
npm install
# Create .env
cat > .env <<'EOF'
MONGODB_URI=mongodb+srv://USER:PASS@cluster.xxx.mongodb.net/?appName=Cluster0
MONGODB_DB=twochain
VOYAGE_API_KEY=pa-xxxxxxxxxxxxxxxx
EOF
npm run smoke:setup # creates collections, indexes, vector index (1024d cosine)
npm run setup:text # creates Atlas Search text index (for hybrid mode)
npm run seed # seeds 199 tools (14 hand-crafted + 185 generated) + 3 agents + eval_runs
npm run dev # http://127.0.0.1:3030The vector index takes ~45s to become queryable on first creation; the setup script polls until it's ready.
The vector index is created with these filter paths declared up-front (filter paths can't be added post-build):
status(string)metadata.reliability_score(number)metadata.cost_per_call_usd(number)metadata.p95_latency_ms(number)
Change streams require a replica set (M10+ on Atlas). The seed will work on M0, but the dashboard's live re-rank will not β fall back to polling.
src/
βββ types.ts Shared TypeScript types + locked constants
βββ db/client.ts Singleton MongoClient
βββ embeddings/voyage.ts Voyage v3 fetch wrapper (1024d)
βββ fixtures/
β βββ tools.ts 14 hand-crafted tool specs (incl. sec-edgar-financials, arxiv-paper-search)
β βββ generated.ts 185 generated specs across pdf-extract, code, summarisation, lint domains
β βββ cases.ts 15 eval cases across 3 domains
β βββ agents.ts 3 demo agents with sha256-hashed API keys
βββ services/
β βββ discover.ts $vectorSearch + 0.70 vec-gate + composite re-rank + dedupe
β βββ discoverHybrid.ts $rankFusion (vector 0.7 + text 0.3) + heuristic / Cohere re-rank
β βββ push.ts insert pending β embed β run evals β flip to active
β βββ call.ts ajv input/output validation + fail-fast circuit-break
β βββ evalRunner.ts Sequential domain-case runner with 5s per-case timeout
β βββ stubs.ts In-process tool registry (case_id-keyed responses + real-fetch wrappers)
β βββ secEdgar.ts Real SEC EDGAR XBRL client (CIK lookup, concept fallback chain)
β βββ arxivSearch.ts Real arxiv.org Atom feed client
β βββ graders.ts numeric_tolerance, regex, length, json_schema_array_field
βββ server/
βββ index.ts buildServer() + change-stream subscriptions
βββ auth.ts x-api-key middleware (sha256 hashed lookup)
βββ sse.ts SSE broadcast manager
βββ streams.ts MongoDB change-stream watchers
βββ routes/
βββ discover.ts GET /discover
βββ push.ts POST /push
βββ call.ts POST /call
βββ dashboard.ts GET / (HTML), /events (SSE), /state (snapshot)
bin/2chain.mjs CLI: 2chain push|discover|call
demo/ Locked tool-spec JSON for Beat 2
scripts/ Smoke tests + seed + demo:full orchestrator
| Script | Purpose |
|---|---|
npm run dev |
Start the API server on 127.0.0.1:3030 |
npm run seed |
Full re-seed: 199 tools, embeddings, agents, eval_runs (~1m25s, hits Voyage) |
npm run reset:state |
Fast reset between recording takes: clears violations/usage/rankings, un-breaks circuit-broken tools (no re-embed) |
npm run demo:full |
Orchestrated 4-beat dry run with timing labels |
npm run demo:beat1..4 |
Run individual beats |
npm run smoke:all |
Run every smoke test in sequence |
npm run typecheck |
tsc --noEmit |
Per-component smoke tests live under npm run smoke:*. All tests use the real Atlas, no mocks (per CLAUDE.md / DESIGN.md).
$vectorSearch: {
...,
filter: {
status: { $eq: 'active' },
'metadata.reliability_score': { $gte: 0.80 }, // hard gate
}
}Tools below 0.80 are invisible to discovery. The /push flow calculates reliability as the eval pass-rate; tools that ship buggy versions get filtered automatically without any agent-side change.
{ $match: { vec_score: { $gte: 0.70 } } }Voyage-3's similarity floor for AI-tool descriptions is ~0.55-0.65 regardless of topic. Without this gate, off-topic tools at high reliability outrank lower-reliability on-topic tools. Standard semantic-search hygiene.
Every /call validates input + output against the tool's JSON Schema (ajv). On output failure with output_repair_strategy: 'fail-fast', the tool flips to circuit_broken immediately and subsequent calls 503 without re-invoking the stub.
The full 34-decision log lives in DESIGN.md. The most consequential:
- D9 β Pushed tools insert with
status='pending',reliability=0. Eval runner is the only writer that flips the status. Closes the eval race window. - D14 β 5 binary cases per domain β pass-rates quantised to multiples of 0.2. Demo math is deterministic at the 0.6/0.8/1.0 boundaries.
- D33 β Post-search
vec_score >= 0.70relevance gate (added at H1 after Voyage-3 baseline tested empirically β see lessons below). - D34 β
/pushalways ends instatus='active'. Reliability filtering is done by the discovery gate. Circuit-break is reserved for/callcontract violations only.
Voyage-3 cosine similarity for AI-tool descriptions floors at ~0.55-0.65 regardless of how topic-separated the descriptions are. The original DESIGN predicted 0.20-0.30 for off-topic; reality is ~0.6. Fixing this with text tuning is fragile. D33 (post-search vec gate at 0.70) is the right move. Industry-standard semantic search has this anyway.
The
/pushflow's status-flip rule had a contradiction in the DESIGN.md sequence diagram (suggested circuit-break at low pass-rate) vs the Β§3.4 state table and EVALS Beat 2 (status stays active, reliability does the gating). D34 resolves: push always endsactive. Circuit-break is/call-only. This matches the demo's narrative: bad versions are filtered, not banished β a fixed v3.2 can pass evals later and reclaim the top slot without admin intervention.
2chain ships an MCP server so Claude Code (and any MCP-compatible agent) can use the registry natively. Configure it in your MCP client:
{
"mcpServers": {
"2chain": {
"command": "node",
"args": ["/path/to/2chain/bin/2chain-mcp.mjs"],
"env": {
"TWOCHAIN_HOST": "https://your-codespace-3030.app.github.dev",
"TWOCHAIN_API_KEY": "sk_demo_pdf_agent_8f2c4a"
}
}
}
}The MCP server exposes two tools:
discover_tools(query, mode?, top?)β search the registry, get a ranked list with reliability scores. Returns only tools that pass the 0.80 reliability gate.call_tool(tool_name, tool_version, input)β invoke a tool. Input/output schemas are enforced; bad responses circuit-break the tool automatically.
After configuring the MCP server, prompts like "extract the line items from this PDF text" or "lint this JavaScript for bugs" trigger Claude to call discover_tools, pick the right tool, then call it via call_tool β all visible on the dashboard's live call feed in real time.
See demo/prompts.md for 7 ready-to-paste demo prompts covering financial extraction, code review, security scanning, summarisation, invoice parsing, contract violations, and live re-ranking.
The headline demo shows two real-fetch tools (SEC EDGAR financials, arxiv paper search) plus three contract-enforced specialists (linter, security scanner, malformed-bot). The same registry mechanism handles any agent task with multiple competing tools and a JSON contract:
| Domain | Multiple tools because... | Eval style |
|---|---|---|
| Audio transcription (Whisper, Deepgram, AssemblyAI) | Accuracy varies per accent, jargon, multi-speaker | WER vs reference |
| Text-to-SQL (Vanna, sqlcoder, Claude, GPT) | Quality varies per schema complexity, dialect | Run vs fixture DB, compare result rows |
| OCR / document understanding (Textract, Document AI) | Per-document-type reliability varies wildly | Field-by-field exact match |
| Code review (already in fixtures) | Different rule sets, different langs, different specialities | Synthetic buggy code, pass/fail per rule |
| Translation (DeepL, Google, Azure) | Reliability per language pair + domain | BLEU vs reference |
| Image generation (DALL-E, SD, Flux) | Style fidelity, brand safety vary | LLM-as-judge with rubric |
The discovery + reliability gate + contract layers stay identical. Only the eval grader changes per domain.
- Atlas auto-embedding β drop the Voyage env var; Atlas Vector Search now generates embeddings on insert.
- LLM-driven repair branch β for tools with
output_repair_strategy: 'llm', attempt up to 3 schema-guided repairs before circuit-break. - Consumer chat UI β non-technical user types "convert this PDF"; the orchestrator agent calls
/discoverthen/call. The registry stays the moat; the chat is the wrapper. - Signed manifests + sandbox execution β preventing malicious tools from poisoning the network beyond the contract layer.
- Theme: Adaptive Retrieval β results reorder as evals roll in; the system gets smarter without any agent code changing.
- Atlas Sandbox: required by hackathon (M10 dedicated, eu-west-2 London).
- Public repo: yes, this one.
- Live demo: see DEMO.md for the locked 3-min stage script.
- Submission video: youtu.be/puINYgtQXdM (60s) Β· script + shotlist in SUBMISSION.md.
Keith So Β· Principal Researcher and Lead Engineer, KITFUNSO LTD Β· skfskf27@gmail.com