Config-driven Document Intelligence Platform on GCP
Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.
The truth is in the documents.
Live Demo · Functional Spec · Roadmap · Example Config
|
Development Progress |
Mulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.
You define your domain ontology in a single mulder.config.yaml. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.
mulder.config.yaml → terraform apply → mulder pipeline run ./pdfs/ → mulder query "..."
| # | Capability | What it does |
|---|---|---|
| 1 | Layout Extraction | Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts |
| 2 | Domain Ontology | One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema. |
| 3 | Taxonomy | Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual |
| 4 | Hybrid Retrieval | Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking |
| 5 | Web Grounding | Gemini verifies entities against live web data — coordinates, bios, org descriptions |
| 6 | Spatio-Temporal | PostGIS proximity queries, temporal clustering, pattern detection across time and space |
| 7 | Evidence Scoring | Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains |
| 8 | Cross-Lingual Resolution | 3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages |
| 9 | Deduplication | MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring |
| 10 | Schema Evolution | Config-hash tracking per document per step, selective reprocessing after config changes |
| 11 | Visual Intelligence | Image extraction, Gemini analysis, image embeddings, map/diagram data extraction |
| 12 | Pattern Discovery | Cluster anomalies, temporal spikes, subgraph similarity, proactive insights |
PDF
│
┌─────▼─────┐
│ Ingest │ Upload to Cloud Storage, pre-flight validation
└─────┬─────┘
│
┌─────▼─────┐
│ Extract │ Document AI + Gemini Vision fallback → layout JSON + page images → GCS
└─────┬─────┘
│
┌─────▼─────┐
│ Segment │ Gemini identifies stories from page images → Markdown + metadata → GCS
└─────┬─────┘
│
┌─────▼─────┐
│ Enrich │ Entity extraction, taxonomy normalization, cross-lingual resolution
└─────┬─────┘
│
┌─────▼─────┐
│ Ground │ Web enrichment via Gemini Search — coordinates, bios, verification
└─────┬─────┘
│
┌─────▼─────┐
│ Embed │ Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25
└─────┬─────┘
│
┌─────▼─────┐
│ Graph │ Deduplication, corroboration scoring, contradiction flagging
└─────┬─────┘
│
┌─────▼─────┐
│ Analyze │ Contradiction resolution, PageRank reliability, evidence chains
└─────┬─────┘
│
Knowledge
Graph
Every step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.
All domain logic lives in mulder.config.yaml. Define your domain, the pipeline adapts:
project:
name: investigative-journalism
ontology:
entity_types:
- name: person
description: Individual mentioned in documents
attributes:
- { name: role, type: string }
- { name: affiliation, type: string }
- name: event
description: A specific incident or occurrence
attributes:
- { name: date, type: date }
- { name: location, type: string }
- name: location
description: Geographic place
attributes:
- { name: coordinates, type: geo_point, optional: true }
relationships:
- { name: involved_in, from: person, to: event }
- { name: occurred_at, from: event, to: location }Everything beyond project and ontology has sensible defaults. See mulder.config.example.yaml for the full reference.
| Single PostgreSQL | pgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub |
| Content in GCS | PDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only. |
| Service Abstraction | All GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost. |
| CLI-first | Every capability is a CLI command. The API is a job producer, not a direct executor. |
| PostgreSQL is truth | Pipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring). |
Baseline cost: ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.
| Language | TypeScript (ESM, strict mode) |
| Monorepo | pnpm + Turborepo |
| Infrastructure | Terraform (modular) |
| OCR | Document AI Layout Parser |
| LLM | Gemini 2.5 Flash (Vertex AI) |
| Embeddings | text-embedding-004 (768-dim Matryoshka) |
| Database | Cloud SQL PostgreSQL |
| Search | pgvector (HNSW) + tsvector (BM25) + recursive CTEs |
| Geospatial | PostGIS |
| CLI | Commander.js |
| Testing | Vitest |
Mulder's design phase is complete — functional spec, implementation roadmap, and config schema are finalized.
Currently building Milestone 2 (ingest + extract: first GCP integration, Document AI, Cloud Storage).
See the roadmap for all 9 milestones from foundation to multi-format ingestion.
Contributions, feedback, and ideas are welcome. Open an issue or start a discussion.
