Mulder

Config-driven Document Intelligence Platform on GCP
Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.

The truth is in the documents.

Live Demo · Functional Spec · Roadmap · Example Config

Development Progress 16 / 81 steps

M1 Foundation       ██████████████████████████████ 11/11 ✓
M2 Ingest+Extract   ████████████████░░░░░░░░░░░░░  5/9
M3 Segment+Enrich   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/10
M4 Search (v1.0)    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/11
M5 Curation         ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/5
M6 Intelligence     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/7
M7 API+Workers      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/9
M8 Operations       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/6
M9 Multi-Format     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/13

What it does

Mulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.

You define your domain ontology in a single mulder.config.yaml. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.

mulder.config.yaml  →  terraform apply  →  mulder pipeline run ./pdfs/  →  mulder query "..."

Capabilities

#	Capability	What it does
1	Layout Extraction	Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts
2	Domain Ontology	One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema.
3	Taxonomy	Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual
4	Hybrid Retrieval	Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking
5	Web Grounding	Gemini verifies entities against live web data — coordinates, bios, org descriptions
6	Spatio-Temporal	PostGIS proximity queries, temporal clustering, pattern detection across time and space
7	Evidence Scoring	Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains
8	Cross-Lingual Resolution	3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages
9	Deduplication	MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring
10	Schema Evolution	Config-hash tracking per document per step, selective reprocessing after config changes
11	Visual Intelligence	Image extraction, Gemini analysis, image embeddings, map/diagram data extraction
12	Pattern Discovery	Cluster anomalies, temporal spikes, subgraph similarity, proactive insights

Pipeline

          PDF
           │
     ┌─────▼─────┐
     │   Ingest  │  Upload to Cloud Storage, pre-flight validation
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │  Extract  │  Document AI + Gemini Vision fallback → layout JSON + page images → GCS
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │  Segment  │  Gemini identifies stories from page images → Markdown + metadata → GCS
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Enrich  │  Entity extraction, taxonomy normalization, cross-lingual resolution
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Ground  │  Web enrichment via Gemini Search — coordinates, bios, verification
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Embed   │  Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Graph   │  Deduplication, corroboration scoring, contradiction flagging
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │  Analyze  │  Contradiction resolution, PageRank reliability, evidence chains
     └─────┬─────┘
           │
       Knowledge
         Graph

Every step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.

Configuration

All domain logic lives in mulder.config.yaml. Define your domain, the pipeline adapts:

project:
  name: investigative-journalism

ontology:
  entity_types:
    - name: person
      description: Individual mentioned in documents
      attributes:
        - { name: role, type: string }
        - { name: affiliation, type: string }
    - name: event
      description: A specific incident or occurrence
      attributes:
        - { name: date, type: date }
        - { name: location, type: string }
    - name: location
      description: Geographic place
      attributes:
        - { name: coordinates, type: geo_point, optional: true }

  relationships:
    - { name: involved_in, from: person, to: event }
    - { name: occurred_at, from: event, to: location }

Everything beyond project and ontology has sensible defaults. See mulder.config.example.yaml for the full reference.

Architecture

Single PostgreSQL	pgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub
Content in GCS	PDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only.
Service Abstraction	All GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost.
CLI-first	Every capability is a CLI command. The API is a job producer, not a direct executor.
PostgreSQL is truth	Pipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring).

Baseline cost: ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.

Tech Stack


Language	TypeScript (ESM, strict mode)
Monorepo	pnpm + Turborepo
Infrastructure	Terraform (modular)
OCR	Document AI Layout Parser
LLM	Gemini 2.5 Flash (Vertex AI)
Embeddings	text-embedding-004 (768-dim Matryoshka)
Database	Cloud SQL PostgreSQL
Search	pgvector (HNSW) + tsvector (BM25) + recursive CTEs
Geospatial	PostGIS
CLI	Commander.js
Testing	Vitest

Status

Mulder's design phase is complete — functional spec, implementation roadmap, and config schema are finalized.

Currently building Milestone 2 (ingest + extract: first GCP integration, Document AI, Cloud Storage).

See the roadmap for all 9 milestones from foundation to multi-format ingestion.

Contributing

Contributions, feedback, and ideas are welcome. Open an issue or start a discussion.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.claude/commands		.claude/commands
.github		.github
apps		apps
demo		demo
devlog		devlog
docker/postgres		docker/postgres
docs		docs
fixtures		fixtures
packages		packages
public		public
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
biome.json		biome.json
docker-compose.yaml		docker-compose.yaml
mulder.config.example.yaml		mulder.config.example.yaml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json
turbo.json		turbo.json
vitest.config.ts		vitest.config.ts
wrangler.json		wrangler.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mulder

What it does

Capabilities

Pipeline

Configuration

Architecture

Tech Stack

Status

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mulder

What it does

Capabilities

Pipeline

Configuration

Architecture

Tech Stack

Status

Contributing

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages