Tracefield Lab is a modular pipeline for multi-dataset analysis. It abstracts ingestion, entity mapping, feature extraction, and statistical analysis into configurable modules, so researchers can compare heterogeneous datasets without rewriting the pipeline.
Demo / MVP: https://tracefieldlab.thor-nydal.no
Many research workflows require combining datasets, harmonizing entities, extracting features, and running statistical tests. Tracefield Lab provides a reproducible system for that workflow, with a feature store, analysis jobs, and provenance tracking.
The system:
- Registers datasets with schemas and licensing metadata (schema can be inferred from CSV/JSON samples).
- Ingests raw data into staging tables and object storage.
- Maps entities across datasets using manual mapping or automated semantic resolution (exact keys + BGE embeddings).
- Extracts features via modular workers: text-to-embedding (BGE 1024-dim) and domain-specific scalar features.
- Runs statistical analysis with effect sizes and correction.
Most research tools work within a single dataset or domain. Tracefield Lab is built for correlation discovery across heterogeneous sources—different labs, disciplines, and formats that rarely get compared. Entity resolution (exact keys plus semantic matching with embeddings) lets you map "the same thing" across datasets. The feature store and analysis layer then surface correlations that emerge only when you can cross-reference. That makes it a tool for cracking open hermeticized science: bringing siloed knowledge into one auditable, reproducible system so you can find the patterns that live at the boundaries.
Dataset upload
↓
Dataset registry + raw storage
↓
Worker-ingest → staging tables
↓
Entity mapping (resolver)
↓
Feature workers → feature store
↓
Analysis worker → results
| Component | Description |
|---|---|
| API (Kotlin/Ktor) | Dataset registry, job orchestration, results |
| Worker-ingest | Parses datasets and normalizes raw data |
| Resolver | Semantic entity resolution (BGE embeddings, exact + fuzzy matching) |
| Feature workers | Embeddings, custom modules |
| Analysis worker | Statistical tests and corrections |
| PostgreSQL + pgvector | Structured data and vector storage |
| Kafka | Job queue |
| MinIO | Raw dataset object storage |
| Grafana (optional) | Metrics |
Core pipeline assumptions (provenance, job status lifecycle, feature contract) are documented in docs/INVARIANTS.md and checked in CI. A full-workflow integration test (test/test_full_workflow_integration.py) validates the path from seed data to analysis results and provenance. CI runs unit tests (excluding @pytest.mark.integration) then integration tests against a real Postgres; see RUNBOOK for local test commands.
Feature modules follow a common contract and write to the feature store with provenance. Examples:
- Text embeddings (semantic vectors)
- Structured trait extraction
- Domain-specific numeric features
- Entity attribute normalization
datasets— metadata, schema, source, licensedataset_files— object storage referencesentities— canonical entities and typesentity_map— cross-dataset mapping rules (manual or from resolution jobs)resolution_jobs— entity resolution job queue and statusfeatures— normalized feature values with provenanceanalysis_jobs— analysis configurations and statusanalysis_results— tests, effect sizes, p-valuesprovenance_event— process audit tracking
- Docker + Docker Compose
- ~12 GB disk space
- CPU-only support (Gpu optional for faster inference)
Dev (API built locally):
docker compose up -d --buildProduction (API and services from registry; Watchtower pulls new images from CI and restarts containers). Use the prod override and start via deploy/start.ps1 or docker compose:
# One-time: copy and edit deploy/deploy.env (see deploy/deploy.env.example)
# Start production stack (start.ps1; api, frontend, workers, resolver use pull_policy: always; schedule it for automatic updates).
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -dEnsure TRACEFIELD_API_IMAGE (and other image vars) are set in deploy/deploy.env or .env, e.g. ghcr.io/<owner>/<repo>/api:main. See RUNBOOK.md production deployment section.
The web frontend follows the system light/dark preference. Users with a dark OS theme see an inverted (dark) theme. You can override it by setting localStorage.setItem('color-theme', 'light') or 'dark' in the browser.
The pipeline expects ingest, features, and analysis topics (adjust per config):
docker compose exec -T kafka rpk topic create ingest
docker compose exec -T kafka rpk topic create features
docker compose exec -T kafka rpk topic create analysisVerify:
docker compose psdocker exec -it <local-llm-container-name> ollama pull qwen2.5:7b-instruct-q4_K_MTest:
curl http://localhost:8001/api/tags- Register a dataset (optionally infer schema from a pasted CSV/JSON sample)
- Upload raw data
- Map entities (manual via Entity Mappings UI, or automated via resolution jobs with embeddings)
- Trigger feature extraction
- Run analysis jobs and inspect results
Each processing step logs a provenance record for reproducibility.
Grafana dashboards are provisioned from grafana/ when you run
docker compose up. The default Grafana login is admin / admin and the
PostgreSQL datasource points at the local db container.
Dashboards:
Pipeline Observabilityprovides counts for datasets, features, and jobs.
Alerts:
Pipeline stucktriggers when any record has not progressed within the SLA.Pipeline errorstriggers when error events appear in the last 15 minutes.
For research use only. External datasets must follow their respective licenses.
If using Tracefield Lab for academic research, cite:
- Model name and version (if LLMs used)
- Prompt hash (if LLMs used)
- Processing date
- Dataset source attribution