Tracefield Lab

Tracefield Lab is a modular pipeline for multi-dataset analysis. It abstracts ingestion, entity mapping, feature extraction, and statistical analysis into configurable modules, so researchers can compare heterogeneous datasets without rewriting the pipeline.

Demo / MVP: https://tracefieldlab.thor-nydal.no

Abstract

Many research workflows require combining datasets, harmonizing entities, extracting features, and running statistical tests. Tracefield Lab provides a reproducible system for that workflow, with a feature store, analysis jobs, and provenance tracking.

The system:

Registers datasets with schemas and licensing metadata (schema can be inferred from CSV/JSON samples).
Ingests raw data into staging tables and object storage.
Maps entities across datasets using manual mapping or automated semantic resolution (exact keys + BGE embeddings).
Extracts features via modular workers: text-to-embedding (BGE 1024-dim) and domain-specific scalar features.
Runs statistical analysis with effect sizes and correction.

What makes it different

Most research tools work within a single dataset or domain. Tracefield Lab is built for correlation discovery across heterogeneous sources—different labs, disciplines, and formats that rarely get compared. Entity resolution (exact keys plus semantic matching with embeddings) lets you map "the same thing" across datasets. The feature store and analysis layer then surface correlations that emerge only when you can cross-reference. That makes it a tool for cracking open hermeticized science: bringing siloed knowledge into one auditable, reproducible system so you can find the patterns that live at the boundaries.

Technical Overview

Data Flow (Target)

Dataset upload
      ↓
Dataset registry + raw storage
      ↓
Worker-ingest → staging tables
      ↓
Entity mapping (resolver)
      ↓
Feature workers → feature store
      ↓
Analysis worker → results

System Components

Component	Description
API (Kotlin/Ktor)	Dataset registry, job orchestration, results
Worker-ingest	Parses datasets and normalizes raw data
Resolver	Semantic entity resolution (BGE embeddings, exact + fuzzy matching)
Feature workers	Embeddings, custom modules
Analysis worker	Statistical tests and corrections
PostgreSQL + pgvector	Structured data and vector storage
Kafka	Job queue
MinIO	Raw dataset object storage
Grafana (optional)	Metrics

Invariants and Guardrails

Core pipeline assumptions (provenance, job status lifecycle, feature contract) are documented in docs/INVARIANTS.md and checked in CI. A full-workflow integration test (test/test_full_workflow_integration.py) validates the path from seed data to analysis results and provenance. CI runs unit tests (excluding @pytest.mark.integration) then integration tests against a real Postgres; see RUNBOOK for local test commands.

Feature Modules

Feature modules follow a common contract and write to the feature store with provenance. Examples:

Text embeddings (semantic vectors)
Structured trait extraction
Domain-specific numeric features
Entity attribute normalization

Database Schema (Target Core Tables)

datasets — metadata, schema, source, license
dataset_files — object storage references
entities — canonical entities and types
entity_map — cross-dataset mapping rules (manual or from resolution jobs)
resolution_jobs — entity resolution job queue and status
features — normalized feature values with provenance
analysis_jobs — analysis configurations and status
analysis_results — tests, effect sizes, p-values
provenance_event — process audit tracking

Installation & Local Setup

Requirements

Docker + Docker Compose
~12 GB disk space
CPU-only support (Gpu optional for faster inference)

Start services

Dev (API built locally):

docker compose up -d --build

Production (API and services from registry; Watchtower pulls new images from CI and restarts containers). Use the prod override and start via deploy/start.ps1 or docker compose:

# One-time: copy and edit deploy/deploy.env (see deploy/deploy.env.example)
# Start production stack (start.ps1; api, frontend, workers, resolver use pull_policy: always; schedule it for automatic updates).
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

Ensure TRACEFIELD_API_IMAGE (and other image vars) are set in deploy/deploy.env or .env, e.g. ghcr.io/<owner>/<repo>/api:main. See RUNBOOK.md production deployment section.

The web frontend follows the system light/dark preference. Users with a dark OS theme see an inverted (dark) theme. You can override it by setting localStorage.setItem('color-theme', 'light') or 'dark' in the browser.

Kafka topics (required)

The pipeline expects ingest, features, and analysis topics (adjust per config):

docker compose exec -T kafka rpk topic create ingest
docker compose exec -T kafka rpk topic create features
docker compose exec -T kafka rpk topic create analysis

Verify:

docker compose ps

Load the LLM (optional)

docker exec -it <local-llm-container-name> ollama pull qwen2.5:7b-instruct-q4_K_M

Test:

curl http://localhost:8001/api/tags

Typical Usage Flow

Register a dataset (optionally infer schema from a pasted CSV/JSON sample)
Upload raw data
Map entities (manual via Entity Mappings UI, or automated via resolution jobs with embeddings)
Trigger feature extraction
Run analysis jobs and inspect results

Each processing step logs a provenance record for reproducibility.

Observability

Grafana dashboards are provisioned from grafana/ when you run docker compose up. The default Grafana login is admin / admin and the PostgreSQL datasource points at the local db container.

Dashboards:

Pipeline Observability provides counts for datasets, features, and jobs.

Alerts:

Pipeline stuck triggers when any record has not progressed within the SLA.
Pipeline errors triggers when error events appear in the last 15 minutes.

License

For research use only. External datasets must follow their respective licenses.

Citation

If using Tracefield Lab for academic research, cite:

Model name and version (if LLMs used)
Prompt hash (if LLMs used)
Processing date
Dataset source attribution

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
cloudflare		cloudflare
deploy		deploy
docker		docker
docs		docs
edge		edge
frontend		frontend
grafana		grafana
infra/sql		infra/sql
requirements		requirements
scripts		scripts
service		service
test		test
wake-up-agent		wake-up-agent
.cursorrules		.cursorrules
.gitignore		.gitignore
AGENT.md		AGENT.md
AI_RESEARCH_ENGINE.md		AI_RESEARCH_ENGINE.md
ARCHITECTURE.md		ARCHITECTURE.md
Caddyfile		Caddyfile
NFR.md		NFR.md
README.md		README.md
RUNBOOK.md		RUNBOOK.md
build.gradle.kts		build.gradle.kts
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
pytest.ini		pytest.ini
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tracefield Lab

Abstract

What makes it different

Technical Overview

Data Flow (Target)

System Components

Invariants and Guardrails

Feature Modules

Database Schema (Target Core Tables)

Installation & Local Setup

Requirements

Start services

Kafka topics (required)

Load the LLM (optional)

Typical Usage Flow

Observability

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tracefield Lab

Abstract

What makes it different

Technical Overview

Data Flow (Target)

System Components

Invariants and Guardrails

Feature Modules

Database Schema (Target Core Tables)

Installation & Local Setup

Requirements

Start services

Kafka topics (required)

Load the LLM (optional)

Typical Usage Flow

Observability

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages