OpenRunner

Open-source, self-hosted ML experiment tracking platform -- a drop-in replacement for Weights & Biases.

Track experiments, run hyperparameter sweeps, trace LLM calls, version artifacts, manage models, and share reports. Same wandb.init() / wandb.log() API you already know.

Screenshots

Screenshots coming soon. See docs/screenshots for the capture guide.

Key screens: Dashboard | Run Table | Run Detail with Charts | Run Comparison | Sweep Parallel Coordinates | LLM Trace Waterfall | Artifact Lineage | Reports

Features

Core Experiment Tracking

W&B-compatible SDK -- switch by changing one import (import openrunner as wandb)
Experiment tracking -- metrics, hyperparameters, system stats (CPU/GPU/memory)
Run dashboard -- real-time charts, search/filter/sort, 10K+ run performance
Run comparison -- overlay charts, bar/scatter/box plots, config diff
Custom dashboards -- drag-and-drop chart panels with persistent layouts
Full-fidelity export -- CSV, JSON, Parquet (no sampling limits)

Hyperparameter Sweeps

Grid, random, and Bayesian search strategies
Hyperband early termination -- stop underperforming runs automatically
W&B-compatible sweep config -- same YAML format, same API
Parallel coordinates visualization -- explore parameter-metric relationships

LLM Tracing

@openrunner.trace decorator -- capture inputs, outputs, timing, and errors
OpenAI auto-patching -- openrunner.trace.patch_openai() instruments all completions
Async support -- trace sync and async functions with automatic span nesting
Waterfall visualization -- inspect call chains, latency, and token usage

Launch (Remote Execution)

openrunner.launch() -- submit jobs to remote workers with Docker images
openrunner.launch.from_run() -- re-launch from a previous run's config
Job lifecycle -- .wait(), .cancel(), .refresh() for monitoring
Worker queue -- Redis-backed job dispatch with configurable concurrency

Artifact Versioning & Model Registry

Artifact versioning -- datasets, models, checkpoints with SHA-256 content-addressed dedup
Model registry with aliases -- latest, production, staging for deployment workflows
openrunner.link_artifact() -- upload and set aliases in one call
name:alias syntax -- openrunner.use_artifact("my-model:production")
Lineage graph -- track producer/consumer relationships across runs

Alerts & Notifications

openrunner.alert() -- send alerts from training code (INFO/WARN/ERROR)
Slack webhooks -- receive alerts in your Slack channel
Email notifications -- delivery via Resend API
Console fallback -- alerts logged when no external channel is configured

Reports & Collaboration

Shareable reports -- snapshots with metric charts and run data
Report anonymization -- strip identifiers for blind paper reviews
Real-time collaborative editing -- WebSocket-based concurrent editing

Media Logging (10 types)

Image -- numpy arrays, PIL Images, file paths with captions
Table -- structured columnar data
Audio -- WAV serialization from numpy arrays or file paths
Video -- file paths or numpy array sequences
Histogram -- distribution visualization from numeric arrays
Plotly -- interactive Plotly figures (JSON serialization)
PlotlyChart -- enhanced Plotly with static PNG fallback
MatplotlibFigure -- capture matplotlib figures as PNG
PointCloud3D -- 3D point cloud visualization
BoundingBoxes2D -- bounding box overlay on images

Enterprise

SAML/OIDC SSO -- Okta, Azure AD, OneLogin, and any SAML 2.0 / OpenID Connect IdP
OAuth -- Google and GitHub social login
Audit logs -- compliance-ready event trail with filtering by action, user, resource, date
Organization management -- teams, roles (admin/member), invitations

Programmatic Query API

openrunner.Api() -- read-only access to projects, runs, artifacts
Filter and sort -- api.runs("project", filters={"state": "finished"}, order="-summary.accuracy")
History export -- run.history(pandas=True) returns a DataFrame

Framework Integrations (8 frameworks)

PyTorch -- gradient logging
HuggingFace Transformers -- OpenRunnerCallback for Trainer
PyTorch Lightning -- OpenRunnerLogger for pl.Trainer
Keras -- callback for training loops
XGBoost -- callback for boosting rounds
Scikit-learn -- experiment logging wrapper
FastAI -- Learner callback
LangChain -- chain/agent tracing callback

Infrastructure

Self-hosted -- Docker Compose or Kubernetes, your data stays yours
Offline mode -- JSONL storage, idempotent sync when back online
Kubernetes Helm chart -- production-ready deployment with HA
Email delivery via Resend -- transactional emails without SMTP setup

Quick Start

1. Deploy the server

git clone https://github.com/jqueguiner/openrunner.git
cd openrunner
docker compose up -d

All 5 services start automatically: PostgreSQL, Redis, MinIO, API server, frontend.

Open http://localhost:3000 and create an account.

Hit a snag on first run? See Troubleshooting first-run for the five issues most fresh-clone users encounter (port mismatch, empty SECRET_KEY, MinIO endpoint and credential drift).

2. Install the SDK

pip install openrunner-sdk

3. Set your credentials

export OPENRUNNER_API_KEY="or_your_key"
export OPENRUNNER_BASE_URL="http://localhost:8000"

Get your API key from the dashboard under Settings > API Keys.

4. Track experiments

import openrunner

openrunner.init(project="my-project", config={"lr": 0.001, "epochs": 10})

for epoch in range(10):
    loss = train(epoch)
    acc = evaluate()
    openrunner.log({"loss": loss, "accuracy": acc, "epoch": epoch})

openrunner.finish()

Migrating from W&B

import openrunner as wandb

wandb.init(project="my-project")
wandb.log({"loss": 0.5})
wandb.finish()

Sweeps Quick Start

import openrunner

sweep_config = {
    "method": "bayes",
    "name": "lr-sweep",
    "metric": {"name": "val_loss", "goal": "minimize"},
    "parameters": {
        "lr": {"min": 0.0001, "max": 0.1, "distribution": "log_uniform_values"},
        "batch_size": {"values": [16, 32, 64]},
        "epochs": {"value": 10},
    },
    "early_terminate": {"type": "hyperband", "min_iter": 3, "eta": 3},
}

sweep_id = openrunner.sweep(sweep_config, project="my-project")

def train_fn():
    run = openrunner.init()
    lr = openrunner.config.lr
    for epoch in range(openrunner.config.epochs):
        loss = train(lr=lr, epoch=epoch)
        openrunner.log({"val_loss": loss})
    openrunner.finish()

openrunner.agent(sweep_id, function=train_fn)

Sweep methods: grid, random, bayes. Early termination via Hyperband stops underperforming runs.

LLM Tracing Quick Start

import openrunner

# Decorator-based tracing
@openrunner.trace
def summarize(text: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    return response.choices[0].message.content

# Or auto-patch all OpenAI calls
openrunner.trace.patch_openai()

openrunner.init(project="llm-app")
result = summarize("Long article text...")
openrunner.finish()

Traces capture inputs, outputs, duration, errors, and token usage. View them in the Trace Waterfall UI.

Launch Quick Start

import openrunner

# Submit a remote training job
job = openrunner.launch(
    project="my-project",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    command="python train.py --lr 0.001 --epochs 50",
    name="gpu-training-run",
)

# Wait for completion
final_state = job.wait(poll_interval=10.0)
print(f"Job finished with state: {final_state}")

# Re-launch from a previous run's config
job2 = openrunner.launch.from_run(
    run_id="abc12345",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    command="python train.py",
)

Model Registry Quick Start

import openrunner

openrunner.init(project="my-project")

# Log a model artifact with aliases
artifact = openrunner.Artifact(name="resnet50", type="model")
artifact.add_file("model.pth")
openrunner.link_artifact(artifact, aliases=["production", "v2.1"])

openrunner.finish()

# Later -- download by alias
openrunner.init(project="my-project")
model_dir = openrunner.use_artifact("resnet50:production")
# model_dir is a Path to the cached local directory

Built-in aliases: latest (auto-set), plus custom aliases like production, staging, best.

Query API Quick Start

import openrunner

api = openrunner.Api()

# List projects
projects = api.projects()

# Query runs with filters and ordering
runs = api.runs("my-project", filters={"state": "finished"}, order="-summary.accuracy")

# Access run details
for run in runs:
    print(f"{run.name}: accuracy={run.summary.get('accuracy')}")

# Get full metric history
run = api.run("my-project/abc12345")
history = run.history()             # List of dicts
df = run.history(pandas=True)       # pandas DataFrame

# Get artifact by alias
artifact = api.artifact("resnet50:production")

Alerts & Notifications

import openrunner

openrunner.init(project="my-project")

# Send alerts from training code
openrunner.alert("Training complete", text="Final accuracy: 0.95", level="INFO")
openrunner.alert("Loss spike", text="Loss jumped from 0.1 to 5.0", level="WARN")
openrunner.alert("OOM Error", text="GPU out of memory at batch 1024", level="ERROR")

openrunner.finish()

Configure Slack webhooks via the SLACK_WEBHOOK_URL environment variable. Email alerts use Resend (RESEND_API_KEY).

Architecture

openrunner/
  src/
    api/          # FastAPI backend (Python)
    web/          # React frontend (TypeScript)
  sdk/            # Python SDK (pip install openrunner-sdk)
  helm/           # Kubernetes Helm chart
  examples/       # Demo scripts (MNIST, artifacts)
  docker-compose.yml

Component	Technology
Backend	Python, FastAPI, SQLAlchemy (async), asyncpg
Frontend	React 19, TypeScript, Vite, ECharts, TanStack Table
Database	PostgreSQL 16 (partitioned metrics table)
Object storage	MinIO (S3-compatible)
Cache/pubsub	Redis
SDK	Python, httpx, Click CLI
Email	Resend API (no SMTP needed)
SSO	SAML 2.0, OpenID Connect
Deployment	Docker Compose, Kubernetes Helm chart

Backend API Routes

Route Group	Purpose
`/auth`	Login, register, OAuth, SSO (OIDC/SAML)
`/users`	User profile management
`/organizations`	Org CRUD, member management
`/projects`	Project CRUD
`/runs`	Run lifecycle, metrics, config, summary
`/sweeps`	Sweep creation, parameter suggestions, status
`/jobs`	Launch job submission and monitoring
`/traces`	LLM trace and span storage
`/artifacts`	Artifact versioning, file upload/download
`/registry`	Model registry aliases
`/media`	Image, audio, video, table upload
`/alerts`	Alert creation and listing
`/reports`	Report CRUD, sharing, anonymization
`/dashboards`	Custom dashboard layouts
`/streams`	Server-Sent Events for real-time updates
`/export`	CSV, JSON, Parquet data export
`/audit-logs`	Organization audit trail (admin only)
`/tags`	Tag CRUD and bulk operations
`/health`	Health check endpoint

SDK Documentation

Full API reference: openrunner-sdk on PyPI

Core API

import openrunner

# Initialize a run
run = openrunner.init(
    project="my-project",
    name="experiment-1",
    config={"lr": 0.001},
    tags=["baseline"],
    notes="First experiment",
    group="ablation-study",
    resume=True,  # or "must" to require an existing run
)

# Log metrics (non-blocking)
openrunner.log({"loss": 0.5, "accuracy": 0.85})

# Log media
openrunner.log({"samples": openrunner.Image(img_array, caption="epoch 5")})
openrunner.log({"audio": openrunner.Audio(waveform, sample_rate=16000)})
openrunner.log({"fig": openrunner.Plotly(plotly_fig)})
openrunner.log({"detections": openrunner.BoundingBoxes2D(image, boxes)})

# Log tables
table = openrunner.Table(
    columns=["input", "predicted", "actual"],
    data=[["img_01", 7, 7], ["img_02", 3, 5]],
)
openrunner.log({"results": table})

# Log and manage artifacts
artifact = openrunner.Artifact(name="my-model", type="model")
artifact.add_file("model.pth")
run.log_artifact(artifact)

# Config & Summary
openrunner.config["batch_size"] = 64
openrunner.config.optimizer.lr  # dot notation access
openrunner.summary["best_accuracy"] = 0.95

# Alerts
openrunner.alert("Training done", level="INFO")

# Finish
openrunner.finish()

Framework Integrations

# HuggingFace Transformers
from openrunner.integration.huggingface import OpenRunnerCallback
trainer = Trainer(callbacks=[OpenRunnerCallback()])

# PyTorch Lightning
from openrunner.integration.lightning import OpenRunnerLogger
trainer = pl.Trainer(logger=OpenRunnerLogger(project="my-project"))

# PyTorch
from openrunner.integration.pytorch import log_gradients
log_gradients(model)

# Keras
from openrunner.integration.keras import OpenRunnerCallback
model.fit(x, y, callbacks=[OpenRunnerCallback()])

# XGBoost
from openrunner.integration.xgboost import OpenRunnerCallback
xgb.train(params, dtrain, callbacks=[OpenRunnerCallback()])

# Scikit-learn
from openrunner.integration.sklearn import log_model
log_model(clf, X_test, y_test)

# FastAI
from openrunner.integration.fastai import OpenRunnerCallback
learn = cnn_learner(dls, resnet34, cbs=[OpenRunnerCallback()])

# LangChain
from openrunner.integration.langchain import OpenRunnerCallback
chain.invoke({"input": "hello"}, config={"callbacks": [OpenRunnerCallback()]})

Install framework extras:

pip install openrunner-sdk[pytorch]
pip install openrunner-sdk[huggingface]
pip install openrunner-sdk[lightning]
pip install openrunner-sdk[keras]
pip install openrunner-sdk[xgboost]
pip install openrunner-sdk[sklearn]
pip install openrunner-sdk[fastai]
pip install openrunner-sdk[gpu]  # NVIDIA GPU monitoring

Offline Mode

export OPENRUNNER_MODE=offline
python train.py
openrunner sync   # Upload when back online

Offline runs are stored as JSONL files. Config is saved at init() time (not finish()), so it survives crashes.

CLI

openrunner login   # Store API key
openrunner sync    # Upload offline runs
openrunner ls      # List projects and runs

Kubernetes Deployment

A production-ready Helm chart is included for Kubernetes clusters:

helm install openrunner ./helm/openrunner \
  --namespace openrunner --create-namespace \
  --set api.replicas=3 \
  --set image.tag=0.2.0 \
  --set env.DATABASE_URL="postgresql+asyncpg://user:pass@db:5432/openrunner" \
  --set env.SECRET_KEY="your-production-secret"

The chart deploys: API server, web frontend, PostgreSQL, Redis, MinIO -- all with resource limits, health checks, and PVC storage.

For external databases in production, disable the bundled PostgreSQL:

# values-production.yaml
postgresql:
  enabled: false
env:
  DATABASE_URL: "postgresql+asyncpg://user:pass@your-rds-host:5432/openrunner"

See helm/openrunner/ for full chart documentation and values.yaml.

Environment Variables

SDK (Client)

Variable	Description	Default
`OPENRUNNER_API_KEY`	API key for server authentication	(required)
`OPENRUNNER_BASE_URL`	Server URL	`http://localhost:8000`
`OPENRUNNER_PROJECT`	Default project name	(none)
`OPENRUNNER_MODE`	`online` or `offline`	`online`
`OPENRUNNER_SYSTEM_METRICS`	Enable GPU/CPU/memory monitoring	`true`

W&B environment variables (WANDB_API_KEY, WANDB_BASE_URL) are supported as fallback for migration.

Server

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	`postgresql+asyncpg://openrunner:openrunner@postgres:5432/openrunner`
`DB_USE_PGBOUNCER`	Enable PgBouncer compatibility mode	`false`
`REDIS_URL`	Redis connection string	`redis://redis:6379/0`
`MINIO_INTERNAL_ENDPOINT`	MinIO/S3 endpoint reached from inside the docker/cluster network — used for API & worker reads/writes. Legacy `MINIO_ENDPOINT` is still accepted.	`minio:9000`
`MINIO_PUBLIC_ENDPOINT`	MinIO/S3 endpoint baked into presigned URLs returned to the browser/SDK. Must resolve from outside the cluster. Legacy `MINIO_EXTERNAL_ENDPOINT` is still accepted.	`localhost:9000`
`MINIO_ACCESS_KEY`	MinIO/S3 access key	`minioadmin`
`MINIO_SECRET_KEY`	MinIO/S3 secret key	`minioadmin`
`MINIO_SECURE`	Use HTTPS for MinIO	`false`
`MINIO_BUCKET`	Object storage bucket name	`openrunner`
`SECRET_KEY`	Application secret key	`change-me-in-production`
`DEBUG`	Enable debug mode	`false`
`API_HOST`	API bind address	`0.0.0.0`
`API_PORT`	API bind port	`8000`
`MAX_UPLOAD_SIZE`	Max artifact upload size accepted by the in-stack nginx `/api/` proxy. Sized for typical ML model artifacts; lower it for shared deployments. Restart the `web` service to apply.	`5g`
`JWT_ACCESS_SECRET`	JWT access token signing key	`change-me-access-secret`
`JWT_REFRESH_SECRET`	JWT refresh token signing key	`change-me-refresh-secret`
`JWT_ACCESS_EXPIRE_MINUTES`	Access token lifetime	`15`
`JWT_REFRESH_EXPIRE_DAYS`	Refresh token lifetime	`30`
`FRONTEND_URL`	Frontend URL for CORS and redirects	`http://localhost:3000`

OAuth & SSO

Variable	Description	Default
`GOOGLE_CLIENT_ID`	Google OAuth client ID	(empty)
`GOOGLE_CLIENT_SECRET`	Google OAuth client secret	(empty)
`GOOGLE_REDIRECT_URI`	Google OAuth callback URL	`http://localhost:8000/api/v1/auth/google/callback`
`GITHUB_CLIENT_ID`	GitHub OAuth client ID	(empty)
`GITHUB_CLIENT_SECRET`	GitHub OAuth client secret	(empty)
`GITHUB_REDIRECT_URI`	GitHub OAuth callback URL	`http://localhost:8000/api/v1/auth/github/callback`
`OIDC_CLIENT_ID`	OpenID Connect client ID	(empty)
`OIDC_CLIENT_SECRET`	OpenID Connect client secret	(empty)
`OIDC_DISCOVERY_URL`	OIDC discovery endpoint (`.well-known/openid-configuration`)	(empty)
`OIDC_PROVIDER_NAME`	Display name for OIDC button	`SSO`
`OIDC_REDIRECT_URI`	OIDC callback URL	`http://localhost:8000/api/v1/auth/oidc/callback`
`SAML_ENTITY_ID`	SAML Service Provider entity ID	(empty)
`SAML_SSO_URL`	SAML IdP single sign-on URL	(empty)
`SAML_CERTIFICATE`	SAML IdP certificate (PEM)	(empty)
`SAML_SP_CERT`	SAML SP certificate (PEM)	(empty)
`SAML_SP_KEY`	SAML SP private key (PEM)	(empty)

Notifications & Email

Variable	Description	Default
`SLACK_WEBHOOK_URL`	Slack incoming webhook for alerts	(none)
`RESEND_API_KEY`	Resend API key for email delivery	(none)
`RESEND_FROM_EMAIL`	Sender email address	`noreply@gladia.io`

Production Deployment

For production (e.g., openrun.gladia.io):

# Use external PostgreSQL
cp docker-compose.prod.yml docker-compose.override.yml
# Edit .env with your database credentials

# Set up HTTPS
# Add DNS A record pointing to your server
# Use certbot or your reverse proxy for SSL

See docker-compose.prod.yml for external database configuration.

Server Requirements

Docker and Docker Compose (or Kubernetes with Helm)
2GB RAM minimum (4GB+ recommended for production)
10GB disk (grows with data)
Python 3.10+ for the SDK

Contributing

MIT licensed. PRs welcome.

# Development setup
git clone https://github.com/jqueguiner/openrunner.git
cd openrunner
make install   # Install backend deps
cd src/web && npm install   # Install frontend deps
docker compose up -d   # Start infra (DB, Redis, MinIO)
make dev   # Start API dev server
cd src/web && npm run dev   # Start frontend dev server

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 694 Commits
.github/workflows		.github/workflows
.planning		.planning
blogs		blogs
deploy/helm/openrunner		deploy/helm/openrunner
docs		docs
examples		examples
helm/openrunner		helm/openrunner
landing		landing
nginx		nginx
scripts		scripts
sdk-js		sdk-js
sdk		sdk
src		src
wiki		wiki
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
FEATURE_GAP.md		FEATURE_GAP.md
LICENSE		LICENSE
Makefile		Makefile
PRODUCT_ROADMAP.md		PRODUCT_ROADMAP.md
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.override.yml		docker-compose.override.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

OpenRunner

Screenshots

Features

Core Experiment Tracking

Hyperparameter Sweeps

LLM Tracing

Launch (Remote Execution)

Artifact Versioning & Model Registry

Alerts & Notifications

Reports & Collaboration

Media Logging (10 types)

Enterprise

Programmatic Query API

Framework Integrations (8 frameworks)

Infrastructure

Quick Start

1. Deploy the server

2. Install the SDK

3. Set your credentials

4. Track experiments

Migrating from W&B

Sweeps Quick Start

LLM Tracing Quick Start

Launch Quick Start

Model Registry Quick Start

Query API Quick Start

Alerts & Notifications

Architecture

Backend API Routes

SDK Documentation

Core API

Framework Integrations

Offline Mode

CLI

Kubernetes Deployment

Environment Variables

SDK (Client)

Server

OAuth & SSO

Notifications & Email

Production Deployment

Server Requirements

Contributing

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages