Open-source, self-hosted ML experiment tracking platform -- a drop-in replacement for Weights & Biases.
Track experiments, run hyperparameter sweeps, trace LLM calls, version artifacts, manage models, and share reports. Same wandb.init() / wandb.log() API you already know.
Screenshots coming soon. See docs/screenshots for the capture guide.
Key screens: Dashboard | Run Table | Run Detail with Charts | Run Comparison | Sweep Parallel Coordinates | LLM Trace Waterfall | Artifact Lineage | Reports
- W&B-compatible SDK -- switch by changing one import (
import openrunner as wandb) - Experiment tracking -- metrics, hyperparameters, system stats (CPU/GPU/memory)
- Run dashboard -- real-time charts, search/filter/sort, 10K+ run performance
- Run comparison -- overlay charts, bar/scatter/box plots, config diff
- Custom dashboards -- drag-and-drop chart panels with persistent layouts
- Full-fidelity export -- CSV, JSON, Parquet (no sampling limits)
- Grid, random, and Bayesian search strategies
- Hyperband early termination -- stop underperforming runs automatically
- W&B-compatible sweep config -- same YAML format, same API
- Parallel coordinates visualization -- explore parameter-metric relationships
@openrunner.tracedecorator -- capture inputs, outputs, timing, and errors- OpenAI auto-patching --
openrunner.trace.patch_openai()instruments all completions - Async support -- trace sync and async functions with automatic span nesting
- Waterfall visualization -- inspect call chains, latency, and token usage
openrunner.launch()-- submit jobs to remote workers with Docker imagesopenrunner.launch.from_run()-- re-launch from a previous run's config- Job lifecycle --
.wait(),.cancel(),.refresh()for monitoring - Worker queue -- Redis-backed job dispatch with configurable concurrency
- Artifact versioning -- datasets, models, checkpoints with SHA-256 content-addressed dedup
- Model registry with aliases --
latest,production,stagingfor deployment workflows openrunner.link_artifact()-- upload and set aliases in one callname:aliassyntax --openrunner.use_artifact("my-model:production")- Lineage graph -- track producer/consumer relationships across runs
openrunner.alert()-- send alerts from training code (INFO/WARN/ERROR)- Slack webhooks -- receive alerts in your Slack channel
- Email notifications -- delivery via Resend API
- Console fallback -- alerts logged when no external channel is configured
- Shareable reports -- snapshots with metric charts and run data
- Report anonymization -- strip identifiers for blind paper reviews
- Real-time collaborative editing -- WebSocket-based concurrent editing
- Image -- numpy arrays, PIL Images, file paths with captions
- Table -- structured columnar data
- Audio -- WAV serialization from numpy arrays or file paths
- Video -- file paths or numpy array sequences
- Histogram -- distribution visualization from numeric arrays
- Plotly -- interactive Plotly figures (JSON serialization)
- PlotlyChart -- enhanced Plotly with static PNG fallback
- MatplotlibFigure -- capture matplotlib figures as PNG
- PointCloud3D -- 3D point cloud visualization
- BoundingBoxes2D -- bounding box overlay on images
- SAML/OIDC SSO -- Okta, Azure AD, OneLogin, and any SAML 2.0 / OpenID Connect IdP
- OAuth -- Google and GitHub social login
- Audit logs -- compliance-ready event trail with filtering by action, user, resource, date
- Organization management -- teams, roles (admin/member), invitations
openrunner.Api()-- read-only access to projects, runs, artifacts- Filter and sort --
api.runs("project", filters={"state": "finished"}, order="-summary.accuracy") - History export --
run.history(pandas=True)returns a DataFrame
- PyTorch -- gradient logging
- HuggingFace Transformers --
OpenRunnerCallbackfor Trainer - PyTorch Lightning --
OpenRunnerLoggerfor pl.Trainer - Keras -- callback for training loops
- XGBoost -- callback for boosting rounds
- Scikit-learn -- experiment logging wrapper
- FastAI -- Learner callback
- LangChain -- chain/agent tracing callback
- Self-hosted -- Docker Compose or Kubernetes, your data stays yours
- Offline mode -- JSONL storage, idempotent sync when back online
- Kubernetes Helm chart -- production-ready deployment with HA
- Email delivery via Resend -- transactional emails without SMTP setup
git clone https://github.com/jqueguiner/openrunner.git
cd openrunner
docker compose up -dAll 5 services start automatically: PostgreSQL, Redis, MinIO, API server, frontend.
Open http://localhost:3000 and create an account.
Hit a snag on first run? See Troubleshooting first-run for the five issues most fresh-clone users encounter (port mismatch, empty
SECRET_KEY, MinIO endpoint and credential drift).
pip install openrunner-sdkexport OPENRUNNER_API_KEY="or_your_key"
export OPENRUNNER_BASE_URL="http://localhost:8000"Get your API key from the dashboard under Settings > API Keys.
import openrunner
openrunner.init(project="my-project", config={"lr": 0.001, "epochs": 10})
for epoch in range(10):
loss = train(epoch)
acc = evaluate()
openrunner.log({"loss": loss, "accuracy": acc, "epoch": epoch})
openrunner.finish()import openrunner as wandb
wandb.init(project="my-project")
wandb.log({"loss": 0.5})
wandb.finish()import openrunner
sweep_config = {
"method": "bayes",
"name": "lr-sweep",
"metric": {"name": "val_loss", "goal": "minimize"},
"parameters": {
"lr": {"min": 0.0001, "max": 0.1, "distribution": "log_uniform_values"},
"batch_size": {"values": [16, 32, 64]},
"epochs": {"value": 10},
},
"early_terminate": {"type": "hyperband", "min_iter": 3, "eta": 3},
}
sweep_id = openrunner.sweep(sweep_config, project="my-project")
def train_fn():
run = openrunner.init()
lr = openrunner.config.lr
for epoch in range(openrunner.config.epochs):
loss = train(lr=lr, epoch=epoch)
openrunner.log({"val_loss": loss})
openrunner.finish()
openrunner.agent(sweep_id, function=train_fn)Sweep methods: grid, random, bayes. Early termination via Hyperband stops underperforming runs.
import openrunner
# Decorator-based tracing
@openrunner.trace
def summarize(text: str) -> str:
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Summarize: {text}"}],
)
return response.choices[0].message.content
# Or auto-patch all OpenAI calls
openrunner.trace.patch_openai()
openrunner.init(project="llm-app")
result = summarize("Long article text...")
openrunner.finish()Traces capture inputs, outputs, duration, errors, and token usage. View them in the Trace Waterfall UI.
import openrunner
# Submit a remote training job
job = openrunner.launch(
project="my-project",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
command="python train.py --lr 0.001 --epochs 50",
name="gpu-training-run",
)
# Wait for completion
final_state = job.wait(poll_interval=10.0)
print(f"Job finished with state: {final_state}")
# Re-launch from a previous run's config
job2 = openrunner.launch.from_run(
run_id="abc12345",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
command="python train.py",
)import openrunner
openrunner.init(project="my-project")
# Log a model artifact with aliases
artifact = openrunner.Artifact(name="resnet50", type="model")
artifact.add_file("model.pth")
openrunner.link_artifact(artifact, aliases=["production", "v2.1"])
openrunner.finish()
# Later -- download by alias
openrunner.init(project="my-project")
model_dir = openrunner.use_artifact("resnet50:production")
# model_dir is a Path to the cached local directoryBuilt-in aliases: latest (auto-set), plus custom aliases like production, staging, best.
import openrunner
api = openrunner.Api()
# List projects
projects = api.projects()
# Query runs with filters and ordering
runs = api.runs("my-project", filters={"state": "finished"}, order="-summary.accuracy")
# Access run details
for run in runs:
print(f"{run.name}: accuracy={run.summary.get('accuracy')}")
# Get full metric history
run = api.run("my-project/abc12345")
history = run.history() # List of dicts
df = run.history(pandas=True) # pandas DataFrame
# Get artifact by alias
artifact = api.artifact("resnet50:production")import openrunner
openrunner.init(project="my-project")
# Send alerts from training code
openrunner.alert("Training complete", text="Final accuracy: 0.95", level="INFO")
openrunner.alert("Loss spike", text="Loss jumped from 0.1 to 5.0", level="WARN")
openrunner.alert("OOM Error", text="GPU out of memory at batch 1024", level="ERROR")
openrunner.finish()Configure Slack webhooks via the SLACK_WEBHOOK_URL environment variable. Email alerts use Resend (RESEND_API_KEY).
openrunner/
src/
api/ # FastAPI backend (Python)
web/ # React frontend (TypeScript)
sdk/ # Python SDK (pip install openrunner-sdk)
helm/ # Kubernetes Helm chart
examples/ # Demo scripts (MNIST, artifacts)
docker-compose.yml
| Component | Technology |
|---|---|
| Backend | Python, FastAPI, SQLAlchemy (async), asyncpg |
| Frontend | React 19, TypeScript, Vite, ECharts, TanStack Table |
| Database | PostgreSQL 16 (partitioned metrics table) |
| Object storage | MinIO (S3-compatible) |
| Cache/pubsub | Redis |
| SDK | Python, httpx, Click CLI |
| Resend API (no SMTP needed) | |
| SSO | SAML 2.0, OpenID Connect |
| Deployment | Docker Compose, Kubernetes Helm chart |
| Route Group | Purpose |
|---|---|
/auth |
Login, register, OAuth, SSO (OIDC/SAML) |
/users |
User profile management |
/organizations |
Org CRUD, member management |
/projects |
Project CRUD |
/runs |
Run lifecycle, metrics, config, summary |
/sweeps |
Sweep creation, parameter suggestions, status |
/jobs |
Launch job submission and monitoring |
/traces |
LLM trace and span storage |
/artifacts |
Artifact versioning, file upload/download |
/registry |
Model registry aliases |
/media |
Image, audio, video, table upload |
/alerts |
Alert creation and listing |
/reports |
Report CRUD, sharing, anonymization |
/dashboards |
Custom dashboard layouts |
/streams |
Server-Sent Events for real-time updates |
/export |
CSV, JSON, Parquet data export |
/audit-logs |
Organization audit trail (admin only) |
/tags |
Tag CRUD and bulk operations |
/health |
Health check endpoint |
Full API reference: openrunner-sdk on PyPI
import openrunner
# Initialize a run
run = openrunner.init(
project="my-project",
name="experiment-1",
config={"lr": 0.001},
tags=["baseline"],
notes="First experiment",
group="ablation-study",
resume=True, # or "must" to require an existing run
)
# Log metrics (non-blocking)
openrunner.log({"loss": 0.5, "accuracy": 0.85})
# Log media
openrunner.log({"samples": openrunner.Image(img_array, caption="epoch 5")})
openrunner.log({"audio": openrunner.Audio(waveform, sample_rate=16000)})
openrunner.log({"fig": openrunner.Plotly(plotly_fig)})
openrunner.log({"detections": openrunner.BoundingBoxes2D(image, boxes)})
# Log tables
table = openrunner.Table(
columns=["input", "predicted", "actual"],
data=[["img_01", 7, 7], ["img_02", 3, 5]],
)
openrunner.log({"results": table})
# Log and manage artifacts
artifact = openrunner.Artifact(name="my-model", type="model")
artifact.add_file("model.pth")
run.log_artifact(artifact)
# Config & Summary
openrunner.config["batch_size"] = 64
openrunner.config.optimizer.lr # dot notation access
openrunner.summary["best_accuracy"] = 0.95
# Alerts
openrunner.alert("Training done", level="INFO")
# Finish
openrunner.finish()# HuggingFace Transformers
from openrunner.integration.huggingface import OpenRunnerCallback
trainer = Trainer(callbacks=[OpenRunnerCallback()])
# PyTorch Lightning
from openrunner.integration.lightning import OpenRunnerLogger
trainer = pl.Trainer(logger=OpenRunnerLogger(project="my-project"))
# PyTorch
from openrunner.integration.pytorch import log_gradients
log_gradients(model)
# Keras
from openrunner.integration.keras import OpenRunnerCallback
model.fit(x, y, callbacks=[OpenRunnerCallback()])
# XGBoost
from openrunner.integration.xgboost import OpenRunnerCallback
xgb.train(params, dtrain, callbacks=[OpenRunnerCallback()])
# Scikit-learn
from openrunner.integration.sklearn import log_model
log_model(clf, X_test, y_test)
# FastAI
from openrunner.integration.fastai import OpenRunnerCallback
learn = cnn_learner(dls, resnet34, cbs=[OpenRunnerCallback()])
# LangChain
from openrunner.integration.langchain import OpenRunnerCallback
chain.invoke({"input": "hello"}, config={"callbacks": [OpenRunnerCallback()]})Install framework extras:
pip install openrunner-sdk[pytorch]
pip install openrunner-sdk[huggingface]
pip install openrunner-sdk[lightning]
pip install openrunner-sdk[keras]
pip install openrunner-sdk[xgboost]
pip install openrunner-sdk[sklearn]
pip install openrunner-sdk[fastai]
pip install openrunner-sdk[gpu] # NVIDIA GPU monitoringexport OPENRUNNER_MODE=offline
python train.py
openrunner sync # Upload when back onlineOffline runs are stored as JSONL files. Config is saved at init() time (not finish()), so it survives crashes.
openrunner login # Store API key
openrunner sync # Upload offline runs
openrunner ls # List projects and runsA production-ready Helm chart is included for Kubernetes clusters:
helm install openrunner ./helm/openrunner \
--namespace openrunner --create-namespace \
--set api.replicas=3 \
--set image.tag=0.2.0 \
--set env.DATABASE_URL="postgresql+asyncpg://user:pass@db:5432/openrunner" \
--set env.SECRET_KEY="your-production-secret"The chart deploys: API server, web frontend, PostgreSQL, Redis, MinIO -- all with resource limits, health checks, and PVC storage.
For external databases in production, disable the bundled PostgreSQL:
# values-production.yaml
postgresql:
enabled: false
env:
DATABASE_URL: "postgresql+asyncpg://user:pass@your-rds-host:5432/openrunner"See helm/openrunner/ for full chart documentation and values.yaml.
| Variable | Description | Default |
|---|---|---|
OPENRUNNER_API_KEY |
API key for server authentication | (required) |
OPENRUNNER_BASE_URL |
Server URL | http://localhost:8000 |
OPENRUNNER_PROJECT |
Default project name | (none) |
OPENRUNNER_MODE |
online or offline |
online |
OPENRUNNER_SYSTEM_METRICS |
Enable GPU/CPU/memory monitoring | true |
W&B environment variables (WANDB_API_KEY, WANDB_BASE_URL) are supported as fallback for migration.
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | postgresql+asyncpg://openrunner:openrunner@postgres:5432/openrunner |
DB_USE_PGBOUNCER |
Enable PgBouncer compatibility mode | false |
REDIS_URL |
Redis connection string | redis://redis:6379/0 |
MINIO_INTERNAL_ENDPOINT |
MinIO/S3 endpoint reached from inside the docker/cluster network — used for API & worker reads/writes. Legacy MINIO_ENDPOINT is still accepted. |
minio:9000 |
MINIO_PUBLIC_ENDPOINT |
MinIO/S3 endpoint baked into presigned URLs returned to the browser/SDK. Must resolve from outside the cluster. Legacy MINIO_EXTERNAL_ENDPOINT is still accepted. |
localhost:9000 |
MINIO_ACCESS_KEY |
MinIO/S3 access key | minioadmin |
MINIO_SECRET_KEY |
MinIO/S3 secret key | minioadmin |
MINIO_SECURE |
Use HTTPS for MinIO | false |
MINIO_BUCKET |
Object storage bucket name | openrunner |
SECRET_KEY |
Application secret key | change-me-in-production |
DEBUG |
Enable debug mode | false |
API_HOST |
API bind address | 0.0.0.0 |
API_PORT |
API bind port | 8000 |
MAX_UPLOAD_SIZE |
Max artifact upload size accepted by the in-stack nginx /api/ proxy. Sized for typical ML model artifacts; lower it for shared deployments. Restart the web service to apply. |
5g |
JWT_ACCESS_SECRET |
JWT access token signing key | change-me-access-secret |
JWT_REFRESH_SECRET |
JWT refresh token signing key | change-me-refresh-secret |
JWT_ACCESS_EXPIRE_MINUTES |
Access token lifetime | 15 |
JWT_REFRESH_EXPIRE_DAYS |
Refresh token lifetime | 30 |
FRONTEND_URL |
Frontend URL for CORS and redirects | http://localhost:3000 |
| Variable | Description | Default |
|---|---|---|
GOOGLE_CLIENT_ID |
Google OAuth client ID | (empty) |
GOOGLE_CLIENT_SECRET |
Google OAuth client secret | (empty) |
GOOGLE_REDIRECT_URI |
Google OAuth callback URL | http://localhost:8000/api/v1/auth/google/callback |
GITHUB_CLIENT_ID |
GitHub OAuth client ID | (empty) |
GITHUB_CLIENT_SECRET |
GitHub OAuth client secret | (empty) |
GITHUB_REDIRECT_URI |
GitHub OAuth callback URL | http://localhost:8000/api/v1/auth/github/callback |
OIDC_CLIENT_ID |
OpenID Connect client ID | (empty) |
OIDC_CLIENT_SECRET |
OpenID Connect client secret | (empty) |
OIDC_DISCOVERY_URL |
OIDC discovery endpoint (.well-known/openid-configuration) |
(empty) |
OIDC_PROVIDER_NAME |
Display name for OIDC button | SSO |
OIDC_REDIRECT_URI |
OIDC callback URL | http://localhost:8000/api/v1/auth/oidc/callback |
SAML_ENTITY_ID |
SAML Service Provider entity ID | (empty) |
SAML_SSO_URL |
SAML IdP single sign-on URL | (empty) |
SAML_CERTIFICATE |
SAML IdP certificate (PEM) | (empty) |
SAML_SP_CERT |
SAML SP certificate (PEM) | (empty) |
SAML_SP_KEY |
SAML SP private key (PEM) | (empty) |
| Variable | Description | Default |
|---|---|---|
SLACK_WEBHOOK_URL |
Slack incoming webhook for alerts | (none) |
RESEND_API_KEY |
Resend API key for email delivery | (none) |
RESEND_FROM_EMAIL |
Sender email address | noreply@gladia.io |
For production (e.g., openrun.gladia.io):
# Use external PostgreSQL
cp docker-compose.prod.yml docker-compose.override.yml
# Edit .env with your database credentials
# Set up HTTPS
# Add DNS A record pointing to your server
# Use certbot or your reverse proxy for SSLSee docker-compose.prod.yml for external database configuration.
- Docker and Docker Compose (or Kubernetes with Helm)
- 2GB RAM minimum (4GB+ recommended for production)
- 10GB disk (grows with data)
- Python 3.10+ for the SDK
MIT licensed. PRs welcome.
# Development setup
git clone https://github.com/jqueguiner/openrunner.git
cd openrunner
make install # Install backend deps
cd src/web && npm install # Install frontend deps
docker compose up -d # Start infra (DB, Redis, MinIO)
make dev # Start API dev server
cd src/web && npm run dev # Start frontend dev server