Skip to content

jqueguiner/openrunner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

694 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenRunner

GitHub PyPI License: MIT Python 3.10+

Open-source, self-hosted ML experiment tracking platform -- a drop-in replacement for Weights & Biases.

Track experiments, run hyperparameter sweeps, trace LLM calls, version artifacts, manage models, and share reports. Same wandb.init() / wandb.log() API you already know.

Screenshots

Screenshots coming soon. See docs/screenshots for the capture guide.

Key screens: Dashboard | Run Table | Run Detail with Charts | Run Comparison | Sweep Parallel Coordinates | LLM Trace Waterfall | Artifact Lineage | Reports

Features

Core Experiment Tracking

  • W&B-compatible SDK -- switch by changing one import (import openrunner as wandb)
  • Experiment tracking -- metrics, hyperparameters, system stats (CPU/GPU/memory)
  • Run dashboard -- real-time charts, search/filter/sort, 10K+ run performance
  • Run comparison -- overlay charts, bar/scatter/box plots, config diff
  • Custom dashboards -- drag-and-drop chart panels with persistent layouts
  • Full-fidelity export -- CSV, JSON, Parquet (no sampling limits)

Hyperparameter Sweeps

  • Grid, random, and Bayesian search strategies
  • Hyperband early termination -- stop underperforming runs automatically
  • W&B-compatible sweep config -- same YAML format, same API
  • Parallel coordinates visualization -- explore parameter-metric relationships

LLM Tracing

  • @openrunner.trace decorator -- capture inputs, outputs, timing, and errors
  • OpenAI auto-patching -- openrunner.trace.patch_openai() instruments all completions
  • Async support -- trace sync and async functions with automatic span nesting
  • Waterfall visualization -- inspect call chains, latency, and token usage

Launch (Remote Execution)

  • openrunner.launch() -- submit jobs to remote workers with Docker images
  • openrunner.launch.from_run() -- re-launch from a previous run's config
  • Job lifecycle -- .wait(), .cancel(), .refresh() for monitoring
  • Worker queue -- Redis-backed job dispatch with configurable concurrency

Artifact Versioning & Model Registry

  • Artifact versioning -- datasets, models, checkpoints with SHA-256 content-addressed dedup
  • Model registry with aliases -- latest, production, staging for deployment workflows
  • openrunner.link_artifact() -- upload and set aliases in one call
  • name:alias syntax -- openrunner.use_artifact("my-model:production")
  • Lineage graph -- track producer/consumer relationships across runs

Alerts & Notifications

  • openrunner.alert() -- send alerts from training code (INFO/WARN/ERROR)
  • Slack webhooks -- receive alerts in your Slack channel
  • Email notifications -- delivery via Resend API
  • Console fallback -- alerts logged when no external channel is configured

Reports & Collaboration

  • Shareable reports -- snapshots with metric charts and run data
  • Report anonymization -- strip identifiers for blind paper reviews
  • Real-time collaborative editing -- WebSocket-based concurrent editing

Media Logging (10 types)

  • Image -- numpy arrays, PIL Images, file paths with captions
  • Table -- structured columnar data
  • Audio -- WAV serialization from numpy arrays or file paths
  • Video -- file paths or numpy array sequences
  • Histogram -- distribution visualization from numeric arrays
  • Plotly -- interactive Plotly figures (JSON serialization)
  • PlotlyChart -- enhanced Plotly with static PNG fallback
  • MatplotlibFigure -- capture matplotlib figures as PNG
  • PointCloud3D -- 3D point cloud visualization
  • BoundingBoxes2D -- bounding box overlay on images

Enterprise

  • SAML/OIDC SSO -- Okta, Azure AD, OneLogin, and any SAML 2.0 / OpenID Connect IdP
  • OAuth -- Google and GitHub social login
  • Audit logs -- compliance-ready event trail with filtering by action, user, resource, date
  • Organization management -- teams, roles (admin/member), invitations

Programmatic Query API

  • openrunner.Api() -- read-only access to projects, runs, artifacts
  • Filter and sort -- api.runs("project", filters={"state": "finished"}, order="-summary.accuracy")
  • History export -- run.history(pandas=True) returns a DataFrame

Framework Integrations (8 frameworks)

  • PyTorch -- gradient logging
  • HuggingFace Transformers -- OpenRunnerCallback for Trainer
  • PyTorch Lightning -- OpenRunnerLogger for pl.Trainer
  • Keras -- callback for training loops
  • XGBoost -- callback for boosting rounds
  • Scikit-learn -- experiment logging wrapper
  • FastAI -- Learner callback
  • LangChain -- chain/agent tracing callback

Infrastructure

  • Self-hosted -- Docker Compose or Kubernetes, your data stays yours
  • Offline mode -- JSONL storage, idempotent sync when back online
  • Kubernetes Helm chart -- production-ready deployment with HA
  • Email delivery via Resend -- transactional emails without SMTP setup

Quick Start

1. Deploy the server

git clone https://github.com/jqueguiner/openrunner.git
cd openrunner
docker compose up -d

All 5 services start automatically: PostgreSQL, Redis, MinIO, API server, frontend.

Open http://localhost:3000 and create an account.

Hit a snag on first run? See Troubleshooting first-run for the five issues most fresh-clone users encounter (port mismatch, empty SECRET_KEY, MinIO endpoint and credential drift).

2. Install the SDK

pip install openrunner-sdk

3. Set your credentials

export OPENRUNNER_API_KEY="or_your_key"
export OPENRUNNER_BASE_URL="http://localhost:8000"

Get your API key from the dashboard under Settings > API Keys.

4. Track experiments

import openrunner

openrunner.init(project="my-project", config={"lr": 0.001, "epochs": 10})

for epoch in range(10):
    loss = train(epoch)
    acc = evaluate()
    openrunner.log({"loss": loss, "accuracy": acc, "epoch": epoch})

openrunner.finish()

Migrating from W&B

import openrunner as wandb

wandb.init(project="my-project")
wandb.log({"loss": 0.5})
wandb.finish()

Sweeps Quick Start

import openrunner

sweep_config = {
    "method": "bayes",
    "name": "lr-sweep",
    "metric": {"name": "val_loss", "goal": "minimize"},
    "parameters": {
        "lr": {"min": 0.0001, "max": 0.1, "distribution": "log_uniform_values"},
        "batch_size": {"values": [16, 32, 64]},
        "epochs": {"value": 10},
    },
    "early_terminate": {"type": "hyperband", "min_iter": 3, "eta": 3},
}

sweep_id = openrunner.sweep(sweep_config, project="my-project")

def train_fn():
    run = openrunner.init()
    lr = openrunner.config.lr
    for epoch in range(openrunner.config.epochs):
        loss = train(lr=lr, epoch=epoch)
        openrunner.log({"val_loss": loss})
    openrunner.finish()

openrunner.agent(sweep_id, function=train_fn)

Sweep methods: grid, random, bayes. Early termination via Hyperband stops underperforming runs.


LLM Tracing Quick Start

import openrunner

# Decorator-based tracing
@openrunner.trace
def summarize(text: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    return response.choices[0].message.content

# Or auto-patch all OpenAI calls
openrunner.trace.patch_openai()

openrunner.init(project="llm-app")
result = summarize("Long article text...")
openrunner.finish()

Traces capture inputs, outputs, duration, errors, and token usage. View them in the Trace Waterfall UI.


Launch Quick Start

import openrunner

# Submit a remote training job
job = openrunner.launch(
    project="my-project",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    command="python train.py --lr 0.001 --epochs 50",
    name="gpu-training-run",
)

# Wait for completion
final_state = job.wait(poll_interval=10.0)
print(f"Job finished with state: {final_state}")

# Re-launch from a previous run's config
job2 = openrunner.launch.from_run(
    run_id="abc12345",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    command="python train.py",
)

Model Registry Quick Start

import openrunner

openrunner.init(project="my-project")

# Log a model artifact with aliases
artifact = openrunner.Artifact(name="resnet50", type="model")
artifact.add_file("model.pth")
openrunner.link_artifact(artifact, aliases=["production", "v2.1"])

openrunner.finish()

# Later -- download by alias
openrunner.init(project="my-project")
model_dir = openrunner.use_artifact("resnet50:production")
# model_dir is a Path to the cached local directory

Built-in aliases: latest (auto-set), plus custom aliases like production, staging, best.


Query API Quick Start

import openrunner

api = openrunner.Api()

# List projects
projects = api.projects()

# Query runs with filters and ordering
runs = api.runs("my-project", filters={"state": "finished"}, order="-summary.accuracy")

# Access run details
for run in runs:
    print(f"{run.name}: accuracy={run.summary.get('accuracy')}")

# Get full metric history
run = api.run("my-project/abc12345")
history = run.history()             # List of dicts
df = run.history(pandas=True)       # pandas DataFrame

# Get artifact by alias
artifact = api.artifact("resnet50:production")

Alerts & Notifications

import openrunner

openrunner.init(project="my-project")

# Send alerts from training code
openrunner.alert("Training complete", text="Final accuracy: 0.95", level="INFO")
openrunner.alert("Loss spike", text="Loss jumped from 0.1 to 5.0", level="WARN")
openrunner.alert("OOM Error", text="GPU out of memory at batch 1024", level="ERROR")

openrunner.finish()

Configure Slack webhooks via the SLACK_WEBHOOK_URL environment variable. Email alerts use Resend (RESEND_API_KEY).


Architecture

openrunner/
  src/
    api/          # FastAPI backend (Python)
    web/          # React frontend (TypeScript)
  sdk/            # Python SDK (pip install openrunner-sdk)
  helm/           # Kubernetes Helm chart
  examples/       # Demo scripts (MNIST, artifacts)
  docker-compose.yml
Component Technology
Backend Python, FastAPI, SQLAlchemy (async), asyncpg
Frontend React 19, TypeScript, Vite, ECharts, TanStack Table
Database PostgreSQL 16 (partitioned metrics table)
Object storage MinIO (S3-compatible)
Cache/pubsub Redis
SDK Python, httpx, Click CLI
Email Resend API (no SMTP needed)
SSO SAML 2.0, OpenID Connect
Deployment Docker Compose, Kubernetes Helm chart

Backend API Routes

Route Group Purpose
/auth Login, register, OAuth, SSO (OIDC/SAML)
/users User profile management
/organizations Org CRUD, member management
/projects Project CRUD
/runs Run lifecycle, metrics, config, summary
/sweeps Sweep creation, parameter suggestions, status
/jobs Launch job submission and monitoring
/traces LLM trace and span storage
/artifacts Artifact versioning, file upload/download
/registry Model registry aliases
/media Image, audio, video, table upload
/alerts Alert creation and listing
/reports Report CRUD, sharing, anonymization
/dashboards Custom dashboard layouts
/streams Server-Sent Events for real-time updates
/export CSV, JSON, Parquet data export
/audit-logs Organization audit trail (admin only)
/tags Tag CRUD and bulk operations
/health Health check endpoint

SDK Documentation

Full API reference: openrunner-sdk on PyPI

Core API

import openrunner

# Initialize a run
run = openrunner.init(
    project="my-project",
    name="experiment-1",
    config={"lr": 0.001},
    tags=["baseline"],
    notes="First experiment",
    group="ablation-study",
    resume=True,  # or "must" to require an existing run
)

# Log metrics (non-blocking)
openrunner.log({"loss": 0.5, "accuracy": 0.85})

# Log media
openrunner.log({"samples": openrunner.Image(img_array, caption="epoch 5")})
openrunner.log({"audio": openrunner.Audio(waveform, sample_rate=16000)})
openrunner.log({"fig": openrunner.Plotly(plotly_fig)})
openrunner.log({"detections": openrunner.BoundingBoxes2D(image, boxes)})

# Log tables
table = openrunner.Table(
    columns=["input", "predicted", "actual"],
    data=[["img_01", 7, 7], ["img_02", 3, 5]],
)
openrunner.log({"results": table})

# Log and manage artifacts
artifact = openrunner.Artifact(name="my-model", type="model")
artifact.add_file("model.pth")
run.log_artifact(artifact)

# Config & Summary
openrunner.config["batch_size"] = 64
openrunner.config.optimizer.lr  # dot notation access
openrunner.summary["best_accuracy"] = 0.95

# Alerts
openrunner.alert("Training done", level="INFO")

# Finish
openrunner.finish()

Framework Integrations

# HuggingFace Transformers
from openrunner.integration.huggingface import OpenRunnerCallback
trainer = Trainer(callbacks=[OpenRunnerCallback()])

# PyTorch Lightning
from openrunner.integration.lightning import OpenRunnerLogger
trainer = pl.Trainer(logger=OpenRunnerLogger(project="my-project"))

# PyTorch
from openrunner.integration.pytorch import log_gradients
log_gradients(model)

# Keras
from openrunner.integration.keras import OpenRunnerCallback
model.fit(x, y, callbacks=[OpenRunnerCallback()])

# XGBoost
from openrunner.integration.xgboost import OpenRunnerCallback
xgb.train(params, dtrain, callbacks=[OpenRunnerCallback()])

# Scikit-learn
from openrunner.integration.sklearn import log_model
log_model(clf, X_test, y_test)

# FastAI
from openrunner.integration.fastai import OpenRunnerCallback
learn = cnn_learner(dls, resnet34, cbs=[OpenRunnerCallback()])

# LangChain
from openrunner.integration.langchain import OpenRunnerCallback
chain.invoke({"input": "hello"}, config={"callbacks": [OpenRunnerCallback()]})

Install framework extras:

pip install openrunner-sdk[pytorch]
pip install openrunner-sdk[huggingface]
pip install openrunner-sdk[lightning]
pip install openrunner-sdk[keras]
pip install openrunner-sdk[xgboost]
pip install openrunner-sdk[sklearn]
pip install openrunner-sdk[fastai]
pip install openrunner-sdk[gpu]  # NVIDIA GPU monitoring

Offline Mode

export OPENRUNNER_MODE=offline
python train.py
openrunner sync   # Upload when back online

Offline runs are stored as JSONL files. Config is saved at init() time (not finish()), so it survives crashes.

CLI

openrunner login   # Store API key
openrunner sync    # Upload offline runs
openrunner ls      # List projects and runs

Kubernetes Deployment

A production-ready Helm chart is included for Kubernetes clusters:

helm install openrunner ./helm/openrunner \
  --namespace openrunner --create-namespace \
  --set api.replicas=3 \
  --set image.tag=0.2.0 \
  --set env.DATABASE_URL="postgresql+asyncpg://user:pass@db:5432/openrunner" \
  --set env.SECRET_KEY="your-production-secret"

The chart deploys: API server, web frontend, PostgreSQL, Redis, MinIO -- all with resource limits, health checks, and PVC storage.

For external databases in production, disable the bundled PostgreSQL:

# values-production.yaml
postgresql:
  enabled: false
env:
  DATABASE_URL: "postgresql+asyncpg://user:pass@your-rds-host:5432/openrunner"

See helm/openrunner/ for full chart documentation and values.yaml.


Environment Variables

SDK (Client)

Variable Description Default
OPENRUNNER_API_KEY API key for server authentication (required)
OPENRUNNER_BASE_URL Server URL http://localhost:8000
OPENRUNNER_PROJECT Default project name (none)
OPENRUNNER_MODE online or offline online
OPENRUNNER_SYSTEM_METRICS Enable GPU/CPU/memory monitoring true

W&B environment variables (WANDB_API_KEY, WANDB_BASE_URL) are supported as fallback for migration.

Server

Variable Description Default
DATABASE_URL PostgreSQL connection string postgresql+asyncpg://openrunner:openrunner@postgres:5432/openrunner
DB_USE_PGBOUNCER Enable PgBouncer compatibility mode false
REDIS_URL Redis connection string redis://redis:6379/0
MINIO_INTERNAL_ENDPOINT MinIO/S3 endpoint reached from inside the docker/cluster network — used for API & worker reads/writes. Legacy MINIO_ENDPOINT is still accepted. minio:9000
MINIO_PUBLIC_ENDPOINT MinIO/S3 endpoint baked into presigned URLs returned to the browser/SDK. Must resolve from outside the cluster. Legacy MINIO_EXTERNAL_ENDPOINT is still accepted. localhost:9000
MINIO_ACCESS_KEY MinIO/S3 access key minioadmin
MINIO_SECRET_KEY MinIO/S3 secret key minioadmin
MINIO_SECURE Use HTTPS for MinIO false
MINIO_BUCKET Object storage bucket name openrunner
SECRET_KEY Application secret key change-me-in-production
DEBUG Enable debug mode false
API_HOST API bind address 0.0.0.0
API_PORT API bind port 8000
MAX_UPLOAD_SIZE Max artifact upload size accepted by the in-stack nginx /api/ proxy. Sized for typical ML model artifacts; lower it for shared deployments. Restart the web service to apply. 5g
JWT_ACCESS_SECRET JWT access token signing key change-me-access-secret
JWT_REFRESH_SECRET JWT refresh token signing key change-me-refresh-secret
JWT_ACCESS_EXPIRE_MINUTES Access token lifetime 15
JWT_REFRESH_EXPIRE_DAYS Refresh token lifetime 30
FRONTEND_URL Frontend URL for CORS and redirects http://localhost:3000

OAuth & SSO

Variable Description Default
GOOGLE_CLIENT_ID Google OAuth client ID (empty)
GOOGLE_CLIENT_SECRET Google OAuth client secret (empty)
GOOGLE_REDIRECT_URI Google OAuth callback URL http://localhost:8000/api/v1/auth/google/callback
GITHUB_CLIENT_ID GitHub OAuth client ID (empty)
GITHUB_CLIENT_SECRET GitHub OAuth client secret (empty)
GITHUB_REDIRECT_URI GitHub OAuth callback URL http://localhost:8000/api/v1/auth/github/callback
OIDC_CLIENT_ID OpenID Connect client ID (empty)
OIDC_CLIENT_SECRET OpenID Connect client secret (empty)
OIDC_DISCOVERY_URL OIDC discovery endpoint (.well-known/openid-configuration) (empty)
OIDC_PROVIDER_NAME Display name for OIDC button SSO
OIDC_REDIRECT_URI OIDC callback URL http://localhost:8000/api/v1/auth/oidc/callback
SAML_ENTITY_ID SAML Service Provider entity ID (empty)
SAML_SSO_URL SAML IdP single sign-on URL (empty)
SAML_CERTIFICATE SAML IdP certificate (PEM) (empty)
SAML_SP_CERT SAML SP certificate (PEM) (empty)
SAML_SP_KEY SAML SP private key (PEM) (empty)

Notifications & Email

Variable Description Default
SLACK_WEBHOOK_URL Slack incoming webhook for alerts (none)
RESEND_API_KEY Resend API key for email delivery (none)
RESEND_FROM_EMAIL Sender email address noreply@gladia.io

Production Deployment

For production (e.g., openrun.gladia.io):

# Use external PostgreSQL
cp docker-compose.prod.yml docker-compose.override.yml
# Edit .env with your database credentials

# Set up HTTPS
# Add DNS A record pointing to your server
# Use certbot or your reverse proxy for SSL

See docker-compose.prod.yml for external database configuration.


Server Requirements

  • Docker and Docker Compose (or Kubernetes with Helm)
  • 2GB RAM minimum (4GB+ recommended for production)
  • 10GB disk (grows with data)
  • Python 3.10+ for the SDK

Contributing

MIT licensed. PRs welcome.

# Development setup
git clone https://github.com/jqueguiner/openrunner.git
cd openrunner
make install   # Install backend deps
cd src/web && npm install   # Install frontend deps
docker compose up -d   # Start infra (DB, Redis, MinIO)
make dev   # Start API dev server
cd src/web && npm run dev   # Start frontend dev server

License

MIT

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors