Argus

Self-hostable ML experiment monitoring — batches, jobs, GPU/CPU resources, reruns, hyperopt sweeps. Real-time dashboard. Multi-user. One container.

The name comes from Argus Panoptes (Ἄργος Πανόπτης) — the hundred-eyed giant of Greek mythology, tasked with watching over what others could not see. A fitting name for a tool whose job is to keep every eye open so you don't have to.

English · 简体中文 · Documentation

A familiar story

It's 23:00. You launched a 32-config Optuna sweep on the lab's workstations and went to dinner. At 02:00 you wake up worried, SSH in, and discover:

two of the eight GPUs have sat at 0% for four hours — a dataloader stalled and the loop never crashed;
one trial died at epoch 47 of 50 with CUDA out of memory, but the only trace is a 600 MB stdout.log you don't want to read;
when you re-run the dead trial it lands as a new batch in your notes, fragmenting the sweep across two rows;
your advisor pings: "which trial won — can I see the loss curve?";
the PI on Slack: "is anyone using gpu-03 right now?".

Argus is the dashboard the author built after losing several nights to that exact story. It's an open-source experiment tracker for teams running long ML training jobs across one or many machines. You instrument your training script with a two-line Reporter SDK call, and Argus answers the questions above without your having to SSH in.

Pain → feature

The pain at 02:00	What Argus does
GPU has been at 0% for hours, training loop didn't notice	Idle-job detector — `resource_snapshot` events every 30 s; flagged when GPU util < 5% for `ARGUS_IDLE_JOB_THRESHOLD_MIN` (default 10) min
Training script crashed but Argus still says running	Stalled-batch detector — flips status to stalled after `ARGUS_STALL_TIMEOUT_MIN` (default 15) min of silence
Resumed run becomes a new batch — sweep history fragments	Crash-resume — `derive_batch_id(project, experiment, git_sha)` returns a deterministic id; pass `Reporter(resume_from=…)` and the resumed run appends to the original Batch row
Wandering through 600 MB of stdout to find the failure	Tail of `log_line` events on the job detail page; full log streamed via SSE
"Which trial won?" across a 32-config sweep	JobMatrix — one global best (green border + trophy) and one global worst (red border + warning) on the primary metric across the whole `model × dataset` grid; CSV export for paper tables
"Show me the loss curves of trial #17" without a login	Per-batch share links — opaque slug, read-only, revocable; or set the project visibility to public
200 lines of W&B / CSV glue per training script	Two `with` blocks: `with Reporter(...)` opens a batch, `with r.job(...)` opens a job, `job.epoch(...)` streams metrics; failures auto-emit `job_failed`
Optuna sweeps need their own dashboard	Studies tab — scatter, parallel-coordinates and parameter-importance from `optuna.{study_name, trial_number}` labels Sibyl's monitor callback stamps on `job_start`
"Re-run that failed trial, but I'm at home and the box is on campus"	Rerun button on a finished batch sends a `kind=rerun` command to the `argus-agent` daemon on the origin host; the agent calls `subprocess.Popen` with the recorded `env_snapshot.command`
"Stop this run, I changed the hyperparameter"	Stop button flips a flag your loop reads via `if job.stopped: break` (10 s polling); the agent escalates to `SIGTERM` if the SDK isn't running
Co-authors all want notifications, but each on their own terms	Per-user prefs (email-on-batch-done, email-on-job-failed, daily digest) + per-project recipient routing — a project mails its event list, but each recipient's prefs win
"Where did this batch's command line go?"	`env_snapshot` (git SHA, command, cwd, hostname) recorded on `batch_start`; one-click Copy command in the UI
Hydra training, want monitoring without code changes	`argus.integrations.hydra.ArgusCallback` — wire under `hydra.callbacks` once, every `python main.py …` (and `-m` sweeps) emits a batch automatically
Lightning / Keras training, same story	`argus.integrations.{lightning,keras}.ArgusCallback` drop-in callbacks
"I want to deploy this on our airgapped cluster"	Single Docker image (FastAPI + Vue 3 + SQLite by default). No SaaS, no telemetry. Optional Postgres for multi-host.

Quickstart (60 seconds)

git clone https://github.com/leehom0123/argus.git
cd argus/deploy
cp .env.example .env

# Generate a JWT secret (>=32 bytes is enforced when ARGUS_ENV=prod)
python3 -c "import secrets; print('ARGUS_JWT_SECRET=' + secrets.token_urlsafe(48))" >> .env

docker compose up -d --build
# Open http://localhost:8000 — register the first account; it becomes admin.

Then, in your training script:

from argus import Reporter

with Reporter("my-run",
              experiment_type="forecast",
              source_project="my-paper",
              n_total=1,
              monitor_url="http://localhost:8000",
              token="em_live_…") as r:                # mint at Settings → Tokens
    with r.job("run-1", model="patchtst", dataset="etth1") as job:
        for epoch in range(50):
            # JobContext.epoch defines four named optional floats —
            # train_loss, val_loss, lr, batch_time_ms — plus **extra:
            # any other keyword is stored on the event payload, so you
            # can stream as many per-epoch metrics as you compute.
            job.epoch(epoch,
                      train_loss=..., val_loss=..., lr=..., batch_time_ms=...,
                      val_mse=..., val_rmse=..., val_mae=...,
                      val_r2=..., val_pcc=...)
        # Final / headline metrics for the run — any dict[str, float],
        # surfaced on the JobMatrix and CSV export. Drop everything you
        # want plotted on the leaderboard here.
        job.metrics({
            "MSE":  ..., "RMSE": ...,
            "MAE":  ..., "R2":   ...,
            "PCC":  ...,
        })

Or set the env vars and skip the explicit args:

export ARGUS_URL=http://localhost:8000          # SDK reads ARGUS_URL (not ARGUS_BASE_URL)
export ARGUS_TOKEN=em_live_…

That's it — refresh the dashboard and your batch is live.

Repository layout

argus/
├── backend/    FastAPI + async SQLAlchemy 2.0 + Alembic (Python ≥3.10)
├── frontend/   Vue 3 + TypeScript + Vite + Pinia + Ant Design Vue (4.2.6 pinned)
├── client/     argus-reporter SDK (PyPI: argus-reporter)
├── schemas/    Event contract (event_v1.json, schema v1.1)
├── deploy/     Docker compose, Dockerfile, nginx snippet, .env.example
└── docs/       MkDocs Material site (English + 简体中文)

Feature highlights

Area	What's there
Tracking	Batches → jobs → epochs hierarchy; idempotent ingest by `event_id`; JSONL spill replay on outage
Live UI	Single multiplexed SSE connection (`GET /api/sse`) per page; ECharts loss curves; GPU/CPU sparklines
Auth	Email + password (argon2id); GitHub OAuth (optional); JWT dual-key rotation
Tokens	`em_live_` for SDK, `ag_live_` for agents — both bound per user
Executor	Backend serves `/api/agents/*`. The agent daemon (`argus-agent`) ships in the Sibyl package; reruns use `subprocess.Popen`, stops use SIGTERM
Studies	Optuna multirun via `optuna.{study_name, trial_number}` labels stashed by Sibyl's monitor callback
Notifications	Per-user email prefs + per-project multi-recipient routing; SMTP delivery
Runtime config	GitHub OAuth, SMTP, retention, demo project, feature flags editable from UI without redeploy
i18n	Full English + Simplified Chinese UI
Frameworks	First-class PyTorch Lightning + Keras + Hydra callback adapters (`argus.integrations.{lightning,keras,hydra}`)
Resume	`derive_batch_id` keeps a relaunched experiment on the same Batch row

Documentation


🚀 Getting started	Install, first run, send first event
📖 User guide	Dashboard, batches, jobs, sharing, notifications
🐍 SDK reference	`Reporter` API, framework adapters, event schema
🛠 Operations	Docker, configuration, admin settings, agent, database, retention
🏗 Architecture overview	How the pieces fit together
📝 Contributing · Security

License

Apache-2.0 — see LICENSE and NOTICE.

If you use Argus in academic work, please cite via CITATION.cff.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
backend		backend
client		client
deploy		deploy
docs		docs
frontend		frontend
schemas		schemas
.gitignore		.gitignore
.gitmessage		.gitmessage
.yamllint		.yamllint
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
requirements-docs.txt		requirements-docs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Argus

A familiar story

Pain → feature

Quickstart (60 seconds)

Repository layout

Feature highlights

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Argus

A familiar story

Pain → feature

Quickstart (60 seconds)

Repository layout

Feature highlights

Documentation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages