Know which commit is going to break production — before you merge it.
A read-only, calibrated risk score for every code change, plus an honest ledger
of what actually happened. Deterministic. LLM-free at the core. It never touches production.
- What it looks like
- Why this exists
- Why you can trust the number
- Quick start
- Reproduce the model from scratch
- How it works
- Risk tiers
- The Pellet ledger
- Command reference
- The bigger picture
- Project layout
- Contributing
- License
Open a pull request. Before a human even reads the diff, ROOST posts a verdict:
┌─ Augur risk ───────────────────────────────────────────────┐
│ 73% tier: network ⚠ high risk │
│ │
│ Why this scored high: │
│ • change is spread across 5 subsystems (high diffusion) │
│ • touches files with 4 prior fix-inducing changes │
│ • large, scattered diff — not a focused edit │
│ │
│ 73% of changes that look like this needed a fix later. │
└─────────────────────────────────────────────────────────────┘
No LLM wrote that. It's a calibrated probability from a trained model — reproducible from a fixed seed, byte-for-byte. When ROOST says 73%, ~73% of changes like it really did get fixed later. That last sentence is the whole point: the number means something.
AI now writes more code than any human can carefully review. Agents open PRs in minutes; the diffs are bigger, more frequent, and land faster than a reviewer can keep up. So review quietly degrades into skimming — you approve, you merge, you hope. The tests are green, so it's probably fine. Right?
Your CI tells you the tests passed. It does not tell you which of today's twenty green PRs — half of them machine-written — is the one that quietly induces an incident three weeks from now. That call gets left to gut feeling, reviewer fatigue, and "looks fine to me."
ROOST gives that judgment back a backbone: it puts a calibrated number on each change so you can spend your scarce review attention where the risk actually is. Skim the 4% it flags network/destructive; let the 65% it scores low ride through with a lighter touch. Review smarter, not by reading every line a robot wrote.
Risk tools that do exist tend to fail in one of two ways:
- Uncalibrated rankers — they sort changes "risky → safe" but a "0.8" doesn't mean 80% of anything. You can't set a threshold you trust.
- LLM black boxes — non-deterministic, unauditable, and they'll happily hallucinate a rationale for a number they made up.
ROOST is the opposite of both. The score is calibrated (probabilities you can act on), deterministic (same input → same output, forever), and the core is LLM-free (a test literally asserts it never imports an LLM SDK). Then it remembers every prediction and checks it against what really happened — so the score sharpens on your own history instead of staying a one-shot guess.
Read-only, always. ROOST reads diffs and posts advisory verdicts. It never writes code, never merges, never blocks a build — unless you explicitly opt in with
--fail-at. Any LLM is an optional, swappable explainer that can only rephrase the verdict, never change the score. It's off by default.
Most "AI for code" projects ask you to take their metrics on faith. We did the opposite — and this is the part we're proudest of.
We wrote down the pass/fail bar before we saw any results, committed it to git, and reported against it honestly. A clean FAIL would have been just as publishable as a PASS.
It passed:
| What we measured | Result | Bar we set in advance |
|---|---|---|
| Top-20% riskiest changes vs. base rate | 3.2× more fix-inducing | ≥ 2.0× |
| Beats a "just count lines changed" baseline (PR-AUC) | +0.149 | ≥ 0.05 |
| Calibration error (Brier) | 0.087 | < 0.118 (base rate) |
| Ranking quality (ROC-AUC) | 0.873 | — |
| Generalizes to a repo it never trained on (leave-one-out) | 2.8× (every repo > 2.0) | cold-start sanity |
| Holds up under noisy labels | 2.4× | robustness |
And the caveats we don't hide: labels are an SZZ public-OSS proxy, not real production incidents; OSS ≠ your private code; the bespoke "blast-radius" feature honestly didn't earn its place, so we dropped it. We hold the same bar for new ideas: an experimental path-signal set (slim_paths) improves discrimination (PR-AUC +0.017) and calibration over the noise band, but its effort-aware lift gain stays within noise — a qualified result we report rather than dress up. The full warts-and-all findings log is in docs/DECISIONS.md; intended use and limits in the model card.
Calibration is a first-class output, not a footnote — the score comes from isotonic-calibrated LightGBM on a strict time-ordered split (never shuffled — temporal leakage is a pre-registered failure mode, not a thing we discovered later).
Requires Python 3.11–3.12 and uv. No API key, no cloud, no LLM.
make setup # uv: pinned py3.12 venv + deps
make init # create the local Pellet ledger
# Score one commit of a local checkout (the shipped model is baked in):
roost ci --commit HEAD --format md
# …or score an entire repo you've never seen:
roost score-repo https://github.com/some/repoThat's it — nothing leaves your machine.
Drop it into GitHub Actions (advisory, ~10 lines)
name: Augur risk
on: [pull_request]
jobs:
risk:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 } # full history is required
- uses: ninoxAI/roost@v1
with:
fail-at: "" # advisory by default; set e.g. 0.9 to block--format md drops straight into a PR comment or $GITHUB_STEP_SUMMARY; --format json pipes to jq. GitLab CI, a self-contained Docker image (model baked in, nothing leaves your infra), and a read-only GitHub App are in docs/ci.md, docs/github-app.md, and docs/deploy.md.
Every step is deterministic from the committed repo list + seed. The LLM is off the whole way.
make ingest # mine 11 OSS repos → 'change' (28,009 commits / 24,375 non-merge)
make label # SZZ fix-inducing labels → 'outcome'
make features # 11 language-agnostic Kamei features → features.parquet
make train # calibrated LightGBM, strict time-ordered split → 'prediction'
make eval # full report + PASS/FAIL vs the pre-registered bar
make test # the test suite, LLM disabledBeyond the pipeline targets, the CLI exposes roost robustness (multi-seed / rolling-origin CV / ablations), roost thresholds (data-driven tiers), and roost package (shippable cold-start model + model card). make ablation-paths runs the research track: the experimental slim_paths set vs the shipping slim baseline. See Command reference for the full surface.
mine → label (SZZ) → features (Kamei) → calibrate → score + tier → verdict
│
every score is kept and later checked against the real outcome
▼
Pellet ledger: change → prediction → action → outcome → recurrence
| Step | What happens |
|---|---|
| mine | Read-only PyDriller pass over a repo's git history → diff stats, sanitized messages, parents. |
| label | An SZZ-style blame trace marks each past change clean or fix_inducing — the training signal. |
| features | 11 language-agnostic Kamei change metrics: diffusion, size, purpose, history. No import graph, no leakage. |
| calibrate | LightGBM + isotonic CalibratedClassifierCV on a strict time-ordered split. |
| score | A calibrated probability + a risk tier on the read_only → write → execute → network → destructive scale. |
| record | The scored change + its prediction land in the Pellet ledger, ready to be graded later. |
Each tier is a documented operating point you choose between — be conservative or aggressive on purpose, not by accident. The cut points below are the data-driven thresholds from the shipped model card; roost thresholds re-derives them for your own data, and tier_thresholds in configs/default.yaml sets the advisory defaults.
| tier | score ≥ | precision | recall | share of changes |
|---|---|---|---|---|
write |
0.086 | 0.31 | 0.99 | 65% |
execute |
0.200 | 0.51 | 0.77 | 31% |
network |
0.750 | 0.89 | 0.17 | 4% |
destructive |
1.000 | 1.00 | 0.09 | 2% |
A prediction nobody checks is a horoscope. Pellet is the local system-of-record that closes the loop: every score is stored and later compared to what actually happened, so you build a verifiable track record instead of a stream of unaccountable guesses.
change → prediction → action → outcome → recurrence
(what landed) (Augur's call) (who acted) (what really happened) (did it come back?)
- Built to grow up.
actionandrecurrencealready exist in the schema (empty for now), so wiring in real incident/rollback signals or autonomous-agent actions later needs no migration — theoutcomelabel just upgrades from an OSS proxy to production truth. - No secrets, no PII.
author_idis a salted hash; raw names/emails never land in the ledger; commit messages are sanitized at ingest. Public data only. - Zero infra. It's a local DuckDB file (
data/ledger.duckdb) — columnar, regenerable, with content-hash keys that make every re-run byte-identical.
The CLI is installed as roost (uv run roost <cmd>). Pipeline commands have matching make targets; the rest are run directly.
| Command | Make target | What it does |
|---|---|---|
roost init |
make init |
Create + migrate the Pellet ledger. --reset recreates it. |
roost info |
make info |
Show ledger row counts, seed, and explainer status. |
roost ingest |
make ingest |
Mine the configured OSS repos into the change table. |
roost label |
make label |
Write SZZ fix-inducing labels into outcome. |
roost features |
make features |
Build the Kamei feature matrix → features.parquet. |
roost train |
make train |
Train the calibrated LightGBM model → prediction. |
roost eval |
make eval |
Honest eval + PASS/FAIL vs the pre-registered bar. |
roost robustness |
— | Multi-seed bands, rolling-origin CV, ablations, importance. |
roost thresholds |
— | Derive score→tier cut points from calibration-slice targets. |
roost package |
— | Build a shippable cold-start model bundle + model card. |
roost ci |
— | Score one commit of a local checkout for a CI pipeline. |
roost score-repo <url> |
— | Score a repo Augur has never seen with the cold-start model. |
roost comment |
make comment |
Render the deterministic risk comment for a change. |
roost serve |
— | Local webhook simulator (needs the serve extra). |
roost version |
— | Print the version. |
roost ci is advisory by default (--warn-at 0.6); pass --fail-at to set a non-zero exit code. The optional LLM explainer is off everywhere unless you pass --explainer (or set explainer.enabled in config) and install the llm extra.
ROOST is one corner of a deliberate design. Today's release builds and honestly evaluates the first two pieces; the rest are designed-for, not built.
| Module | Role | Status |
|---|---|---|
| AUGUR | score — calibrated risk over change features, before a change lands | here today |
| PELLET | record — the outcome ledger / system-of-record | here today |
| PARLIAMENT | grade — cross-vendor evaluation of other AI-ops agents | designed |
| TALON | gate — a permissioned write layer, earned only once Augur proves the bar on your own history | designed |
The thesis: autonomous agents are unreliable, so the layer that measures and bounds them must itself be deterministic and auditable. LLMs only ever show up as bounded, optional, swappable parts — never load-bearing decision logic.
src/roost/
ledger/ Pellet schema, migrations, deterministic ids, DuckDB wrapper
ingest/ repo mining (PyDriller)
labeling/ SZZ fix-inducing labels
features/ Kamei change features
model/ calibrated LightGBM, feature sets, packaging, thresholds
evaluation/ PR-AUC, calibration, effort-aware lift, leave-one-repo-out, robustness
render/ deterministic risk comment
explain/ optional LLM explainer (no-op default)
serve/ cold-start scoring + local webhook simulator
models/ shipped cold-start model bundle
configs/ default.yaml, repos.yaml
docs/ spec, decisions, model card, CI / deploy / GitHub-App guides
ROOST is young and contributions move it forward fast. Whether you fix a typo or add a whole language to the feature extractor, you're welcome here — see CONTRIBUTING.md for the full guide.
Good places to start:
- Score a new language. The feature extractor is intentionally language-agnostic — help us validate it on Go, Rust, TypeScript, Java.
- Add a repo to the evaluation set. More repos = a more honest, more general model. Mixed sizes/domains/languages especially.
- Try a calibration method. Beat isotonic on the reliability diagram without leaking time.
- Wire up a real outcome source. A connector that upgrades Pellet's
outcomefrom the SZZ proxy to genuine incident/rollback signals. - Reproduce a result and tell us if it doesn't hold. Honest negative findings are first-class here.
Every PR runs the test suite with the LLM off — the deterministic core must stay deterministic. The fastest way to get a change merged is a reproducible command and a test.
New contributors are welcome on Discord — say hi, ask anything, or bring a repo you want scored.
Apache 2.0 — free to use, self-host, fork, and build on, in open or closed projects alike.
Predict honestly. Record everything. Touch nothing.