ROOST

Know which commit is going to break production — before you merge it.

A read-only, calibrated risk score for every code change, plus an honest ledger
of what actually happened. Deterministic. LLM-free at the core. It never touches production.

What it looks like

Open a pull request. Before a human even reads the diff, ROOST posts a verdict:

┌─ Augur risk ───────────────────────────────────────────────┐
│  73%   tier: network        ⚠ high risk                     │
│                                                             │
│  Why this scored high:                                      │
│   • change is spread across 5 subsystems (high diffusion)   │
│   • touches files with 4 prior fix-inducing changes         │
│   • large, scattered diff — not a focused edit              │
│                                                             │
│  73% of changes that look like this needed a fix later.     │
└─────────────────────────────────────────────────────────────┘

No LLM wrote that. It's a calibrated probability from a trained model — reproducible from a fixed seed, byte-for-byte. When ROOST says 73%, ~73% of changes like it really did get fixed later. That last sentence is the whole point: the number means something.

Why this exists

AI now writes more code than any human can carefully review. Agents open PRs in minutes; the diffs are bigger, more frequent, and land faster than a reviewer can keep up. So review quietly degrades into skimming — you approve, you merge, you hope. The tests are green, so it's probably fine. Right?

Your CI tells you the tests passed. It does not tell you which of today's twenty green PRs — half of them machine-written — is the one that quietly induces an incident three weeks from now. That call gets left to gut feeling, reviewer fatigue, and "looks fine to me."

ROOST gives that judgment back a backbone: it puts a calibrated number on each change so you can spend your scarce review attention where the risk actually is. Skim the 4% it flags network/destructive; let the 65% it scores low ride through with a lighter touch. Review smarter, not by reading every line a robot wrote.

Risk tools that do exist tend to fail in one of two ways:

Uncalibrated rankers — they sort changes "risky → safe" but a "0.8" doesn't mean 80% of anything. You can't set a threshold you trust.
LLM black boxes — non-deterministic, unauditable, and they'll happily hallucinate a rationale for a number they made up.

ROOST is the opposite of both. The score is calibrated (probabilities you can act on), deterministic (same input → same output, forever), and the core is LLM-free (a test literally asserts it never imports an LLM SDK). Then it remembers every prediction and checks it against what really happened — so the score sharpens on your own history instead of staying a one-shot guess.

Read-only, always. ROOST reads diffs and posts advisory verdicts. It never writes code, never merges, never blocks a build — unless you explicitly opt in with --fail-at. Any LLM is an optional, swappable explainer that can only rephrase the verdict, never change the score. It's off by default.

Why you can trust the number

Most "AI for code" projects ask you to take their metrics on faith. We did the opposite — and this is the part we're proudest of.

We wrote down the pass/fail bar before we saw any results, committed it to git, and reported against it honestly. A clean FAIL would have been just as publishable as a PASS.

It passed:

What we measured	Result	Bar we set in advance
Top-20% riskiest changes vs. base rate	3.2× more fix-inducing	≥ 2.0×
Beats a "just count lines changed" baseline (PR-AUC)	+0.149	≥ 0.05
Calibration error (Brier)	0.087	< 0.118 (base rate)
Ranking quality (ROC-AUC)	0.873	—
Generalizes to a repo it never trained on (leave-one-out)	2.8× (every repo > 2.0)	cold-start sanity
Holds up under noisy labels	2.4×	robustness

And the caveats we don't hide: labels are an SZZ public-OSS proxy, not real production incidents; OSS ≠ your private code; the bespoke "blast-radius" feature honestly didn't earn its place, so we dropped it. We hold the same bar for new ideas: an experimental path-signal set (slim_paths) improves discrimination (PR-AUC +0.017) and calibration over the noise band, but its effort-aware lift gain stays within noise — a qualified result we report rather than dress up. The full warts-and-all findings log is in docs/DECISIONS.md; intended use and limits in the model card.

Calibration is a first-class output, not a footnote — the score comes from isotonic-calibrated LightGBM on a strict time-ordered split (never shuffled — temporal leakage is a pre-registered failure mode, not a thing we discovered later).

Quick start

Requires Python 3.11–3.12 and uv. No API key, no cloud, no LLM.

make setup                          # uv: pinned py3.12 venv + deps
make init                           # create the local Pellet ledger

# Score one commit of a local checkout (the shipped model is baked in):
roost ci --commit HEAD --format md

# …or score an entire repo you've never seen:
roost score-repo https://github.com/some/repo

That's it — nothing leaves your machine.

Drop it into GitHub Actions (advisory, ~10 lines)

name: Augur risk
on: [pull_request]
jobs:
  risk:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }      # full history is required
      - uses: ninoxAI/roost@v1
        with:
          fail-at: ""                 # advisory by default; set e.g. 0.9 to block

--format md drops straight into a PR comment or $GITHUB_STEP_SUMMARY; --format json pipes to jq. GitLab CI, a self-contained Docker image (model baked in, nothing leaves your infra), and a read-only GitHub App are in docs/ci.md, docs/github-app.md, and docs/deploy.md.

Reproduce the model from scratch

Every step is deterministic from the committed repo list + seed. The LLM is off the whole way.

make ingest      # mine 11 OSS repos → 'change'   (28,009 commits / 24,375 non-merge)
make label       # SZZ fix-inducing labels → 'outcome'
make features    # 11 language-agnostic Kamei features → features.parquet
make train       # calibrated LightGBM, strict time-ordered split → 'prediction'
make eval        # full report + PASS/FAIL vs the pre-registered bar
make test        # the test suite, LLM disabled

Beyond the pipeline targets, the CLI exposes roost robustness (multi-seed / rolling-origin CV / ablations), roost thresholds (data-driven tiers), and roost package (shippable cold-start model + model card). make ablation-paths runs the research track: the experimental slim_paths set vs the shipping slim baseline. See Command reference for the full surface.

How it works

mine → label (SZZ) → features (Kamei) → calibrate → score + tier → verdict
                                                         │
        every score is kept and later checked against the real outcome
                                                         ▼
        Pellet ledger:  change → prediction → action → outcome → recurrence

Step	What happens
mine	Read-only PyDriller pass over a repo's git history → diff stats, sanitized messages, parents.
label	An SZZ-style blame trace marks each past change `clean` or `fix_inducing` — the training signal.
features	11 language-agnostic Kamei change metrics: diffusion, size, purpose, history. No import graph, no leakage.
calibrate	LightGBM + isotonic `CalibratedClassifierCV` on a strict time-ordered split.
score	A calibrated probability + a risk tier on the `read_only → write → execute → network → destructive` scale.
record	The scored change + its prediction land in the Pellet ledger, ready to be graded later.

Risk tiers

Each tier is a documented operating point you choose between — be conservative or aggressive on purpose, not by accident. The cut points below are the data-driven thresholds from the shipped model card; roost thresholds re-derives them for your own data, and tier_thresholds in configs/default.yaml sets the advisory defaults.

tier	score ≥	precision	recall	share of changes
`write`	0.086	0.31	0.99	65%
`execute`	0.200	0.51	0.77	31%
`network`	0.750	0.89	0.17	4%
`destructive`	1.000	1.00	0.09	2%

The Pellet ledger

A prediction nobody checks is a horoscope. Pellet is the local system-of-record that closes the loop: every score is stored and later compared to what actually happened, so you build a verifiable track record instead of a stream of unaccountable guesses.

change  →  prediction  →  action  →  outcome  →  recurrence
(what landed) (Augur's call) (who acted) (what really happened) (did it come back?)

Built to grow up. action and recurrence already exist in the schema (empty for now), so wiring in real incident/rollback signals or autonomous-agent actions later needs no migration — the outcome label just upgrades from an OSS proxy to production truth.
No secrets, no PII. author_id is a salted hash; raw names/emails never land in the ledger; commit messages are sanitized at ingest. Public data only.
Zero infra. It's a local DuckDB file (data/ledger.duckdb) — columnar, regenerable, with content-hash keys that make every re-run byte-identical.

Command reference

The CLI is installed as roost (uv run roost <cmd>). Pipeline commands have matching make targets; the rest are run directly.

Command	Make target	What it does
`roost init`	`make init`	Create + migrate the Pellet ledger. `--reset` recreates it.
`roost info`	`make info`	Show ledger row counts, seed, and explainer status.
`roost ingest`	`make ingest`	Mine the configured OSS repos into the `change` table.
`roost label`	`make label`	Write SZZ fix-inducing labels into `outcome`.
`roost features`	`make features`	Build the Kamei feature matrix → `features.parquet`.
`roost train`	`make train`	Train the calibrated LightGBM model → `prediction`.
`roost eval`	`make eval`	Honest eval + PASS/FAIL vs the pre-registered bar.
`roost robustness`	—	Multi-seed bands, rolling-origin CV, ablations, importance.
`roost thresholds`	—	Derive score→tier cut points from calibration-slice targets.
`roost package`	—	Build a shippable cold-start model bundle + model card.
`roost ci`	—	Score one commit of a local checkout for a CI pipeline.
`roost score-repo <url>`	—	Score a repo Augur has never seen with the cold-start model.
`roost comment`	`make comment`	Render the deterministic risk comment for a change.
`roost serve`	—	Local webhook simulator (needs the `serve` extra).
`roost version`	—	Print the version.

roost ci is advisory by default (--warn-at 0.6); pass --fail-at to set a non-zero exit code. The optional LLM explainer is off everywhere unless you pass --explainer (or set explainer.enabled in config) and install the llm extra.

The bigger picture

ROOST is one corner of a deliberate design. Today's release builds and honestly evaluates the first two pieces; the rest are designed-for, not built.

Module	Role	Status
AUGUR	score — calibrated risk over change features, before a change lands	here today
PELLET	record — the outcome ledger / system-of-record	here today
PARLIAMENT	grade — cross-vendor evaluation of other AI-ops agents	designed
TALON	gate — a permissioned write layer, earned only once Augur proves the bar on your own history	designed

The thesis: autonomous agents are unreliable, so the layer that measures and bounds them must itself be deterministic and auditable. LLMs only ever show up as bounded, optional, swappable parts — never load-bearing decision logic.

Project layout

src/roost/
  ledger/      Pellet schema, migrations, deterministic ids, DuckDB wrapper
  ingest/      repo mining (PyDriller)
  labeling/    SZZ fix-inducing labels
  features/    Kamei change features
  model/       calibrated LightGBM, feature sets, packaging, thresholds
  evaluation/  PR-AUC, calibration, effort-aware lift, leave-one-repo-out, robustness
  render/      deterministic risk comment
  explain/     optional LLM explainer (no-op default)
  serve/       cold-start scoring + local webhook simulator
  models/      shipped cold-start model bundle
configs/       default.yaml, repos.yaml
docs/          spec, decisions, model card, CI / deploy / GitHub-App guides

Contributing

ROOST is young and contributions move it forward fast. Whether you fix a typo or add a whole language to the feature extractor, you're welcome here — see CONTRIBUTING.md for the full guide.

Good places to start:

Score a new language. The feature extractor is intentionally language-agnostic — help us validate it on Go, Rust, TypeScript, Java.
Add a repo to the evaluation set. More repos = a more honest, more general model. Mixed sizes/domains/languages especially.
Try a calibration method. Beat isotonic on the reliability diagram without leaking time.
Wire up a real outcome source. A connector that upgrades Pellet's outcome from the SZZ proxy to genuine incident/rollback signals.
Reproduce a result and tell us if it doesn't hold. Honest negative findings are first-class here.

Every PR runs the test suite with the LLM off — the deterministic core must stay deterministic. The fastest way to get a change merged is a reproducible command and a test.

New contributors are welcome on Discord — say hi, ask anything, or bring a repo you want scored.

License

Apache 2.0 — free to use, self-host, fork, and build on, in open or closed projects alike.

Predict honestly. Record everything. Touch nothing.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
configs		configs
deploy		deploy
docs		docs
scripts		scripts
src/roost		src/roost
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
action.yml		action.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROOST

Contents

What it looks like

Why this exists

Why you can trust the number

Quick start

Reproduce the model from scratch

How it works

Risk tiers

The Pellet ledger

Command reference

The bigger picture

Project layout

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ROOST

Contents

What it looks like

Why this exists

Why you can trust the number

Quick start

Reproduce the model from scratch

How it works

Risk tiers

The Pellet ledger

Command reference

The bigger picture

Project layout

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages