Free, local-first regression-catcher. Generates unit tests from a diff, then runs them against the parent and child revs in isolated git worktrees. A test that passes on the parent and fails on the child is executable evidence of a regression.
JitCatch implements ideas from Meta's Just-in-Time Catching Test Generation at Meta (Becker et al., FSE Companion '26), adapted for local developer loops and open-source LLM backends.
- Why JitCatch
- How it works
- Installation
- Quick start
- CLI reference
- LLM providers
- Workflows
- Output
- Project layout
- Supported languages
- Configuration tips
- Development
- Security
- License
Most LLM-generated-test tools stop once a test compiles. That's cheap theater. An untested test is just a plausible-looking string. JitCatch enforces a stronger bar:
- Generate a test targeting the diff.
- Write it into two detached git worktrees. One at the parent rev, one at the child.
- Run it in both worktrees.
- Rank candidates: a test that passes on parent and fails on child is a weak catch. Real, reproducible evidence that the diff changed behavior.
The "weak catch" is the core invariant. Everything else in the pipeline exists to improve the signal-to-noise ratio on top of it:
- Rule-based assessors flag common false-positive patterns (
fp:reflection,fp:flakiness,fp:broken_test_runner). - An LLM-as-judge pass scores each weak catch (
tp_prob,bucket, rationale). - A feedback-driven retry loop targets risks the first round missed, with the prior test's failure output in the prompt.
- An agentic reviewer channel surfaces bugs that test-gen can't reach (mocks, env-coupled paths, untested symbols). Opinion-only findings, kept in a separate section.
- Runtime flake detection re-runs every failing child test N times (default 3). Any flip demotes the candidate with
fp:flake_runtime. - Parallel worktree evaluation runs parent and child tests concurrently per candidate.
- Risk-inference cache under
.jitcatch/cache/risks/keyed on(bundle + lang + model)with a 7-day TTL. Reruns on the same diff skip the LLM round trip. - Per-run token + cost accounting surfaces in the report and on stderr, broken down by stage (risks / tests / judge / review).
Design goals: local-first, zero-config for Ollama, no API keys required for the full offline path (--stub), and deterministic wherever the signal can be expressed as a pattern instead of a prompt.
┌──────────────────────────────────────────────────────────────────────┐
│ jitcatch pr <repo> │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────┐ ┌──────────────────────┐
│ revs.resolve │────▶│ (parent, child) │ (HEAD~1..HEAD,
│ pr | last | │ │ RevPair │ merge-base,
│ staged | working│ │ │ scratch commit)
└──────────────────┘ └──────────────────────┘
│
▼
┌──────────────────┐
│ context.bundle │ Group changed files by language adapter.
│ (top-N by churn)│ Build a single prompt per group.
└──────────────────┘
│
├────────────────────────────────┐
▼ ▼
┌──────────────────┐ ┌────────────────────┐
│ workflows/intent │ │ workflows/dodgy │
│ risks → tests │ │ mutation-mindset │
└──────────────────┘ └────────────────────┘
│ │
└────────────────┬───────────────┘
▼
┌────────────────────┐
│ WorktreeSandbox │ git worktree add --detach
│ parent / child │ run each test in both
└────────────────────┘
│
▼
┌────────────────────┐ ┌──────────────────────┐
│ assessor/rules │ │ workflows/reviewer │
│ fp:* / tp:* flags │ │ BugBot-style diff │
└────────────────────┘ │ review (+ validator) │
│ └──────────────────────┘
▼ │
┌────────────────────┐ │
│ assessor/judge │ │
│ tp_prob, bucket │ │
└────────────────────┘ │
│ │
▼ │
┌────────────────────┐ │
│ workflows/retry │ │
│ feedback loop for │ │
│ uncaught risks │ │
└────────────────────┘ │
│ │
└─────────────┬─────────────┘
▼
┌─────────────────┐
│ report │
│ html / md / │
│ json + usage │
└─────────────────┘
Requires Python ≥ 3.9, git, and (for the JavaScript adapter) Node ≥ 18.
git clone https://github.com/kushankurdas/jitcatch
cd jitcatch
pip install -e '.[dev]'This installs the jitcatch console script. Verify:
jitcatch --helpRun against a pull request, using the repo's origin default branch as the base:
cd /path/to/your/repo
jitcatch pr .Output lands in .jitcatch/output/: a JSON report is always written. Pass --format html (or md, or all) to also emit a readable report. A summary is printed to stdout.
No API key, no network. Uses the built-in StubClient:
jitcatch pr . --stubZero config when Ollama is running on the default port:
ollama pull qwen2.5-coder:7b
jitcatch pr . --provider ollamaexport ANTHROPIC_API_KEY=sk-ant-...
jitcatch pr . --provider anthropic
# or just:
jitcatch pr . # --provider=auto picks anthropic when ANTHROPIC_API_KEY is setjitcatch <subcommand> <repo> [options]
| Subcommand | Parent rev | Child rev | Use case |
|---|---|---|---|
last |
HEAD~1 |
HEAD |
Smoke-test the commit you just made |
pr [--base <ref>] |
merge-base(base, HEAD) |
HEAD |
Review a whole PR against its base |
staged |
HEAD |
synthetic commit of git diff --cached |
Pre-commit check |
working |
HEAD |
synthetic commit of working tree | Check uncommitted changes |
run --file <f> --parent <r> --child <r> |
explicit | explicit | Single-file, explicit revs |
explain <repo> <id-prefix> |
- | - | Print full detail for a candidate from the latest JSON report (prefix ≥ 4 chars), then open an interactive LLM chat about it |
staged and working create a detached scratch worktree at HEAD, apply your patch there, and commit it. Your index and working tree are never mutated.
explain reads the most recently modified jitcatch-*.json under .jitcatch/output/ (override with --report <path>). JSON is always emitted, so explain works after any run. In an interactive terminal it opens a colored LLM REPL seeded with the candidate's full context. Ask follow-ups ("is this a real regression?", "what would a proper fix look like?") without leaving the terminal. Pass --no-chat or pipe the output to skip the REPL and get the plain candidate detail block instead.
| Flag | Default | Description |
|---|---|---|
--workflow {intent,dodgy,both} |
both |
Which test-gen strategies to run |
--provider {auto,anthropic,ollama,openai-compat} |
auto |
LLM backend. auto → anthropic if ANTHROPIC_API_KEY, else ollama |
--base-url <url> |
- | Required for openai-compat. Defaults for ollama: http://localhost:11434/v1 |
--model <name> |
provider-aware | Default model. claude-sonnet-4-6 for anthropic, qwen2.5-coder:7b otherwise |
--model-risks / --model-tests / --model-judge / --model-review |
- | Per-stage model overrides |
--stub |
off | Offline mode, no LLM calls |
--no-judge |
off | Skip LLM-as-judge scoring pass |
--no-review |
off | Skip the agentic reviewer (BugBot-style) |
--no-retry |
off | Skip feedback-driven retry rounds |
--max-retries <n> |
2 |
Retry rounds for uncaught risks |
--max-retry-risks <n> |
8 |
Per-round risk cap - bounds LLM spend |
--skip-validator |
off | Keep every reviewer finding (don't drop FPs) |
--with-callers |
off | Include caller source as usage context |
--max-callers <n> |
5 |
Cap on callers added per file |
--max-files <n> |
20 |
Cap on files per adapter group (by churn) |
--max-bytes <n> |
200_000 |
Cap on bundle prompt size |
--timeout <sec> |
60 |
Per-test execution timeout |
--flake-check <n> |
3 |
Extra child re-runs to confirm a failure is deterministic. Any flip tags the candidate fp:flake_runtime. Set 0 to disable |
--no-cache |
off | Bypass the risk-inference cache for this run |
--clear-cache |
off | Purge .jitcatch/cache/ before running |
--llm-timeout <sec> |
120 |
HTTP read timeout per LLM call |
--max-tokens <n> |
model ceiling | Per-call output token cap |
--filename <name> |
timestamped | Base name for report files |
--format <list> |
- | Comma-separated human-readable formats to emit alongside the always-on JSON: html, md, all. Omit the flag → JSON only. Example: --format html,md |
--verbose |
off | Write per-call LLM transcripts to .jitcatch/logs/ |
--log-dir <path> |
- | Override LLM transcript directory |
| Provider | Flag | Auth | Default model |
|---|---|---|---|
| Anthropic | --provider anthropic |
ANTHROPIC_API_KEY env var |
claude-sonnet-4-6 |
| Ollama | --provider ollama |
none (local) | qwen2.5-coder:7b |
| OpenAI-compatible | --provider openai-compat --base-url <url> |
OPENAI_API_KEY if required by endpoint |
qwen2.5-coder:7b |
| Stub (offline) | --stub |
- | - |
The openai-compat provider works with any chat-completions endpoint:
- LM Studio, vLLM, LocalAI
- Groq, Together, Fireworks, OpenRouter
- Self-hosted gateways
Why Ollama gets its own client: JitCatch routes Ollama through the native /api/chat endpoint so format: "json" and num_ctx are honored. The generic /v1 OpenAI-compat shim silently drops both, which breaks strict JSON-schema prompts on many local models.
Different stages have different cost/quality profiles. Use cheaper models for bulk output and reasoning-heavy models where it matters:
jitcatch pr . \
--provider openai-compat --base-url https://api.together.xyz/v1 \
--model-risks meta-llama/Meta-Llama-3.1-70B-Instruct \
--model-tests meta-llama/Meta-Llama-3.1-8B-Instruct \
--model-judge meta-llama/Meta-Llama-3.1-70B-Instruct \
--model-review meta-llama/Meta-Llama-3.1-70B-Instruct- Ask the LLM to enumerate risks the diff introduces (null deref, off-by-one, contract change, etc.), each tagged with
[file:line]. - Generate one test per risk.
Best for: structured diffs where intent can be reasoned about from the code.
- Skip risk inference.
- Directly ask for tests that would detect the diff as if it were a bug. The test should pin the parent's behavior and fail on any change.
Best for: refactors, small tweaks, cases where the intent is "preserve behavior".
--workflow both (the default) runs both and merges candidates.
Runs independently of test-gen. The reviewer reads the bundle and flags suspected bugs with a rationale. A second LLM validator pass filters obvious false positives (drops) or reduces confidence (downgrades). Findings with validator_verdict ∈ {keep, downgrade} are kept.
Why a separate channel: some bugs can't be exercised by a generated test. A mock swallows the error, an env var stubs out the broken path, or the buggy function is never called in any test. The reviewer surfaces those without pretending they come with executable evidence. Findings appear in their own Markdown section and never outrank test-backed weak catches in the report.
After the first round of tests runs, JitCatch diffs the risk list against the weak catches. For each uncaught risk it generates a follow-up test, including the prior test's failure output as feedback. Capped by --max-retries and --max-retry-risks to bound cost.
Reports land under <repo>/.jitcatch/output/. The JSON report is always written; readable formats are opt-in via --format:
jitcatch-<timestamp>.json. Always written. Machine-readable, sorted so weak catches come first (byfinal_scoredescending, non-weak appended). Consumed byjitcatch explain.jitcatch-<timestamp>.html. With--format html(orall). Self-contained single-file HTML: inlines all CSS, no CDN, works offline. Color-coded diffs, severity badges, collapsed false-positive section.jitcatch-<timestamp>.md. With--format md(orall). Same groupings as HTML.
All three group findings into:
- Test-backed findings (weak catches). Ranked by severity × confidence.
- Reviewer-only findings. Opinion-based, never outrank test-backed.
- Likely false positives. Low-signal entries collapsed to keep the top of the report clean.
A LLM usage panel (tokens, cost, per-stage breakdown) renders in every format when a real LLM client was used. --stub runs omit it.
Each candidate carries:
id. Stable 12-hex hash (workflow + test name + sorted target files). Pass any 4+ char prefix tojitcatch explain.parent_result/child_result. Pass/fail status, stdout, stderr.rule_flags. Deterministic assessor signals (fp:reflection,fp:flake_runtime,tp:null_value, …).judge_tp_prob,judge_bucket,judge_rationale. LLM-as-judge scores.final_score∈ [-1, 1]. Combined ranking score.target_files. Files the test targets.
jitcatch last .
# copy an id prefix from .jitcatch/output/jitcatch-*.json (field: "id")
jitcatch explain . a7f3b2explain opens an interactive chat with the LLM, seeded with that candidate's full context (test code, parent/child stdout/stderr, risks, judge rationale). No need to open the JSON by hand:
────────────────────────────────────────────────────────────
jitcatch explain a7f3b2c1d0e9 test_parses_empty_body intent_aware bucket=High score=+0.72 weak-catch
────────────────────────────────────────────────────────────
ask about this candidate. empty line, 'exit', or Ctrl-D to quit.
you ❯ is this a real regression or a flake?
thinking…
llm ❯ The `fp:flake_runtime` flag wasn't set and the child failed on …
you ❯ what would a minimal fix look like?
llm ❯ …
you ❯ exit
bye.
- Provider/model flags mirror
run/pr:--provider {auto,anthropic,ollama,openai-compat},--model,--base-url,--stub,--max-tokens,--llm-timeout,--verbose,--log-dir. - Colored banner + prompts (
you ❯cyan,llm ❯green) render only on a TTY. SetNO_COLOR=1to disable styling; redirecting stdout also drops colors automatically. --no-chatskips the REPL and prints the plain candidate detail block. No LLM call.- Non-tty stdin (pipes, redirects, CI) auto-skips the REPL and falls back to the detail block, so
jitcatch explain . a7f3b2 | lessstill works. - Exit the REPL with an empty line,
exit/quit/:q, or Ctrl-D.
See docs/ for per-use-case documentation, and the Output section above for how findings are grouped and ranked.
jitcatch/
├── cli.py Argument parsing, subcommand dispatch, end-to-end orchestration
├── llm.py Provider clients (Anthropic, Ollama, OpenAI-compat, Stub) + UsageStats
├── cache.py Risk-inference disk cache (sha256 keys, TTL, clear)
├── revs.py Parent/child rev resolution + scratch worktrees
├── diff.py Low-level git helpers
├── context.py Bundle assembly, caller discovery, file selection
├── runner.py WorktreeSandbox, evaluate_test (parallel parent/child), rerun_child
├── config.py Dataclasses: CatchCandidate, GeneratedTest, ReviewFinding, TestResult
├── report.py JSON + Markdown + HTML output, stable_id
├── workflows/
│ ├── intent_aware.py Risks-first test gen
│ ├── dodgy_diff.py Mutation-mindset test gen
│ ├── reviewer.py BugBot-style diff review + validator
│ └── retry.py Feedback-driven retry rounds
├── assessor/
│ ├── rules.py Deterministic fp:*/tp:* flagging + final_score
│ └── judge.py LLM-as-judge wrapper
└── adapters/
├── base.py Adapter ABC + subprocess helper
├── python.py pytest
└── javascript.py node:test (ESM + CommonJS)
tests/
├── test_context.py Bundle + caller discovery
├── test_revs.py Rev resolvers (last/pr/staged/working)
├── test_smoke.py End-to-end with StubClient
├── test_rules.py Deterministic assessor rules
├── test_reviewer_retry.py Reviewer pipeline, retry loop, report sorting
├── test_llm_parse.py JSON extraction, truncation recovery
├── test_pr_mode.py pr/base-detection logic
├── test_provider_dispatch.py Provider routing via httpx MockTransport
├── test_cache.py Risk-inference cache (TTL, key stability, clear)
├── test_explain.py stable_id + `jitcatch explain` behavior
├── test_html_report.py HTML writer, --format flag, usage panel
└── fixtures/ Fixture repos for language-adapter tests
| Language | Extensions | Test runner | Adapter |
|---|---|---|---|
| Python | .py |
pytest |
jitcatch/adapters/python.py |
| JavaScript | .js, .mjs, .cjs |
node --test (node:test) |
jitcatch/adapters/javascript.py |
The JS adapter auto-detects ESM vs CommonJS from the target file extension and the project's package.json "type" field. detect_runner also recognizes Jest and Vitest for future extension.
Subclass jitcatch.adapters.base.Adapter, register it in jitcatch/adapters/__init__.py, and add a fixture under tests/fixtures/. See the existing adapters as templates. Contract:
class Adapter(ABC):
lang: str
exts: tuple[str, ...] = ()
def detect(self, source_rel: str) -> bool: ...
def prompt_hints(self, module_rel: str, repo_root: Path | None = None) -> str: ...
def write_test(self, repo_root: Path, test_name: str, code: str) -> TestArtifact: ...
def run_test(self, repo_root: Path, artifact: TestArtifact, timeout: int) -> TestResult: ...- Large diffs are bounded by design. Bundle is capped at
--max-bytes(default 200 KB). Files beyond that are hunk-windowed (50 lines around each hunk). Override with--max-bytesif you have a generous context window. - Top-N by churn.
select_fileskeeps the most-changed files when a group exceeds--max-files. Noise from incidental edits doesn't crowd out the signal. - Prompt injection from untrusted repos. JitCatch assumes you trust the code it reads. See
SECURITY.mdfor the threat model. - Verbose logs.
--verbosewrites every LLM request/response to.jitcatch/logs/untruncated. Invaluable when a run produces no weak catches. Start by reading the risk list. - Truncation. If the stderr summary reports
truncated (max_tokens): N > 0, a response was cut off. Raise--max-tokens, shrink--max-bytes, or switch stages to a higher-ceiling model via--model-tests.
git clone https://github.com/kushankurdas/jitcatch
cd jitcatch
pip install -e '.[dev]'
pytest tests/The test suite is fully offline:
StubClientfor LLM callshttpx.MockTransportfor provider-dispatch tests- Temp-dir git repos for sandbox tests
No API keys, no network. CI runs on Python 3.9–3.12 against Ubuntu with Node 20 installed for the JS adapter.
Run a single test:
pytest tests/test_reviewer_retry.py::ReportSortingTest -vSee CONTRIBUTING.md for the PR checklist, architecture orientation, and the code style rules.
JitCatch executes generated tests with the invoking user's privileges. The worktree sandbox is for rev isolation, not security containment. Run it against trusted repositories only, or inside a disposable container for CI.
Report vulnerabilities privately. See SECURITY.md. Do not open public issues for security bugs.
JitCatch is an independent open-source implementation of ideas from:
Becker, M. et al. Just-in-Time Catching Test Generation at Meta. In Companion Proceedings of the 34th ACM International Conference on the Foundations of Software Engineering (FSE Companion '26), June 2026, Montreal, Canada. arXiv:2601.22832.
@inproceedings{becker2026jitcatch,
author = {Becker, Matthew and others},
title = {Just-in-Time Catching Test Generation at Meta},
booktitle = {Companion Proceedings of the 34th ACM International Conference on the Foundations of Software Engineering (FSE Companion '26)},
year = {2026},
address = {Montreal, Canada},
url = {https://arxiv.org/abs/2601.22832},
eprint = {2601.22832},
archivePrefix = {arXiv}
}This repository is not affiliated with or endorsed by Meta.
MIT. See LICENSE.
Copyright © 2026 Kushankur Das.