Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
d52a3a4
feat: add maintenance skills — deps-audit, bench-check, test-health, …
carlos-alm Mar 21, 2026
b187fe1
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 21, 2026
a562b52
fix(bench-check): capture stderr, guard division-by-zero, commit base…
carlos-alm Mar 21, 2026
4fc994d
fix(deps-audit): run npm ci after revert, document tokenizer skip reason
carlos-alm Mar 21, 2026
89aef6b
fix(housekeep): guard Phase 5 in source repo, fix stale-worktree crit…
carlos-alm Mar 21, 2026
3e892d1
Merge remote-tracking branch 'origin/feat/maintenance-skills' into fe…
carlos-alm Mar 21, 2026
ce5d811
fix: address Round 3 Greptile review feedback
carlos-alm Mar 22, 2026
01b5110
fix: move deps-audit stash to Phase 0, before npm commands modify man…
carlos-alm Mar 22, 2026
3b0e293
fix: capture flaky-detection loop output to per-run files for comparison
carlos-alm Mar 22, 2026
52de495
fix: always require confirmation for stale worktree removal
carlos-alm Mar 22, 2026
8be5cec
fix: use parsed threshold in baseline.json, guard --compare-only on f…
carlos-alm Mar 22, 2026
0691ffc
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 22, 2026
87d9213
fix(deps-audit): track stash creation to avoid operating on wrong entry
carlos-alm Mar 22, 2026
65d9836
fix(test-health): use mktemp for flaky-run directory to avoid concurr…
carlos-alm Mar 22, 2026
eef2c03
fix(bench-check): add save-baseline verdict path, fix em-dash, use ex…
carlos-alm Mar 22, 2026
19b14e9
docs(roadmap): update Phase 5 TypeScript migration with accurate prog…
carlos-alm Mar 22, 2026
5bda6ba
fix: deps-audit success path should keep npm changes, not revert (#565)
carlos-alm Mar 22, 2026
bd0ba1a
fix: bench-check use git add + diff --cached to detect new files (#565)
carlos-alm Mar 22, 2026
7b91e3c
fix: housekeep require confirmation before branch deletion (#565)
carlos-alm Mar 22, 2026
7fcdd93
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 23, 2026
5462d32
fix: scope git diff --cached to bench-check files only (#565)
carlos-alm Mar 23, 2026
457e6b9
fix: use json-summary reporter to match coverage-summary.json output …
carlos-alm Mar 23, 2026
852003d
fix: capture stash ref by name to avoid position-based targeting (#565)
carlos-alm Mar 23, 2026
eea2954
fix: remove unreachable Phase 5 subphases since source-repo guard alw…
carlos-alm Mar 23, 2026
baf6797
Merge remote-tracking branch 'origin/feat/maintenance-skills' into fi…
carlos-alm Mar 23, 2026
9b4869c
fix: use dynamic threshold variable in bench-check Phase 6 report tem…
carlos-alm Mar 23, 2026
8d92c99
fix: address open review items in maintenance skills (#565)
carlos-alm Mar 23, 2026
9ad37ea
fix: add COVERAGE_ONLY guards to Phase 2 and Phase 4 in test-health
carlos-alm Mar 23, 2026
30ab30e
fix: add regression skip guard to bench-check Phase 5, expand deps-au…
carlos-alm Mar 23, 2026
a8631d2
fix: add empty-string guard for stat size check in housekeep (#565)
carlos-alm Mar 23, 2026
8fd7430
fix: add BASELINE SAVED verdict path and clarify if/else-if in bench-…
carlos-alm Mar 23, 2026
23f2f76
docs(roadmap): mark Phase 4 complete, update Phase 5 progress (5 of 7)
carlos-alm Mar 23, 2026
2616c78
docs(roadmap): correct Phase 5 progress — 5.3/5.4/5.5 still in progress
carlos-alm Mar 23, 2026
9d2b7ff
fix(skill): ban untracked deferrals in /review skill
carlos-alm Mar 23, 2026
ef11f5c
feat(types): migrate db, graph algorithms/builders, and domain/querie…
carlos-alm Mar 23, 2026
152540c
Merge remote-tracking branch 'origin/main' into fix/review-570
carlos-alm Mar 23, 2026
eb7bdf8
fix: remove type lie and false safety guard in Leiden partition (#570)
carlos-alm Mar 23, 2026
e4d87e5
fix: safe get() casts, preserve transaction types, add builders barre…
carlos-alm Mar 23, 2026
7a1e6a9
fix: simplify dead interface and truthy guard in leiden (#570)
carlos-alm Mar 23, 2026
36cc8bb
Merge branch 'main' into feat/ts-migrate-phase5.5
carlos-alm Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
271 changes: 271 additions & 0 deletions .claude/skills/bench-check/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
---
name: bench-check
description: Run benchmarks against a saved baseline, detect performance regressions, and update the baseline — guards against silent slowdowns
argument-hint: "[--save-baseline | --compare-only | --threshold 15] (default: compare + save)"
allowed-tools: Bash, Read, Write, Edit, Glob, Grep, Agent
---

# /bench-check — Performance Regression Check

Run the project's benchmark suite, compare results against a saved baseline, flag regressions beyond a threshold, and optionally update the baseline. Prevents silent performance degradation between releases.

## Arguments

- `$ARGUMENTS` may contain:
- `--save-baseline` — run benchmarks and save as the new baseline (no comparison)
- `--compare-only` — compare against baseline without updating it
- `--threshold N` — regression threshold percentage (default: 15%)
- No arguments — compare against baseline, then update it if no regressions

## Phase 0 — Pre-flight

1. Confirm we're in the codegraph repo root
2. Check that benchmark scripts exist:
- `scripts/benchmark.js` (build speed, query latency)
- `scripts/incremental-benchmark.js` (incremental build tiers)
- `scripts/query-benchmark.js` (query depth scaling)
- `scripts/embedding-benchmark.js` (search recall) — optional, skip if embedding deps missing
3. Parse `$ARGUMENTS`:
- `SAVE_ONLY=true` if `--save-baseline`
- `COMPARE_ONLY=true` if `--compare-only`
- `THRESHOLD=N` from `--threshold N` (default: 15)
4. Check for existing baseline at `generated/bench-check/baseline.json`
- If missing and not `--save-baseline`: warn that this will be an initial baseline run

## Phase 1 — Run Benchmarks

Run each benchmark script and collect results. Each script outputs JSON to stdout.

### 1a. Build & Query Benchmark

```bash
output=$(timeout 300 node scripts/benchmark.js 2>&1)
exit_code=$?
```

If `exit_code` is 124: record `"timeout"` for this suite and skip to the next suite.
Else if `exit_code` is non-zero: record `"error: $output"` for this suite and skip to the next suite.

Extract:
- `buildTime` (ms) — per engine (native, WASM)
- `queryTime` (ms) — per query type
- `nodeCount`, `edgeCount` — graph size

### 1b. Incremental Benchmark

```bash
output=$(timeout 300 node scripts/incremental-benchmark.js 2>&1)
exit_code=$?
```

If `exit_code` is 124: record `"timeout"` for this suite and skip to the next suite.
Else if `exit_code` is non-zero: record `"error: $output"` for this suite and skip to the next suite.

Extract:
- `noOpRebuild` (ms) — time for no-change rebuild
- `singleFileRebuild` (ms) — time after one file change
- `importResolution` (ms) — resolution throughput

### 1c. Query Depth Benchmark

```bash
output=$(timeout 300 node scripts/query-benchmark.js 2>&1)
exit_code=$?
```

If `exit_code` is 124: record `"timeout"` for this suite and skip to the next suite.
Else if `exit_code` is non-zero: record `"error: $output"` for this suite and skip to the next suite.

Extract:
- `fnDeps` scaling by depth
- `fnImpact` scaling by depth
- `diffImpact` latency

### 1d. Embedding Benchmark (optional)

```bash
output=$(timeout 300 node scripts/embedding-benchmark.js 2>&1)
exit_code=$?
```

If `exit_code` is 124: record `"timeout"` for this suite and skip to the next suite.
Else if `exit_code` is non-zero: record `"error: $output"` for this suite and skip to the next suite.

Extract:
- `embeddingTime` (ms)
- `recall` at Hit@1, Hit@3, Hit@5, Hit@10

> **Timeout:** Each benchmark gets 5 minutes max (`timeout 300`). Exit code 124 indicates timeout — record `"timeout"` for that suite and continue.

> **Errors:** If a benchmark script fails (non-zero exit), record `"error: <message>"` and continue with remaining benchmarks.

## Phase 2 — Normalize Results

Build a flat metrics object from all benchmark results:

```json
{
"timestamp": "<ISO 8601>",
"version": "<from package.json>",
"gitRef": "<current HEAD short SHA>",
"metrics": {
"build.native.ms": 1234,
"build.wasm.ms": 2345,
"query.fnDeps.depth3.ms": 45,
"query.fnImpact.depth3.ms": 67,
"query.diffImpact.ms": 89,
"incremental.noOp.ms": 12,
"incremental.singleFile.ms": 34,
"incremental.importResolution.ms": 56,
"graph.nodes": 500,
"graph.edges": 1200,
"embedding.time.ms": 3000,
"embedding.recall.hit1": 0.85,
"embedding.recall.hit5": 0.95
}
}
```

Adapt the metric keys to match whatever the benchmark scripts actually output — the above are representative. The goal is a flat key→number map for easy comparison.

## Phase 3 — Compare Against Baseline

Skip this phase if `SAVE_ONLY=true` or no baseline exists.

For each metric in the current run:

1. Look up the same metric in the baseline
2. Guard against division-by-zero: if `baseline == 0`, mark the delta as `"N/A — baseline was zero"` and treat the metric as **informational only** (not a regression or improvement)
3. Otherwise compute: `delta_pct = ((current - baseline) / baseline) * 100`
4. Classify:
- **Regression**: metric increased by more than `THRESHOLD`% (for time metrics) or decreased by more than `THRESHOLD`% (for recall/quality metrics)
- **Improvement**: metric decreased by more than `THRESHOLD`% (time) or increased (quality)
- **Stable**: within threshold

> **Direction awareness:** For latency metrics (ms), higher = worse. For recall/quality metrics, higher = better. For count metrics (nodes, edges), changes are informational only — not regressions.

### Regression table

| Metric | Baseline | Current | Delta | Status |
|--------|----------|---------|-------|--------|
| build.native.ms | 1200 | 1500 | +25% | REGRESSION |
| query.fnDeps.depth3.ms | 45 | 43 | -4.4% | stable |

## Phase 4 — Verdict

Based on comparison results:

### No regressions found
- Print: `BENCH-CHECK PASSED — no regressions beyond {THRESHOLD}% threshold`
- If not `COMPARE_ONLY`: update baseline with current results

### Regressions found
- Print: `BENCH-CHECK FAILED — {N} regressions detected`
- List each regression with metric name, baseline value, current value, delta %
- Do NOT update the baseline
- Suggest investigation:
- `git log --oneline <baseline-ref>..HEAD` to find what changed
- `codegraph diff-impact <baseline-ref> -T` to find structural changes
- Re-run individual benchmarks to confirm (not flaky)

### First run (no baseline)
- If `COMPARE_ONLY` is set: print a warning that no baseline exists and exit without saving
- Otherwise: print `BENCH-CHECK — initial baseline saved` and save current results as baseline

### Save-baseline with existing baseline (`--save-baseline`)
- Print: `BENCH-CHECK — baseline overwritten (previous: <old gitRef>, new: <new gitRef>)`
- Save current results as the new baseline (overwrite existing)

## Phase 5 — Save Baseline

**Skip this phase if `COMPARE_ONLY` is set.** Compare-only mode never writes or commits baselines.
**Skip this phase if regressions were detected in Phase 4.** The baseline is only updated on a clean run.

When saving (initial run, `--save-baseline`, or passed comparison):

Write to `generated/bench-check/baseline.json`:
```json
{
"savedAt": "<ISO 8601>",
"version": "<package version>",
"gitRef": "<HEAD short SHA>",
"threshold": $THRESHOLD,
"metrics": { ... }
}
```

Also append a one-line summary to `generated/bench-check/history.ndjson`:
```json
{"timestamp":"...","version":"...","gitRef":"...","metrics":{...}}
```

This creates a running log of benchmark results over time.

After writing both files, commit the baseline so it is a shared reference point:
```bash
git add generated/bench-check/baseline.json generated/bench-check/history.ndjson
git diff --cached --quiet -- generated/bench-check/baseline.json generated/bench-check/history.ndjson || git commit generated/bench-check/baseline.json generated/bench-check/history.ndjson -m "chore: update bench-check baseline (<gitRef>)"
```

> `git add` first so that newly created files (first run) are staged; `--cached` then detects them correctly. Without this, `git diff --quiet` ignores untracked files and the baseline is never committed on the first run.

## Phase 6 — Report

Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.

**If `SAVE_ONLY` is set or no prior baseline existed (first run):** write a shortened report — omit the "Comparison vs Baseline" and "Regressions" sections since no comparison was performed:

```markdown
# Benchmark Report — <date>

**Version:** X.Y.Z | **Git ref:** abc1234 | **Threshold:** $THRESHOLD%

## Verdict: BASELINE SAVED — no comparison performed

## Raw Results

<!-- Full JSON output from each benchmark -->
```

**Otherwise (comparison was performed):** write the full report with comparison and verdict:

```markdown
# Benchmark Report — <date>

**Version:** X.Y.Z | **Git ref:** abc1234 | **Threshold:** $THRESHOLD%

## Verdict: PASSED / FAILED

## Comparison vs Baseline

<!-- Full comparison table with all metrics -->

## Regressions (if any)

<!-- Detail each regression with possible causes -->

## Trend (if history.ndjson has 3+ entries)

<!-- Show trend for key metrics: build time, query time, graph size -->

## Raw Results

<!-- Full JSON output from each benchmark -->
```

## Phase 7 — Cleanup

1. If report was written, print its path
2. If baseline was updated, print confirmation
3. Print one-line summary: `PASSED (0 regressions) | FAILED (N regressions) | BASELINE SAVED`

## Rules

- **Never skip a benchmark** — if it fails, record the failure and continue
- **Timeout is 5 minutes per benchmark** — use appropriate timeout flags
- **Don't update baseline on regression** — the user must investigate first
- **Recall/quality metrics are inverted** — a decrease is a regression
- **Count metrics are informational** — graph growing isn't a regression
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 always commits it
- **history.ndjson is append-only** — never truncate or rewrite it
- Generated files go in `generated/bench-check/` — create the directory if needed
Loading
Loading