v0.7.4

v0.7.4 — SWE-bench Validation + Rules 2.3–2.6

Released: 2026-04-10 Theme: Measure the framework. Tighten four rules that the measurement exposed.

TL;DR

This release ran Kasidit against SWE-bench Lite and hardened four rules where the measurement revealed the framework was drifting. 56 SWE-bench tasks → 60.7% strict PASS / 87.5% valid rate. New rules ban fabricated metrics, enforce numbered option lists, require native-language replies, and make git log / git log -S mandatory before any bug-fix proposal.

Why

By v0.3.0 the framework had stabilized; by v0.7 it had accumulated variants and exceptions. The author wanted to know whether the discipline actually produced better outputs. Kasidit was run unmodified against SWE-bench Lite samples using multiple tiers; the results exposed four patterns of failure that were not tier-specific — they were framework-specific.

What's new

SWE-bench runs

Sample	Tasks	PASS (strict)	Valid rate
SWE-bench Lite sequential	56 / 300	34 (60.7%)	49 (87.5%)
Curated 15-task	15	8 (53%)	15 (100%) — zero FAIL
Curated 7-task (Opus)	7	6 (86%)	7 (100%)

"Valid" = runs to completion, satisfies the task definition. "PASS strict" = passes the hidden test suite.

Rule 2.3 — No fake metrics

Before: "expected 10× speedup", "theoretical savings of 40% memory", "projected reduction in DB calls" — all allowed if clearly labeled.

After: these phrasings are banned. Numbers must be measured before claimed. Labels "analytical / theoretical / expected / projected" do not make fabricated numbers acceptable.

Why: when the framework recorded its own outputs, fabricated numbers were the single biggest source of user-perceived hallucination. Users read "expected 10×" and remembered "10×".

Rule 2.4 — Numbered options

Before: "you could try A, B, or C" — AI offers a list, user paraphrases back.

After: number every option — 1. A / 2. B / 3. C. User replies by number. Removes transcription error from the loop.

Why: paraphrase drift. The agent would propose option B, user says "yes the second one" meaning option A (counting from zero), agent implements something different entirely.

Rule 2.5 — Native language reply

Before: AI replied in whichever language the system prompt was in (usually English).

After: reply in the user's native language. Thai user → Thai response. Code and identifiers stay English. Mixed is fine.

Why: cognitive load. A Thai user parsing English architecture discussion is slower and more error-prone than the same user parsing Thai. Code stays English so it compiles.

Rule 2.6 — Mandatory git-log before bug fix

Before: bug-fix agent reads the current code, forms a hypothesis, writes a fix.

After: before any fix, run:

git log --grep=<term> — search commit messages for related history
git log -S <symbol> — search for commits that added/removed the exact symbol
read the relevant commits before hypothesizing

Why: most "new" bugs are re-regressions of bugs fixed before. The earlier fix's commit message is the shortest path to the cause. Skipping this step produces fixes that re-introduce the bug the earlier commit fixed.

What changed vs v0.3.0 (prior recorded major)

Added: four rules (2.3, 2.4, 2.5, 2.6)
Added: SWE-bench validation data in the Version section
Tightened: confidence labels — [unsure] items cannot be silently dropped

(Versions between 0.3 and 0.7.4 are not separately documented in the wiki; they were iterative tuning.)

Breaking changes

Rule 2.3 may reject outputs that previously passed. Fabricated metrics are now surfaced as errors instead of warnings.
Rule 2.6 adds steps to bug-fix flow. Missions that bypassed git-log now take longer but succeed more often.

Migration

No code change required. The next bug-fix mission will run git log --grep and git log -S as its opening step. If the repo is shallow or the bug is in files without history, the agent will state that explicitly and proceed.

Known limitations

SWE-bench Lite is a curated subset of SWE-bench. Results do not generalize to every codebase.
git log -S on large monorepos can be slow — agents time out the query and fall back to --grep.

v0.7.4

v0.7.4 — SWE-bench Validation + Rules 2.3–2.6

TL;DR

Why

What's new

SWE-bench runs

Rule 2.3 — No fake metrics

Rule 2.4 — Numbered options

Rule 2.5 — Native language reply

Rule 2.6 — Mandatory git-log before bug fix

What changed vs v0.3.0 (prior recorded major)

Breaking changes

Migration

Known limitations

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally