Skip to content
Oak Dev-inter edited this page Apr 23, 2026 · 1 revision

v0.7.4 — SWE-bench Validation + Rules 2.3–2.6

Released: 2026-04-10 Theme: Measure the framework. Tighten four rules that the measurement exposed.

TL;DR

This release ran Kasidit against SWE-bench Lite and hardened four rules where the measurement revealed the framework was drifting. 56 SWE-bench tasks → 60.7% strict PASS / 87.5% valid rate. New rules ban fabricated metrics, enforce numbered option lists, require native-language replies, and make git log / git log -S mandatory before any bug-fix proposal.

Why

By v0.3.0 the framework had stabilized; by v0.7 it had accumulated variants and exceptions. The author wanted to know whether the discipline actually produced better outputs. Kasidit was run unmodified against SWE-bench Lite samples using multiple tiers; the results exposed four patterns of failure that were not tier-specific — they were framework-specific.

What's new

SWE-bench runs

Sample Tasks PASS (strict) Valid rate
SWE-bench Lite sequential 56 / 300 34 (60.7%) 49 (87.5%)
Curated 15-task 15 8 (53%) 15 (100%) — zero FAIL
Curated 7-task (Opus) 7 6 (86%) 7 (100%)

"Valid" = runs to completion, satisfies the task definition. "PASS strict" = passes the hidden test suite.

Rule 2.3 — No fake metrics

Before: "expected 10× speedup", "theoretical savings of 40% memory", "projected reduction in DB calls" — all allowed if clearly labeled.

After: these phrasings are banned. Numbers must be measured before claimed. Labels "analytical / theoretical / expected / projected" do not make fabricated numbers acceptable.

Why: when the framework recorded its own outputs, fabricated numbers were the single biggest source of user-perceived hallucination. Users read "expected 10×" and remembered "10×".

Rule 2.4 — Numbered options

Before: "you could try A, B, or C" — AI offers a list, user paraphrases back.

After: number every option1. A / 2. B / 3. C. User replies by number. Removes transcription error from the loop.

Why: paraphrase drift. The agent would propose option B, user says "yes the second one" meaning option A (counting from zero), agent implements something different entirely.

Rule 2.5 — Native language reply

Before: AI replied in whichever language the system prompt was in (usually English).

After: reply in the user's native language. Thai user → Thai response. Code and identifiers stay English. Mixed is fine.

Why: cognitive load. A Thai user parsing English architecture discussion is slower and more error-prone than the same user parsing Thai. Code stays English so it compiles.

Rule 2.6 — Mandatory git-log before bug fix

Before: bug-fix agent reads the current code, forms a hypothesis, writes a fix.

After: before any fix, run:

  • git log --grep=<term> — search commit messages for related history
  • git log -S <symbol> — search for commits that added/removed the exact symbol
  • read the relevant commits before hypothesizing

Why: most "new" bugs are re-regressions of bugs fixed before. The earlier fix's commit message is the shortest path to the cause. Skipping this step produces fixes that re-introduce the bug the earlier commit fixed.

What changed vs v0.3.0 (prior recorded major)

  • Added: four rules (2.3, 2.4, 2.5, 2.6)
  • Added: SWE-bench validation data in the Version section
  • Tightened: confidence labels — [unsure] items cannot be silently dropped

(Versions between 0.3 and 0.7.4 are not separately documented in the wiki; they were iterative tuning.)

Breaking changes

  • Rule 2.3 may reject outputs that previously passed. Fabricated metrics are now surfaced as errors instead of warnings.
  • Rule 2.6 adds steps to bug-fix flow. Missions that bypassed git-log now take longer but succeed more often.

Migration

No code change required. The next bug-fix mission will run git log --grep and git log -S as its opening step. If the repo is shallow or the bug is in files without history, the agent will state that explicitly and proceed.

Known limitations

  • SWE-bench Lite is a curated subset of SWE-bench. Results do not generalize to every codebase.
  • git log -S on large monorepos can be slow — agents time out the query and fall back to --grep.

See also

Kasidit

Core

Version History

Concepts

Commands

Agents

Deprecated v0.10 (stubs → audit-specialist --focus=..., removed in v0.11)

Clone this wiki locally