-
Notifications
You must be signed in to change notification settings - Fork 1
v0.7.4
Released: 2026-04-10 Theme: Measure the framework. Tighten four rules that the measurement exposed.
This release ran Kasidit against SWE-bench Lite and hardened four rules where the measurement revealed the framework was drifting. 56 SWE-bench tasks → 60.7% strict PASS / 87.5% valid rate. New rules ban fabricated metrics, enforce numbered option lists, require native-language replies, and make git log / git log -S mandatory before any bug-fix proposal.
By v0.3.0 the framework had stabilized; by v0.7 it had accumulated variants and exceptions. The author wanted to know whether the discipline actually produced better outputs. Kasidit was run unmodified against SWE-bench Lite samples using multiple tiers; the results exposed four patterns of failure that were not tier-specific — they were framework-specific.
| Sample | Tasks | PASS (strict) | Valid rate |
|---|---|---|---|
| SWE-bench Lite sequential | 56 / 300 | 34 (60.7%) | 49 (87.5%) |
| Curated 15-task | 15 | 8 (53%) | 15 (100%) — zero FAIL |
| Curated 7-task (Opus) | 7 | 6 (86%) | 7 (100%) |
"Valid" = runs to completion, satisfies the task definition. "PASS strict" = passes the hidden test suite.
Before: "expected 10× speedup", "theoretical savings of 40% memory", "projected reduction in DB calls" — all allowed if clearly labeled.
After: these phrasings are banned. Numbers must be measured before claimed. Labels "analytical / theoretical / expected / projected" do not make fabricated numbers acceptable.
Why: when the framework recorded its own outputs, fabricated numbers were the single biggest source of user-perceived hallucination. Users read "expected 10×" and remembered "10×".
Before: "you could try A, B, or C" — AI offers a list, user paraphrases back.
After: number every option — 1. A / 2. B / 3. C. User replies by number. Removes transcription error from the loop.
Why: paraphrase drift. The agent would propose option B, user says "yes the second one" meaning option A (counting from zero), agent implements something different entirely.
Before: AI replied in whichever language the system prompt was in (usually English).
After: reply in the user's native language. Thai user → Thai response. Code and identifiers stay English. Mixed is fine.
Why: cognitive load. A Thai user parsing English architecture discussion is slower and more error-prone than the same user parsing Thai. Code stays English so it compiles.
Before: bug-fix agent reads the current code, forms a hypothesis, writes a fix.
After: before any fix, run:
-
git log --grep=<term>— search commit messages for related history -
git log -S <symbol>— search for commits that added/removed the exact symbol - read the relevant commits before hypothesizing
Why: most "new" bugs are re-regressions of bugs fixed before. The earlier fix's commit message is the shortest path to the cause. Skipping this step produces fixes that re-introduce the bug the earlier commit fixed.
- Added: four rules (2.3, 2.4, 2.5, 2.6)
- Added: SWE-bench validation data in the Version section
-
Tightened: confidence labels —
[unsure]items cannot be silently dropped
(Versions between 0.3 and 0.7.4 are not separately documented in the wiki; they were iterative tuning.)
- Rule 2.3 may reject outputs that previously passed. Fabricated metrics are now surfaced as errors instead of warnings.
- Rule 2.6 adds steps to bug-fix flow. Missions that bypassed git-log now take longer but succeed more often.
No code change required. The next bug-fix mission will run git log --grep and git log -S as its opening step. If the repo is shallow or the bug is in files without history, the agent will state that explicitly and proceed.
- SWE-bench Lite is a curated subset of SWE-bench. Results do not generalize to every codebase.
-
git log -Son large monorepos can be slow — agents time out the query and fall back to--grep.
Repo • Discussions • Issues • Changelog • Security • Contributing • MIT • © Kasidit Wansudon
Kasidit
Core
- Commands
- Kasi-Mode 🔥 v0.10
- Backend-Hooks 🔥 v0.10
- Model Tiers
- Gravity Pattern
- Multi-Agent-Orchestration
- Claude Design Integration
- UI Override Mode
- FAQ
Version History
- Version History — overview
- v0.13.0 — thClaws (Consolidated) 🦞
- v0.12.0 — thClaws Runtime Support 🦞
- v0.11.0 — Backend + Bridge + Runbook 🚀
- v0.10.0 — Mode + Backend Hooks
- v0.9.2 — Gravity
- v0.9.1 — Master Orchestrator
- v0.9.0 — Claude Design
- v0.8.0 — Tier Cascade
- v0.7.4 — SWE-bench
- v0.3.0 — Tier adaptation
- v0.2.1 — Docs protocol
- v0.2.0 — UI Override
- v0.1.0 — Core
Concepts
Commands
- Kasi-Init
- Kasi-Review
- Kasi-Security
- Kasi-Fix
- Kasi-Ui
- Kasi-Cascade
- Kasi-Multi
- Kasi-Scaffold
- Kasi-Docs
- Kasi-Status
- Kasi-Promote
- Kasi-Pull
- Kasi-Sync
- Kasi-Search
- Kasi-Wiki-Sync
Agents
- Agent-Architect-Planner
- Agent-Audit-Specialist 🔥 v0.10
- Agent-Bug-Hunter
- Agent-Deep-Researcher
- Agent-Legacy-Specialist
- Agent-Migration-Specialist
- Agent-Refactor-Surgeon
- Agent-Test-Writer
Deprecated v0.10 (stubs → audit-specialist --focus=..., removed in v0.11)
-
Agent-Code-Reviewer →
--focus=quality -
Agent-Security-Auditor →
--focus=security -
Agent-Perf-Profiler →
--focus=perf