-
Notifications
You must be signed in to change notification settings - Fork 0
The Complexity Refactor Campaign
A run of releases (roughly v3.19–v3.22) built around a different discipline than either the later "bulletproofing pass" (Home Assistant, Kiwix, Snapshot Engine) or general feature work: running real static-analysis tooling (radon for cyclomatic complexity, mypy, bandit, vulture, pip-audit) against the codebase, then treating every genuine outlier as a reason to read that specific code closely — not refactoring for the score's own sake, but using the score as a pointer to where a careful read was actually warranted. Several real bugs came out of it, almost all found the same way: comparing two superficially-similar code paths side by side and checking whether they were really doing the same thing.
radon flagged route_with_source() at a cyclomatic complexity of F(45) — by far the worst in the codebase, the natural consequence of years of real logic (conditional detection, recursive remainder handling, decomposition, fallback resolution) accumulating in one function.
First extraction: the single-source-resolution-with-fallback logic pulled into a new _resolve_single_source() helper, bringing the parent function to D(30). Comparing the newly-isolated logic against its near-duplicate inline counterpart (the decomposition loop had its own, separate copy of the same fallback logic) surfaced two real bugs neither implementation's own tests had caught on their own:
- A real fallback-caching inconsistency. The decomposition loop's fallback path called the fallback source's handler directly, with no cache check — while the top-level path correctly checked the fallback source's own cache first. A repeated query that had already fallen back once should be served from cache the second time, regardless of which code path led to it; the decomposition loop wasn't doing this.
-
An "unknown source" message
_looks_empty()couldn't recognize. If intent detection ever returned a source name with no matching handler — not normally expected, but not provably unreachable either — the resulting error message matched none of the known failure phrases, so it was treated as real content instead of being dropped the way every other failed result already was.
Both fixed by unifying both call sites on the new shared helper, which always checks cache first and recognizes "unknown source" as a real failure.
Second and third extractions, in the next release: _resolve_conditional() (the leading "if X, Y" handling, including recursive condition/remainder resolution — the single most bug-prone piece of this function's own history) and _merge_decomposed_parts() (consecutive same-source merging and header formatting). A third near-duplicate — the decomposition loop's own per-sub-query conditional check, which looks similar to the newly-extracted _resolve_conditional() at first glance — was checked carefully and deliberately left unmerged: the two are genuinely, correctly different by design (the top-level handler builds its own complete headers since nothing else follows it; the sub-query version defers header-wrapping to the decomposition loop's later merge step, since its result is just one ingredient among potentially several still to come). Forcing them into one shared function would have broken the sub-query path's correct deferred-header behavior purely to look less duplicated. route_with_source() reached C(18), the original target range, across both releases combined.
A fourth pass found one more real gap, this time while investigating whether pushing further into B-grade complexity was worth pursuing at all — explicitly scoped in advance as "only proceed if it surfaces a real bug, not purely to chase the grade." It did: a decomposed sub-query that itself resolves to fusion (multiple sources at once) had no caching at all, unlike every other path in the system. A repeated compound query whose individual clause happened to resolve to multiple sources internally would re-run LLM-based fusion source selection and re-query every fusion source on every single request, even identical, immediate repeats. Fixed using the same cache-key convention the top-level fusion path already used. The dispatch-unification idea that originally motivated the investigation was explicitly not pursued once the actual numbers showed it would only reach roughly C(12) — not worth the real risk of restructuring the function's recursive, loop-internal logic for that small a further gain. The fix itself cost two points back (C(18) → C(20)) for the new cache check and empty-result check it added — a small, deliberate, accepted regression in the metric in exchange for a real correctness fix, recorded honestly rather than smoothed over.
The same side-by-side comparison technique, applied next to home_assistant.py's search() (F, 43-44) — comparing its area-filtered entity-matching branch against its keyword-filtered branch — found that the area-filtered branch reimplemented only a subset of the real filtering logic, silently missing exclusion-keyword filtering and several other real filter fields entirely. This is the same bug documented in full, with its real user-facing consequence, in The Home Assistant Bulletproofing Pass — recorded here only to note it was this campaign's discipline, not a dedicated full-file read, that originally found it.
A second, smaller finding from the same pass: an early version of the fix added a defensive branch preserving what looked like a deliberate leniency for a bare area-only query. Rather than assume it was needed, it was checked three independent ways — a full static trace of the filter-building logic, 2000 randomly fuzzed inputs run directly against it, and an exhaustive check of every real entry in the keyword map — before concluding the branch was genuinely unreachable, not cheap insurance, and removing it.
Continuing the same discipline into a third file, E(34) → D(24). Unlike the two functions above, this one's duplication had no hidden behavioral difference to find — the "pick Wikipedia if available, otherwise the first book" fallback logic was duplicated byte-for-byte, used identically whether the LLM wasn't configured at all or had responded with something unusable. Extracted into _fallback_book_choice(), with a dedicated test confirming both real call sites genuinely invoke the same shared function rather than just coincidentally producing matching output — a distinction worth testing for directly, since the latter could silently mask a future re-divergence the former would catch immediately.
The campaign's fifth and largest single reduction: _interpret_yes_no() (D, 28) down to B(8) via a new _interpret_binary_state() helper, generalizing the shared shape behind uptime's, ha's, and forecast's three structurally similar "does the condition's asserted state match the result's actual state" checks.
A real bug was introduced during the extraction itself, and caught before it ever shipped. "Locked" is a literal substring of "unlocked," and the original, separate implementations correctly avoided this trap by always checking for "unlocked" first, regardless of which polarity the condition asserted. A first, naive attempt at generalizing this instead checked whichever result keyword matched the condition's own polarity first — which gets the "condition says locked, result says unlocked" case exactly backwards, silently returning True instead of the correct False. Caught by deliberately constructing and testing that exact scenario before trusting the generalization, not by the existing test suite — which, also found during this same investigation, had never actually covered this specific case at all. This is the origin of the substring-trap guard described in Conditional Query Detection; a genuinely missing regression test for the original, correct behavior was added at the same time, specifically so a future "simplification" can't silently reintroduce the same bug a second time.
A related, fourth-pass investigation into the structurally similar _llm_detect() (D, 29) found something worth recording for the opposite reason: a difference that looked like the same kind of inconsistency on first read — two of four near-duplicate code blocks re-caching a routing decision after applying a bias, two not — turned out, after direct testing, to be correct, deliberate behavior. The bias check is cheap and re-evaluated fresh on every call regardless of what's cached, so a cache entry written before the bias existed still correctly escalates on every subsequent call without needing the escalated decision itself re-cached. Two genuinely reusable patterns were still extracted from the four blocks (D(29) → D(23)), with one honest, minor logging-detail tradeoff disclosed rather than hidden: one of the four call sites lost a specific intermediate log line the shared helper has no way to reproduce, with the same diagnostic information still visible in a second, always-fires log line immediately after it.
bandit found one real, medium-severity issue (Kiwix's OPDS catalog parser using a stdlib XML parser documented as vulnerable to entity-expansion attacks on untrusted input — switched to defusedxml, a drop-in replacement) and confirmed three flagged try/except patterns as deliberate, already-considered design rather than hidden bugs. pip-audit found zero known CVEs in pinned dependencies. vulture found a real, recurring copy-paste accident — the same dead, undecorated duplicate of a logs_clear function body, found and removed four separate times across the project's history as it kept reappearing. mypy found three real, independently-verified issues (a confusing variable-naming collision across two non-overlapping branches of route_with_source(), a stale return-type annotation, and a return type narrower than what the function's own logic already correctly produced) among 31 raw findings, most of the rest being the same low-value pattern repeated.
Every one of these tools was deliberately run as a one-time check during this campaign, not added to permanent CI — a choice revisited and confirmed each time, not assumed by default.
Almost every real bug this campaign found came from the same mechanism: a complexity score pointed at a function, and the actual finding came from comparing that function's near-duplicate-looking branches or call sites against each other directly, not from the refactor itself. The complexity score was never the point — it was a reliable way to decide where a careful comparison was worth doing, in a codebase too large to give every function that scrutiny by default. Two genuine non-findings are recorded here with the same honesty as the real bugs: _llm_detect()'s apparent caching inconsistency turned out to be correct, and pushing route_with_source() from C(18) toward B(6-10) was explicitly tried, measured, and declined once the real numbers showed the further gain wasn't worth the risk.