-
Notifications
You must be signed in to change notification settings - Fork 0
The Kiwix Bulletproofing Pass
Five real bugs found in app/sources/kiwix.py during a deliberate, full read of the file (plus a full repo-wide test coverage audit that caught one more in the same area) — the same discipline, applied to a different source, that found the Home Assistant bug chain documented in The Home Assistant Bulletproofing Pass. Most are independent findings; two form a short chain, the second found while verifying the first.
_pick_books_with_llm() decides which Kiwix book(s) a query should search, normally falling back to Wikipedia-first if the LLM doesn't return anything usable. The actual fallback check had a substring-matching flaw: when the LLM returned an empty or whitespace-only string — a network hiccup, a timeout, a blank model response — that empty candidate string would match via Python's "" in name against whatever book name happened to come first in unordered set iteration, rather than correctly triggering the intended Wikipedia-first fallback. The practical effect was non-deterministic: the same failure condition could pick a different, effectively random book depending on iteration order, instead of consistently falling back to the same sensible default. Found through a full repo-wide test coverage audit, not through any specific reported behavior. Fixed by skipping empty candidates before the substring match runs at all.
_fetch_article() passed .toc/#toc to BeautifulSoup's soup([...]) call, intending to strip table-of-contents clutter from article text before returning it. .toc and #toc are CSS-selector syntax, not literal HTML tag names — soup([...]) only matches actual tag names. Confirmed directly that TOC clutter had survived in every single fetched article since this code was written, despite the code's own clear intent right there in the surrounding logic. The genuine bare tag names in the same strip list ("script", "table", and others) had been working correctly the whole time — only the two CSS-selector-style entries were silently inert. Fixed with soup.select(".toc, #toc") for the selector-style entries, keeping the existing soup([...]) call for the real tag names.
If Kiwix's search endpoint was healthy but the specific article-content fetch path kept failing, the retry logic would keep trying every remaining scored search result in sequence, with no cap. A realistic worst case — multiple books selected, disambiguation active, on the order of 59 total scored results across all of them — could mean up to 59 sequential fetch attempts at a real 10-second timeout each: close to ten minutes for a single search request. Fixed by capping retries at 5 — generous enough to recover from a realistic cluster of a few broken links near the top of the results, narrow enough to bound the genuine worst case to well under a minute.
Single-character search terms were silently dropped, the same bug independently rediscovered in a second file
_build_search_terms() carried the identical bug already found and fixed the same release in scoring.py's own keyword extraction: a query like "what is r programming used for" reduced to the literal Kiwix search string "programm," — the single letter that actually distinguished "R" from any other programming language was filtered out the same way stray punctuation is. Fixed with the same isalnum()-based approach already proven correct in scoring.py.
Found immediately while verifying that fix: making single-character terms reachable exposed a second, related weakness one layer downstream. Once a genuinely single-letter search term could survive this far, _get_disambiguation_candidates()'s own sanity filter — checking whether a candidate phrase "contains the original word" — turned out to provide almost no real protection for one-letter terms specifically, since almost any English phrase coincidentally contains a single letter like "c" somewhere as a bare substring. This had been a latent, untriggered gap the whole time the first bug existed, since a single-character term could never actually reach this filter until the first fix made it reachable. Fixed with the same word-boundary regex discipline already used elsewhere in the codebase — see Kiwix Disambiguation for this fix in its own fuller context.
Two of these four bugs were found purely by reading code that looked fine and asking "does this actually do what it's clearly trying to do" — the TOC-stripping bug in particular passed every existing test, because nothing in the test suite happened to check whether TOC markup specifically survived; it just never came up. The other two are a reminder that fixing a bug can genuinely create a new code path that didn't exist before, with its own latent weaknesses worth checking immediately rather than assuming a narrow, targeted fix is automatically complete on its own.