-
Notifications
You must be signed in to change notification settings - Fork 0
Kiwix Catalog and Article Fetching
Kiwix Disambiguation, Kiwix Scoring, and Multi-Book Fusion all assume two things already happened: a real, current list of which books exist, and a way to turn a chosen search result into actual article text. This page covers both — the part of the pipeline the other three pages take for granted.
Mnemolis never hardcodes a book list. At first request (or right after a POST /catalog/refresh), it fetches Kiwix's own OPDS catalog feed and builds the list from what's actually there.
get_books()
│
▼
Cache already populated?
│
┌────┴────┐
▼ yes ▼ no
Return Fetch /catalog/v2/entries
cached (start=0, count=10)
list │
▼
Parse Atom/OPDS XML,
extract name/title/summary
per <entry>
│
▼
Got a FULL page (10 entries)?
│
┌───────┴───────┐
▼ yes ▼ no (partial or empty)
Fetch next page Stop — this was
(start += 10) the last page
│ │
└─────────┬─────────┘
▼
Cache the full list,
return it
get_books() is a real cache, not just a memoized function — once populated, it's never re-fetched until something explicitly clears it. POST /catalog/refresh is that explicit clear: it empties _book_cache and immediately re-fetches, which is the actual, only way Mnemolis learns about a ZIM file added after startup.
Why pagination matters here: Kiwix's OPDS endpoint returns 10 entries per page by default, and a real homelab stack with several ZIMs (Wikipedia, multiple Stack Exchange sites, iFixit, DevDocs) can easily exceed that in one page. get_books() keeps requesting the next page (start += 10) until it gets back fewer than a full page — the honest signal that there's nothing left — rather than assuming a fixed page count or a has_next field the feed doesn't actually provide.
Each page is parsed independently, and a parse failure on any one page doesn't crash the whole fetch — _fetch_catalog_page() catches its own exceptions and returns an empty list on failure, which get_books()'s loop reads as "no more pages" and stops there. A genuinely malformed catalog response, a Kiwix instance that's temporarily unreachable mid-fetch, or a real XML parsing error all degrade to "stop here with whatever was already collected" rather than discarding everything gathered so far.
The full, versioned book name (wikipedia_en_all_maxi_2026-02, not just Wikipedia) comes from the entry's own text/html link href, not its title — Kiwix's title field is the human-readable display name, but every other part of this codebase (search requests, KIWIX_MAX_BOOKS selection, multi-book fusion attribution) needs the exact, full versioned identifier to actually address the book.
The OPDS feed is parsed with defusedxml.ElementTree, not the standard library's own xml.etree.ElementTree. This isn't a stylistic choice — a static security analysis pass found that the standard library parser is documented as vulnerable to XML entity expansion attacks (the "billion laughs" attack class) on untrusted input, and switched to a drop-in-compatible replacement built specifically to reject that pattern. The realistic threat model here is genuinely contained — this XML comes from KIWIX_URL, expected to be your own self-hosted, trusted Kiwix instance, not arbitrary internet content — but the fix was free and applied regardless of how contained the risk actually was.
Once a book is selected (by keyword match, LLM selection, or the Wikipedia-first fallback when no LLM is configured), _search_book() queries Kiwix's own /search endpoint and scrapes the HTML result page — Kiwix doesn't expose a structured JSON search API, so this is real HTML parsing, not a clean REST call.
Each result needs a title and a link; an excerpt is optional (some results genuinely don't have one, and a missing <cite> tag is handled rather than treated as a parse failure). One deliberate filter runs on every result: anything whose URL contains /questions/tagged/ is dropped. A Stack Exchange tag-listing page (questions/tagged/python) is a list of many loosely-related questions, not a focused answer to anything — exactly the kind of result that would otherwise win on raw keyword overlap while being useless as an actual answer.
KIWIX_SEARCH_LIMIT (default 15) controls how many results are requested per book per search — raised from an original hardcoded 5 specifically to give scoring more real candidates to choose from when a common search term collides with several brand-name or homonym results.
A search result is a title, URL, and maybe a short excerpt — none of that is the article. _fetch_article() is what turns the winning result into the text that actually gets returned.
It fetches the real page and strips it down to readable content: scripts, stylesheets, navigation, headers, footers, and any <table> are removed outright, and a table-of-contents box is removed separately. Content is then pulled from whichever of several known selectors matches first — Wikipedia's own content div, a generic content-text div, a generic <article> tag, a blog-style post-content div, a Stack Exchange question div, falling all the way back to <body> if nothing more specific matches. The result is truncated to KIWIX_ARTICLE_MAX_CHARS (default 3000) — a per-article limit, distinct from Fusion's own FUSION_MAX_CHARS_PER_SOURCE, which truncates the already-combined multi-source response after fusion, not a single article before it.
If the winning result's article genuinely can't be fetched — a broken link, a malformed page, a transient timeout — the response is honest about it rather than silently returning nothing: "Found {title} but could not fetch article content.", with the real URL attached, so the failure is visible and actionable rather than indistinguishable from a clean empty result.
-
Table-of-contents boxes were never actually being stripped from any fetched article, despite the code's clear intent. The original strip list passed
".toc"and"#toc"to a function that only matches literal HTML tag names —".toc"and"#toc"are CSS selector syntax, not tag names, so neither ever matched anything real."table"in the same original list was a genuine, valid bare tag name and was already working correctly; only the two CSS-selector-style entries needed a real selector-aware method instead. - A failed article fetch used to retry every remaining scored result with no upper bound. A realistic worst case — multiple books selected, disambiguation active, up to roughly 59 total results across all of them — could mean up to 59 sequential article-fetch attempts at a real 10-second timeout each, nearly 10 minutes for one search request, if Kiwix's search endpoint stayed healthy but the specific article-content path kept failing. Capped at 5 — generous enough to recover from a realistic cluster of a few broken links near the top of the results, narrow enough to bound the real worst case to well under a minute.