Skip to content

Kiwix Catalog and Article Fetching

Bob edited this page Jun 26, 2026 · 1 revision

Kiwix Catalog & Article Fetching

Kiwix Disambiguation, Kiwix Scoring, and Multi-Book Fusion all assume two things already happened: a real, current list of which books exist, and a way to turn a chosen search result into actual article text. This page covers both — the part of the pipeline the other three pages take for granted.

Catalog discovery

Mnemolis never hardcodes a book list. At first request (or right after a POST /catalog/refresh), it fetches Kiwix's own OPDS catalog feed and builds the list from what's actually there.

   get_books()
        │
        ▼
   Cache already populated?
        │
   ┌────┴────┐
   ▼ yes      ▼ no
 Return    Fetch /catalog/v2/entries
 cached    (start=0, count=10)
 list           │
                ▼
        Parse Atom/OPDS XML,
        extract name/title/summary
        per <entry>
                │
                ▼
        Got a FULL page (10 entries)?
                │
        ┌───────┴───────┐
        ▼ yes            ▼ no (partial or empty)
   Fetch next page    Stop — this was
   (start += 10)       the last page
        │                   │
        └─────────┬─────────┘
                   ▼
         Cache the full list,
         return it

get_books() is a real cache, not just a memoized function — once populated, it's never re-fetched until something explicitly clears it. POST /catalog/refresh is that explicit clear: it empties _book_cache and immediately re-fetches, which is the actual, only way Mnemolis learns about a ZIM file added after startup.

Why pagination matters here: Kiwix's OPDS endpoint returns 10 entries per page by default, and a real homelab stack with several ZIMs (Wikipedia, multiple Stack Exchange sites, iFixit, DevDocs) can easily exceed that in one page. get_books() keeps requesting the next page (start += 10) until it gets back fewer than a full page — the honest signal that there's nothing left — rather than assuming a fixed page count or a has_next field the feed doesn't actually provide.

Each page is parsed independently, and a parse failure on any one page doesn't crash the whole fetch_fetch_catalog_page() catches its own exceptions and returns an empty list on failure, which get_books()'s loop reads as "no more pages" and stops there. A genuinely malformed catalog response, a Kiwix instance that's temporarily unreachable mid-fetch, or a real XML parsing error all degrade to "stop here with whatever was already collected" rather than discarding everything gathered so far.

The full, versioned book name (wikipedia_en_all_maxi_2026-02, not just Wikipedia) comes from the entry's own text/html link href, not its title — Kiwix's title field is the human-readable display name, but every other part of this codebase (search requests, KIWIX_MAX_BOOKS selection, multi-book fusion attribution) needs the exact, full versioned identifier to actually address the book.

A real, deliberate security hardening

The OPDS feed is parsed with defusedxml.ElementTree, not the standard library's own xml.etree.ElementTree. This isn't a stylistic choice — a static security analysis pass found that the standard library parser is documented as vulnerable to XML entity expansion attacks (the "billion laughs" attack class) on untrusted input, and switched to a drop-in-compatible replacement built specifically to reject that pattern. The realistic threat model here is genuinely contained — this XML comes from KIWIX_URL, expected to be your own self-hosted, trusted Kiwix instance, not arbitrary internet content — but the fix was free and applied regardless of how contained the risk actually was.

Searching a book

Once a book is selected (by keyword match, LLM selection, or the Wikipedia-first fallback when no LLM is configured), _search_book() queries Kiwix's own /search endpoint and scrapes the HTML result page — Kiwix doesn't expose a structured JSON search API, so this is real HTML parsing, not a clean REST call.

Each result needs a title and a link; an excerpt is optional (some results genuinely don't have one, and a missing <cite> tag is handled rather than treated as a parse failure). One deliberate filter runs on every result: anything whose URL contains /questions/tagged/ is dropped. A Stack Exchange tag-listing page (questions/tagged/python) is a list of many loosely-related questions, not a focused answer to anything — exactly the kind of result that would otherwise win on raw keyword overlap while being useless as an actual answer.

KIWIX_SEARCH_LIMIT (default 15) controls how many results are requested per book per search — raised from an original hardcoded 5 specifically to give scoring more real candidates to choose from when a common search term collides with several brand-name or homonym results.

Fetching the actual article

A search result is a title, URL, and maybe a short excerpt — none of that is the article. _fetch_article() is what turns the winning result into the text that actually gets returned.

It fetches the real page and strips it down to readable content: scripts, stylesheets, navigation, headers, footers, and any <table> are removed outright, and a table-of-contents box is removed separately. Content is then pulled from whichever of several known selectors matches first — Wikipedia's own content div, a generic content-text div, a generic <article> tag, a blog-style post-content div, a Stack Exchange question div, falling all the way back to <body> if nothing more specific matches. The result is truncated to KIWIX_ARTICLE_MAX_CHARS (default 3000) — a per-article limit, distinct from Fusion's own FUSION_MAX_CHARS_PER_SOURCE, which truncates the already-combined multi-source response after fusion, not a single article before it.

If the winning result's article genuinely can't be fetched — a broken link, a malformed page, a transient timeout — the response is honest about it rather than silently returning nothing: "Found {title} but could not fetch article content.", with the real URL attached, so the failure is visible and actionable rather than indistinguishable from a clean empty result.


Development Notes

  • Table-of-contents boxes were never actually being stripped from any fetched article, despite the code's clear intent. The original strip list passed ".toc" and "#toc" to a function that only matches literal HTML tag names — ".toc" and "#toc" are CSS selector syntax, not tag names, so neither ever matched anything real. "table" in the same original list was a genuine, valid bare tag name and was already working correctly; only the two CSS-selector-style entries needed a real selector-aware method instead.
  • A failed article fetch used to retry every remaining scored result with no upper bound. A realistic worst case — multiple books selected, disambiguation active, up to roughly 59 total results across all of them — could mean up to 59 sequential article-fetch attempts at a real 10-second timeout each, nearly 10 minutes for one search request, if Kiwix's search endpoint stayed healthy but the specific article-content path kept failing. Capped at 5 — generous enough to recover from a realistic cluster of a few broken links near the top of the results, narrow enough to bound the real worst case to well under a minute.

Clone this wiki locally