-
Notifications
You must be signed in to change notification settings - Fork 0
The Meaningful Content Filter Bugs
Query Decomposition's meaningful-content filter exists to decide which fragments of a split query are real intents worth keeping, and which are stray leftovers — stop words, filler, a stranded comma — worth discarding. It's a different mechanism from the proper-noun-pair guard, which protects specific pairs from being split apart in the first place; this filter runs after a split has already happened, deciding what survives. Two real, separate bugs lived here, both found via the same source: tracing real, live Adversarial Self-Testing production data on an actual deployment, not synthetic test cases written in advance.
A candidate fragment survives if any of three checks pass, run in this specific order:
-
Colloquial-phrase check — does the fragment contain a recognized filler phrase (
"what's the deal with","that thing", etc.)? If so, keep it regardless of what's left after stripping. -
Real keyword check — does the fragment contain a literal, real
INTENT_MAPkeyword phrase, the same flattened list every source's routing already uses? If so, keep it — regardless of what stop-word stripping would otherwise conclude. - Length and stop-word check — is the fragment longer than 3 characters, and does it have at least one real content word once stop words are stripped?
The order of checks 2 and 3 is the actual subject of this page. It wasn't always this order, and the bug from getting it wrong twice — once for what stop-word stripping considers "real," once for what the length gate considers "long enough" — is worth understanding in detail, since both bugs share the same root shape: a generic, blunt filter discarding something that a more specific, real-world check already knew was meaningful.
uptime's INTENT_MAP entry includes "is it up" and "are they up" — perfectly natural, real phrases a person would actually type. Both are made entirely of common English stop words: "is", "it", "up", "are", "they". Confirmed directly against all 113 real keyword phrases across every source — these are the only two with this property.
When either phrase ended up as its own clause in a longer compound query — "feeds plus is it up in addition later today also door locked as well as google" — the stop-word check stripped every single word from it, leaving zero content words. The filter correctly concluded "nothing meaningful here" by its own stated logic, and silently discarded the entire clause. Not folded into a neighboring fragment, not logged — just gone. The query that should have decomposed into 5 parts (feeds, is it up, later today, door locked, google) came back with 4, missing uptime entirely.
Fixed by adding the real-keyword check (step 2 above) before the generic stop-word check — a real keyword phrase now always counts as meaningful, even when every individual word in it happens to be a stop word. This closes the general case, not just these two phrases by name: any future INTENT_MAP addition with the same all-stop-words property is automatically protected too, with no special-casing required.
Bug 2 — the length gate ran before the keyword check that was supposed to protect against exactly this
Bug 1's fix wasn't actually sufficient on its own, and the gap took a second real production query to surface: "everyone keeps talking about black holes, and rss" should decompose into ["...black holes,", "rss"], but didn't — it stayed as one unsplit string. Tracing why landed on a different, second filter entirely: "rss" — confirmed the only real INTENT_MAP keyword that is itself 3 characters or shorter — was being discarded by the filter's length gate (if len(p) <= 3: continue), which ran before the keyword check that bug 1 had just added, not after.
Fragment: "rss" (length 3)
│
▼
Length gate: len(p) <= 3? ──── TRUE
│
▼
Discarded immediately.
The keyword check added for
bug 1 never even runs — it's
positioned AFTER this gate,
not before it.
The fix for bug 1 added a real, working check — but adding it to the end of the filter chain meant it could only ever protect a fragment that survived every earlier check first. A short keyword that the length gate would discard outright never got that far.
This had a real, visible downstream consequence beyond just the missing decomposition. Once "rss" failed to split off as its own clause, it rode along as part of the larger discourse-framed clause sent to kiwix — polluting both the actual Kiwix search query and its relevance scoring with a word that had nothing to do with the real topic. Tracing that consequence is its own, separate story, told in The Adversarial Testing Production Bugs and The Fusion Merge Bugs.
Fixed by reordering the filter: colloquial-phrase and keyword checks now both run before the length gate, not after. "rss" survives as its own clause regardless of its length, because it's checked against the real keyword list before anything ever asks how long it is.
Both bugs have the identical shape: a blunt, general-purpose filter (stop-word stripping; a short-fragment length cutoff) discarding something that a more specific check — "is this a real, documented keyword Mnemolis already knows about?" — would have correctly protected, if only that check ran first. Neither bug was a flaw in the idea of either filter; stop-word stripping and a length floor are both reasonable, generally-correct heuristics. The bug, both times, was assuming a general heuristic and a specific exception could coexist in any order, when the only correct order is specific-before-general — the same lesson the proper-noun-pair saga arrived at independently, in a structurally different part of the same decomposition mechanism.