-
Notifications
You must be signed in to change notification settings - Fork 0
The Adversarial Testing Production Bugs
Adversarial Self-Testing exists to institutionalize the adversarial megaquery testing approach that found most of Mnemolis's real bugs historically — and once it actually ran against real traffic, it kept doing exactly that. This page is the record of what it found: two bugs caught during the feature's own development, a four-bug chain found tracing a single real query through several rounds of verification, a genuine false positive in the feature's own detector, a real backend timeout with no way to tune it, and one investigation that ended without ever finding a root cause.
Building the discourse-framing check exposed a real logic bug during its own unit testing, worth recording here in the same spirit as the rest of Design History: the first version checked "kiwix" in result.lower() as one of its two ways to confirm kiwix was actually used. A genuinely realistic mock result reading "plain web result, no kiwix involved" — explicitly stating kiwix was not used — contains the literal substring "kiwix", so the naive check passed it as if kiwix had been present. Fixed by trusting only source_used and the real, structural "[KIWIX —" header marker fusion.py actually emits — never a freeform substring search across response text.
A second, more consequential bug surfaced during code review of the first fix: the original design explicitly documented "a flag is only ever cleared by a clean re-roll of the same fingerprint" as a deliberate choice — but a reviewer correctly identified that this was a real risk, not a stylistic tradeoff, specifically for intermittent anomalies. The ever_flagged/first_flagged_*/review_status design described on the main feature page is the actual fix — and writing the fix surfaced two more real bugs in its own first draft: a schema-migration ordering bug (an index was created on the new ever_flagged column before the column itself had been added to a pre-existing table, raising no such column: ever_flagged on every real, already-deployed database), and a missing review_status reset (a dismissed combination that got a genuinely new, different flag later stayed permanently invisible, since nothing ever cleared the earlier dismissal). Both were caught by failing tests written specifically to exercise the scenario, not found by inspection — the same discipline this whole feature exists to apply to Mnemolis itself, applied here to its own code.
The first cycle ever run against the real, fully-reachable Kiwix/SearXNG/Ollama stack came back clean — 8/8, zero flags. Worth recording what it actually generated, since "clean" doesn't mean "boring":
nested_proper_noun_pairs fusion 11909ms
conditional_with_remainder uptime 2028ms
no_intent_fallthrough kiwix 1092ms
discourse_framing_plus_real_keyword fusion 6080ms
discourse_framing_plus_real_keyword fusion 3080ms
conditional_with_remainder fusion 276ms
no_intent_fallthrough kiwix 1990ms
nosplit_adjacent_to_real_conjunction web 2502ms
Two real things worth noting, neither of which got flagged (correctly — no history existed yet for the latency check to compare against):
-
"whats the deal with the Beatles and the Rolling Stones plus Mercury and Venus, in addition since last time" — two proper-noun pairs in one query, resolved to
fusionin 11.9 seconds, by far the slowest of the eight. A real, legitimately slow case the recipe was built to surface; worth watching once more history accumulates. - Two
conditional_with_remainderqueries differing 2028ms vs. 276ms — almost certainly a cache hit/miss difference on the sub-query, not a real anomaly. Exactly the kind of normal varianceADVERSARIAL_TEST_LATENCY_OUTLIER_FLOOR_MSexists to absorb.
This is the actual point of the feature, not a footnote: after running for roughly a day against MiniDock's real stack (136 real combinations tried, 9 flagged), tracing every single flag — not just the ones that looked interesting — turned up two genuine, previously-unknown bugs in Mnemolis's actual routing/decomposition logic, one genuine false positive in this feature's own detector, and one real, structural (not buggy) latency characteristic worth documenting rather than chasing.
The Discourse-Framing Investigation documents fixing "all four real code paths" inside _llm_detect() — fresh and cached, single- and multi-source. All four genuinely were fixed. What none of those four cover: detect_intent()'s own if source: return source early-returns the instant _keyword_detect() matches any real INTENT_MAP keyword — even a single, common, generic one like "rss" or "news" — short-circuiting before _llm_detect() (and therefore every one of its four correctly-fixed escalation paths) is ever reached.
A real, live flag caught this directly: "everyone keeps talking about black holes, and rss" resolved to bare "news" in 35ms — far too fast to have touched the LLM at all, confirming pure keyword-match resolution. Reproduced and generalized immediately: every natural discourse-framed sentence tried that happened to mention any ordinary INTENT_MAP word ("news", "weather", "rss", "feeds", "door locked") hit the identical gap, for both single- and multi-keyword matches. The original fix narrowed the bug's surface area — closing the LLM-routing version — without ever closing the keyword-routing version, since INTENT_MAP contains dozens of short, ordinary words that can easily co-occur with genuine discourse framing in a real sentence.
Fixed by applying the exact same, already-existing _escalate_single_source_for_discourse_framing() / _escalate_multi_source_for_discourse_framing() helpers directly inside detect_intent()'s keyword-match branch — no new escalation logic, just reusing what _llm_detect() already had, at the one call site that was missing it.
A second flagged row — "feeds plus is it up in addition later today also door locked as well as google" — was meant to test 5 independent intents but only resolved to 3 visible sections. Traced directly: _decompose() only produced 4 parts, not 5, with "is it up" (the literal, real uptime keyword phrase) missing entirely — not folded into a neighboring clause, just gone.
Root cause: _filter_meaningful()'s stop-word-stripping check has no awareness of INTENT_MAP at all. "is it up" and "are they up" — confirmed the only two of all 113 real keyword phrases across every source — are made entirely of common English stop words ("is", "it", "up", "are", "they"). A clause consisting only of one of these phrases came back with zero content_words, and was discarded as not meaningful.
Fixed the same way this function already handles a structurally identical problem for _COLLOQUIAL_PHRASES — checking the clause against the real, flattened INTENT_MAP keyword list (_ALL_INTENT_KEYWORDS, computed once at import time) before falling through to generic stop-word stripping. A real keyword phrase now always counts as meaningful, even if every individual word in it happens to be a stop word — closing the general case, not just these two phrases by name, so a future INTENT_MAP addition with the same property is automatically covered too.
A real /search call against MiniDock, run to manually confirm the discourse-framing keyword-path fix above, returned a technically correct result by every check this feature runs — source_used: "fusion", real [NEWS — ...] and [KIWIX — ...] sections both present — and would have scored a clean pass on every one of the seven checks the main feature page describes. But the actual kiwix section was bad: for "everyone keeps talking about black holes, and rss", kiwix returned an unrelated Space StackExchange thread about Hubble telescope camera placement and an unrelated Wikipedia article about a true-crime podcast — never the real Black Hole article. This is exactly the kind of thing the feature's own hard rule (never judge correctness) means it can't catch on its own — finding it took a human actually reading a result and asking "is this actually good," not a structural check.
Tracing it found two distinct, real root causes, both upstream of anything the discourse-framing check itself inspects:
fusion.search() calls every selected source with the identical, full, unmodified query string — "everyone keeps talking about black holes, and rss" in its entirety, not separated per-source. Kiwix has no way to know "rss" is the literal text that triggered news as a co-source, not part of its own topic — confirmed this is general, pre-existing behavior, not specific to discourse framing: an ordinary, non-discourse multi-keyword fusion query ("check the news and tell me about black holes") shows the identical pattern.
The actual fix wasn't to make kiwix defensively robust against arbitrary cross-source noise — a generic "strip every other source's keywords" approach was checked and confirmed unsafe: words like "weather", "forecast", "google" are real INTENT_MAP triggers for other sources but also genuinely legitimate kiwix topics in their own right. The real fix was upstream: _decompose() should have split "black holes" and "rss" into two independent clauses in the first place — the same mechanism that already correctly handles every other multi-intent query in the system — but didn't, because of a second, separate bug: "rss" (confirmed the only real INTENT_MAP keyword that is itself 3 characters or shorter) was being discarded by _filter_meaningful()'s if len(p) <= 3: continue length gate before the _ALL_INTENT_KEYWORDS check (added earlier this same investigation, for "is it up"/"are they up") ever got a chance to protect it — the two checks existed in the wrong order. Fixed by reordering the keyword/colloquial checks ahead of the length gate. Once decomposition correctly splits the query, kiwix only ever receives "black holes" as its own search text.
Scoring never actually used the cleaned text _build_search_terms() already builds. Separately, and worth fixing regardless of the decomposition fix above: The Discourse-Framing Investigation documents fixing search-term pollution from discourse-framing words ("everyone", "obsessed") by stripping them in _build_search_terms() — and that part is genuinely true. But _score_result(), which ranks whatever Kiwix's search actually returns, was never updated to use that same cleaned text — it scores against the raw, original query parameter, where "everyone"/"keeps"/"talking" are still real, counted words today, confirmed even for the original bitcoin case the wiki documents as fully fixed. That case's real winner just never visibly changed, because the real Bitcoin article's title-overlap signal was strong enough to win regardless of the noise; "black holes" had no such margin. Fixed by stripping discourse framing from the specific word set used for keyword-overlap scoring (query_words) — query_lower itself stays the full, original phrasing, since the exact-match check and _is_definitional_query() genuinely need the real leading phrase structure that _strip_discourse_framing() was never meant to touch.
Re-running the exact same real query against MiniDock to confirm the decomposition/scoring fixes above worked — and they did; the real Black Hole disambiguation article correctly led the kiwix section — surfaced a third, genuinely separate bug in the actual merged answer: a second, redundant [NEWS — ...] section appeared near the end, duplicating real headlines already shown earlier in the response ([KIWIX, NEWS, WEB, NEWS] instead of the correct [KIWIX, NEWS, WEB]).
Root cause: once decomposition correctly splits a query like this into ["...black holes,", "rss"], the first clause's own LLM-judged source selection can independently land on internal fusion (e.g. ["kiwix", "news", "web"] together, all sharing one already-headered, nested blob), while the second, separately-decomposed "rss" clause resolves to bare news on its own. _merge_same_source() only ever compares the outer tuple label, so it has no way to see that a section nested inside the fusion blob duplicates the second, separate tuple's own source.
Confirmed this is a real, pre-existing gap that predates this investigation, not a new regression from either fix above — this recipe's specific shape (discourse escalation + an unrelated trailing keyword) just made it reliably reachable instead of needing a rarer LLM-judgment coincidence to trigger. This is the same bug, and the same _dedupe_nested_fusion_sections() fix, described in full in The Fusion Merge Bugs — this page covers the production discovery; that page covers the merge-logic mechanism.
Re-running the exact query one more time to confirm the section-level fix above against MiniDock's real stack — and it worked exactly as designed, producing a single, correct [NEWS — ...] header — surfaced a fourth, separate bug: the same several headlines still appeared twice inside that one, correctly-deduplicated section's own body, since _merge_same_source()'s actual content join had zero awareness of what was inside either blob. This is the second bug in the same three-bug merge chain — see The Fusion Merge Bugs for the mechanism and the fix (_dedupe_items_across_blobs()).
The lesson across all four bugs in this chain: each one only became visible once the previous fix in the chain was already verified working — they were found by tracing a single real, live query end to end on actual production data, four separate times, not by inspection or a synthetic test case. A feature designed to catch structural anomalies in Mnemolis's output ended up needing exactly that same discipline — re-run, re-check, don't assume the first fix was the whole fix — applied to itself.
Tracing a third flagged row — "what's offline as well as while i've been at work in addition this weekend plus news and security status", 5 intended intents, only 3 headers — turned out to be a false positive, not a Mnemolis bug. Decomposition produced all 5 correct parts; every part resolved to the correct source. The 2 "missing" sources legitimately and correctly returned empty results, and route_with_source() deliberately drops an empty sub-query result before merging — exactly the right behavior; nobody wants an answer cluttered with empty sections.
The real problem: by the time the part-count check sees the final merged string, there's no trace anywhere of which sub-queries were tried and legitimately came back empty versus which results were silently lost to a bug — that information is gone before merging ever happens.
Fixed by loosening the check from an exact-count comparison to "fewer than half of the intended sources produced any header at all." The original proper-noun-pair bug 5 this check exists to catch was a global veto — collapsing an entire multi-intent query down to a single, un-split result (0 or 1 headers against 4+ intended sources) — not a partial 2-of-5 gap. "Less than half" is loose enough to never fire on ordinary empty-result variance across this recipe's real range (3–5 intended sources), while still catching a genuine large-scale collapse with the same shape as the original bug. As a direct consequence, the now-removed ADVERSARIAL_TEST_PART_COUNT_MISMATCH_TOLERANCE setting no longer exists.
(This also corrected an unrelated, separate defect in the old check that the new logic improves on for free: the old version's n_headers > 0 guard meant a complete collapse to zero headers could never be flagged at all, for any number of intended sources — a worse blind spot than the false positive this same fix closes, since a total collapse is exactly bug 5's real signature. The new check correctly flags 0 headers for 2+ intended sources.)
That fix, while real and correct on its own terms, turned out not to be the whole story. Tracing a completely separate flag much later — conditional_remainder_missing_sections on "if it is raining, I will be careful with communication, as well as feeds" — led to _HEADER_PATTERN itself: the regex both this check and the conditional-remainder check use to count real headers in a result string required exactly one literal " — " separator, with the character class after it deliberately excluding the em-dash. kiwix's real label ("ENCYCLOPEDIC KNOWLEDGE — UNRELATED TO OTHER SECTIONS BELOW") and news's real label ("RECENT NEWS HEADLINES — GENERAL, NOT LOCATION-SPECIFIC UNLESS STATED") both legitimately contain a second em-dash — so neither header could ever be matched by the original regex, at all, regardless of any threshold.
Both real part_count_mismatch flags described above involved news as one of the intended sources. Reconstructing a realistic 5-header result including both vulnerable headers confirmed the regex undercounted by exactly 2 — 5 real headers, 3 counted — the precise shape of both flags' literal text ("intended 5 intents, found 3 headers"). The threshold fix above made the check tolerant of this undercount without ever finding why the undercount was happening; it likely papered over this exact regex bug the whole time, rather than the legitimate-empty-results explanation being the real, complete story.
Fixed properly by rebuilding _HEADER_PATTERN from the real, exact header strings fusion._format_header() can actually produce (re.escape()'d, not a generic bracket-matching character class) — the same safe approach router.py's own _dedupe_nested_fusion_sections() already uses for the identical underlying need. Several existing tests for both checks had also been using fabricated header text ("[KIWIX — A]") that happened to be equally invisible to the broken regex for an unrelated reason, which is the real, structural reason this had survived in the test suite for as long as it did — a test using fake data can accidentally agree with a real bug. New tests use only real header strings going forward.
The last of this round's two genuinely unresolved flags: unexpected_empty on "if any services are down, let me know right away, as well as lights off", latency 30056ms. The number itself was the actual clue — UptimeKumaApi(settings.uptime_kuma_url, timeout=30) was a bare, hardcoded 30-second client timeout, and 30056ms is exactly that, plus the small overhead of everything else the query touched.
Tracing the real call path confirmed this is not a Mnemolis bug at all — the Uptime Kuma client connection genuinely timed out, the exception was caught, and Mnemolis correctly, honestly returned "Could not connect to Uptime Kuma: {e}" rather than hiding the failure or crashing. fusion._looks_empty() correctly recognizes "could not connect" as a real failure signal, and this feature's own check correctly flagged the resulting empty-looking merged response — every layer did exactly what it was supposed to do.
The real, fixable gap: there was no setting anywhere to tune that 30-second wait. Every other source this project touches (SEARXNG_REQUEST_TIMEOUT_SECONDS, FUSION_TIMEOUT_SECONDS) already has a real, configurable timeout; Uptime Kuma's was the one bare literal left over. 30 seconds is a long time to wait on what should be a fast, same-LAN service before falling back — fixed by adding UPTIME_KUMA_TIMEOUT_SECONDS (default 10), wired directly into the real client call, with the documented fallback behavior (a real "could not connect" message on a genuine failure) completely unchanged.
Whether the original 30-second timeout on a real, live deployment reflects a genuine, repeatable network issue worth investigating further, or was a one-off hiccup, is a separate, open question this fix doesn't answer on its own — but a shorter, configurable timeout means a future occurrence fails fast and falls back sooner, rather than holding up an entire conditional+remainder response for half a minute.
A real unexpected_empty flag on a nosplit_adjacent_to_real_conjunction query ("vs Python and JavaScript, plus while at work", routed correctly to changes via the literal "while at work" keyword) was traced as far as the actual production logs would allow. Every mechanism that could be checked against real evidence was checked and ruled out, in order: the LLM being down at the time (it was — but this specific query never calls the LLM at all, a pure keyword match); a different, adjacent synthetic query's LLM/SearXNG failures in the same batch somehow bleeding into this one (no causal path found); a time-window edge case in _hours_since() (the resolved window was a sane ~14 hours); cold-start snapshots (history was old enough); a swallowed exception (_check_crash() runs before _check_unexpected_empty() in priority order and would have fired first, and didn't); a timezone misconfiguration (confirmed directly — TZ=America/Phoenix, correct local clock); and stale message text (confirmed byte-for-byte on the real deployed container — format_changes({}) produces exactly the message expected, which does not match any _looks_empty() phrase).
Every one of those six checks came back negative, against real logs, real container state, and real function calls — not assumption. times_generated for this combination was 1: it had only ever happened once, and nothing about it has recurred since. The investigation reached the genuine limit of what was recoverable and stopped there, rather than manufacturing a seventh theory to force a tidy conclusion.
The real, lasting finding wasn't the root cause — it was that the schema made finding the root cause impossible after the fact. adversarial_combinations recorded that something matched a known empty/error phrase, but never the phrase itself. Once that one real occurrence was gone, there was no way to recover what had actually happened, no matter how thoroughly every other angle got checked. Fixed with a new last_flagged_result_excerpt column (up to 500 characters, populated only when a flag actually fires — never on a clean run) and exposed through GET /adversarial/flagged. The next time this fires, the first query is the database, not a six-step elimination across production logs.
The lesson: the feature was built to verify Mnemolis's behavioral guarantees, not to explain every anomaly it finds — and sometimes the honest, correct outcome of an investigation is recording that nothing more could be recovered, while fixing the actual, structural reason it couldn't be recovered next time.