Adversarial Self Testing

Adversarial Self-Testing

A background job, running on the same apscheduler infrastructure the snapshot engine already uses, that generates structurally-novel queries by combining Mnemolis's own real ingredient vocabulary, runs each one through the real route_with_source() pipeline, and flags structural anomalies for human review. It exists to institutionalize the adversarial megaquery testing approach that found most of the bugs documented in Design History — the proper-noun-pair saga's bug 5, in particular — instead of relying on someone deliberately constructing a nasty test sentence by hand each time.

The one hard rule

Nothing in this feature ever judges whether a response was correct. That's not a stylistic choice — it's the load-bearing design constraint the whole feature depends on.

An LLM-as-judge approach to this exact shape of problem (generate a test input and an expected answer, then trust an LLM's own judgment about whether a system's real output matches) was measured in real research at 6.3% precision — 93.7% of flagged "failures" were the judge's own invented expected-answer being wrong, not the system under test. Building this feature around that approach would have meant trading a few hours of setup for a permanent, self-inflicted false-positive problem.

Instead, every check here verifies one of Mnemolis's own documented, already-stated behavioral guarantees against what the real pipeline actually did:

Does a discourse_framing_plus_real_keyword query actually keep kiwix in the result, the way the discourse-framing bias is supposed to guarantee?
Does a query built from N independent intents produce something close to N [SOURCE — LABEL] headers, the same signal that originally caught the proper-noun-pair bug?
Does the response contain a raw traceback, an empty-result phrase from fusion._looks_empty(), or a source that doesn't match anything the query actually said?

None of those require knowing whether the content of the answer was right. They require knowing whether Mnemolis did the thing it claims to do — a fundamentally more reliable kind of check, and one that needs no LLM call and no ground truth.

Generation — pure combinatorics, no LLM calls

Every generated query comes from one of seven recipes, each pure Python combining real vocabulary already defined elsewhere in the codebase:

router.INTENT_MAP — the same dict detect_intent() uses for keyword routing
router._CONJUNCTIONS / router._NOSPLIT_PATTERNS — the same lists query decomposition uses
kiwix.DISCOURSE_FRAMING_PATTERNS — the same list behind the discourse-framing investigation
A small hardcoded seed corpus: real proper-noun pairs, and the real conditional phrases from tests/locustfile.py's CONDITIONAL_QUERIES/CONDITIONAL_WITH_REMAINDER_QUERIES — reused directly rather than re-typed, so the two test surfaces can never silently drift apart

Recipe	What it stresses
`proper_noun_plus_pronoun_intent`	The exact shape that found proper-noun-pair bug 5 — a real pair immediately followed by a conjunction and the pronoun "I"
`multi_intent_chain`	3–5 independent intents from different sources, joined by different conjunctions
`conditional_with_remainder`	A real conditional seed plus a genuinely unrelated remainder intent after it
`nosplit_adjacent_to_real_conjunction`	A nosplit phrase ("compare", "versus", etc.) placed next to a different, unrelated real conjunction elsewhere in the query
`discourse_framing_plus_real_keyword`	A discourse phrase followed by a clean keyword match for a different source
`nested_proper_noun_pairs`	Two distinct proper-noun pairs in the same query, testing whether the per-occurrence guard protects both independently
`no_intent_fallthrough`	A query with no `INTENT_MAP` keyword at all — does it fall through to Kiwix/LLM routing sanely?

Each generated query is fingerprinted by the ingredients used (not the literal string), and generation biases toward fingerprints never seen before, falling back to a repeat only once a recipe's seed vocabulary is genuinely exhausted — confirmed directly: against a single-recipe, five-topic test vocabulary, all five topics surface as novel within the first five generations before repeats begin.

The one place an LLM call would actually be worth its cost is periodic (weekly-scale, not per-cycle) expansion of the seed lists themselves — PROPER_NOUN_PAIRS, CONDITIONAL_SEEDS, _DISCOURSE_TOPICS — not the generation loop itself. That's a deliberate, not-yet-built follow-up, not part of the hot path.

What gets flagged

Seven checks run in priority order against every generated query's real result:

Crash — an exception escaped, or a raw traceback ended up in the response body
Source mismatch — source_used doesn't match any source the query's own keywords actually pointed at (fusion is always allowed, since merging multiple real sources is itself correct behavior)
Part-count mismatch — a multi_intent_chain query's intended intent count is significantly off from its result's [SOURCE — LABEL] header count
Discourse framing dropped kiwix — a discourse_framing_plus_real_keyword query's result has neither source_used == "kiwix" nor a [KIWIX — ...] header
Conditional remainder missing sections — a conditional_with_remainder query's result has zero [SOURCE — LABEL] headers at all
Unexpected empty — the result matches one of fusion._looks_empty()'s own canonical empty/error phrases
Latency outlier — more than 1.5x the same recipe's own historical p95, once at least 10 samples exist

A flagged combination is stored, never silently dropped. GET /adversarial/flagged returns the union of two things: combinations flagged on their most recent run, and combinations that have ever been flagged and haven't been explicitly dismissed by a human yet — not just the narrower "currently flagged" set alone.

This distinction exists because of a real gap a reviewer caught in an earlier version of this feature: the original design only tracked "currently flagged," which meant a combination flagged once for an intermittent anomaly (a flaky latency outlier, a transient bug that doesn't reproduce on every run) could silently vanish from the review queue the moment the same fingerprint happened to be re-rolled and came back clean — with no human ever having reviewed or dismissed it. Each row now carries ever_flagged (sticky, never auto-resets), first_flagged_reason/first_flagged_timestamp (the original anomaly, preserved even after later clean runs overwrite the last_* columns), and currently_flagged (true only if the most recent run is still actively anomalous) — so a person can tell "still broken right now" apart from "flagged once, currently clean, still genuinely needs a look."

The only way a combination actually leaves the default review queue is POST /adversarial/dismiss?fingerprint=... — a real human action, not a side effect of a lucky clean run. Dismissal doesn't delete history (include_dismissed=true still shows it), and a genuinely new flag on a previously-dismissed combination correctly resurfaces it — an old, closed-out review doesn't permanently suppress a fresh, unrelated anomaly on the same fingerprint later.

POST /adversarial/undismiss?fingerprint=... exists for the real reason most "one-way" actions eventually need a way back: a mistake. The first real batch-review on MiniDock dismissed several flags at once by matching index numbers against a listing fetched a turn earlier in a conversation, rather than a freshly re-fetched one — the live queue had reordered in between (a new flag had been recorded), so the indices no longer lined up, and two genuinely unresolved flags got closed out along with the seven that were actually fine. Before this endpoint existed, the only way back was editing the database by hand. Undismiss restores review_status to exactly the state it was in before the first-ever dismissal — NULL, the same as a combination that was never dismissed at all — not a new, third state.

A bug this feature found in itself, before it ever ran in production

Building the discourse-framing check exposed a real logic bug during its own unit testing, worth recording here in the same spirit as the rest of Design History: the first version checked "kiwix" in result.lower() as one of its two ways to confirm kiwix was actually used. A genuinely realistic mock result reading "plain web result, no kiwix involved" — explicitly stating kiwix was not used — contains the literal substring "kiwix", so the naive check passed it as if kiwix had been present. Fixed by trusting only source_used and the real, structural "[KIWIX —" header marker fusion.py actually emits — never a freeform substring search across response text. A small, contained version of exactly the kind of trap this whole feature exists to catch in Mnemolis itself, caught here by a real failing unit test rather than by accident.

A second, more consequential bug surfaced during code review of the first fix above: the original design explicitly documented "a flag is only ever cleared by a clean re-roll of the same fingerprint" as a deliberate choice — but a reviewer correctly identified that this was a real risk, not a stylistic tradeoff, specifically for intermittent anomalies. The ever_flagged/first_flagged_*/review_status design above is the actual fix, not a reframing of the old behavior — and writing the fix surfaced two more real bugs in its own first draft: a schema-migration ordering bug (an index was created on the new ever_flagged column before the column itself had been added to a pre-existing table, raising no such column: ever_flagged on every real, already-deployed database), and a missing review_status reset (a dismissed combination that got a genuinely new, different flag later stayed permanently invisible, since nothing ever cleared the earlier dismissal). Both were caught by failing tests written specifically to exercise the scenario, not found by inspection — the same discipline this whole feature exists to apply to Mnemolis itself, applied here to its own code.

First real run, on MiniDock

The first cycle ever run against the real, fully-reachable Kiwix/SearXNG/Ollama stack came back clean — 8/8, zero flags. Worth recording what it actually generated, since "clean" doesn't mean "boring":

nested_proper_noun_pairs           fusion   11909ms
conditional_with_remainder         uptime    2028ms
no_intent_fallthrough              kiwix     1092ms
discourse_framing_plus_real_keyword fusion   6080ms
discourse_framing_plus_real_keyword fusion   3080ms
conditional_with_remainder         fusion     276ms
no_intent_fallthrough              kiwix     1990ms
nosplit_adjacent_to_real_conjunction web      2502ms

Two real things worth noting, neither of which got flagged (correctly — no history existed yet for the latency check to compare against):

"whats the deal with the Beatles and the Rolling Stones plus Mercury and Venus, in addition since last time" — two proper-noun pairs in one query, resolved to fusion in 11.9 seconds, by far the slowest of the eight. A real, legitimately slow case the recipe was built to surface; worth watching once more history accumulates.
Two conditional_with_remainder queries differing 2028ms vs. 276ms — almost certainly a cache hit/miss difference on the sub-query, not a real anomaly. Exactly the kind of normal variance ADVERSARIAL_TEST_LATENCY_OUTLIER_FLOOR_MS exists to absorb.

Real bugs this feature found in Mnemolis itself, after running for real

This is the actual point of the feature, not a footnote: after running for roughly a day against MiniDock's real stack (136 real combinations tried, 9 flagged), tracing every single flag — not just the ones that looked interesting — turned up two genuine, previously-unknown bugs in Mnemolis's actual routing/decomposition logic, one genuine false positive in this feature's own detector, and one real, structural (not buggy) latency characteristic worth documenting rather than chasing.

Real bug: discourse-framing escalation never ran on the keyword-match path

The Discourse-Framing Investigation documents fixing "all four real code paths" inside _llm_detect() — fresh and cached, single- and multi-source. All four genuinely were fixed. What none of those four cover: detect_intent()'s own if source: return source early-returns the instant _keyword_detect() matches any real INTENT_MAP keyword — even a single, common, generic one like "rss" or "news" — short-circuiting before _llm_detect() (and therefore every one of its four correctly-fixed escalation paths) is ever reached.

A real, live flag caught this directly: "everyone keeps talking about black holes, and rss" resolved to bare "news" in 35ms — far too fast to have touched the LLM at all, confirming pure keyword-match resolution. Reproduced and generalized immediately: every natural discourse-framed sentence tried that happened to mention any ordinary INTENT_MAP word ("news", "weather", "rss", "feeds", "door locked") hit the identical gap, for both single- and multi-keyword matches. The original fix narrowed the bug's surface area — closing the LLM-routing version — without ever closing the keyword-routing version, since INTENT_MAP contains dozens of short, ordinary words that can easily co-occur with genuine discourse framing in a real sentence.

Fixed by applying the exact same, already-existing _escalate_single_source_for_discourse_framing() / _escalate_multi_source_for_discourse_framing() helpers directly inside detect_intent()'s keyword-match branch — no new escalation logic, just reusing what _llm_detect() already had, at the one call site that was missing it.

Real bug: two real `INTENT_MAP` keywords made entirely of stop words were silently dropped during decomposition

A second flagged row — "feeds plus is it up in addition later today also door locked as well as google" — was meant to test 5 independent intents but only resolved to 3 visible sections. Traced directly: _decompose() only produced 4 parts, not 5, with "is it up" (the literal, real uptime keyword phrase) missing entirely — not folded into a neighboring clause, just gone.

Root cause: _filter_meaningful()'s stop-word-stripping check has no awareness of INTENT_MAP at all. "is it up" and "are they up" — confirmed the only two of all 113 real keyword phrases across every source — are made entirely of common English stop words ("is", "it", "up", "are", "they"). A clause consisting only of one of these phrases came back with zero content_words, and was discarded as not meaningful.

Fixed the same way this function already handles a structurally identical problem for _COLLOQUIAL_PHRASES — checking the clause against the real, flattened INTENT_MAP keyword list (_ALL_INTENT_KEYWORDS, computed once at import time) before falling through to generic stop-word stripping. A real keyword phrase now always counts as meaningful, even if every individual word in it happens to be a stop word — closing the general case, not just these two phrases by name, so a future INTENT_MAP addition with the same property is automatically covered too.

Real bug: a flagged-clean fusion result still had a real, visible answer-quality problem

A real /search call against MiniDock, run to manually confirm the discourse-framing keyword-path fix above, returned a technically correct result by every check this feature runs — source_used: "fusion", real [NEWS — ...] and [KIWIX — ...] sections both present — and would have scored a clean pass on every one of the seven checks above. But the actual kiwix section was bad: for "everyone keeps talking about black holes, and rss", kiwix returned an unrelated Space StackExchange thread about Hubble telescope camera placement and an unrelated Wikipedia article about a true-crime podcast — never the real Black Hole article. This is exactly the kind of thing this feature's own hard rule (never judge correctness) means it can't catch on its own — finding it took a human actually reading a result and asking "is this actually good," not a structural check. Worth recording here anyway, since the eventual fix traces back through the exact same recipe this feature generates.

Tracing it found two distinct, real root causes, both upstream of anything discourse_framing_plus_real_keyword's own check inspects:

fusion.search() calls every selected source with the identical, full, unmodified query string — "everyone keeps talking about black holes, and rss" in its entirety, not separated per-source. Kiwix has no way to know "rss" is the literal text that triggered news as a co-source, not part of its own topic — confirmed this is general, pre-existing behavior, not specific to discourse framing: an ordinary, non-discourse multi-keyword fusion query ("check the news and tell me about black holes") shows the identical pattern.

The actual fix wasn't to make kiwix defensively robust against arbitrary cross-source noise — a generic "strip every other source's keywords" approach was checked and confirmed unsafe: words like "weather", "forecast", "google" are real INTENT_MAP triggers for other sources but also genuinely legitimate kiwix topics in their own right (rejecting a fix that would have broken real queries like "tell me about google's history" is exactly the kind of check this project's own bug-hunting culture insists on before shipping). The real fix was upstream: _decompose() should have split "black holes" and "rss" into two independent clauses in the first place — the same mechanism that already correctly handles every other multi-intent query in the system — but didn't, because of a second, separate bug: "rss" (confirmed the only real INTENT_MAP keyword that is itself 3 characters or shorter) was being discarded by _filter_meaningful()'s if len(p) <= 3: continue length gate before the _ALL_INTENT_KEYWORDS check (added earlier this same investigation, for "is it up"/"are they up") ever got a chance to protect it — the two checks existed in the wrong order. Fixed by reordering the keyword/colloquial checks ahead of the length gate. Once decomposition correctly splits the query, kiwix only ever receives "black holes" as its own search text — the cross-source pollution problem doesn't need a separate fix at all, because the noise word never reaches kiwix in the first place.

Scoring never actually used the cleaned text _build_search_terms() already builds. Separately, and worth fixing regardless of the decomposition fix above (since fusion's "same full query to every source" behavior is itself a real, broader, pre-existing pattern that could resurface this class of problem elsewhere): The Discourse-Framing Investigation documents fixing search-term pollution from discourse-framing words ("everyone", "obsessed") by stripping them in _build_search_terms() — and that part is genuinely true. But _score_result(), which ranks whatever Kiwix's search actually returns, was never updated to use that same cleaned text — it scores against the raw, original query parameter, where "everyone"/"keeps"/"talking" are still real, counted words today, confirmed even for the original bitcoin case the wiki documents as fully fixed. That case's real winner just never visibly changed, because the real Bitcoin article's title-overlap signal was strong enough to win regardless of the noise; "black holes" had no such margin. Fixed by stripping discourse framing from the specific word set used for keyword-overlap scoring (query_words) — query_lower itself stays the full, original phrasing, since the exact-match check and _is_definitional_query() genuinely need the real leading phrase structure ("what's the deal with") that _strip_discourse_framing() was never meant to touch.

A third, separate real bug found verifying the first two: duplicate sections from nested fusion

Re-running the exact same real query against MiniDock to confirm the decomposition/scoring fixes above worked — and they did; the real Black Hole disambiguation article correctly led the kiwix section — surfaced a third, genuinely separate bug in the actual merged answer: a second, redundant [NEWS — ...] section appeared near the end, duplicating real headlines already shown earlier in the response ([KIWIX, NEWS, WEB, NEWS] instead of the correct [KIWIX, NEWS, WEB]).

Root cause: once decomposition correctly splits a query like this into ["...black holes,", "rss"], the first clause's own LLM-judged source selection can independently land on internal fusion (e.g. ["kiwix", "news", "web"] together, all sharing one already-headered, nested blob), while the second, separately-decomposed "rss" clause resolves to bare news on its own. _merge_same_source() — the function that already correctly merges two bare same-source tuples like ("ha", ...) and ("ha", ...) — only ever compares the outer tuple label. "fusion" (the first clause's label) and "news" (the second clause's label) are genuinely different outer labels, so it has no way to see that a section nested inside the fusion blob duplicates the second, separate tuple's own source.

Confirmed this is a real, pre-existing gap that predates this whole investigation, not a new regression from either fix above — it was already reachable via any other query shape where one decomposed clause's own ordinary LLM judgment happens to pick multiple sources that overlap with a different, separately-decomposed clause's source; this recipe's specific shape (discourse escalation + an unrelated trailing keyword) just made it reliably, easily reachable instead of needing a rarer LLM-judgment coincidence to trigger.

Fixed with a second, separate post-processing pass, _dedupe_nested_fusion_sections(), that runs on the final, fully-assembled result text — after _merge_same_source()'s existing tuple-level merge, not instead of it. It splits the text on the exact, real header strings fusion._format_header() can produce (re.escape()'d, not a generic bracket-matching pattern — confirmed safe against real content that happens to contain bracket-like or dash-like text, since a match requires the literal, exact header, not just a bracket shape), groups by header, and merges duplicate sections' content while preserving first-occurrence position — the same convention _merge_same_source() already uses. A true no-op for the overwhelming majority of results, which never contain a duplicate section at all.

A fourth real bug found verifying the third: duplicate content survives even when the section-level fix works correctly

Re-running the exact query one more time to confirm the section-level fix above against MiniDock's real stack — and it worked exactly as designed, producing a single, correct [NEWS — ...] header — surfaced a fourth, separate bug: the same several headlines still appeared twice inside that one, correctly-deduplicated section's own body.

Root cause: _dedupe_nested_fusion_sections() fixed the structural duplication (two headers becoming one), but _merge_same_source()'s actual content join — current_result.rstrip() + "\n\n" + result.lstrip() — is a plain string concatenation with zero awareness of what's inside either blob. When the nested fusion blob's own news section and the second, separately-decomposed clause's bare news result both came from genuinely independent calls to news.search() — and FreshRSS's own _is_general_query() path means a broad query gets "everything, no filtering" — the two calls legitimately returned overlapping recent headlines, and nothing anywhere deduplicated across them.

The real fix had to happen at a specific point, found only after a first attempt failed: deduping the headline items after _merge_same_source()'s join (by re-splitting the already-joined text on its own "---" item separator) seemed reasonable, but a failing test caught the actual problem directly — by the time two blobs are joined with a bare "\n\n", the real boundary between "the last item of call 1" and "the first item of call 2" is no longer reliably distinguishable from an ordinary paragraph break within either call's own content, so a later split can silently merge what should have been two separate items into one and miss a real, earlier duplicate.

Fixed by moving the dedup before the join — a new _dedupe_items_across_blobs() helper runs at the one point where the boundary between the two original results is still completely unambiguous: two distinct strings, not yet concatenated. It splits each blob on the real "---" item separator every multi-item source (freshrss.py's news, searxng.py's web) already uses for its own result blocks, and removes any item from the second blob whose leading **Title** line exactly matches one already present in the first — exact match only, never fuzzy similarity. Both fusion._merge_same_source()'s own join and router.py's separate header-level merge now call this helper and use the real item separator (not a bare double-newline) when joining genuinely multi-item content, so the visual boundary between two merged result lists stays as clean and unambiguous as the dedup logic itself needs it to be.

A real false positive this feature's own detector had, fixed

Tracing a third flagged row — "what's offline as well as while i've been at work in addition this weekend plus news and security status", 5 intended intents, only 3 headers — turned out to be a false positive, not a Mnemolis bug. Decomposition produced all 5 correct parts; every part resolved to the correct source. The 2 "missing" sources legitimately and correctly returned empty results, and route_with_source() deliberately drops an empty sub-query result before merging — if not _looks_empty(sub_result): parts.append(...) — exactly the right behavior; nobody wants an answer cluttered with empty sections.

The real problem: by the time _check_multi_intent_part_count sees the final merged string, there's no trace anywhere of which sub-queries were tried and legitimately came back empty versus which results were silently lost to a bug — that information is gone before merging ever happens, and recovering it would mean re-running every real backend call a second time just to validate the check, doubling real load on every test cycle.

Fixed by loosening the check from an exact-count comparison to "fewer than half of the intended sources produced any header at all." The original proper-noun-pair bug 5 this check exists to catch was a global veto — collapsing an entire multi-intent query down to a single, un-split result (0 or 1 headers against 4+ intended sources) — not a partial 2-of-5 gap. "Less than half" is loose enough to never fire on ordinary empty-result variance across this recipe's real range (3–5 intended sources), while still catching a genuine large-scale collapse with the same shape as the original bug. As a direct consequence, the now-removed ADVERSARIAL_TEST_PART_COUNT_MISMATCH_TOLERANCE setting no longer exists — there's nothing left to tune; the new threshold is a fixed, principled rule derived from the original bug's actual signature, not a per-deployment knob.

(This also corrected an unrelated, separate defect in the old check that the new logic improves on for free: the old version's n_headers > 0 guard meant a complete collapse to zero headers could never be flagged at all, for any number of intended sources — a worse blind spot than the false positive this same fix closes, since a total collapse is exactly bug 5's real signature. The new check correctly flags 0 headers for 2+ intended sources.)

That fix, while real and correct on its own terms, turned out not to be the whole story. Tracing a completely separate flag much later — conditional_remainder_missing_sections on "if it is raining, I will be careful with communication, as well as feeds" — led to _HEADER_PATTERN itself: the regex both this check and _check_conditional_remainder_sections use to count real headers in a result string required exactly one literal " — " separator, with the character class after it deliberately excluding the em-dash. kiwix's real label ("ENCYCLOPEDIC KNOWLEDGE — UNRELATED TO OTHER SECTIONS BELOW") and news's real label ("RECENT NEWS HEADLINES — GENERAL, NOT LOCATION-SPECIFIC UNLESS STATED") both legitimately contain a second em-dash — so neither header could ever be matched by the original regex, at all, regardless of any threshold.

Both real part_count_mismatch flags this page already describes involved news as one of the intended sources. Reconstructing a realistic 5-header result including both vulnerable headers confirmed the regex undercounted by exactly 2 — 5 real headers, 3 counted — the precise shape of both flags' literal text ("intended 5 intents, found 3 headers"). The threshold fix above made the check tolerant of this undercount without ever finding why the undercount was happening; it likely papered over this exact regex bug the whole time, rather than the legitimate-empty-results explanation being the real, complete story.

Fixed properly by rebuilding _HEADER_PATTERN from the real, exact header strings fusion._format_header() can actually produce (re.escape()'d, not a generic bracket-matching character class) — the same safe approach router.py's own _dedupe_nested_fusion_sections() already uses for the identical underlying need. Several existing tests for both checks had also been using fabricated header text ("[KIWIX — A]") that happened to be equally invisible to the broken regex for an unrelated reason, which is the real, structural reason this had survived in the test suite for as long as it did — a test using fake data can accidentally agree with a real bug. New tests use only real header strings going forward.

A real, genuine backend timeout — correctly reported, but with no way to tune it

The last of this round's two genuinely unresolved flags: unexpected_empty on "if any services are down, let me know right away, as well as lights off", latency 30056ms. The number itself was the actual clue — UptimeKumaApi(settings.uptime_kuma_url, timeout=30) was a bare, hardcoded 30-second client timeout, and 30056ms is exactly that, plus the small overhead of everything else the query touched.

Tracing the real call path confirmed this is not a Mnemolis bug at all — the Uptime Kuma client connection genuinely timed out, the exception was caught, and Mnemolis correctly, honestly returned "Could not connect to Uptime Kuma: {e}" rather than hiding the failure or crashing. fusion._looks_empty() correctly recognizes "could not connect" as a real failure signal, and this feature's own check correctly flagged the resulting empty-looking merged response — every layer did exactly what it was supposed to do.

The real, fixable gap: there was no setting anywhere to tune that 30-second wait. Every other source this project touches (SEARXNG_REQUEST_TIMEOUT_SECONDS, FUSION_TIMEOUT_SECONDS) already has a real, configurable timeout; Uptime Kuma's was the one bare literal left over. 30 seconds is a long time to wait on what should be a fast, same-LAN service before falling back — fixed by adding UPTIME_KUMA_TIMEOUT_SECONDS (default 10), wired directly into the real client call, with the documented fallback behavior (a real "could not connect" message on a genuine failure) completely unchanged.

Whether the original 30-second timeout on a real, live deployment reflects a genuine, repeatable network issue worth investigating further, or was a one-off hiccup, is a separate, open question this fix doesn't answer on its own — but a shorter, configurable timeout means a future occurrence fails fast and falls back sooner, rather than holding up an entire conditional+remainder response for half a minute.

A real, structural latency characteristic — found, initially accepted, then actually fixed

Three of the real latency-outlier flags, plus one near-timeout unexpected_empty flag, all traced to the same mechanical cause: route_with_source() used to handle a conditional_with_remainder query's condition and remainder as two separate, sequential, blocking calls — sub_condition_result, sub_source = route_with_source(sub_condition, "auto"), then, afterward, remainder_result, remainder_source = route_with_source(sub_remainder, "auto"). If either half hit a slow LLM call or a slow fusion fan-out, the total wall-clock time was additive, not the max of the two.

This was initially recorded as deliberately not being fixed — the reasoning at the time cited the same conditional-handling code's two unrelated real bug fixes (The Recursion Design Bug among them) as a reason to avoid touching it. That reasoning was later re-examined directly and found to conflate "this code has a real bug history" with "this specific change is risky" — the two unrelated bugs live in parsing and interpretation logic, not in call ordering, and don't actually bear on whether the condition and remainder calls can safely run concurrently. They can — the same as the web query-expansion case below, which was fixed first and supplied both the real blocker (a genuine file-write race, since fixed) and the verification discipline (a real concurrency-timing test, a real suppress_cache_writes() propagation test) this fix reused directly. See Conditional Query Detection for the full mechanism and the real measured improvement.

A second, structurally different latency source — `web`'s own query expansion, now fixed

A fresh flag on nosplit_adjacent_to_real_conjunction — "difference between Iran and Israel, and find online", 6412ms vs. a recipe p95 of 2502ms — traced to a genuinely different mechanism than the conditional+remainder case above, despite the same surface symptom. This query never decomposes (the nosplit guard correctly keeps "Iran and Israel" intact) and resolves to a single source, web — no fusion, no merge, nothing conditional involved at all. The cost was entirely inside searxng.py's own search(): a primary SearXNG fetch, followed by Query Expansion's get_alternate_phrasing() (a real, blocking LLM completion call), followed by a second SearXNG fetch for the alternate phrasing — three sequential network/LLM round-trips billed as one source's latency. Reproduced directly with realistic mocked timings: roughly 4x the cost of a single fetch when expansion fired versus when it didn't.

Unlike the conditional+remainder case, the two fetches here have no real data dependency on each other — get_alternate_phrasing() only needs the original query text, not the primary fetch's results — so this was a genuinely better parallelization candidate, if it could be shown safe rather than just assumed safe. _fetch_searxng() itself was already confirmed to be a pure function with no shared state. The one real, open question was get_alternate_phrasing()'s own routing-cache read/write — not because it looked unsafe, but because this project has real, hard-won history (suppress_cache_writes()'s own ContextVar design) showing concurrent access to shared cache state looks safe right up until it isn't.

That audit found a real, separate, pre-existing bug — not in the cache's in-memory dict (a single dict mutation is already safe under the GIL), but in how both the routing cache and the result cache persisted to disk: a bare open(path, "w") followed by json.dump(), with no protection against two concurrent writers truncating the same file at the same time. This wasn't a new risk introduced by parallelizing query expansion — FastAPI's /search endpoint is a synchronous route, so Starlette already runs genuinely concurrent real requests on its own thread pool today, making this a real, already-live exposure. Confirmed directly: a deliberate 8-writer/8-reader stress test against the old pattern produced 79,609 JSON corruption errors in two seconds. The actual blast radius was bounded — the existing cache-load logic already catches a corrupt file and starts fresh — but "silently lose the entire on-disk cache on next restart" is still a real, avoidable cost.

Fixed properly, not worked around, with the standard pattern for this exact problem: write to a temporary file in the same directory, then os.replace() onto the real target — atomic on POSIX, so the file is always either the complete old version or the complete new one, never a partial write from either side. The identical 8-writer/8-reader stress test against the fixed version: zero errors. Both _save_routing_cache() and _save_cache() now use this shared helper, closing the same gap in both places rather than just the one this investigation started from.

With that real concern resolved, the primary fetch and the alternate-phrasing chain now run concurrently via a small ThreadPoolExecutor, the same pattern fusion.py already uses for its own multi-source dispatch. Verified with the exact original repro timings: 4.15s (sequential) down to 3.04s (concurrent) — not a full elimination, since the alternate chain's own two steps (LLM call, then its own second fetch) are still genuinely sequential within themselves; the real, available win was removing the outer wait between the primary fetch and that whole chain, not every sequential dependency inside it.

This is the second real recipe-latency-variance mechanism found on two different recipes, both stemming from sequential work inside what the adversarial check treats as a single, uniform "this recipe's latency distribution." One of the two (this one) is now actually fixed; the other (conditional+remainder) remains a documented, accepted cost — see above for why that one's a worse parallelization candidate. Still worth a real design pass at some point: a per-recipe latency-history comparison, rather than every recipe sharing one global ADVERSARIAL_TEST_LATENCY_OUTLIER_MULTIPLIER, would let conditional_with_remainder specifically have its own, honestly-higher baseline instead of being judged against a mix of fast and slow samples. Not yet built.

A real regression this very fix introduced, found while researching whether the riskier case was actually feasible

Asked directly afterward whether the conditional+remainder case — the one left as a documented, accepted cost above — was actually infeasible to parallelize, or just assumed to be. Re-deriving the real data dependencies confirmed the two route_with_source() calls inside _resolve_conditional() have no real dependency on each other either, the same as the web case above — and re-examining the original "two carefully-reasoned bug fixes nearby" caution found both of those real bugs live in detect_conditional()'s parsing and _interpret_binary_state()'s keyword matching, neither of which the calls' execution order would touch.

That re-investigation is what surfaced this: researching whether ThreadPoolExecutor actually propagates contextvars.ContextVar state into worker threads (the real mechanism suppress_cache_writes() depends on) found that it does not, by default — confirmed as official, documented Python behavior, not an implementation quirk. Testing the already-shipped web query-expansion fix directly against this found a real, live regression: suppress_cache_writes() active in the calling thread was being silently ignored inside the concurrent alternate-phrasing thread, meaning a synthetic Adversarial Self-Testing query could leak a real write into the routing cache — precisely the bug suppress_cache_writes() exists to prevent, reintroduced by the very fix meant to improve performance.

Fixed by giving each submitted task its own contextvars.copy_context() call before submission, so the calling thread's suppression state correctly propagates into both worker threads. A first attempt at this fix tried sharing one captured context object between both tasks, which failed a second way: a single Context object cannot be entered by two threads simultaneously (Context.run() is documented as non-reentrant across concurrent execution) — confirmed directly via a real RuntimeError: cannot enter context... already entered from the test suite itself. Each task needs its own, independently-copied context, not one shared copy. Both failure modes were caught by real tests before shipping, not found in production after the fact — but worth being honest that the first version of this exact fix had a real gap, found only by deliberately going back and verifying the reasoning that had been used to avoid a different, structurally similar change.

One known limitation worth tracking, not yet tuned

Source mismatch on the conditional path — a conditional query's condition text gets routed through LLM-based source selection, which can validly land on a source that doesn't literally appear as an INTENT_MAP keyword in the query. The check doesn't yet distinguish "the LLM made a different valid call" from "the LLM made a wrong call" — right now it flags both the same way. Not yet a confirmed real false-positive rate against live traffic (unlike the part-count issue above, which was directly traced and confirmed) — recorded here as a standing, plausible concern worth watching, not yet acted on.
A single, global latency-outlier multiplier across every recipe — the two real, distinct latency-variance mechanisms that originally motivated this observation (conditional_with_remainder's sequential routing, web's query expansion) have both since been fixed at the root rather than accommodated with a per-recipe override. Recorded here in case a third, different mechanism surfaces a similar pattern in the future — at that point, a per-recipe baseline genuinely earns its complexity; fixing the actual cause has been the better trade twice in a row so far.

Configuration

Setting	Default	What it controls
`ADVERSARIAL_TEST_ENABLED`	`true`	Master on/off switch. `false` skips DB init, never registers the scheduler job, and `POST /adversarial/trigger` returns `{"status": "disabled"}` instead of running anyway — checked at both scheduler-registration time and inside `run_adversarial_test_cycle()` itself, so a direct call can never accidentally run real queries against the LLM/SearXNG/Kiwix backends while turned off
`ADVERSARIAL_TEST_INTERVAL_MINUTES`	`60`	How often the scheduler tick fires
`ADVERSARIAL_TEST_BATCH_SIZE`	`8`	Queries generated per tick — cheap to raise (no LLM calls in the hot path)
`ADVERSARIAL_TEST_LATENCY_OUTLIER_MULTIPLIER`	`1.5`	How many multiples of a recipe's own historical p95 counts as a real latency outlier
`ADVERSARIAL_TEST_LATENCY_OUTLIER_FLOOR_MS`	`1000`	A floor below which latency is never flagged regardless of the multiplier — protects fast, cache-hit-driven queries from getting flagged just for being a multiple of an even-faster sample
`ADVERSARIAL_TEST_LATENCY_OUTLIER_MIN_SAMPLES`	`10`	How many historical samples a recipe needs before the latency-outlier check engages at all

/health reports adversarial_testing alongside snapshot_jobs, using the same staleness-grace-multiplier convention (SNAPSHOT_STALE_GRACE_MULTIPLIER, default 3x) the snapshot engine already uses. When disabled, it reports {"status": "disabled"} directly rather than eventually reading as "stale" — a deliberate off-switch shouldn't look like a job that silently stopped running.

Endpoints

POST /adversarial/trigger — manually run one cycle immediately, rather than waiting for the next scheduled tick. Mirrors /snapshots/trigger's exact pattern. Returns {"status": "ran", "queries_run": N, "flagged": N}, or {"status": "disabled", "queries_run": 0, "flagged": 0} without touching any real backend if ADVERSARIAL_TEST_ENABLED is false.

GET /adversarial/flagged?limit=50&include_dismissed=false — the union of currently-flagged and ever-flagged-but-not-dismissed combinations, most recent first. Each row includes ever_flagged, currently_flagged, first_flagged_reason/first_flagged_timestamp (the original anomaly), and review_status. Pass include_dismissed=true for the full audit trail including closed-out rows. Reports {"status": "disabled", ...} the same way if turned off. Deliberately left unauthenticated, the same way /health and /areas already are: it exposes only synthetic, generated test queries and their structural anomaly flags, never real user queries or cache contents, so it sits outside API_KEYS' documented scope (POST /search and GET /changes only) for the same reason those two already do.

POST /adversarial/dismiss?fingerprint=... — mark a flagged combination as reviewed and closed. The fingerprint is the exact value from a flagged row's own fingerprint field, copied verbatim — not constructed by hand. Returns 404 for an unknown fingerprint. History is never deleted by a dismissal; a genuinely new flag on the same fingerprint later resurfaces it normally.

POST /adversarial/undismiss?fingerprint=... — the real, symmetric reversal. Use GET /adversarial/flagged?include_dismissed=true to find a dismissed row's fingerprint, since the default view no longer shows it once dismissed. Returns 404 for an unknown fingerprint; a fingerprint that was never dismissed in the first place is a safe no-op (the row already has the state this would restore it to).

Adversarial Self Testing

Adversarial Self-Testing

The one hard rule

Generation — pure combinatorics, no LLM calls

What gets flagged

A bug this feature found in itself, before it ever ran in production

First real run, on MiniDock

Real bugs this feature found in Mnemolis itself, after running for real

Real bug: discourse-framing escalation never ran on the keyword-match path

Real bug: two real INTENT_MAP keywords made entirely of stop words were silently dropped during decomposition

Real bug: a flagged-clean fusion result still had a real, visible answer-quality problem

A third, separate real bug found verifying the first two: duplicate sections from nested fusion

A fourth real bug found verifying the third: duplicate content survives even when the section-level fix works correctly

A real false positive this feature's own detector had, fixed

A real, genuine backend timeout — correctly reported, but with no way to tune it

A real, structural latency characteristic — found, initially accepted, then actually fixed

A second, structurally different latency source — web's own query expansion, now fixed

A real regression this very fix introduced, found while researching whether the riskier case was actually feasible

One known limitation worth tracking, not yet tuned

Configuration

Endpoints

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Real bug: two real `INTENT_MAP` keywords made entirely of stop words were silently dropped during decomposition

A second, structurally different latency source — `web`'s own query expansion, now fixed