Skip to content

The Latency Parallelization Investigation

Bob edited this page Jun 26, 2026 · 1 revision

The Latency Parallelization Investigation

This is the single most cross-page-scattered story in Mnemolis's history — two structurally similar latency problems, found by the same feature, fixed at different times, with one fix's own regression only caught by going back and checking whether the other problem had actually been correctly left alone. Told here in the order it actually happened.

Two recipes, the same surface symptom

Adversarial Self-Testing flagged real latency outliers on two different recipes, both stemming from the same underlying shape of mistake: sequential work being billed as one source's latency, when the work involved had no real reason to run sequentially.

conditional_with_remainder — three of the real latency-outlier flags, plus one near-timeout unexpected_empty flag, all traced to the same mechanical cause: route_with_source() used to handle a conditional query's condition and remainder as two separate, sequential, blocking calls. If either half hit a slow LLM call or a slow fusion fan-out, the total wall-clock time was additive, not the max of the two — a condition that took 2 seconds and a remainder that took 6 added up to roughly 8 seconds total, not 6.

nosplit_adjacent_to_real_conjunction — a fresh flag on "difference between Iran and Israel, and find online" (6412ms vs. a recipe p95 of 2502ms) traced to a genuinely different mechanism despite the same surface symptom. This query never decomposes and resolves to a single source, web — no fusion, no merge, nothing conditional. The cost was entirely inside Query Expansion: a primary SearXNG fetch, followed by a real, blocking LLM completion call for an alternate phrasing, followed by a second SearXNG fetch — three sequential network/LLM round-trips billed as one source's latency. Reproduced directly with realistic mocked timings: roughly 4x the cost of a single fetch when expansion fired versus when it didn't.

The web case was fixed first

Of the two, the web/query-expansion case was the better parallelization candidate on its face: the primary fetch and the alternate-phrasing chain have no real data dependency on each other — get_alternate_phrasing() only needs the original query text, not the primary fetch's results. _fetch_searxng() itself was already confirmed to be a pure function with no shared state.

The one real open question was get_alternate_phrasing()'s own routing-cache read/write — not because it looked unsafe, but because this project has real, hard-won history (suppress_cache_writes()'s own ContextVar design) showing concurrent access to shared cache state looks safe right up until it isn't. Auditing that turned up a real, separate, pre-existing file-write race in how both caches persisted to disk — full mechanism and the 79,609-error stress test that confirmed it are in The Caching Concurrency Investigation.

With that resolved, the primary fetch and the alternate-phrasing chain now run concurrently via a small ThreadPoolExecutor, the same pattern fusion.py already uses for its own multi-source dispatch. Verified with the exact original repro timings: 4.15s sequential down to 3.04s concurrent — not a full elimination, since the alternate chain's own two steps (the LLM call, then its own second fetch) are still genuinely sequential within that chain; the real, available win was removing the outer wait between the primary fetch and the whole chain, not every dependency inside it.

The conditional+remainder case was initially, wrongly, left alone

This was initially recorded as deliberately not being fixed. The reasoning at the time cited the same conditional-handling code's two unrelated real bug fixes — the recursion depth-counter bug and the greedy-consequence-regex bug — as a reason to avoid touching nearby code.

Re-examined directly, that reasoning conflated two different things. Both real bugs live in detect_conditional()'s parsing logic and _interpret_binary_state()'s keyword matching — neither has anything to do with the order the two route_with_source() calls inside _resolve_conditional() execute in. Re-deriving the actual data dependencies confirmed those two calls don't depend on each other any more than the web case's two SearXNG fetches did — a structurally similar case that had already been found feasible and fixed by the time this one was re-examined. "This code has a real bug history" and "this specific change is risky" are not the same claim, and the first doesn't establish the second without actually checking.

The real, original blocker turned out to be concrete and fixable, not a vague "this area is fragile" caution: the same file-write race the web fix had already found and closed. That fix removed the actual reason this page's earlier caution had any teeth.

Fixed the same way, with the same verification discipline — and worth being explicit that the discipline mattered, not just the idea. Building this fix started from the web fix's own lesson rather than relearning it from scratch: each task gets its own contextvars.copy_context() call before submission, and four real checks ran before trusting any of it — a genuine timing proof of concurrency, direct confirmation suppress_cache_writes() correctly reaches both worker threads, confirmation normal caching is unaffected when suppression isn't active, and a fourth check specific to this case: real exception propagation from the remainder thread. That fourth check exists because this case's failure semantics genuinely differ from the web fix's — a failed alternate-phrasing fetch is deliberately non-fatal (the primary search still stands on its own), but a failed remainder search is a real, user-facing failure that needs to surface, not get silently swallowed by the thread pool. All four passed cleanly; no second regression this time.

Only spun up at all when a remainder genuinely exists — a plain "if X, Y" query with no trailing conjunction (the more common real-world shape) has an empty remainder and never needed a second call in the first place; that path is completely unchanged. Verified against realistic timings matching the real flagged query's shape (a 2-second fusion fan-out condition, a 1.5-second single-source remainder): 2.0s concurrent versus what would have been 3.5s sequential.

The regression that almost shipped twice

Asked directly afterward whether the conditional+remainder case was actually infeasible to parallelize, or just assumed to be, re-deriving the dependencies (above) is what reopened the question — and that re-investigation is what surfaced something the web fix had missed.

Researching whether ThreadPoolExecutor actually propagates contextvars.ContextVar state into worker threads — the real mechanism suppress_cache_writes() depends on — found that it does not, by default, confirmed as official, documented Python behavior. Testing the already-shipped web query-expansion fix directly against this found a real, live regression: suppress_cache_writes() active in the calling thread was being silently ignored inside the concurrent alternate-phrasing thread, meaning a synthetic Adversarial Self-Testing query could leak a real write into the routing cache — precisely the bug suppress_cache_writes() exists to prevent, reintroduced by the very fix meant to improve performance.

Fixed by giving each submitted task its own contextvars.copy_context() call before submission, so the calling thread's suppression state correctly propagates into both worker threads. A first attempt at this fix tried sharing one captured context object between both tasks, which failed a second way: a single Context object cannot be entered by two threads simultaneously — confirmed directly via a real RuntimeError: cannot enter context... already entered from the test suite itself. Each task needs its own, independently-copied context, not one shared copy.

Both failure modes were caught by real tests before shipping, not found in production after the fact — but worth being honest that the first version of this exact fix had a real gap, found only by deliberately going back and verifying the reasoning that had been used to avoid a different, structurally similar change. The conditional+remainder fix above was built with both lessons already in hand, which is why it didn't repeat either mistake.

What's still open

This is the second real recipe-latency-variance mechanism found on two different recipes, both stemming from sequential work inside what the adversarial check treats as a single, uniform "this recipe's latency distribution." Both are now fixed at the root. Still worth a real design pass at some point: a per-recipe latency-history comparison, rather than every recipe sharing one global ADVERSARIAL_TEST_LATENCY_OUTLIER_MULTIPLIER, would let any future recipe with a genuinely higher honest baseline avoid being judged against a mix of fast and slow samples. Not yet built.

The lesson: "this area has a real bug history" is a reason to be careful, not a reason to stop looking. The original deferral wasn't wrong to be cautious — it was wrong to treat caution about unrelated code as evidence about this specific change, without ever re-deriving whether the two were actually connected. The fix that came from re-checking that exact reasoning also caught a second, real regression along the way — vigilance about old assumptions paid for itself twice in the same investigation.

Clone this wiki locally