-
Notifications
You must be signed in to change notification settings - Fork 0
The Benchmark Investigation Log
Six separate threads, three of them spanning multiple releases before reaching a real conclusion, one of them involving a real mistake that a later benchmark caught, one resolved as "understood and deliberately not fixed" rather than patched. This page is the chronological story — what each benchmark run actually found, what got tried, what worked, and what didn't. For the current, present-tense numbers and how to run your own benchmark, see Benchmarks instead; this page is the "how we got here," not the reference.
The raw, dated tables for every run mentioned here live in BENCHMARKS.md in the repo root.
First spotted in the v3.17.0 benchmark: an unexplained tail (p95 980ms, p99 1400ms) with no clear cause, flagged honestly as "could be a connection hiccup or sampling noise" rather than guessed at. It reproduced in v3.44.0 and again in v3.50.2 — three releases showing the same shape was enough to treat it as a real, deterministic cost rather than noise, even without a confirmed mechanism yet.
v3.50.2's hypothesis: CACHE_TTL_UPTIME_SECONDS (60s) is the only source TTL shorter than a typical 120-second benchmark window, so a cache miss genuinely happens during every run. That correctly explained when the miss occurred but not why it cost what it cost — a real, slow-but-successful Socket.IO round-trip was assumed, but never confirmed by reading the actual connection code.
v3.50.4 found the first real cause: app/sources/uptime_kuma.py opened a brand-new Socket.IO connection, logged in, and disconnected on every single call — including every 2-minute snapshot_uptime() scheduler tick, independent of whether anyone was asking a live question. Fixed by making the connection persistent, reused across calls and warmed once at app startup. One thing the design doc for this fix got wrong and caught before shipping: it assumed UptimeKumaApi had a .connected property to check connection liveness. Reading the installed library's actual source showed no such property exists — the real signal is the underlying sio (python-socketio) client's own .connected attribute.
The fix measurably helped — warm p95 dropped from 1500ms to 470ms — but left a smaller tail that reproduced identically across two further releases (v3.50.6, v3.50.7): the exact same ~440ms value, every single time. That precision was itself a clue: a cost that's noisy looks different release to release. A cost that's the same number every time is deterministic, and a deterministic, unexplained number is something with a findable cause.
v3.50.8 found it, by directly reproducing the library's behavior in isolation rather than continuing to guess from benchmark timing: uptime_kuma_api's own _get_event_data() pays an unconditional 0.2-second wait_events sleep on every single call — even when the data it's waiting for was already complete. Two such calls per search() (get_monitors() + get_heartbeats()) is exactly the ~0.4-second floor that had shown up identically three releases running. This is a real, deliberate design choice in the upstream library, not a bug in it — Uptime Kuma's server sends one heartbeatList push per monitor after login, and wait_events exists to let trailing per-monitor pushes land before the client treats the initial batch as complete. But that risk only applies to the first call after a fresh connect or reconnect; every later call on an already-settled persistent connection has nothing left to wait for, confirmed directly by reading the steady-state push handler (one complete record per call, no multi-message batching).
The fix shrinks wait_events only after a connection's first call genuinely settles, keeping the library's full, safe default for the one call that still needs it. v3.50.9 confirmed it with a real re-benchmark: warm p98/p99 dropped to 69ms — finally in the same order of magnitude as the cleanest sources (kiwix 44ms, forecast 45ms), not a separate tail. Five releases, three real findings, one closed loop.
Thread 2: the auto/conditional thundering herd, including a real mistake caught by the next benchmark
First appeared as conditional's warm-cache tail in v3.17.0, traced to a real, identifiable cause once the project started looking for one: _resolve_conditional() recurses into the full routing pipeline using the extracted condition text as the cache key, not the original "if X, Y" phrasing — and that extracted text never appeared as a standalone query anywhere else in the load test's small, fixed query pool. With 20 concurrent Locust users and only a handful of distinct conditions, multiple users could pick the same never-yet-cached condition before the first one to resolve it had written the cache entry, so several paid the full LLM routing cost concurrently even on a nominally "warm" run. auto's tail shared the same root cause for the same reason. This reproduced identically through v3.44.0 and v3.50.2 — auto's cold p99 hit a full 10 seconds in the v3.50.2 run, the single worst sample the project had seen.
v3.50.3 widened the pools (AUTO_QUERIES 6→12, CONDITIONAL_QUERIES 4→8, CONDITIONAL_WITH_REMAINDER_QUERIES 2→4) on the straightforward intuition that more options dilutes collision odds. Re-benchmarks in v3.50.5 and v3.50.7 confirmed it helped but not enough — auto's cold p99 dropped from the 10-second spike but stayed multi-second-adjacent, and conditional/conditional_remainder kept real warm-cache tails that a fully-warmed pool shouldn't have shown at all.
v3.50.8 tried to model the problem properly instead of guessing again — and made a real, documented mistake in the process. The model used ("expected number of pool entries hit by 2 or more of 20 users") correctly identified that the relationship between pool size and collision count isn't simple, but it was the wrong metric: it doesn't predict what the benchmark's tail actually measures. The model's own output for conditional_with_remainder's specific change (4→12 entries) showed the absolute collision count getting worse (3.90 → 6.07), not better — and that warning was sitting in the model's own numbers, unacted on. The widening shipped anyway (AUTO_QUERIES 24, CONDITIONAL_QUERIES 20, CONDITIONAL_WITH_REMAINDER_QUERIES 12).
The very next re-benchmark (v3.50.9) caught it: conditional's cold p99 hit 9800ms — the worst single sample that endpoint had ever produced — and conditional_remainder's cold p98/p99 nearly tripled (1300ms → 4200ms) versus the pre-widening baseline. Both got worse, not better. Taking that result seriously rather than writing it off as noise led to re-deriving the math properly: the metric that actually predicts the tail is "fraction of the 20 users whose first pick collides with someone else's," which declines monotonically with pool size (no peak to dodge) but slowly enough that v3.50.8's sizes still left 62-81% of users colliding with someone.
A second, independent factor made it worse than the collision rate alone explains: AUTO_QUERIES's widening had worked well at a similar nominal collision rate (55.5% at 24 entries) because most of its collisions land on a cheap, structured source — only 2 of 24 entries fall through to kiwix's expensive LLM book-selection path, confirmed by running every entry through detect_intent() directly. CONDITIONAL_QUERIES's 20-entry pool was the opposite: 17 of 20 conditions (85%) fell through to kiwix. A collision there costs far more per occurrence than a similarly-frequent collision on auto's pool — a lower collision rate alone was never going to be enough, and the v3.50.8 widening's new entries (mostly "if X is in retrograde"-style conditions, checked only for correct conditional detection, never checked against which source they'd actually route to) made the mix slightly worse, not better.
v3.50.9's correction: widened further (CONDITIONAL_QUERIES 40, CONDITIONAL_WITH_REMAINDER_QUERIES 30) and fixed the source mix — the new entries were specifically written to hit ha/uptime/forecast/changes keywords rather than falling through to kiwix, verified directly against detect_intent() before being added. This brought the pool's kiwix-fallback ratio from 85% down to 42%.
The next re-benchmark confirmed real, genuine improvement — but not a fix. conditional's cold p98/p99 dropped from 9800ms to 5300ms (46% reduction); conditional_remainder's dropped from 4200ms to 1500ms (64% reduction) — the corrected model's predictions held up directionally, this time. But neither cleared the bar: both still show real multi-second cold tails, and the warm-cache numbers barely moved (conditional 1300ms → 1400ms, essentially flat; conditional_remainder 1800ms → 1400ms, a modest improvement) — nowhere near the "low tens of milliseconds" every clean source shows. This matches what the model itself predicted in advance: getting the collision rate down from ~70-80% to ~40% meaningfully helps without eliminating the thundering herd, and full elimination would need pool sizes (150+) that aren't realistic to maintain. Recorded as the genuine, partial result it is — real progress, not a closed thread.
Two further re-benchmarks on the same codebase tell a more nuanced story than a single clean confirmation. conditional's cold p99 across the three v3.50.11-era runs was 5300ms, then 2500ms, then 2500ms again — real, substantial improvement from the original 9800ms in every case, but genuinely noisy run to run, not settled on one number. conditional_remainder showed the same pattern (1500ms, 1800ms, 2900ms) — always well below the pre-correction 4200ms, never fully quiet. auto told the more interesting story: 990ms, 2300ms, 2400ms — three of the four total runs at this pool size (including the original v3.50.9 run) cluster tightly at 2300-2700ms, with one clean, favorable miss. That asymmetry — mostly clustered, occasionally clean — turned out to have its own, different root cause from conditional's fix: a fixed 2-of-24 count of genuinely LLM-dependent AUTO_QUERIES entries that pool-widening structurally cannot dilute away, confirmed by reading the real keyword-matching code directly rather than inferred from the numbers alone.
This time the fix sits at the application layer rather than the load-test layer. Per-key in-flight deduplication ("singleflight") now wraps every routing-cache call site that pays an LLM cost on a miss — _llm_detect() (the auto path this thread is about), _llm_pick_fusion_sources(), and Kiwix's own _pick_books_with_llm()/_get_disambiguation_candidates() — so N concurrent callers for one never-yet-cached key now cost roughly 1x the LLM call instead of Nx, the actual collision mechanism this thread traced rather than another pass at diluting the odds of it occurring. See Caching's own coverage of the mechanism (the result cache's identical, structurally-equivalent gap — see Thread 3 and TestResultCacheThunderingHerd — is deliberately left alone). Confirmed directly in isolation that the mechanism itself works exactly as designed: 8 concurrent callers for the identical uncached key collapse to exactly 1 real LLM call, not 8.
Three real re-benchmarks against the shipped fix told a genuinely surprising story: auto's cold p99 didn't move at all. 2500ms, 2300ms, 2300ms — landing in almost the identical 2300-2700ms band the pre-fix runs already showed (990/2300/2400ms), with the fix's own three runs actually clustering slightly worse than the pre-fix best case. Singleflight wasn't broken — the isolated test above confirms the mechanism does exactly what it's supposed to — but the diagnosis this thread had built (Part 1's "consistent with... a genuine, repeated same-query collision" framing) turned out to be incomplete. Investigated rather than written off as noise or quietly re-attributed without checking: kiwix/kiwix_disambiguation/cache_hit all dropped substantially run-over-run across the same three benchmarks while auto stayed flat — an asymmetry that ruled out "this was all just one noisy session of shared backend contention" as the explanation, since a shared-contention story predicts everything moving together, not one endpoint standing still while its neighbors visibly improve.
The actual missing piece, found by reading the code directly rather than continuing to model auto's collision rate: app/llm.py had zero connection reuse. Every single call into complete() — book selection, source routing, fusion-source selection, disambiguation candidates, the entire LLM-dependent surface this whole investigation has been chasing — used the bare requests.post() module function, never a requests.Session, meaning every call opened a fresh TCP connection to the LLM backend and tore it down immediately after. The identical class of bug Thread 1 already found and fixed for Uptime Kuma's own connection, just never checked for here. Confirmed directly against a real local server, not inferred: 10 sequential calls through the old pattern opened 10 distinct connections; the same 10 calls through a persistent session opened exactly 1. Singleflight removes redundant concurrent work (N callers racing the same key collapse to 1 real cost) — it was never going to touch a cost every single call pays individually, which a fresh connection per call structurally is. See Caching for the fix itself.
Three confirmed re-benchmarks against the connection-pooling fix told the identical story singleflight's own re-benchmark did: auto's cold p99 still didn't move. 3400ms, 2300ms, 2200ms — squarely inside the same overall scatter every prior pool-size-era run has shown, arguably clustering toward the worse end rather than the better one (one further run, of uncertain build provenance — possibly still the v3.50.13 image, never confirmed — showed 2400ms and was deliberately excluded from this comparison rather than counted as a fourth confirmed data point). The connection-pooling fix is real and independently verified (genuine reuse, genuine thread-safety, genuine pool sizing — none of that is in question), but two real, independently-verified fixes in a row failing to move the same number was the signal that the actual mechanism was still something else.
Re-reading the same numbers differently, rather than running a fourth pass and hoping, surfaced something that should have been obvious earlier: at this benchmark's real sample sizes, p99 isn't a stable statistic. With auto drawing roughly 60-80 total picks per run and only 2 of 24 pool entries being genuinely LLM-dependent, p99 sits at rank 1-2 from the top of a sample that small — it's effectively reporting "the single slowest request in this run," not a repeatable property of the routing code. auto's own p90 — a steadier signal, reflecting roughly 7-9 real samples rather than 1 — has stayed remarkably consistent across every release this project has ever benchmarked: 710ms (v3.50.9), 720/740/720ms (the three v3.50.11 runs), 710/710/91ms (v3.50.13, the last being the one genuine statistical outlier in the whole series), 720/720/700ms (the three confirmed v3.50.14 runs). That's not a plateau caused by anything specific to fix — it's plausibly just the genuine, ordinary cost of one real, unqueued LLM call on this hardware, with p99's run-to-run volatility coming from whichever single request happens to be unluckiest in a given 60-80-request sample, not from a property either singleflight or connection pooling was ever going to change.
One real, concrete candidate for that single-worst-sample volatility, found the same "read the actual client code" way as both prior fixes: app/llm.py never sent Ollama's keep_alive field at all, relying entirely on the server's own ambient 5-minute default with zero application-level control. This project's own deployment notes (Thread 6, below) that the same qwen3:8b instance is shared with an unrelated agentic-coding workflow on the same machine — a real, plausible way for the model to be evicted from VRAM by something entirely outside Mnemolis's own request pattern, independent of anything this codebase does between calls. LLM_KEEP_ALIVE (see Caching) now lets every Ollama-native call assert its own explicit value rather than depending on whatever the server's ambient state happens to be. Recorded with the same honesty as the connection-pooling fix before it: this is a real, low-risk, idiomatic fix worth having regardless of outcome, but there is no direct evidence the other workflow was actually active during any of the specific runs that showed the plateau — it closes one plausible contributor, not a confirmed explanation. Whether it moves auto's p99 at all, or whether p99 simply continues being noisy regardless (which the analysis above suggests may be the more likely outcome either way, since a single-instance VRAM eviction wouldn't explain a plateau so much as an occasional separate spike), is for the next re-benchmark to show.
The lesson, restated for this second round: a worked-out model is better than a guess, but only if the model checks every mechanism that could explain the magnitude, not just the one the investigation happened to be looking for already. Part 1's own "queueing/collision, both real and additive" framing correctly named two mechanisms — but a third, connection-lifecycle cost, sat unexamined in the same code path the whole time, findable by the same "read the actual client code" discipline Thread 1 already used once. A clean re-benchmark showing zero movement on the targeted metric, while adjacent metrics genuinely improved, was the signal that something real was still missing — not evidence the fix had failed, and not noise to wave off either.
The third round's own conclusion — that the plateau was likely just irreducible small-sample noise at the tail, not a fixable property — turned out to be wrong, and a fourth round found the real cause. A single, genuinely cold request to a query that escalates to fusion was timed directly, repeatedly: 1.75s on one run, 11-13s on others, for the identical query with no code change in between. That inconsistency was the tell that something external and intermittent, not statistical noise, was the actual mechanism — small-sample noise would look the same shape every time, not swing 7x. Checking SearXNG's own container logs directly (not Mnemolis's, not the benchmark output) found it: duckduckgo's own per-engine timeout: override had stayed at SearXNG's old factory value (10.0s) the entire time, completely unaffected by the global outgoing.request_timeout/max_request_timeout fix already documented in The SearXNG Timeout Lesson — per-engine overrides don't inherit from global settings, they replace them outright for that one engine. DuckDuckGo was also, independently, hitting its own CAPTCHA defense and triggering a real Brave rate-limit suspension, both confirmed directly in the same logs. None of this was ever reachable from auto's own LLM-routing code, which is exactly why singleflight and connection pooling — both real, both correctly diagnosed and fixed for what they targeted — could never have closed it: the actual bottleneck lived in web/fusion's fan-out to SearXNG, not in _llm_detect()'s 2-entry subset at all. Full mechanism and the fix: The SearXNG Timeout Lesson's own third recurrence, and searxng/settings.yml's own comments.
The lesson from this fourth round, on top of the others: an explanation that fits the available evidence ("this looks like small-sample noise, p90 is stable, p99 just reflects the worst single draw") can still be wrong if the available evidence was incomplete — single, repeated, direct timing of the identical query, outside the benchmark harness entirely, surfaced a real swing the percentile analysis alone couldn't distinguish from noise. When a plausible statistical explanation and a live system both exist, testing the live system directly — the same query, run by hand, more than once — is what actually settles it, not further analysis of numbers already in hand.
Surfaced in the v3.50.4/v3.50.5 benchmark runs: cache_hit's cold p90/p98/p99 spiked to 5100ms/8000ms/8000ms — a real anomaly with no precedent anywhere in the project's benchmark history, where cache_hit had always been one of the cheapest, most boring rows in the table. Flagged honestly as an unexplained open item rather than guessed at.
The root cause, found by reading tests/locustfile.py directly rather than the timing numbers alone: cache_hit's task used the literal query "what is nitrogen" — which was also the first entry in KIWIX_QUERIES, the pool the separate, much more frequent kiwix_search task (the highest-weighted task in its user class) drew from at random. On a cold run, both tasks could independently draw the identical, not-yet-cached key at nearly the same instant. _resolve_single_source()'s check-cache-then-call-then-write sequence has no per-key lock, so both genuinely missed and both paid the full cold-routing cost concurrently — the exact same thundering-herd shape as Thread 2, just on a task that was never supposed to be exposed to it, since cache_hit's whole purpose is to never miss at all once warm.
Fixed (v3.50.6) by giving cache_hit its own dedicated query, confirmed via a direct AST-based parse of every list literal in locustfile.py to never collide with any other pool. v3.50.7's re-benchmark confirmed it actually closed the gap: cold p90/p98/p99 dropped to 880ms/940ms/940ms — the same shape every other genuinely-cold single-source row shows — and the warm run landed cache_hit at 29ms p99, essentially identical to kiwix's own 34ms.
This specific bug — the query-pool collision — was genuinely, fully fixed. But cache_hit's cold-run cost resurfaced in two later benchmarks (v3.50.9, v3.50.11-era) for an entirely different, unrelated reason: see Thread 6 below. The two are easy to conflate since they produce a similar-looking symptom (one slow cold cache_hit request) but have nothing to do with each other.
A fresh finding from the v3.50.2 run, not previously flagged anywhere: a warm-cache /health sample hit 5244ms, several times worse than its own 750ms median. _check_kiwix(), _check_forecast(), _check_news(), _check_web(), _check_uptime(), _check_ha(), and _check_llm() were plain sequential calls in app/main.py's health() endpoint, each a real network request with its own 3-5 second timeout — one or two genuinely slow checks could stack additively into a multi-second worst case.
Fixed (v3.50.3) with a ThreadPoolExecutor, the same concurrent-dispatch pattern fusion.py had already established for source queries. Verified correct in the test suite immediately (TestHealthConcurrentSourceChecks), but real-hardware confirmation waited one release — v3.50.4's re-benchmark confirmed it held under actual load: warm max dropped to 1152ms, with no recurrence of the old sequential-stacking signature.
Thread 5: the RemoteDisconnected — flagged, recurred once, a real candidate fix shipped, still not provably closed
The v3.50.9 warm run produced one real failure: POST /search [fusion_triple]: RemoteDisconnected('Remote end closed connection without response') — the server closed the TCP connection without sending any HTTP response at all, not a timeout or an error response. fusion_triple queries uptime alongside forecast/news, so it touches code changed that same release, but a direct read of both app/sources/uptime_kuma.py and app/sources/fusion.py's concurrent-dispatch error handling (every per-source exception is caught and converted to a None result, never re-raised past the dispatch loop) found nothing that explains a dropped connection at the HTTP layer.
Not dismissed as noise, not attributed to the recent changes either — both would be guessing past the actual evidence available from the Locust output alone. Left as an open item pending real server-side logs from around the time of that run.
The very next re-benchmark (v3.50.11-era) showed zero exceptions, zero failures on both cold and warm passes. One data point either way isn't conclusive on its own — a single non-recurrence doesn't rule out a real, intermittent cause — but it's at least consistent with a one-off network or Docker-level blip rather than something the v3.50.9 code changes introduced. Genuinely still unexplained; recorded here as "didn't recur" rather than "resolved."
It recurred once more, on the exact same endpoint, in a benchmark run against the v3.50.18 fusion.py fixes — POST /search [fusion_triple]: RemoteDisconnected('Remote end closed connection without response'), one occurrence, warm run. That release's own investigation (see The Fusion Merge Bugs) had already independently arrived at this exact failure as a real, plausible motivation for replacing fusion.py's per-call ThreadPoolExecutor with a shared, bounded one — found from the codebase side, before this benchmark run's recurrence was even known, then confirmed consistent with it. The very next benchmark run after that fix shipped (v3.50.20-era) showed zero exceptions, zero failures on both passes again — the same "didn't recur" caveat applies (one clean run doesn't prove the fix closed it, especially against an already-rare, intermittent failure), but it's now two non-recurrences against one recurrence across the fix, not one against one. Still not claimed as resolved — there's no way to prove an intermittent OS/socket-level failure won't happen again — but the thread-pool fix at least removes the leading, well-corroborated theory's own root cause, which is the most this kind of bug can honestly offer without direct dmesg/ulimit access from the moment of failure.
Thread 6: cache_hit resurfaces with a different, unfixable-on-Mnemolis's-side cause — and a real, separate bug found along the way
Thread 3 closed cache_hit's 8-second cold-run anomaly in two releases (v3.50.6 fix, v3.50.7 confirmation). Two benchmarks later (v3.50.9, v3.50.11-era), cache_hit's single cold request paid a real cost again — 3800ms, then 3601ms — a genuinely different question from Thread 3's already-fixed query-collision bug, since cache_hit's dedicated query was reconfirmed clean of any pool collision both times.
Investigated thoroughly rather than waved off as "probably the same old thing": disambiguation eligibility ruled out directly (the query's search terms reduce to 3 words; disambiguation requires exactly 1); a second, undetected routing-cache collision ruled out (the cache key is the full literal query, reconfirmed unique across every pool); routing-cache disk-write cost growing across a run ruled out by direct measurement (writing a realistic 200-entry cache takes under 1ms, nowhere near the observed cost).
The real explanation: ordinary Ollama request queueing, shared by every cold LLM call across the whole benchmark, not specific to cache_hit at all. Ollama defaults to OLLAMA_NUM_PARALLEL=1 on memory-constrained setups, queuing every other concurrent request FIFO; Mnemolis's own LLM client has zero concurrency control on its side. Comparing which single endpoint shows the worst sample between the two runs found a different one each time — conditional in one run, kiwix_disambiguation in the other, with fusion_auto/web/cache_hit all shuffling rank in between — the signature of every cold call sharing one queue, not an endpoint-specific defect. cache_hit itself landed at nearly the same magnitude both times (3800ms, 3601ms), consistent with "draws from the same shared lottery."
Deliberately not pursued as a fix. Raising OLLAMA_NUM_PARALLEL would genuinely help, and the actual VRAM math (worked out directly against the real deployment: qwen3:8b already at 10GB with a 32K context window, on a 24GB card) shows real headroom for num_parallel=2. But that VRAM is something the person actively wants free for other use (gaming) — spending it on a benefit that's purely benchmark cosmetics, since this contention doesn't reflect real single-household usage, isn't worth the tradeoff. Recorded as understood and deliberately left as-is.
A real, separate, previously-undiscovered bug was found while checking _looks_empty() as one of the candidate mechanisms above — ruled out for explaining cache_hit specifically, but real and worth fixing on its own: the function matched phrases like "not configured", "could not connect", "could not determine", and "error:" anywhere in a result, with no other constraint. Several of these are ordinary English, not unique sentinel strings. _diff_news() and freshrss.py's own article formatting both echo raw, real upstream headlines verbatim — a real headline like "Tech Company Could Not Determine Cause of Outage" could trigger this by sheer coincidence, and since news is a real, live entry in FALLBACK_CHAIN, a genuinely correct news answer could be silently discarded in production and replaced with a worse web-search fallback, for no reason but an unlucky word in a real headline.
Two fixes were tried and rejected before the real one: a length cap couldn't distinguish a short real headline from a short real error message without breaking real coverage either way; a prefix check broke 5 of the project's own genuine "X is not configured" messages, since the source name always comes first. The actual fix: every genuine empty/error message is plain, unformatted prose; every real article/multi-source result wraps titles in markdown bold. A bare ** anywhere in the result is now checked first as a structural gate, before any phrase matching happens at all — confirmed this distinction holds for every real message (none use markdown) and every constructed false-positive case (all of them do, since the offending text only ever appears wrapped in the same formatting every other real article uses).
A possible, unconfirmed connection to The Adversarial Testing Production Bugs's own unresolved investigation: that page documents a single historical unexpected_empty flag that was traced six ways and never explained, including confirming what an empty changes response looks like — but never testing whether the real response at that moment was genuinely non-empty with a coincidentally-matching headline, since that mechanism didn't exist as a known possibility until this investigation found it. Can't be confirmed retroactively (the original result text was never recoverable), and deliberately left unresolved rather than retroactively claimed — if it recurs, this fix should at least rule out a coincidental-headline cause.
Thread 7: conditional_remainder warm-worse-than-cold — the same Ollama constraint, but deterministic this time
A user-supplied cold/warm Locust pair showed conditional_remainder's warm-run p98/p99/max (4400/4400/4410ms) worse than its own cold-run numbers (1200/1200/1200ms) — backwards from every other row in either table, and backwards from this project's entire benchmark history. Investigated with the raw stats_history.csv files rather than the aggregate percentile tables alone, which had initially, incorrectly, pointed at a different endpoint entirely (fusion_auto — two plausible-looking synthetic reconstructions both happened to land near the observed number, neither one correct; see the full retraction in this investigation's own design doc, "_resolve_conditional()'s Concurrent Dispatch Can Self-Inflict Ollama Queueing Worse Than the Sequential Code It Replaced," Part 1). The per-second time series let the 4410ms cumulative-max jump be matched, by exact value, against conditional_remainder's own row instead.
The mechanism is Thread 6's OLLAMA_NUM_PARALLEL=1 constraint again — but a deterministic, single-request version of it, not the diffuse cross-request "shared lottery" Thread 6 documented. _resolve_conditional() already dispatches its condition and remainder concurrently (ThreadPoolExecutor(max_workers=2), with the contextvars.copy_context() fix already in place from the same investigation that produced fusion.py's own seven fixes), specifically so the two calls cost max(a, b) instead of a + b. Checked directly, entry by entry, against CONDITIONAL_WITH_REMAINDER_QUERIES's real 30 entries: 5 of them (17%) had both the condition and the remainder independently resolving to kiwix's real, LLM-bound book-selection path — meaning _resolve_conditional()'s own concurrent dispatch fired two genuine LLM calls at the same instant, from the same single incoming request, every single time one of those 5 entries was drawn. Against a backend that queues every request FIFO by default with no per-model parallelism unless explicitly raised, that's not two unrelated users colliding by chance — it's a guaranteed property of the query shape, degrading the dispatch's own intended max(a,b) benefit back toward a+b for exactly those 5 entries. Magnitude check: kiwix/kiwix_disambiguation's own worst real sample anywhere else in the same benchmark run was 2600ms; conditional_remainder's 4410ms anomaly is squarely in the range a 2600ms call plus a second, independently-costed kiwix call queued behind it would produce, and not explainable as either side's own single worst case alone.
The warm-worse-than-cold appearance is small-sample variance, confirmed by direct simulation, not a real warm-cache regression. A kiwix-bound result, once resolved, is cached for 24 hours — far longer than one benchmark session — so naively, the cold run should absorb this cost and the warm run (run immediately after, no cache clear, per this file's own documented procedure) should be clean. But with only ~13 cold-run picks against a 30-entry pool (random.choice(), not a deterministic sweep), there's a real, simulated ≈30.6% chance per entry that its first-ever occurrence across the combined cold+warm session lands in the warm run instead of the cold one — meaning the expected count of "first occurrence happens to land warm" across the 5 double-hit entries is ≈1.5. One of them simply didn't get drawn during the smaller cold sample and paid its one-time real cost during the warm run instead, which looks paradoxical in the aggregate table but is fully explained by the same small-sample tail instability Thread 2 already worked out for auto's own p99 — here inverting cold vs. warm instead of just adding noise within one run.
Three real, independent keyword-list gaps were found and fixed along the way (v3.50.19), narrowing but not eliminating the double-hit set. 3 of the 5 double-hit remainders only fell through to kiwix because of confirmed INTENT_MAP gaps in the same class this project has already found and fixed once before (the "is it up"/"are they up" stop-word-only gap; the GPIO/"on" word-boundary issue): "is it raining"/"is it going to rain" had no forecast trigger at all (only "will it rain" existed); "is everything online" had no uptime trigger (only "is everything up" existed); "doors are locked" (with "are" inserted) matched none of ha's three door-lock triggers, all of which require word-adjacency. Fixing these brought the pool's double-kiwix-hit count from 5/30 down to 2/30 — at the time, the remaining 2 ("the garage door is open"/"any cameras go offline", both paired with a "whats happening with bitcoin" remainder) were recorded as intentional, correct kiwix routing on both sides.
That "intentional on both sides" framing was wrong, caught on direct pushback, and corrected the same release cycle (v3.50.20). Only the bitcoin remainder half of each pair was ever a genuine, considered design choice (an open-ended topic with no structured-source home — correctly kiwix's job). The condition half of each pair was never actually checked with the same rigor. Checked directly: "the garage door is open" turned out to be a real, two-file gap, not intentional routing at all — router.py's INTENT_MAP["ha"] had no trigger for open/closed garage-door phrasing (a genuinely different question from the existing locked/unlocked door triggers), and even with routing fixed, home_assistant.py's own _QUERY_MAP had no entry that could find a real garage-door entity (neither a cover-domain entity with device_class: "garage", nor the alternate binary_sensor-with-"garage_door" shape a live, filed Home Assistant core issue confirms is a real, separate naming convention for the same physical thing). Both fixed, in both files, with tests confirming neither direction bleeds into the existing, pinned lock-domain behavior. "any cameras go offline", by contrast, held up under the same scrutiny as a genuine, unfixable-without-new-backend-work gap — home_assistant.py's existing "camera" entry answers motion-detection questions, not connectivity/reachability, and no other source in this system covers that question either. Pool double-hit count: 5/30 → 2/30 → 1/30, with the final 1 being the real, confirmed-irreducible case rather than an assumption that stopped checking one step too early.
No fix to _resolve_conditional()'s dispatch or to OLLAMA_NUM_PARALLEL itself. Making the dispatch sequential again would regress the 83% of the pool that genuinely benefits from max(a,b) today; a per-call-site admission lock would treat this exact deployment-wide constraint as a local problem, needing its own new global-scope primitive to do correctly, for a question (is closing this exposure worth a new mechanism, given Thread 6 already declined the simpler, broader fix) this investigation doesn't have standing to answer on its own. Thread 6's own decision not to raise OLLAMA_NUM_PARALLEL stands unchanged — this thread adds one more concrete data point (a deterministic single-request case, not just Thread 6's diffuse cross-request one) to weigh if that decision is ever revisited, not a reason to revisit it now.
The lesson, on top of Thread 6's own: a fix that correctly makes two independent calls run concurrently can still cost more than the sequential code it replaced, for the specific subset of inputs where both calls share a serialized downstream dependency neither call's own code controls — confirmed correct concurrency at the application layer (the existing TestConditionalRemainderConcurrency test suite already proved this, and still does) is necessary but not sufficient evidence that the concurrency actually delivers its intended benefit once a real, shared, externally-constrained backend is involved instead of two independent mocks.
This is the project's own benchmark methodology under examination as much as the application code is. Two of these six threads took multiple releases and, in one case, a real wrong turn, to actually close. One was resolved by investigating thoroughly enough to confirm a real, external cause (Ollama's own request queueing) and then deliberately choosing not to chase it further, since the actual fix would trade real, wanted resources for a benefit that's purely benchmark cosmetics. That's recorded here on purpose too — "investigated, understood, and consciously left alone" is a different, equally honest outcome from "fixed," and conflating the two would make this record less trustworthy, not more complete.