-
Notifications
You must be signed in to change notification settings - Fork 0
The Benchmark Investigation Log
Six separate threads, three of them spanning multiple releases before reaching a real conclusion, one of them involving a real mistake that a later benchmark caught, one resolved as "understood and deliberately not fixed" rather than patched. This page is the chronological story — what each benchmark run actually found, what got tried, what worked, and what didn't. For the current, present-tense numbers and how to run your own benchmark, see Benchmarks instead; this page is the "how we got here," not the reference.
The raw, dated tables for every run mentioned here live in BENCHMARKS.md in the repo root.
First spotted in the v3.17.0 benchmark: an unexplained tail (p95 980ms, p99 1400ms) with no clear cause, flagged honestly as "could be a connection hiccup or sampling noise" rather than guessed at. It reproduced in v3.44.0 and again in v3.50.2 — three releases showing the same shape was enough to treat it as a real, deterministic cost rather than noise, even without a confirmed mechanism yet.
v3.50.2's hypothesis: CACHE_TTL_UPTIME_SECONDS (60s) is the only source TTL shorter than a typical 120-second benchmark window, so a cache miss genuinely happens during every run. That correctly explained when the miss occurred but not why it cost what it cost — a real, slow-but-successful Socket.IO round-trip was assumed, but never confirmed by reading the actual connection code.
v3.50.4 found the first real cause: app/sources/uptime_kuma.py opened a brand-new Socket.IO connection, logged in, and disconnected on every single call — including every 2-minute snapshot_uptime() scheduler tick, independent of whether anyone was asking a live question. Fixed by making the connection persistent, reused across calls and warmed once at app startup. One thing the design doc for this fix got wrong and caught before shipping: it assumed UptimeKumaApi had a .connected property to check connection liveness. Reading the installed library's actual source showed no such property exists — the real signal is the underlying sio (python-socketio) client's own .connected attribute.
The fix measurably helped — warm p95 dropped from 1500ms to 470ms — but left a smaller tail that reproduced identically across two further releases (v3.50.6, v3.50.7): the exact same ~440ms value, every single time. That precision was itself a clue: a cost that's noisy looks different release to release. A cost that's the same number every time is deterministic, and a deterministic, unexplained number is something with a findable cause.
v3.50.8 found it, by directly reproducing the library's behavior in isolation rather than continuing to guess from benchmark timing: uptime_kuma_api's own _get_event_data() pays an unconditional 0.2-second wait_events sleep on every single call — even when the data it's waiting for was already complete. Two such calls per search() (get_monitors() + get_heartbeats()) is exactly the ~0.4-second floor that had shown up identically three releases running. This is a real, deliberate design choice in the upstream library, not a bug in it — Uptime Kuma's server sends one heartbeatList push per monitor after login, and wait_events exists to let trailing per-monitor pushes land before the client treats the initial batch as complete. But that risk only applies to the first call after a fresh connect or reconnect; every later call on an already-settled persistent connection has nothing left to wait for, confirmed directly by reading the steady-state push handler (one complete record per call, no multi-message batching).
The fix shrinks wait_events only after a connection's first call genuinely settles, keeping the library's full, safe default for the one call that still needs it. v3.50.9 confirmed it with a real re-benchmark: warm p98/p99 dropped to 69ms — finally in the same order of magnitude as the cleanest sources (kiwix 44ms, forecast 45ms), not a separate tail. Five releases, three real findings, one closed loop.
Thread 2: the auto/conditional thundering herd, including a real mistake caught by the next benchmark
First appeared as conditional's warm-cache tail in v3.17.0, traced to a real, identifiable cause once the project started looking for one: _resolve_conditional() recurses into the full routing pipeline using the extracted condition text as the cache key, not the original "if X, Y" phrasing — and that extracted text never appeared as a standalone query anywhere else in the load test's small, fixed query pool. With 20 concurrent Locust users and only a handful of distinct conditions, multiple users could pick the same never-yet-cached condition before the first one to resolve it had written the cache entry, so several paid the full LLM routing cost concurrently even on a nominally "warm" run. auto's tail shared the same root cause for the same reason. This reproduced identically through v3.44.0 and v3.50.2 — auto's cold p99 hit a full 10 seconds in the v3.50.2 run, the single worst sample the project had seen.
v3.50.3 widened the pools (AUTO_QUERIES 6→12, CONDITIONAL_QUERIES 4→8, CONDITIONAL_WITH_REMAINDER_QUERIES 2→4) on the straightforward intuition that more options dilutes collision odds. Re-benchmarks in v3.50.5 and v3.50.7 confirmed it helped but not enough — auto's cold p99 dropped from the 10-second spike but stayed multi-second-adjacent, and conditional/conditional_remainder kept real warm-cache tails that a fully-warmed pool shouldn't have shown at all.
v3.50.8 tried to model the problem properly instead of guessing again — and made a real, documented mistake in the process. The model used ("expected number of pool entries hit by 2 or more of 20 users") correctly identified that the relationship between pool size and collision count isn't simple, but it was the wrong metric: it doesn't predict what the benchmark's tail actually measures. The model's own output for conditional_with_remainder's specific change (4→12 entries) showed the absolute collision count getting worse (3.90 → 6.07), not better — and that warning was sitting in the model's own numbers, unacted on. The widening shipped anyway (AUTO_QUERIES 24, CONDITIONAL_QUERIES 20, CONDITIONAL_WITH_REMAINDER_QUERIES 12).
The very next re-benchmark (v3.50.9) caught it: conditional's cold p99 hit 9800ms — the worst single sample that endpoint had ever produced — and conditional_remainder's cold p98/p99 nearly tripled (1300ms → 4200ms) versus the pre-widening baseline. Both got worse, not better. Taking that result seriously rather than writing it off as noise led to re-deriving the math properly: the metric that actually predicts the tail is "fraction of the 20 users whose first pick collides with someone else's," which declines monotonically with pool size (no peak to dodge) but slowly enough that v3.50.8's sizes still left 62-81% of users colliding with someone.
A second, independent factor made it worse than the collision rate alone explains: AUTO_QUERIES's widening had worked well at a similar nominal collision rate (55.5% at 24 entries) because most of its collisions land on a cheap, structured source — only 2 of 24 entries fall through to kiwix's expensive LLM book-selection path, confirmed by running every entry through detect_intent() directly. CONDITIONAL_QUERIES's 20-entry pool was the opposite: 17 of 20 conditions (85%) fell through to kiwix. A collision there costs far more per occurrence than a similarly-frequent collision on auto's pool — a lower collision rate alone was never going to be enough, and the v3.50.8 widening's new entries (mostly "if X is in retrograde"-style conditions, checked only for correct conditional detection, never checked against which source they'd actually route to) made the mix slightly worse, not better.
v3.50.9's correction: widened further (CONDITIONAL_QUERIES 40, CONDITIONAL_WITH_REMAINDER_QUERIES 30) and fixed the source mix — the new entries were specifically written to hit ha/uptime/forecast/changes keywords rather than falling through to kiwix, verified directly against detect_intent() before being added. This brought the pool's kiwix-fallback ratio from 85% down to 42%.
The next re-benchmark confirmed real, genuine improvement — but not a fix. conditional's cold p98/p99 dropped from 9800ms to 5300ms (46% reduction); conditional_remainder's dropped from 4200ms to 1500ms (64% reduction) — the corrected model's predictions held up directionally, this time. But neither cleared the bar: both still show real multi-second cold tails, and the warm-cache numbers barely moved (conditional 1300ms → 1400ms, essentially flat; conditional_remainder 1800ms → 1400ms, a modest improvement) — nowhere near the "low tens of milliseconds" every clean source shows. This matches what the model itself predicted in advance: getting the collision rate down from ~70-80% to ~40% meaningfully helps without eliminating the thundering herd, and full elimination would need pool sizes (150+) that aren't realistic to maintain. Recorded as the genuine, partial result it is — real progress, not a closed thread.
The lesson: a worked-out model is better than a guess, but only if the model is checking the metric that actually matters, and only if its own output gets read carefully enough to notice when it's predicting a regression. Both gaps existed in the same sizing pass, and a real re-benchmark — not more modeling — is what caught both.
Surfaced in the v3.50.4/v3.50.5 benchmark runs: cache_hit's cold p90/p98/p99 spiked to 5100ms/8000ms/8000ms — a real anomaly with no precedent anywhere in the project's benchmark history, where cache_hit had always been one of the cheapest, most boring rows in the table. Flagged honestly as an unexplained open item rather than guessed at.
The root cause, found by reading tests/locustfile.py directly rather than the timing numbers alone: cache_hit's task used the literal query "what is nitrogen" — which was also the first entry in KIWIX_QUERIES, the pool the separate, much more frequent kiwix_search task (the highest-weighted task in its user class) drew from at random. On a cold run, both tasks could independently draw the identical, not-yet-cached key at nearly the same instant. _resolve_single_source()'s check-cache-then-call-then-write sequence has no per-key lock, so both genuinely missed and both paid the full cold-routing cost concurrently — the exact same thundering-herd shape as Thread 2, just on a task that was never supposed to be exposed to it, since cache_hit's whole purpose is to never miss at all once warm.
Fixed (v3.50.6) by giving cache_hit its own dedicated query, confirmed via a direct AST-based parse of every list literal in locustfile.py to never collide with any other pool. v3.50.7's re-benchmark confirmed it actually closed the gap: cold p90/p98/p99 dropped to 880ms/940ms/940ms — the same shape every other genuinely-cold single-source row shows — and the warm run landed cache_hit at 29ms p99, essentially identical to kiwix's own 34ms.
This specific bug — the query-pool collision — was genuinely, fully fixed. But cache_hit's cold-run cost resurfaced in two later benchmarks (v3.50.9, v3.50.11-era) for an entirely different, unrelated reason: see Thread 6 below. The two are easy to conflate since they produce a similar-looking symptom (one slow cold cache_hit request) but have nothing to do with each other.
A fresh finding from the v3.50.2 run, not previously flagged anywhere: a warm-cache /health sample hit 5244ms, several times worse than its own 750ms median. _check_kiwix(), _check_forecast(), _check_news(), _check_web(), _check_uptime(), _check_ha(), and _check_llm() were plain sequential calls in app/main.py's health() endpoint, each a real network request with its own 3-5 second timeout — one or two genuinely slow checks could stack additively into a multi-second worst case.
Fixed (v3.50.3) with a ThreadPoolExecutor, the same concurrent-dispatch pattern fusion.py had already established for source queries. Verified correct in the test suite immediately (TestHealthConcurrentSourceChecks), but real-hardware confirmation waited one release — v3.50.4's re-benchmark confirmed it held under actual load: warm max dropped to 1152ms, with no recurrence of the old sequential-stacking signature.
The v3.50.9 warm run produced one real failure: POST /search [fusion_triple]: RemoteDisconnected('Remote end closed connection without response') — the server closed the TCP connection without sending any HTTP response at all, not a timeout or an error response. fusion_triple queries uptime alongside forecast/news, so it touches code changed that same release, but a direct read of both app/sources/uptime_kuma.py and app/sources/fusion.py's concurrent-dispatch error handling (every per-source exception is caught and converted to a None result, never re-raised past the dispatch loop) found nothing that explains a dropped connection at the HTTP layer.
Not dismissed as noise, not attributed to the recent changes either — both would be guessing past the actual evidence available from the Locust output alone. Left as an open item pending real server-side logs from around the time of that run.
The very next re-benchmark (v3.50.11-era) showed zero exceptions, zero failures on both cold and warm passes. One data point either way isn't conclusive on its own — a single non-recurrence doesn't rule out a real, intermittent cause — but it's at least consistent with a one-off network or Docker-level blip rather than something the v3.50.9 code changes introduced. Genuinely still unexplained; recorded here as "didn't recur" rather than "resolved."
Thread 6: cache_hit resurfaces with a different, unfixable-on-Mnemolis's-side cause — and a real, separate bug found along the way
Thread 3 closed cache_hit's 8-second cold-run anomaly in two releases (v3.50.6 fix, v3.50.7 confirmation). Two benchmarks later (v3.50.9, v3.50.11-era), cache_hit's single cold request paid a real cost again — 3800ms, then 3601ms — a genuinely different question from Thread 3's already-fixed query-collision bug, since cache_hit's dedicated query was reconfirmed clean of any pool collision both times.
Investigated thoroughly rather than waved off as "probably the same old thing": disambiguation eligibility ruled out directly (the query's search terms reduce to 3 words; disambiguation requires exactly 1); a second, undetected routing-cache collision ruled out (the cache key is the full literal query, reconfirmed unique across every pool); routing-cache disk-write cost growing across a run ruled out by direct measurement (writing a realistic 200-entry cache takes under 1ms, nowhere near the observed cost).
The real explanation: ordinary Ollama request queueing, shared by every cold LLM call across the whole benchmark, not specific to cache_hit at all. Ollama defaults to OLLAMA_NUM_PARALLEL=1 on memory-constrained setups, queuing every other concurrent request FIFO; Mnemolis's own LLM client has zero concurrency control on its side. Comparing which single endpoint shows the worst sample between the two runs found a different one each time — conditional in one run, kiwix_disambiguation in the other, with fusion_auto/web/cache_hit all shuffling rank in between — the signature of every cold call sharing one queue, not an endpoint-specific defect. cache_hit itself landed at nearly the same magnitude both times (3800ms, 3601ms), consistent with "draws from the same shared lottery."
Deliberately not pursued as a fix. Raising OLLAMA_NUM_PARALLEL would genuinely help, and the actual VRAM math (worked out directly against the real deployment: qwen3:8b already at 10GB with a 32K context window, on a 24GB card) shows real headroom for num_parallel=2. But that VRAM is something the person actively wants free for other use (gaming) — spending it on a benefit that's purely benchmark cosmetics, since this contention doesn't reflect real single-household usage, isn't worth the tradeoff. Recorded as understood and deliberately left as-is.
A real, separate, previously-undiscovered bug was found while checking _looks_empty() as one of the candidate mechanisms above — ruled out for explaining cache_hit specifically, but real and worth fixing on its own: the function matched phrases like "not configured", "could not connect", "could not determine", and "error:" anywhere in a result, with no other constraint. Several of these are ordinary English, not unique sentinel strings. _diff_news() and freshrss.py's own article formatting both echo raw, real upstream headlines verbatim — a real headline like "Tech Company Could Not Determine Cause of Outage" could trigger this by sheer coincidence, and since news is a real, live entry in FALLBACK_CHAIN, a genuinely correct news answer could be silently discarded in production and replaced with a worse web-search fallback, for no reason but an unlucky word in a real headline.
Two fixes were tried and rejected before the real one: a length cap couldn't distinguish a short real headline from a short real error message without breaking real coverage either way; a prefix check broke 5 of the project's own genuine "X is not configured" messages, since the source name always comes first. The actual fix: every genuine empty/error message is plain, unformatted prose; every real article/multi-source result wraps titles in markdown bold. A bare ** anywhere in the result is now checked first as a structural gate, before any phrase matching happens at all — confirmed this distinction holds for every real message (none use markdown) and every constructed false-positive case (all of them do, since the offending text only ever appears wrapped in the same formatting every other real article uses).
A possible, unconfirmed connection to Thread 5 of The Adversarial Testing Production Bugs: that page documents a single historical unexpected_empty flag that was traced six ways and never explained, including confirming what an empty changes response looks like — but never testing whether the real response at that moment was genuinely non-empty with a coincidentally-matching headline, since that mechanism didn't exist as a known possibility until this investigation found it. Can't be confirmed retroactively (the original result text was never recoverable), and deliberately left unresolved rather than retroactively claimed — if it recurs, this fix should at least rule out a coincidental-headline cause.
This is the project's own benchmark methodology under examination as much as the application code is. Two of these six threads took multiple releases and, in one case, a real wrong turn, to actually close. One was resolved by investigating thoroughly enough to confirm a real, external cause (Ollama's own request queueing) and then deliberately choosing not to chase it further, since the actual fix would trade real, wanted resources for a benefit that's purely benchmark cosmetics. That's recorded here on purpose too — "investigated, understood, and consciously left alone" is a different, equally honest outcome from "fixed," and conflating the two would make this record less trustworthy, not more complete.