-
Notifications
You must be signed in to change notification settings - Fork 0
The Benchmark Investigation Log
Five separate threads, three of them spanning multiple releases before reaching a real conclusion, one of them involving a real mistake that a later benchmark caught. This page is the chronological story — what each benchmark run actually found, what got tried, what worked, and what didn't. For the current, present-tense numbers and how to run your own benchmark, see Benchmarks instead; this page is the "how we got here," not the reference.
The raw, dated tables for every run mentioned here live in BENCHMARKS.md in the repo root.
First spotted in the v3.17.0 benchmark: an unexplained tail (p95 980ms, p99 1400ms) with no clear cause, flagged honestly as "could be a connection hiccup or sampling noise" rather than guessed at. It reproduced in v3.44.0 and again in v3.50.2 — three releases showing the same shape was enough to treat it as a real, deterministic cost rather than noise, even without a confirmed mechanism yet.
v3.50.2's hypothesis: CACHE_TTL_UPTIME_SECONDS (60s) is the only source TTL shorter than a typical 120-second benchmark window, so a cache miss genuinely happens during every run. That correctly explained when the miss occurred but not why it cost what it cost — a real, slow-but-successful Socket.IO round-trip was assumed, but never confirmed by reading the actual connection code.
v3.50.4 found the first real cause: app/sources/uptime_kuma.py opened a brand-new Socket.IO connection, logged in, and disconnected on every single call — including every 2-minute snapshot_uptime() scheduler tick, independent of whether anyone was asking a live question. Fixed by making the connection persistent, reused across calls and warmed once at app startup. One thing the design doc for this fix got wrong and caught before shipping: it assumed UptimeKumaApi had a .connected property to check connection liveness. Reading the installed library's actual source showed no such property exists — the real signal is the underlying sio (python-socketio) client's own .connected attribute.
The fix measurably helped — warm p95 dropped from 1500ms to 470ms — but left a smaller tail that reproduced identically across two further releases (v3.50.6, v3.50.7): the exact same ~440ms value, every single time. That precision was itself a clue: a cost that's noisy looks different release to release. A cost that's the same number every time is deterministic, and a deterministic, unexplained number is something with a findable cause.
v3.50.8 found it, by directly reproducing the library's behavior in isolation rather than continuing to guess from benchmark timing: uptime_kuma_api's own _get_event_data() pays an unconditional 0.2-second wait_events sleep on every single call — even when the data it's waiting for was already complete. Two such calls per search() (get_monitors() + get_heartbeats()) is exactly the ~0.4-second floor that had shown up identically three releases running. This is a real, deliberate design choice in the upstream library, not a bug in it — Uptime Kuma's server sends one heartbeatList push per monitor after login, and wait_events exists to let trailing per-monitor pushes land before the client treats the initial batch as complete. But that risk only applies to the first call after a fresh connect or reconnect; every later call on an already-settled persistent connection has nothing left to wait for, confirmed directly by reading the steady-state push handler (one complete record per call, no multi-message batching).
The fix shrinks wait_events only after a connection's first call genuinely settles, keeping the library's full, safe default for the one call that still needs it. v3.50.9 confirmed it with a real re-benchmark: warm p98/p99 dropped to 69ms — finally in the same order of magnitude as the cleanest sources (kiwix 44ms, forecast 45ms), not a separate tail. Five releases, three real findings, one closed loop.
Thread 2: the auto/conditional thundering herd, including a real mistake caught by the next benchmark
First appeared as conditional's warm-cache tail in v3.17.0, traced to a real, identifiable cause once the project started looking for one: _resolve_conditional() recurses into the full routing pipeline using the extracted condition text as the cache key, not the original "if X, Y" phrasing — and that extracted text never appeared as a standalone query anywhere else in the load test's small, fixed query pool. With 20 concurrent Locust users and only a handful of distinct conditions, multiple users could pick the same never-yet-cached condition before the first one to resolve it had written the cache entry, so several paid the full LLM routing cost concurrently even on a nominally "warm" run. auto's tail shared the same root cause for the same reason. This reproduced identically through v3.44.0 and v3.50.2 — auto's cold p99 hit a full 10 seconds in the v3.50.2 run, the single worst sample the project had seen.
v3.50.3 widened the pools (AUTO_QUERIES 6→12, CONDITIONAL_QUERIES 4→8, CONDITIONAL_WITH_REMAINDER_QUERIES 2→4) on the straightforward intuition that more options dilutes collision odds. Re-benchmarks in v3.50.5 and v3.50.7 confirmed it helped but not enough — auto's cold p99 dropped from the 10-second spike but stayed multi-second-adjacent, and conditional/conditional_remainder kept real warm-cache tails that a fully-warmed pool shouldn't have shown at all.
v3.50.8 tried to model the problem properly instead of guessing again — and made a real, documented mistake in the process. The model used ("expected number of pool entries hit by 2 or more of 20 users") correctly identified that the relationship between pool size and collision count isn't simple, but it was the wrong metric: it doesn't predict what the benchmark's tail actually measures. The model's own output for conditional_with_remainder's specific change (4→12 entries) showed the absolute collision count getting worse (3.90 → 6.07), not better — and that warning was sitting in the model's own numbers, unacted on. The widening shipped anyway (AUTO_QUERIES 24, CONDITIONAL_QUERIES 20, CONDITIONAL_WITH_REMAINDER_QUERIES 12).
The very next re-benchmark (v3.50.9) caught it: conditional's cold p99 hit 9800ms — the worst single sample that endpoint had ever produced — and conditional_remainder's cold p98/p99 nearly tripled (1300ms → 4200ms) versus the pre-widening baseline. Both got worse, not better. Taking that result seriously rather than writing it off as noise led to re-deriving the math properly: the metric that actually predicts the tail is "fraction of the 20 users whose first pick collides with someone else's," which declines monotonically with pool size (no peak to dodge) but slowly enough that v3.50.8's sizes still left 62-81% of users colliding with someone.
A second, independent factor made it worse than the collision rate alone explains: AUTO_QUERIES's widening had worked well at a similar nominal collision rate (55.5% at 24 entries) because most of its collisions land on a cheap, structured source — only 2 of 24 entries fall through to kiwix's expensive LLM book-selection path, confirmed by running every entry through detect_intent() directly. CONDITIONAL_QUERIES's 20-entry pool was the opposite: 17 of 20 conditions (85%) fell through to kiwix. A collision there costs far more per occurrence than a similarly-frequent collision on auto's pool — a lower collision rate alone was never going to be enough, and the v3.50.8 widening's new entries (mostly "if X is in retrograde"-style conditions, checked only for correct conditional detection, never checked against which source they'd actually route to) made the mix slightly worse, not better.
v3.50.9's correction: widened further (CONDITIONAL_QUERIES 40, CONDITIONAL_WITH_REMAINDER_QUERIES 30) and fixed the source mix — the new entries were specifically written to hit ha/uptime/forecast/changes keywords rather than falling through to kiwix, verified directly against detect_intent() before being added. This brought the pool's kiwix-fallback ratio from 85% down to 42%. Not yet re-benchmarked as of this writing — recorded as a real, falsifiable hypothesis for the next run, the same way the prior, wrong sizing was recorded as one rather than declared fixed in advance.
The lesson: a worked-out model is better than a guess, but only if the model is checking the metric that actually matters, and only if its own output gets read carefully enough to notice when it's predicting a regression. Both gaps existed in the same sizing pass, and a real re-benchmark — not more modeling — is what caught both.
Surfaced in the v3.50.4/v3.50.5 benchmark runs: cache_hit's cold p90/p98/p99 spiked to 5100ms/8000ms/8000ms — a real anomaly with no precedent anywhere in the project's benchmark history, where cache_hit had always been one of the cheapest, most boring rows in the table. Flagged honestly as an unexplained open item rather than guessed at.
The root cause, found by reading tests/locustfile.py directly rather than the timing numbers alone: cache_hit's task used the literal query "what is nitrogen" — which was also the first entry in KIWIX_QUERIES, the pool the separate, much more frequent kiwix_search task (the highest-weighted task in its user class) drew from at random. On a cold run, both tasks could independently draw the identical, not-yet-cached key at nearly the same instant. _resolve_single_source()'s check-cache-then-call-then-write sequence has no per-key lock, so both genuinely missed and both paid the full cold-routing cost concurrently — the exact same thundering-herd shape as Thread 2, just on a task that was never supposed to be exposed to it, since cache_hit's whole purpose is to never miss at all once warm.
Fixed (v3.50.6) by giving cache_hit its own dedicated query, confirmed via a direct AST-based parse of every list literal in locustfile.py to never collide with any other pool. v3.50.7's re-benchmark confirmed it actually closed the gap: cold p90/p98/p99 dropped to 880ms/940ms/940ms — the same shape every other genuinely-cold single-source row shows — and the warm run landed cache_hit at 29ms p99, essentially identical to kiwix's own 34ms. A clean, fully-closed thread, start to finish, in two releases.
A fresh finding from the v3.50.2 run, not previously flagged anywhere: a warm-cache /health sample hit 5244ms, several times worse than its own 750ms median. _check_kiwix(), _check_forecast(), _check_news(), _check_web(), _check_uptime(), _check_ha(), and _check_llm() were plain sequential calls in app/main.py's health() endpoint, each a real network request with its own 3-5 second timeout — one or two genuinely slow checks could stack additively into a multi-second worst case.
Fixed (v3.50.3) with a ThreadPoolExecutor, the same concurrent-dispatch pattern fusion.py had already established for source queries. Verified correct in the test suite immediately (TestHealthConcurrentSourceChecks), but real-hardware confirmation waited one release — v3.50.4's re-benchmark confirmed it held under actual load: warm max dropped to 1152ms, with no recurrence of the old sequential-stacking signature.
The v3.50.9 warm run produced one real failure: POST /search [fusion_triple]: RemoteDisconnected('Remote end closed connection without response') — the server closed the TCP connection without sending any HTTP response at all, not a timeout or an error response. fusion_triple queries uptime alongside forecast/news, so it touches code changed that same release, but a direct read of both app/sources/uptime_kuma.py and app/sources/fusion.py's concurrent-dispatch error handling (every per-source exception is caught and converted to a None result, never re-raised past the dispatch loop) found nothing that explains a dropped connection at the HTTP layer.
Not dismissed as noise, not attributed to the recent changes either — both would be guessing past the actual evidence available from the Locust output alone. Left as an open item pending real server-side logs from around the time of that run, the same honest-incompleteness standard every other thread on this page started from before it had an answer.
This is the project's own benchmark methodology under examination as much as the application code is. Two of these five threads took multiple releases and, in one case, a real wrong turn, to actually close. That's recorded here on purpose — an investigation log that only showed the clean, final answer would be a worse record than one that shows the false starts too, since the false starts are exactly what makes the eventual fix trustworthy rather than just plausible-sounding.