feat: Add load test from any-llm PR #1001, fix race condition#42
Open
brightsparc wants to merge 1 commit intomozilla-ai:mainfrom
Open
feat: Add load test from any-llm PR #1001, fix race condition#42brightsparc wants to merge 1 commit intomozilla-ai:mainfrom
brightsparc wants to merge 1 commit intomozilla-ai:mainfrom
Conversation
Contributor
|
@brightsparc could you review the PR. There are few extra markdown files that might not be required in here and that contain mentions of branches etc. Could you also add a few details about the version of k6 you are using. I will play around with it and add a github workflow as a follow up PR. Also can you rebase on main? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes a row-loss bug in
BatchLogWriterduring graceful shutdown, and adds the missingtests/load/harness that was dropped when the gateway was split out ofany-llm(mozilla-ai/any-llm#1001).The bug
BatchLogWriter.stop()used to calltask.cancel()on the background run loop. If the cancel landed while_flush()was mid-commit, the entire in-flight batch was lost:_collect_batchcallsqueue.task_done()the moment items are dequeued — before the flush commits — so aCancelledErrorinterruptingawait db.commit()rolls the session back and leaves the rows in neither the queue nor the database._flush_all()instop()only drains what's still in the queue, so those items are gone.The load test caught it: k6 reported 8,518 iterations but
usage_logsheld 8,418 rows — exactly 100 missing, equal tomax_batch. One in-flight batch dropped on shutdown. In production the same race silently drops up tomax_batchrows on every clean SIGTERM.The fix: graceful shutdown via
asyncio.EventInstead of cancelling the task from under the flush,
stop()now signals anasyncio.Event, awaits the task with a bounded_STOP_TIMEOUT(10s), and only falls back tocancel()if the timeout expires (a genuinely wedged flush).The run loop checks the event between batches, lets any in-flight
_flush()complete, then drains any remaining queue contents via_flush_all()in afinallyblock._collect_batchraces the queue get againststop_event.wait()usingasyncio.wait(FIRST_COMPLETED)so an idle collector wakes immediately on shutdown instead of waiting out the 1s flush interval.asyncio.Event(rather than a plain bool) because_collect_batchneeds to await the shutdown signal concurrently with the queue get — one primitive covers both the loop-top check and the collector race.asyncio.shield(_flush)was considered and rejected: the outerawaitstill raisesCancelledError, sostop()would return while an orphaned flush continued into a half-torn-down event loop.Tradeoff:
stop()can now take up to 10s in the wedged-flush case; previously it returned instantly but silently dropped the batch. Durability over teardown speed, timeout caps worst case. Idle-writer shutdown is actually faster than before because theasyncio.waitrace wakes immediately instead of waiting for_collect_batch'swait_forto time out.Full trace and rationale in
bug.md.Missing load-test harness
mozilla-ai/any-llm#1001 split the gateway to this repo but missed
tests/load/**(6 files) fromintrospection-org:julian/async-asyncpg. Restored here, with one small fix torun_load_test.sh: droppeduv run --extra gatewayin two places — that extra was anany-llmconcept; in this repo the gateway is the project itself, so plainuv runworks.Benchmark results
Same k6 setup as the prior reference run (
GATEWAY_PORT=4001 BUDGET_STRATEGY=disabled LOG_WRITER_STRATEGY=batch, 100 VUs × 30s per scenario + 5s warmup, noop fake provider):usage_logsNo throughput or latency regression — all changes are within noise. The meaningful delta is 100% row coverage, zero drops.
For comparison against the original historical reference run from any-llm#1001 (different machine, but same k6 config): 97.4 total rps / 7,637 requests / 100%. This branch: ~108 total rps / 8,577 requests / 100%. Faster hardware this time, but the durability guarantee matches.
Changes
src/gateway/services/log_writer.pyBatchLogWriter.__init__: addsself._stop_event = asyncio.Event()and_STOP_TIMEOUT = 10.0.BatchLogWriter.stop: sets the event,await asyncio.wait_for(self._task, _STOP_TIMEOUT), cancels only on timeout.BatchLogWriter._run:while not self._stop_event.is_set()loop;try/finallyensures_flush_all()always runs on exit.BatchLogWriter._collect_batch:asyncio.wait({queue.get(), stop_event.wait()}, FIRST_COMPLETED)replaces the oldasyncio.wait_for(queue.get()).BatchLogWriter._flush_all: simplified single-loop drain.SingleLogWriter,NoopLogWriter,LogWriterProtocol all unchanged. Lifespan hook inmain.pyunchanged —stop()signature is still parameter-free.tests/unit/test_log_writer.py— 4 new tests around the batch writer shutdown contract:test_batch_writer_flushes_queued_items_on_stop— baseline.test_batch_writer_does_not_drop_in_flight_batch_on_stop— the regression test. Uses a gated fake_flushto block a batch mid-commit, firesstop()while blocked, releases the gate, asserts every item is flushed (including ones queued afterstop_eventwas set).test_batch_writer_stop_times_out_and_cancels— verifies_STOP_TIMEOUTfallback fires on a wedged flush.test_batch_writer_stop_is_idempotent_when_not_started—stop()on a never-started writer is a no-op.tests/load/— restored fromintrospection-org:julian/async-asyncpg(6 files:README.md,fake_provider.py,gateway-config.yml,load_test.js,run_load_test.sh,results/results.md). Minor fix torun_load_test.shto drop--extra gatewayfor this repo's layout.PR Type
BatchLogWritershutdown)tests/load/harness)Relevant issues
Checklist
AI Usage Information
max_batch), traced it to the cancel-mid-_flushpath, designed theasyncio.Event-based graceful shutdown, wrote the regression tests, and produced the comparison benchmarks. All results are reproducible viaGATEWAY_PORT=4001 BUDGET_STRATEGY=disabled LOG_WRITER_STRATEGY=batch ./tests/load/run_load_test.sh async-batch-disabled.When answering questions by the reviewer, please respond yourself, do not copy/paste the reviewer comments into an AI system and paste back its answer. We want to discuss with you, not your AI :)