feat: Add load test from any-llm PR #1001, fix race condition by brightsparc · Pull Request #42 · mozilla-ai/gateway

brightsparc · 2026-04-14T10:49:51Z

Description

Fixes a row-loss bug in BatchLogWriter during graceful shutdown, and adds the missing tests/load/ harness that was dropped when the gateway was split out of any-llm (mozilla-ai/any-llm#1001).

The bug

BatchLogWriter.stop() used to call task.cancel() on the background run loop. If the cancel landed while _flush() was mid-commit, the entire in-flight batch was lost: _collect_batch calls queue.task_done() the moment items are dequeued — before the flush commits — so a CancelledError interrupting await db.commit() rolls the session back and leaves the rows in neither the queue nor the database. _flush_all() in stop() only drains what's still in the queue, so those items are gone.

The load test caught it: k6 reported 8,518 iterations but usage_logs held 8,418 rows — exactly 100 missing, equal to max_batch. One in-flight batch dropped on shutdown. In production the same race silently drops up to max_batch rows on every clean SIGTERM.

The fix: graceful shutdown via `asyncio.Event`

Instead of cancelling the task from under the flush, stop() now signals an asyncio.Event, awaits the task with a bounded _STOP_TIMEOUT (10s), and only falls back to cancel() if the timeout expires (a genuinely wedged flush).

The run loop checks the event between batches, lets any in-flight _flush() complete, then drains any remaining queue contents via _flush_all() in a finally block. _collect_batch races the queue get against stop_event.wait() using asyncio.wait(FIRST_COMPLETED) so an idle collector wakes immediately on shutdown instead of waiting out the 1s flush interval.

asyncio.Event (rather than a plain bool) because _collect_batch needs to await the shutdown signal concurrently with the queue get — one primitive covers both the loop-top check and the collector race. asyncio.shield(_flush) was considered and rejected: the outer await still raises CancelledError, so stop() would return while an orphaned flush continued into a half-torn-down event loop.

Tradeoff: stop() can now take up to 10s in the wedged-flush case; previously it returned instantly but silently dropped the batch. Durability over teardown speed, timeout caps worst case. Idle-writer shutdown is actually faster than before because the asyncio.wait race wakes immediately instead of waiting for _collect_batch's wait_for to time out.

Full trace and rationale in bug.md.

Missing load-test harness

mozilla-ai/any-llm#1001 split the gateway to this repo but missed tests/load/** (6 files) from introspection-org:julian/async-asyncpg. Restored here, with one small fix to run_load_test.sh: dropped uv run --extra gateway in two places — that extra was an any-llm concept; in this repo the gateway is the project itself, so plain uv run works.

Benchmark results

Same k6 setup as the prior reference run (GATEWAY_PORT=4001 BUDGET_STRATEGY=disabled LOG_WRITER_STRATEGY=batch, 100 VUs × 30s per scenario + 5s warmup, noop fake provider):

Metric	Before fix	After fix
k6 iterations	8,518	8,577
Rows in `usage_logs`	8,418	8,577
Coverage	98% (100 dropped)	100%
distinct_users rps	54.7	55.0
same_user rps	53.2	53.7
distinct p50 / p95 / p99	778 / 1243 / 1649 ms	750 / 1259 / 1688 ms
same p50 / p95 / p99	768 / 1281 / 1598 ms	763 / 1240 / 1702 ms
Failures	0	0
CPU avg / max	89.0% / 101%	90.0% / 101%
RSS max	430 MB	394 MB

No throughput or latency regression — all changes are within noise. The meaningful delta is 100% row coverage, zero drops.

For comparison against the original historical reference run from any-llm#1001 (different machine, but same k6 config): 97.4 total rps / 7,637 requests / 100%. This branch: ~108 total rps / 8,577 requests / 100%. Faster hardware this time, but the durability guarantee matches.

Changes

src/gateway/services/log_writer.py

BatchLogWriter.__init__: adds self._stop_event = asyncio.Event() and _STOP_TIMEOUT = 10.0.
BatchLogWriter.stop: sets the event, await asyncio.wait_for(self._task, _STOP_TIMEOUT), cancels only on timeout.
BatchLogWriter._run: while not self._stop_event.is_set() loop; try/finally ensures _flush_all() always runs on exit.
BatchLogWriter._collect_batch: asyncio.wait({queue.get(), stop_event.wait()}, FIRST_COMPLETED) replaces the old asyncio.wait_for(queue.get()).
BatchLogWriter._flush_all: simplified single-loop drain.
SingleLogWriter, NoopLogWriter, LogWriter Protocol all unchanged. Lifespan hook in main.py unchanged — stop() signature is still parameter-free.

tests/unit/test_log_writer.py — 4 new tests around the batch writer shutdown contract:

test_batch_writer_flushes_queued_items_on_stop — baseline.
test_batch_writer_does_not_drop_in_flight_batch_on_stop — the regression test. Uses a gated fake _flush to block a batch mid-commit, fires stop() while blocked, releases the gate, asserts every item is flushed (including ones queued after stop_event was set).
test_batch_writer_stop_times_out_and_cancels — verifies _STOP_TIMEOUT fallback fires on a wedged flush.
test_batch_writer_stop_is_idempotent_when_not_started — stop() on a never-started writer is a no-op.

tests/load/ — restored from introspection-org:julian/async-asyncpg (6 files: README.md, fake_provider.py, gateway-config.yml, load_test.js, run_load_test.sh, results/results.md). Minor fix to run_load_test.sh to drop --extra gateway for this repo's layout.

PR Type

🐛 Bug Fix (in-flight batch loss on BatchLogWriter shutdown)
🚦 Infrastructure (restore missing tests/load/ harness)

Relevant issues

Checklist

AI Usage Information

AI Model used: Claude Opus 4.6
AI Developer Tool used: Claude Code
Any other info you'd like to share: AI diagnosed the race from the load-test coverage mismatch (100 missing rows == max_batch), traced it to the cancel-mid-_flush path, designed the asyncio.Event-based graceful shutdown, wrote the regression tests, and produced the comparison benchmarks. All results are reproducible via GATEWAY_PORT=4001 BUDGET_STRATEGY=disabled LOG_WRITER_STRATEGY=batch ./tests/load/run_load_test.sh async-batch-disabled.

When answering questions by the reviewer, please respond yourself, do not copy/paste the reviewer comments into an AI system and paste back its answer. We want to discuss with you, not your AI :)

I am an AI Agent filling out this form (check box if true)

tbille · 2026-04-14T12:34:59Z

@brightsparc could you review the PR. There are few extra markdown files that might not be required in here and that contain mentions of branches etc.

Could you also add a few details about the version of k6 you are using. I will play around with it and add a github workflow as a follow up PR.

Also can you rebase on main?

feat: Add load test from any-llm PR #1001, fix race condition

647e45b

brightsparc mentioned this pull request Apr 14, 2026

feat(asyncio): async DB layer, budget strategies, and log writer #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add load test from any-llm PR #1001, fix race condition#42

feat: Add load test from any-llm PR #1001, fix race condition#42
brightsparc wants to merge 1 commit intomozilla-ai:mainfrom
brightsparc:julian/gateway-load-test

brightsparc commented Apr 14, 2026

Uh oh!

tbille commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brightsparc commented Apr 14, 2026

Description

The bug

The fix: graceful shutdown via asyncio.Event

Missing load-test harness

Benchmark results

Changes

PR Type

Relevant issues

Checklist

AI Usage Information

Uh oh!

tbille commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The fix: graceful shutdown via `asyncio.Event`