Skip to content

feat: Add load test from any-llm PR #1001, fix race condition#42

Open
brightsparc wants to merge 1 commit intomozilla-ai:mainfrom
brightsparc:julian/gateway-load-test
Open

feat: Add load test from any-llm PR #1001, fix race condition#42
brightsparc wants to merge 1 commit intomozilla-ai:mainfrom
brightsparc:julian/gateway-load-test

Conversation

@brightsparc
Copy link
Copy Markdown
Contributor

Description

Fixes a row-loss bug in BatchLogWriter during graceful shutdown, and adds the missing tests/load/ harness that was dropped when the gateway was split out of any-llm (mozilla-ai/any-llm#1001).

The bug

BatchLogWriter.stop() used to call task.cancel() on the background run loop. If the cancel landed while _flush() was mid-commit, the entire in-flight batch was lost: _collect_batch calls queue.task_done() the moment items are dequeued — before the flush commits — so a CancelledError interrupting await db.commit() rolls the session back and leaves the rows in neither the queue nor the database. _flush_all() in stop() only drains what's still in the queue, so those items are gone.

The load test caught it: k6 reported 8,518 iterations but usage_logs held 8,418 rows — exactly 100 missing, equal to max_batch. One in-flight batch dropped on shutdown. In production the same race silently drops up to max_batch rows on every clean SIGTERM.

The fix: graceful shutdown via asyncio.Event

Instead of cancelling the task from under the flush, stop() now signals an asyncio.Event, awaits the task with a bounded _STOP_TIMEOUT (10s), and only falls back to cancel() if the timeout expires (a genuinely wedged flush).

The run loop checks the event between batches, lets any in-flight _flush() complete, then drains any remaining queue contents via _flush_all() in a finally block. _collect_batch races the queue get against stop_event.wait() using asyncio.wait(FIRST_COMPLETED) so an idle collector wakes immediately on shutdown instead of waiting out the 1s flush interval.

asyncio.Event (rather than a plain bool) because _collect_batch needs to await the shutdown signal concurrently with the queue get — one primitive covers both the loop-top check and the collector race. asyncio.shield(_flush) was considered and rejected: the outer await still raises CancelledError, so stop() would return while an orphaned flush continued into a half-torn-down event loop.

Tradeoff: stop() can now take up to 10s in the wedged-flush case; previously it returned instantly but silently dropped the batch. Durability over teardown speed, timeout caps worst case. Idle-writer shutdown is actually faster than before because the asyncio.wait race wakes immediately instead of waiting for _collect_batch's wait_for to time out.

Full trace and rationale in bug.md.

Missing load-test harness

mozilla-ai/any-llm#1001 split the gateway to this repo but missed tests/load/** (6 files) from introspection-org:julian/async-asyncpg. Restored here, with one small fix to run_load_test.sh: dropped uv run --extra gateway in two places — that extra was an any-llm concept; in this repo the gateway is the project itself, so plain uv run works.

Benchmark results

Same k6 setup as the prior reference run (GATEWAY_PORT=4001 BUDGET_STRATEGY=disabled LOG_WRITER_STRATEGY=batch, 100 VUs × 30s per scenario + 5s warmup, noop fake provider):

Metric Before fix After fix
k6 iterations 8,518 8,577
Rows in usage_logs 8,418 8,577
Coverage 98% (100 dropped) 100%
distinct_users rps 54.7 55.0
same_user rps 53.2 53.7
distinct p50 / p95 / p99 778 / 1243 / 1649 ms 750 / 1259 / 1688 ms
same p50 / p95 / p99 768 / 1281 / 1598 ms 763 / 1240 / 1702 ms
Failures 0 0
CPU avg / max 89.0% / 101% 90.0% / 101%
RSS max 430 MB 394 MB

No throughput or latency regression — all changes are within noise. The meaningful delta is 100% row coverage, zero drops.

For comparison against the original historical reference run from any-llm#1001 (different machine, but same k6 config): 97.4 total rps / 7,637 requests / 100%. This branch: ~108 total rps / 8,577 requests / 100%. Faster hardware this time, but the durability guarantee matches.

Changes

src/gateway/services/log_writer.py

  • BatchLogWriter.__init__: adds self._stop_event = asyncio.Event() and _STOP_TIMEOUT = 10.0.
  • BatchLogWriter.stop: sets the event, await asyncio.wait_for(self._task, _STOP_TIMEOUT), cancels only on timeout.
  • BatchLogWriter._run: while not self._stop_event.is_set() loop; try/finally ensures _flush_all() always runs on exit.
  • BatchLogWriter._collect_batch: asyncio.wait({queue.get(), stop_event.wait()}, FIRST_COMPLETED) replaces the old asyncio.wait_for(queue.get()).
  • BatchLogWriter._flush_all: simplified single-loop drain.
  • SingleLogWriter, NoopLogWriter, LogWriter Protocol all unchanged. Lifespan hook in main.py unchanged — stop() signature is still parameter-free.

tests/unit/test_log_writer.py — 4 new tests around the batch writer shutdown contract:

  • test_batch_writer_flushes_queued_items_on_stop — baseline.
  • test_batch_writer_does_not_drop_in_flight_batch_on_stopthe regression test. Uses a gated fake _flush to block a batch mid-commit, fires stop() while blocked, releases the gate, asserts every item is flushed (including ones queued after stop_event was set).
  • test_batch_writer_stop_times_out_and_cancels — verifies _STOP_TIMEOUT fallback fires on a wedged flush.
  • test_batch_writer_stop_is_idempotent_when_not_startedstop() on a never-started writer is a no-op.

tests/load/ — restored from introspection-org:julian/async-asyncpg (6 files: README.md, fake_provider.py, gateway-config.yml, load_test.js, run_load_test.sh, results/results.md). Minor fix to run_load_test.sh to drop --extra gateway for this repo's layout.

PR Type

  • 🐛 Bug Fix (in-flight batch loss on BatchLogWriter shutdown)
  • 🚦 Infrastructure (restore missing tests/load/ harness)

Relevant issues

Checklist

  • I understand the code I am submitting.
  • I have added unit tests that prove my fix/feature works
  • I have run this code locally and verified it fixes the issue.
  • New and existing tests pass locally
  • Documentation was updated where necessary
  • I have read and followed the contribution guidelines
  • AI Usage:
    • No AI was used.
    • AI was used for drafting/refactoring.
    • This is fully AI-generated.

AI Usage Information

  • AI Model used: Claude Opus 4.6
  • AI Developer Tool used: Claude Code
  • Any other info you'd like to share: AI diagnosed the race from the load-test coverage mismatch (100 missing rows == max_batch), traced it to the cancel-mid-_flush path, designed the asyncio.Event-based graceful shutdown, wrote the regression tests, and produced the comparison benchmarks. All results are reproducible via GATEWAY_PORT=4001 BUDGET_STRATEGY=disabled LOG_WRITER_STRATEGY=batch ./tests/load/run_load_test.sh async-batch-disabled.

When answering questions by the reviewer, please respond yourself, do not copy/paste the reviewer comments into an AI system and paste back its answer. We want to discuss with you, not your AI :)

  • I am an AI Agent filling out this form (check box if true)

@tbille
Copy link
Copy Markdown
Contributor

tbille commented Apr 14, 2026

@brightsparc could you review the PR. There are few extra markdown files that might not be required in here and that contain mentions of branches etc.

Could you also add a few details about the version of k6 you are using. I will play around with it and add a github workflow as a follow up PR.

Also can you rebase on main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants