Fix GC race on freshly created dynamic pools via RAII guard#255
Merged
Conversation
Before: a dynamic auth_query passthrough pool was inserted into POOLS before its first server connection existed. During the get_server_parameters call that establishes that connection — under the SSLRequest handshake in prefer mode this is several round-trips — a GC sweep could observe pool_state().size == 0 and remove the pool. The next client looking it up got "No pool configured" until a fresh login rebuilt it. The 2-second created_at grace period mitigated the race but did not close it. Now: ConnectionPool carries an init_complete: Arc<AtomicBool> flag, false on dynamic pool creation and true on static pools (which never race with GC). create_dynamic_pool returns (pool, PoolInitGuard); the guard owns that flag. The auth flow calls guard.commit() right after the first get_server_parameters succeeds — at that point the flag flips to true and GC treats the pool like any other. If the auth flow fails or the guard goes out of scope without commit, Drop calls drop_dynamic_pool so the next login starts from a clean slate. GC checks the flag instead of an elapsed-time grace period, so the window is closed regardless of how long the first probe takes. The earlier structural attempt (commit 089fc22, reverted in 760ac60) tried to delay POOLS insertion until after the first connection. Two concurrent first logins for the same user would each get None from get_pool and build a separate ConnectionPool, leaving the second client wired to a bb8 Pool that was about to be overwritten. The guard approach keeps the insertion order (POOLS first, then connection) and relies on the flag — no race, no double-pool.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #209.
Summary
A dynamic auth_query passthrough pool was inserted into
POOLSbeforeits first server connection existed. Between insertion and the
get_server_parameterscall that establishes that connection, a GCsweep could observe
pool_state().size == 0and remove the pool. Thenext client looking it up saw
No pool configureduntil a fresh loginrebuilt it.
The previous mitigation was a 2-second
created_atgrace period ingc_idle_dynamic_pools. It worked locally but on slow CI runners — andin
preferTLS mode where the SSLRequest round-trip widens theinitialization window — the race still triggered.
What changes for the operator
POOLSuntil its firstserver connection is established. GC keeps its hands off until
initialization is done.
parameter, backend unreachable, panic in the middle) leaves no
stale pool entry behind. The next login starts from a clean slate.
created_atfield and its 2-second grace period are gone.What changes internally
ConnectionPoolcarries aninit_complete: Arc<AtomicBool>. Staticpools start with
true. Dynamic pools start withfalse.create_dynamic_poolreturns(ConnectionPool, PoolInitGuard). Theguard owns that flag and the pool identifier.
commitflips theflag to
true;Dropwithoutcommitcallsdrop_dynamic_pool.init_guard.commit()right after the firstsuccessful
get_server_parameters. Every error path returns beforecommit, soDropcleans up.gc_idle_dynamic_poolsnow checksinit_completeinstead of anelapsed-time window. The decision logic is extracted into a small
pure function (
should_gc_idle_pool) covered by unit tests.The earlier structural attempt (
089fc22, reverted in760ac60)moved the POOLS insertion to after the first connection. Two
concurrent first logins for the same user could then each build a
separate
ConnectionPool, leaving one client wired to abb8::Poolthat was about to be overwritten — the hang in
760ac60traced backto that race. The guard approach keeps the insertion order (POOLS
first, then connection) so the single-pool invariant survives, and
relies on the flag to close the GC window.
Test plan
cargo test --lib pool::gc::tests— 4 unit tests on the puredecision function. Two of them encode the regression directly:
idle_pool_still_initializing_is_skippedandflipping_init_complete_makes_pool_eligible. Without this fixthey would assert against the missing
init_completefield.cargo test --lib pool::init_guard::tests— 2 unit tests on theguard (
already_committed_leaves_flag_true_and_does_nothing_on_drop,commit_flips_flag_and_blocks_drop_cleanup).make test-bdd TAGS=@auth-query— 10 features, 42 scenarios,360 steps, all green.
make test-bdd TAGS=@auth-query-init-race— new scenario openstwo concurrent first logins for two different dynamic users with
retain_connections_time = 200msand short idle/serverlifetimes, then asserts that both succeed and that GC reaps the
pools the normal way once connections drain. On CI runners where
the historical 2 s grace period was not enough, this scenario
would have flaked without the fix.
cargo fmt,cargo clippy --lib --tests -- --deny warnings,and the full unit-test suite are clean.
Performance
The new flag is a single
Arc<AtomicBool>:gc_idle_dynamic_poolsreads each dynamic pool once persweep;
create_dynamic_poolruns once per first login per user.pool per sweep and one relaxed atomic store at first-login success.
Instant::elapsedbranch in any production workload; both bound by
O(N)GC sweepover dynamic pools.