refactor(bigtable): unify session/vRPC concurrency on an executor model#13604
refactor(bigtable): unify session/vRPC concurrency on an executor model#13604igorbernstein2 wants to merge 0 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the threading and scheduling model of the Bigtable client by introducing a custom HashedWheelTimer and a per-operation serializing executor (OpExecutor), replacing the previous usage of ScheduledExecutorService and SynchronizationContext to improve thread safety and reduce lock contention. Feedback on these changes highlights a critical deadlock risk in OpExecutor.drain(), where an abrupt termination caused by an uncaught exception in the handler can leave drainScheduled set to true, permanently wedging the executor.
handler.uncaught is the last-resort recovery for a task throw; it cancels the chain, which drives Done.onStart, which calls listener.onClose. If the user's terminal listener also throws, the throw escapes the handler and exits drain() before reaching the `r == null` branch that clears drainScheduled. The flag stays set and every subsequent execute() queues without rescheduling — the OpExecutor is wedged. Per-op lifecycle today has no work after terminal close, so the wedge leaks nothing observable. Fixing it for symmetry with scheduleDrainLocked (which already resets drainScheduled on a backing.execute throw) and to keep the contract robust if a future handler can throw on a non-terminal op. Caught by gemini-code-assist on PR googleapis#13604.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the threading and scheduling model of the Bigtable client by introducing a custom HashedWheelTimer to replace ScheduledExecutorService for low-resolution scheduling, and a per-operation serializing executor (OpExecutor) to replace heavy synchronization in SessionImpl with a SynchronizationContext. Feedback on these changes highlights a critical concurrency bug in OpExecutor.runInline that violates serialization guarantees, a potential race condition in HashedWheelTimer.stop() that could cause stop hooks to execute twice, and a cumulative timeout issue in Client.close() when sequentially waiting for multiple session pools to drain.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring of the Bigtable client's internal concurrency and session management. Key changes include the introduction of a custom 'BigtableTimer' (a hashed-wheel timer) to replace reliance on 'ScheduledExecutorService' for short-lived tasks, the implementation of an 'OpExecutor' to provide per-operation serialization, and a migration of session state management to a 'SynchronizationContext'. These changes improve thread safety, prevent potential memory leaks, and optimize resource usage by reducing the number of threads and timer tasks. I have identified a potential correctness issue in 'VRpcImpl.java' where the state is not transitioned to 'CLOSED' upon successful response, which could lead to memory leaks in the 'VOperationImpl' cancellation listener registration.
| ctx.getExecutor() | ||
| .execute( | ||
| () -> { | ||
| RespT resp; |
There was a problem hiding this comment.
Correctness / Memory Leak Risk
In handleResponse, the state of VRpcImpl is checked but never transitioned to CLOSED (unlike handleError and handleSessionClose which use compareAndSet(State.STARTED, State.CLOSED)).
This leads to two issues:
- Inconsistent
isDone()state:isDone()will returnfalseeven after the RPC has successfully completed. - Potential Memory Leak in
VOperationImpl: IfVRpcImplis used directly (or via a delegating middleware),VOperationImpl.startchecks!chain.isDone()to decide whether to register thecancellationListenerongrpcContext. SinceisDone()returnsfalseon success, the listener will be registered after the call has already completed andCleanupListener.onClosehas run. This permanently leaks the listener (and the entire call chain) on thegrpcContext.
Suggestion:
Transition the state to CLOSED inside the executor task in handleResponse using compareAndSet to ensure symmetry and correctness.
ctx.getExecutor()
.execute(
() -> {
if (!state.compareAndSet(State.STARTED, State.CLOSED)) {
return;
}
RespT resp;4ed8f07 to
b56f9b8
Compare
handler.uncaught is the last-resort recovery for a task throw; it cancels the chain, which drives Done.onStart, which calls listener.onClose. If the user's terminal listener also throws, the throw escapes the handler and exits drain() before reaching the `r == null` branch that clears drainScheduled. The flag stays set and every subsequent execute() queues without rescheduling — the OpExecutor is wedged. Per-op lifecycle today has no work after terminal close, so the wedge leaks nothing observable. Fixing it for symmetry with scheduleDrainLocked (which already resets drainScheduled on a backing.execute throw) and to keep the contract robust if a future handler can throw on a non-terminal op. Caught by gemini-code-assist on PR googleapis#13604.
|
Superseded by #13633 after a rebase mishap on this branch. Closing. |
Summary
Reworks the threading model of the new session-based transport
(
com.google.cloud.bigtable.data.v2.internal). The old model serializedsession callbacks, retry/op state, and pool topology all on a single
synchronized(SessionPoolImpl.this)monitor, dispatched user callbackson the gRPC user-thread pool, and used the JDK
ScheduledThreadPoolExecutorfor heartbeats and per-vRPC deadlinemonitoring. This produced three concrete problems:
mutations, retry decisions, and
PendingVRpcbookkeeping.and deadline-monitor futures are cancelled within ~1 ms but stay in
the
DelayQueueuntil natural expiry, inflating O(log n) inserts.inside the SyncContext could orphan the caller's
Futurewith nolog or timeout.
New model
sessionSyncContextSessionImplOpExecutor=SerializingExecutor(userCallbackExecutor)DirectExecutor)HashedWheelTimer(in-process)SessionPoolImplScheduledExecutorServiceEvery state mutation in a session crosses
sessionSyncContext; everystate mutation in the retry/op layer crosses the per-op
OpExecutor.Netty I/O threads and the user thread are producers only — they submit
and return. The pool monitor now covers only pool topology
(
pendingRpcs,poolState, session list).The
userCallbackExecutoris the one configured on the gaxTransportChannelProvider, so transport and user-callback dispatchshare a single user-controlled pool instead of each Client/Shim
spinning up its own unbounded
CachedThreadPool.Key behavior changes worth flagging to reviewers
gRPC user-thread pool. Callers that implicitly relied on the prior
thread identity will see a different
Thread.currentThread().SessionImpl.forceCloseis now fully async (queues ontosessionSyncContextand returns). If the SyncContext is wedged theabort path catches the uncaught exception, logs, and tears the
session down on the next drain — delivering terminal
onClosetoevery attached vRPC instead of orphaning their futures.
ticking on every idle session.
Client.closedrainsSessionPools before tearing down theuser-callback executor, and serializes against concurrent
open*so racing opens cannot create orphan pools.
Not in scope
topology but is still pool-wide; per-AFE sharding is a queued
follow-up).
requestNextis stillUnsupportedOperationException— unary-only surface today).