feat: RPC timeout exception and fix main's blocking on writing to sockets#3942
Merged
feat: RPC timeout exception and fix main's blocking on writing to sockets#3942
Conversation
Contributor
Author
Tracking
Standard development
CI Testing Labels
Documentation checklist
|
c9eefa6 to
65fb5fe
Compare
andrejtonev
reviewed
Mar 26, 2026
andrejtonev
approved these changes
Mar 27, 2026
…rror message" This reverts commit fd0a6ee.
cb668c1 to
b8a3640
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What
Adds end-to-end RPC timeout detection and error propagation for storage transactions (data replication). When a replica becomes unresponsive during replication, the main instance now detects the timeout at the TCP/socket level and surfaces a distinct
TimeoutReplicationErrorthrough the storage commit path, ultimately throwing aReplicationExceptionwith an actionable message to the user. Previously, all replication failures (timeout, connection reset, protocol error) were collapsed into a single generic error.Additionally, introduces a configurable
replication_modeparameter to the HA stress test infrastructure and a newrpc_timeoutstress workload that usesiptablesto simulate network partitions.Also fixes a scenario in which main would block because sending data to replica would've been blocked. This is solved by introducing TCP_USER_TIMEOUT parameter.
Why
In HA clusters, when a SYNC or STRICT_SYNC replica goes down or becomes unreachable, the main instance's
send()would block for minutes (default TCP retransmission timeout) before failing. Users had no way to distinguish a timeout from other replication failures,making diagnosis difficult. With
TCP_USER_TIMEOUTset on sockets, the kernel tears down the connection within seconds, and the distinct error type lets users understand the root cause immediately.How
The change threads timeout information through four layers:
src/io/network/socket.cpp): AddsSetUserTimeout()(TCP_USER_TIMEOUT) andWaitForReadyWrite()with timeout detection. SocketWrite()now returnsstd::expected<void, ClientCommunicationError>distinguishingTIMEOUT_ERRORfromGENERIC_ERROR.src/rpc/client.hpp,src/rpc/exceptions.hpp): IntroducesRpcTimeoutException(sibling ofGenericRpcFailedException, both inheritRpcFailedException).StreamHandler::SendAndWait()/SendAndWaitProgress()and the SLK write callback translatesocket timeout errors into
RpcTimeoutException. AddsInProgressResto the generic RPC messages layer (moved from storage-specificrpc.hpp).src/storage/v2/replication/):ReplicationStorageClient::FinalizePrepareCommitPhase()andFinalizeTransactionReplication()catchRpcTimeoutExceptionseparately and returnClientCommunicationError::TIMEOUT_ERROR.TransactionReplication::ShipDeltas()returnsstd::expected<void, ClientCommunicationError>(wasbool) and prioritizes timeout errors over generic errors when aggregating across replicas.src/storage/v2/inmemory/storage.cpp,src/query/interpreter.cpp):InMemoryStorage::Commit()mapsClientCommunicationError::TIMEOUT_ERRORtoTimeoutReplicationError{}in theStorageManipulationErrorvariant. The interpreter'scommit path checks for
TimeoutReplicationErrorand throwsReplicationException(rpc::kRpcTimeoutMsg).Other changes:
main-reached-rpc-timeout?handler across HA bank, create, multi-tenancy, and replication test suites. Changed several error handlers from:okto:infofor correctness (these are indeterminate outcomes, not confirmed successes).replication_modeinput tostress_tests.yml/reusable_stress_tests.yaml, threaded throughmgbuild.sh→deployment.sh. Newrpc_timeoutworkload usesiptablesto block a replica's replication port during a heavy write.StreamAndFinalizeDelta(system replication client insrc/replication/include/replication/replication_client.hpp): CatchesRpcFailedException(base class) instead of onlyGenericRpcFailedExceptionto handle both timeout and generic failures gracefully.Testing
rpc_timeouts.cpp,rpc_in_progress.cpp,replication_rpc_progress.cpp,snapshot_rpc_progress.cpp— all timeout scenarios now expectRpcTimeoutExceptioninstead ofGenericRpcFailedException.tests/stress/ha/workloads/rpc_timeout/workload.py— spawns a heavy write query, blocks a replica's replication port viaiptables, verifies that the query fails with a timeout or sync replication error. Enabled for native HA, disabled forDocker/EKS (requires root/iptables access).
:info)