Skip to content

feat: RPC timeout exception and fix main's blocking on writing to sockets#3942

Merged
as51340 merged 22 commits intomasterfrom
feat/rpc-timeout-exception
Mar 30, 2026
Merged

feat: RPC timeout exception and fix main's blocking on writing to sockets#3942
as51340 merged 22 commits intomasterfrom
feat/rpc-timeout-exception

Conversation

@as51340
Copy link
Copy Markdown
Contributor

@as51340 as51340 commented Mar 20, 2026

What

Adds end-to-end RPC timeout detection and error propagation for storage transactions (data replication). When a replica becomes unresponsive during replication, the main instance now detects the timeout at the TCP/socket level and surfaces a distinct
TimeoutReplicationError through the storage commit path, ultimately throwing a ReplicationException with an actionable message to the user. Previously, all replication failures (timeout, connection reset, protocol error) were collapsed into a single generic error.

Additionally, introduces a configurable replication_mode parameter to the HA stress test infrastructure and a new rpc_timeout stress workload that uses iptables to simulate network partitions.

Also fixes a scenario in which main would block because sending data to replica would've been blocked. This is solved by introducing TCP_USER_TIMEOUT parameter.

Why

In HA clusters, when a SYNC or STRICT_SYNC replica goes down or becomes unreachable, the main instance's send() would block for minutes (default TCP retransmission timeout) before failing. Users had no way to distinguish a timeout from other replication failures,
making diagnosis difficult. With TCP_USER_TIMEOUT set on sockets, the kernel tears down the connection within seconds, and the distinct error type lets users understand the root cause immediately.

How

The change threads timeout information through four layers:

  • Socket layer (src/io/network/socket.cpp): Adds SetUserTimeout() (TCP_USER_TIMEOUT) and WaitForReadyWrite() with timeout detection. Socket Write() now returns std::expected<void, ClientCommunicationError> distinguishing TIMEOUT_ERROR from
    GENERIC_ERROR.
  • RPC layer (src/rpc/client.hpp, src/rpc/exceptions.hpp): Introduces RpcTimeoutException (sibling of GenericRpcFailedException, both inherit RpcFailedException). StreamHandler::SendAndWait()/SendAndWaitProgress() and the SLK write callback translate
    socket timeout errors into RpcTimeoutException. Adds InProgressRes to the generic RPC messages layer (moved from storage-specific rpc.hpp).
  • Replication layer (src/storage/v2/replication/): ReplicationStorageClient::FinalizePrepareCommitPhase() and FinalizeTransactionReplication() catch RpcTimeoutException separately and return ClientCommunicationError::TIMEOUT_ERROR.
    TransactionReplication::ShipDeltas() returns std::expected<void, ClientCommunicationError> (was bool) and prioritizes timeout errors over generic errors when aggregating across replicas.
  • Storage/interpreter layer (src/storage/v2/inmemory/storage.cpp, src/query/interpreter.cpp): InMemoryStorage::Commit() maps ClientCommunicationError::TIMEOUT_ERROR to TimeoutReplicationError{} in the StorageManipulationError variant. The interpreter's
    commit path checks for TimeoutReplicationError and throws ReplicationException(rpc::kRpcTimeoutMsg).

Other changes:

  • Jepsen tests: Added main-reached-rpc-timeout? handler across HA bank, create, multi-tenancy, and replication test suites. Changed several error handlers from :ok to :info for correctness (these are indeterminate outcomes, not confirmed successes).
  • Stress test infra: Added replication_mode input to stress_tests.yml / reusable_stress_tests.yaml, threaded through mgbuild.shdeployment.sh. New rpc_timeout workload uses iptables to block a replica's replication port during a heavy write.
  • StreamAndFinalizeDelta (system replication client in src/replication/include/replication/replication_client.hpp): Catches RpcFailedException (base class) instead of only GenericRpcFailedException to handle both timeout and generic failures gracefully.

Testing

  • Unit tests updated: rpc_timeouts.cpp, rpc_in_progress.cpp, replication_rpc_progress.cpp, snapshot_rpc_progress.cpp — all timeout scenarios now expect RpcTimeoutException instead of GenericRpcFailedException.
  • New stress test: tests/stress/ha/workloads/rpc_timeout/workload.py — spawns a heavy write query, blocks a replica's replication port via iptables, verifies that the query fails with a timeout or sync replication error. Enabled for native HA, disabled for
    Docker/EKS (requires root/iptables access).
  • Jepsen tests: Updated all HA and replication suites to handle the new timeout error as an indeterminate outcome (:info)

@as51340 as51340 added this to the mg-v3.10.0 milestone Mar 20, 2026
@as51340 as51340 self-assigned this Mar 20, 2026
@as51340 as51340 added feature feature Docs needed Docs needed CI -build=community -test=core Run community build and core tests on push CI -build=coverage -test=core Run coverage build and core tests on push CI -build=jepsen -test=core Run jepsen build and core tests on push CI -build=release -test=core Run release build and core tests on push CI -build=release -test=e2e Run release build and e2e tests on push CI -build=coverage -test=clang_tidy labels Mar 20, 2026
@as51340
Copy link
Copy Markdown
Contributor Author

as51340 commented Mar 20, 2026

Tracking

  • [Link to Epic/Issue]

Standard development

CI Testing Labels

  • Select the appropriate CI test labels (CI -build=build-name -test=test-suite)

Documentation checklist

  • Add the documentation label
  • Add the bug / feature label
  • Add the milestone for which this feature is intended
    • If not known, set for a later milestone
  • Write a release note, including added/changed clauses
    • When a SYNC or STRICT_SYNC replica becomes unresponsive during replication, the main instance now detects the network timeout within seconds (via TCP_USER_TIMEOUT) instead of blocking for minutes, and returns a distinct error message — "Main reached an RPC timeout
      while waiting for the response from at least one replica" — so users can immediately identify the root cause rather than seeing a generic replication failure. For SYNC clusters the transaction still commits on main; for STRICT_SYNC clusters the transaction is aborted
      on all instances, consistent with existing behavior. No configuration changes or user action is required — the timeout detection is automatic. #3942
    • What has changed? What does it mean for a user? What should a user do with it? [#{{PR_number}}]({{link to the PR}})
  • [ Documentation PR link memgraph/documentation#XXXX ]
    • Is back linked to this development PR

@as51340 as51340 added CI -build=release -test=stress Run release build and stress tests on push and removed CI -build=release -test=stress Run release build and stress tests on push labels Mar 23, 2026
@as51340 as51340 force-pushed the feat/rpc-timeout-exception branch from c9eefa6 to 65fb5fe Compare March 25, 2026 06:19
@as51340 as51340 requested a review from andrejtonev March 25, 2026 08:44
@as51340 as51340 marked this pull request as ready for review March 25, 2026 08:44
Comment thread src/communication/session.hpp Outdated
Comment thread src/rpc/exceptions.hpp Outdated
Comment thread src/rpc/exceptions.hpp Outdated
Comment thread src/rpc/messages.hpp
Comment thread src/storage/v2/replication/replication_client.cpp
Comment thread src/storage/v2/replication/replication_client.cpp
Comment thread src/storage/v2/replication/replication_client.cpp
Comment thread src/storage/v2/replication/replication_transaction.cpp Outdated
Comment thread src/rpc/version.hpp Outdated
@as51340 as51340 changed the title feat: RPC timeout exception feat: RPC timeout exception and fix main's blocking on writing to sockets Mar 27, 2026
@as51340 as51340 requested a review from andrejtonev March 27, 2026 09:50
@as51340 as51340 force-pushed the feat/rpc-timeout-exception branch from cb668c1 to b8a3640 Compare March 30, 2026 05:58
@as51340 as51340 enabled auto-merge March 30, 2026 05:58
@sonarqubecloud
Copy link
Copy Markdown

@as51340 as51340 added this pull request to the merge queue Mar 30, 2026
Merged via the queue into master with commit a5262ff Mar 30, 2026
41 checks passed
@as51340 as51340 deleted the feat/rpc-timeout-exception branch March 30, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Capability - high-availability CI -build=community -test=core Run community build and core tests on push CI -build=coverage -test=clang_tidy CI -build=coverage -test=core Run coverage build and core tests on push CI -build=jepsen -test=core Run jepsen build and core tests on push CI -build=release -test=core Run release build and core tests on push CI -build=release -test=e2e Run release build and e2e tests on push Docs needed Docs needed feature feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants