From 6114af2588e8266dc2a83a19d4aadeea9563b408 Mon Sep 17 00:00:00 2001 From: dfitzmau Date: Thu, 16 Oct 2025 15:09:53 +0100 Subject: [PATCH] OSDOCS-13616-takeover: Documented Cluster latency requirements for etcd --- etcd/etcd-performance.adoc | 3 +++ etcd/etcd-practices.adoc | 12 ++++++++- modules/etcd-tuning-parameters.adoc | 1 - modules/recommended-cluster-latency-etcd.adoc | 27 +++++++++++++++++++ 4 files changed, 41 insertions(+), 2 deletions(-) create mode 100644 modules/recommended-cluster-latency-etcd.adoc diff --git a/etcd/etcd-performance.adoc b/etcd/etcd-performance.adoc index cc40c2b4db76..b7e4ace6fe27 100644 --- a/etcd/etcd-performance.adoc +++ b/etcd/etcd-performance.adoc @@ -26,6 +26,9 @@ include::modules/etcd-node-scaling.adoc[leveloffset=+1] * link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2024/html/installing_openshift_container_platform_with_the_assisted_installer/expanding-the-cluster#installing-control-plane-node-healthy-cluster_expanding-the-cluster[Expanding the cluster] * xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state] +// * xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/etcd-tuning-parameters.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state] + + // Effects of disk latency on etcd include::modules/etcd-disk-latency.adoc[leveloffset=+1] diff --git a/etcd/etcd-practices.adoc b/etcd/etcd-practices.adoc index f758f65e3bda..3f4f2e675ceb 100644 --- a/etcd/etcd-practices.adoc +++ b/etcd/etcd-practices.adoc @@ -6,9 +6,19 @@ include::_attributes/common-attributes.adoc[] toc::[] -The following documentation provides information on recommended performance and scalability practices for etcd. +The following documentation provides information about recommended performance and scalability practices for etcd. +// Storage practices for etcd include::modules/recommended-etcd-practices.adoc[leveloffset=+1] + +// Cluster latency requirements for etcd +include::modules/recommended-cluster-latency-etcd.adoc[leveloffset=+1] + +[role="_additional-resources"] +.Additional resources +* xref:../etcd/etcd-performance.adoc#etcd-tuning-parameters_etcd-performance[Setting tuning parameters for etcd] + +// Validating the hardware for etcd include::modules/etcd-verify-hardware.adoc[leveloffset=+1] [role="_additional-resources"] diff --git a/modules/etcd-tuning-parameters.adoc b/modules/etcd-tuning-parameters.adoc index f756a2ad8aa1..f1b73ea697c7 100644 --- a/modules/etcd-tuning-parameters.adoc +++ b/modules/etcd-tuning-parameters.adoc @@ -17,7 +17,6 @@ By selecting one of the other values, you are overriding the default. If you see To change the hardware speed tolerance for etcd, complete the following steps. - .Procedure . Check to see what the current value is by entering the following command: diff --git a/modules/recommended-cluster-latency-etcd.adoc b/modules/recommended-cluster-latency-etcd.adoc new file mode 100644 index 000000000000..789830bd80f6 --- /dev/null +++ b/modules/recommended-cluster-latency-etcd.adoc @@ -0,0 +1,27 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-practices.adoc + +:_mod-docs-content-type: CONCEPT +[id="recommended-cluster-latency-etcd_{context}"] += Cluster latency requirements for etcd + +[role="_abstract"] +Two important constraints should be addressed to provide a low-latency, high-availability network for etcd: + +* network I/O latency +* disk I/O latency + +etcd uses the Raft consensus algorithm, and every change should replicate to a majority of the cluster members before it commits. This process is highly sensitive to network and disk performance. The minimum time for an etcd request is the Round-Trip Time (RTT) between members, plus the time required for data to write to permanent storage. + +To achieve high availability, etcd should detect and recover from a leader failure quickly. This depends on two key tuning parameters: + +Heartbeat Interval:: The frequency that the leader sends a heartbeat to followers. This value should be close to the average RTT between members. +Election Timeout:: The time a follower waits without hearing a heartbeat before it attempts to become the new leader. This should be at least 10 times the RTT value to account for network variance. + +In a healthy cluster, the round-trip time between members should be less than 50 ms to ensure stability and avoid frequent leader elections. This is why etcd clusters are often deployed within a single data center or availability zone to minimize physical distance and network latency. + +To support a low-latency, high-availability network, especially during the leader election process, an arbiter site should be located where it provides an RTT latency of less than 10 ms. The arbiter component of a network maintains consistency and availability in a distributed system. + +// Need to clarify so the impression is that the arbiter is not counted in the number of nodes +// In the case of leader election and similar processes, the arbiter is used when clusters have an odd number of nodes, so a majority vote determines the system state. \ No newline at end of file