Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

(7.0) Refactor etcd disk check to be more tolerant #1847

Merged
merged 4 commits into from Jul 10, 2020

Conversation

r0mant
Copy link
Contributor

@r0mant r0mant commented Jul 8, 2020

Description

The current fio-based etcd disk performance check is too strict and often fails in the environment where etcd performance may not be of a big concern, without any ability to skip/override it. It results in poor experience for users, to the point where this check actually does more harm than good.

This PR splits the limits the checker verifies into soft and hard thresholds, where a soft threshold only produces a warning and hard threshold leads to a critical failure. So the behavior is as follows:

  • If any of the soft limits are hit, the installer prints a warning in the output and proceeds as normal.
  • If any of the hard limits are hit, it fails pre-checks and installation like now.
  • As an added bonus, all warnings/failures are printed in the installer process right away now.

The old hard limits (50ms latency and 50 IOPS) are now soft limits and produce warnings. The hard limits are 150ms latency and 10 IOPS. The hard limits were chosen by gathering information from users who experienced these issues.

In addition, I added ability to override any of the soft/hard limits by setting respective environment variables. These are manual knobs that are convenient to use for debugging/troubleshooting.

Type of change

  • Internal change (not necessarily a bug fix or a new feature)

Linked tickets and other PRs

TODOs

  • Self-review the change
  • Write tests
  • Perform manual testing
  • Write documentation
  • Address review feedback

Implementation

I had to change the signature of opsservice's ValidateServers method to return failed probes as well so I could process these results in the client (installer) and print warnings/failures properly. Before, it only returns a pre-formatted error. It should not cause any incompatibility issues b/c it is only used during the installation (not upgrade).

Testing done

Override soft limits to trigger the warnings

ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MAX_LATENCY_SOFT=1
ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MIN_IOPS_SOFT=5000
ubuntu@node-1:~/installer$ sudo -E ./gravity install --advertise-addr=192.168.99.102 --cluster=test
Wed Jul  8 20:53:22 UTC	Starting enterprise installer
...
Wed Jul  8 20:53:38 UTC	Executing "/checks" locally
Wed Jul  8 20:53:38 UTC	Running pre-flight checks
Wed Jul  8 20:53:39 UTC	Execute preflight checks
Wed Jul  8 20:53:48 UTC		Still running pre-flight checks (10 seconds elapsed)
Wed Jul  8 20:53:54 UTC	Node node-1 sequential write IOPS on /var/lib/gravity/planet/etcd is lower than 5000 (130) which may result in poor etcd performance
Wed Jul  8 20:53:54 UTC	Node node-1 fsync latency on /var/lib/gravity/planet/etcd is higher than 1ms (15ms) which may result in poor etcd performance
...

Override hard limits to trigger the failures

ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MIN_IOPS_HARD=5000
ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MAX_LATENCY_HARD=1
ubuntu@node-1:~/installer$ sudo -E ./gravity install --advertise-addr=192.168.99.102 --cluster=test
Wed Jul  8 20:54:32 UTC	Starting enterprise installer
...
Wed Jul  8 20:54:50 UTC	Executing "/checks" locally
Wed Jul  8 20:54:50 UTC	Running pre-flight checks
Wed Jul  8 20:54:51 UTC	Execute preflight checks
Wed Jul  8 20:55:00 UTC		Still running pre-flight checks (10 seconds elapsed)
Wed Jul  8 20:55:06 UTC	Node node-1 sequential write IOPS on /var/lib/gravity/planet/etcd is lower than 5000 (125)
Wed Jul  8 20:55:06 UTC	Node node-1 fsync latency on /var/lib/gravity/planet/etcd is higher than 1ms (17ms)
Wed Jul  8 20:55:06 UTC	Saving debug report to /home/ubuntu/installer/crashreport.tgz
[ERROR]: failed to execute phase "/checks"
	The following pre-flight checks failed:
	[×] Node node-1 sequential write IOPS on /var/lib/gravity/planet/etcd is lower than 5000 (125)
	[×] Node node-1 fsync latency on /var/lib/gravity/planet/etcd is higher than 1ms (17ms)

Restore default limits and make sure install succeeds

ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MIN_IOPS_SOFT
ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MIN_IOPS_HARD
ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MAX_LATENCY_HARD
ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MAX_LATENCY_SOFT
ubuntu@node-1:~/installer$ sudo -E ./gravity install --advertise-addr=192.168.99.102 --cluster=test
Wed Jul  8 20:56:21 UTC	Starting enterprise installer
...
Wed Jul  8 20:56:43 UTC	Executing "/checks" locally
Wed Jul  8 20:56:43 UTC	Running pre-flight checks
Wed Jul  8 20:56:43 UTC	Execute preflight checks
Wed Jul  8 20:56:53 UTC		Still running pre-flight checks (10 seconds elapsed)
Wed Jul  8 20:56:59 UTC	Executing "/configure" locally
Wed Jul  8 20:57:00 UTC	Configuring cluster packages
Wed Jul  8 20:57:00 UTC	Configure packages for all nodes
...

@r0mant r0mant requested review from bernardjkim and a team July 8, 2020 21:40
@r0mant r0mant self-assigned this Jul 8, 2020
lib/ops/ops.go Outdated Show resolved Hide resolved
lib/utils/env.go Outdated Show resolved Hide resolved
lib/checks/disks.go Outdated Show resolved Hide resolved
@r0mant r0mant merged commit 8a1dbd9 into version/7.0.x Jul 10, 2020
@r0mant r0mant deleted the roman/7.0/etcdwarn branch July 10, 2020 01:31
r0mant added a commit that referenced this pull request Jul 16, 2020
* (7.0) Refactor etcd disk check to be more tolerant (#1847)

* Pull only required packages during join (#1862)

* Add docs about etcd disk requirements and relevant environment variables

* Update e
helgi pushed a commit to helgi/gravity that referenced this pull request Jun 21, 2021
* (7.0) Refactor etcd disk check to be more tolerant (gravitational#1847)

* Pull only required packages during join (gravitational#1862)

* Add docs about etcd disk requirements and relevant environment variables

* Update e
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants