(7.0) Refactor etcd disk check to be more tolerant #1847

r0mant · 2020-07-08T21:40:17Z

Description

The current fio-based etcd disk performance check is too strict and often fails in the environment where etcd performance may not be of a big concern, without any ability to skip/override it. It results in poor experience for users, to the point where this check actually does more harm than good.

This PR splits the limits the checker verifies into soft and hard thresholds, where a soft threshold only produces a warning and hard threshold leads to a critical failure. So the behavior is as follows:

If any of the soft limits are hit, the installer prints a warning in the output and proceeds as normal.
If any of the hard limits are hit, it fails pre-checks and installation like now.
As an added bonus, all warnings/failures are printed in the installer process right away now.

The old hard limits (50ms latency and 50 IOPS) are now soft limits and produce warnings. The hard limits are 150ms latency and 10 IOPS. The hard limits were chosen by gathering information from users who experienced these issues.

In addition, I added ability to override any of the soft/hard limits by setting respective environment variables. These are manual knobs that are convenient to use for debugging/troubleshooting.

Type of change

Internal change (not necessarily a bug fix or a new feature)

Linked tickets and other PRs

Closes Make etcd disk check into a warning #1834.

TODOs

Implementation

I had to change the signature of opsservice's ValidateServers method to return failed probes as well so I could process these results in the client (installer) and print warnings/failures properly. Before, it only returns a pre-formatted error. It should not cause any incompatibility issues b/c it is only used during the installation (not upgrade).

Testing done

Override soft limits to trigger the warnings

ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MAX_LATENCY_SOFT=1
ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MIN_IOPS_SOFT=5000
ubuntu@node-1:~/installer$ sudo -E ./gravity install --advertise-addr=192.168.99.102 --cluster=test
Wed Jul  8 20:53:22 UTC	Starting enterprise installer
...
Wed Jul  8 20:53:38 UTC	Executing "/checks" locally
Wed Jul  8 20:53:38 UTC	Running pre-flight checks
Wed Jul  8 20:53:39 UTC	Execute preflight checks
Wed Jul  8 20:53:48 UTC		Still running pre-flight checks (10 seconds elapsed)
Wed Jul  8 20:53:54 UTC	Node node-1 sequential write IOPS on /var/lib/gravity/planet/etcd is lower than 5000 (130) which may result in poor etcd performance
Wed Jul  8 20:53:54 UTC	Node node-1 fsync latency on /var/lib/gravity/planet/etcd is higher than 1ms (15ms) which may result in poor etcd performance
...

Override hard limits to trigger the failures

ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MIN_IOPS_HARD=5000
ubuntu@node-1:~/installer$ export GRAVITY_ETCD_MAX_LATENCY_HARD=1
ubuntu@node-1:~/installer$ sudo -E ./gravity install --advertise-addr=192.168.99.102 --cluster=test
Wed Jul  8 20:54:32 UTC	Starting enterprise installer
...
Wed Jul  8 20:54:50 UTC	Executing "/checks" locally
Wed Jul  8 20:54:50 UTC	Running pre-flight checks
Wed Jul  8 20:54:51 UTC	Execute preflight checks
Wed Jul  8 20:55:00 UTC		Still running pre-flight checks (10 seconds elapsed)
Wed Jul  8 20:55:06 UTC	Node node-1 sequential write IOPS on /var/lib/gravity/planet/etcd is lower than 5000 (125)
Wed Jul  8 20:55:06 UTC	Node node-1 fsync latency on /var/lib/gravity/planet/etcd is higher than 1ms (17ms)
Wed Jul  8 20:55:06 UTC	Saving debug report to /home/ubuntu/installer/crashreport.tgz
[ERROR]: failed to execute phase "/checks"
	The following pre-flight checks failed:
	[×] Node node-1 sequential write IOPS on /var/lib/gravity/planet/etcd is lower than 5000 (125)
	[×] Node node-1 fsync latency on /var/lib/gravity/planet/etcd is higher than 1ms (17ms)

Restore default limits and make sure install succeeds

ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MIN_IOPS_SOFT
ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MIN_IOPS_HARD
ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MAX_LATENCY_HARD
ubuntu@node-1:~/installer$ unset GRAVITY_ETCD_MAX_LATENCY_SOFT
ubuntu@node-1:~/installer$ sudo -E ./gravity install --advertise-addr=192.168.99.102 --cluster=test
Wed Jul  8 20:56:21 UTC	Starting enterprise installer
...
Wed Jul  8 20:56:43 UTC	Executing "/checks" locally
Wed Jul  8 20:56:43 UTC	Running pre-flight checks
Wed Jul  8 20:56:43 UTC	Execute preflight checks
Wed Jul  8 20:56:53 UTC		Still running pre-flight checks (10 seconds elapsed)
Wed Jul  8 20:56:59 UTC	Executing "/configure" locally
Wed Jul  8 20:57:00 UTC	Configuring cluster packages
Wed Jul  8 20:57:00 UTC	Configure packages for all nodes
...

lib/ops/ops.go

lib/utils/env.go

lib/checks/disks.go

* (7.0) Refactor etcd disk check to be more tolerant (#1847) * Pull only required packages during join (#1862) * Add docs about etcd disk requirements and relevant environment variables * Update e

* (7.0) Refactor etcd disk check to be more tolerant (gravitational#1847) * Pull only required packages during join (gravitational#1862) * Add docs about etcd disk requirements and relevant environment variables * Update e

r0mant added 3 commits July 8, 2020 12:23

Refactoring etcd disk check into warning

6848d42

Fix etcd limits from env vars

e04e70a

Polish

ebbe6bd

r0mant requested review from bernardjkim and a team July 8, 2020 21:40

r0mant self-assigned this Jul 8, 2020

r0mant requested review from a-palchikov and knisbet July 8, 2020 21:40

bernardjkim approved these changes Jul 8, 2020

View reviewed changes

a-palchikov approved these changes Jul 9, 2020

View reviewed changes

lib/ops/ops.go Outdated Show resolved Hide resolved

lib/utils/env.go Outdated Show resolved Hide resolved

lib/checks/disks.go Outdated Show resolved Hide resolved

Address review comments

c5a01d3

r0mant merged commit 8a1dbd9 into version/7.0.x Jul 10, 2020

r0mant deleted the roman/7.0/etcdwarn branch July 10, 2020 01:31

r0mant added a commit that referenced this pull request Jul 10, 2020

(7.0) Refactor etcd disk check to be more tolerant (#1847)

7fc09da

This was referenced Jul 10, 2020

(7.1) Forward-port etcd disk check updates #1859

Merged

(7.0) Pull only required packages during join #1862

Merged

r0mant added a commit that referenced this pull request Jul 16, 2020

(7.0) Refactor etcd disk check to be more tolerant (#1847)

5703fa3

aelkugia mentioned this pull request Sep 15, 2020

Let customers turn off install check / skip with actionable errors. #707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(7.0) Refactor etcd disk check to be more tolerant #1847

(7.0) Refactor etcd disk check to be more tolerant #1847

r0mant commented Jul 8, 2020 •

edited

(7.0) Refactor etcd disk check to be more tolerant #1847

(7.0) Refactor etcd disk check to be more tolerant #1847

Conversation

r0mant commented Jul 8, 2020 • edited

Description

Type of change

Linked tickets and other PRs

TODOs

Implementation

Testing done

Override soft limits to trigger the warnings

Override hard limits to trigger the failures

Restore default limits and make sure install succeeds

r0mant commented Jul 8, 2020 •

edited