This repository has been archived by the owner on Feb 9, 2024. It is now read-only.
(7.0) Refactor etcd disk check to be more tolerant #1847
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The current fio-based etcd disk performance check is too strict and often fails in the environment where etcd performance may not be of a big concern, without any ability to skip/override it. It results in poor experience for users, to the point where this check actually does more harm than good.
This PR splits the limits the checker verifies into soft and hard thresholds, where a soft threshold only produces a warning and hard threshold leads to a critical failure. So the behavior is as follows:
The old hard limits (50ms latency and 50 IOPS) are now soft limits and produce warnings. The hard limits are 150ms latency and 10 IOPS. The hard limits were chosen by gathering information from users who experienced these issues.
In addition, I added ability to override any of the soft/hard limits by setting respective environment variables. These are manual knobs that are convenient to use for debugging/troubleshooting.
Type of change
Linked tickets and other PRs
TODOs
Implementation
I had to change the signature of opsservice's
ValidateServers
method to return failed probes as well so I could process these results in the client (installer) and print warnings/failures properly. Before, it only returns a pre-formatted error. It should not cause any incompatibility issues b/c it is only used during the installation (not upgrade).Testing done
Override soft limits to trigger the warnings
Override hard limits to trigger the failures
Restore default limits and make sure install succeeds