New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2092395: etcdHighNumberOfFailedGRPCRequests alerts with wrong results #843
Conversation
|
@tjungblu: This pull request references Bugzilla bug 2092395, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.10 |
|
@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.10 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
/retest-required |
related to changes in openshift/cluster-etcd-operator#843
|
runbook update here: openshift/runbooks#55 |
|
/retest-required |
2 similar comments
|
/retest-required |
|
/retest-required |
|
@wking by popular demand of the etcd team, do you also want to take a look? |
Unless Trevor has any thoughts, lgtm |
|
/lgtm |
|
/approve |
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md | ||
| summary: etcd cluster 99th percentile fsync durations are too high. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: etcdHighFsyncDurations is not claiming a specific percentile, so you might want to go more generic here in summary with something like "etcd cluster fsync durations are too high.". The description should definitely call out 99th percentile. That would give you space for extending your existing series of graduated etcdHighFsyncDurations with versions that set different thresholds for different percentiles, and still have them all rolled up into a single reporting entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback applies to some other summary, like the one for etcdHighCommitDurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's from the upstream mixin: https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet#L168-L182
nit: etcdHighFsyncDurations is not claiming a specific percentile, so you might want to go more generic here in summary with something like "etcd cluster fsync durations are too high.".
hmm, are you sure? it does say histogram_quantile(0.99, ...) in the expression of the alert.
| description: 'etcd cluster "{{ $labels.job }}": database size exceeds the | ||
| defined quota on etcd instance {{ $labels.instance }}, please defrag or | ||
| increase the quota as the writes to etcd will be disabled when it is full.' | ||
| description: 'etcd cluster "{{ $labels.job }}": database size exceeds the defined quota on etcd instance {{ $labels.instance }}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... exceeds the defined quota..." is a bit premature. I'd just say "... is {{ $value }}% of the defined quota..." or some such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's also from upstream: https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet#L213-L226
but I agree, this is a rather strange description.
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdBackendQuotaLowSpace.md | ||
| summary: etcd cluster database is running full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"running full" is not very specific about what is full. Maybe rephrase to "etcd cluster database size is near quota" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same upstream mixing as above: https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet#L213-L226
I think this one is actually fine, but diverges from it's alert description above. Should definitely be updated upstream.
| leading to 50% increase in database size over the past four hours on etcd | ||
| instance {{ $labels.instance }}, please check as it might be disruptive.' | ||
| description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes leading to 50% increase in database size over the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.' | ||
| summary: etcd cluster database growing very fast. | ||
| expr: | | ||
| increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated nit, but putting etcd_server_quota_backend_bytes inside the increase could lead to false-positives if the user scales down the quota because they had overprovisioned. What you care about is consumption vs. the current quota, so maybe:
increase(etcd_mvcc_db_total_size_in_bytes[4h]) / etcd_server_quota_backend_bytes *100 > 50
with a description like:
etcd cluster
"{{ $labels.job }}": etcd database size increased{{ $value }}%of the configured quota over the past four hours on etcd instance{{ $labels.instance }}. Please defrag (FIXME: doc link?) or increase the quota (FIXME: doc link?) as the writes to etcd will be disabled when it is full. Alternatively, investigate consumption and see if you can remove whatever's spewing this data into etcd (FIXME: reword. Doc link).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let's take that upstream too.
I think OCP customers currently don't have the option to change the size, so unless you go completely unmanaged etcd it's not possible to get into such scenario.
|
/retest-required |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, EmilyM1, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
/retest-required |
1 similar comment
|
/retest-required |
|
/retest-required |
|
/retest-required |
1 similar comment
|
/retest-required |
|
....... |
|
overriding the jobs now, dunno why prow wanted to retest the whole thing after all was green already... /override ci/prow/e2e-aws-serial |
|
@tjungblu: tjungblu unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@tjungblu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
28a4ae4
into
openshift:master
|
@tjungblu: All pull requests linked via external trackers have merged: Bugzilla bug 2092395 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@tjungblu: new pull request created: #850 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@tjungblu: #843 failed to apply on top of branch "release-4.9": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.