New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-1130: increase etcdGRPCRequestsSlow thresholds #932
Conversation
@tjungblu: This pull request references Jira Issue OCPBUGS-1130, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm 🎉 |
@tjungblu: This pull request references Jira Issue OCPBUGS-1130, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/jira refresh |
@hasbro17: This pull request references Jira Issue OCPBUGS-1130, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -130,6 +119,17 @@ spec: | |||
severity: warning | |||
- name: openshift-etcd.rules | |||
rules: | |||
- alert: etcdGRPCRequestsSlow | |||
annotations: | |||
description: 'etcd cluster "{{ $labels.%s }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want $labels.%s
here instead of $labels.job
?
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks - fixing
> 1 | ||
for: 30m | ||
labels: | ||
severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
critical
is midnight-page territory. And currently the only direct mitigation mentioned in the runbook is running a defrag. Perhaps the alert should pivot from "right now things are awful!" to "the past day has been pretty weak", and it could be a warning
so folks could start thinking about provisioning faster disks, or whatever, in the near future? Because it's hard for me to imaging rolling out of bed to run a defrag, or to scale up my control plane disks, and feeling like that was a healthy UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, I'm going to also update the runbook. Will send you the link to the updated one later today.
I would definitely want to get woken up if etcd takes longer than 1s to respond to anything. I'll revise the mitigation.
so folks could start thinking about provisioning faster disks
On BM that makes sense, in clouds you can easily scale up your VM type to unlock more IOPS or a faster CPU in a couple of minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely want to get woken up if etcd takes longer than 1s to respond to anything.
Would you like to have had a warning
ping the day before that response times were frequently up over 0.25s? Or would that be too noisy, and you'd rather not hear about that and take the midnight page if/when it gets up to 1s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would, if we assume that this is something that degrades slowly over time. The failures we observe in CI are transient, they come briefly and go away again. Would you pro-actively change anything in the build cluster when you see such warning the day before?
Maybe we can find some folks over at SD to see whether their ROSA/ARO clusters would benefit from such warning. Not sure they would find that actionable as such and it would just drown in the rest of their 400 alerts a day.
summary: etcd grpc requests are slow | ||
expr: | | ||
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method!="Defragment", grpc_type="unary"}[10m])) without(grpc_type)) | ||
> 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We collect a bunch of etcd performance metrics in Telemetry. Maybe we can process those, and possibly also correlate to other signs of cluster distress, to motivate particular thresholds? No worries if we want to take a guess now, and circle back later when we have more time for analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this in our backlog already in:
https://issues.redhat.com/browse/ETCD-144
feel free to add anything that's missing from your PoV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for that threshold, we don't gather the grpc latencies sadly, so there's nothing much we can gather from telemeter unfortunately.
@Elbehery can you please take a look again? I fixed a label for @wking and updated the runbook here: |
/hold cancel |
/cherry-pick release-4.11 |
@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.11 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Elbehery, hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-aws-ovn-serial |
The current pass rate via sippy is 10% on ci and 20% on nightly for aws-ovn-serial jobs. I'm really not sure this job should be holding up this PR at this point. |
@tjungblu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@tjungblu: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged:
These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-1130 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@tjungblu: new pull request created: #934 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
This replaces the upstream alert with something that's a lot less sensitive to bad etcd (fsync) latency.