Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-1130: increase etcdGRPCRequestsSlow thresholds #932

Merged
merged 1 commit into from Sep 21, 2022
Merged

OCPBUGS-1130: increase etcdGRPCRequestsSlow thresholds #932

merged 1 commit into from Sep 21, 2022

Conversation

tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Sep 20, 2022

This replaces the upstream alert with something that's a lot less sensitive to bad etcd (fsync) latency.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Sep 20, 2022
@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-1130, which is invalid:

  • expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tjungblu
Copy link
Contributor Author

TODO update the runbook: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdGRPCRequestsSlow.md

@Elbehery
Copy link
Contributor

/lgtm 🎉

@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-1130, which is invalid:

  • expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

This replaces the upstream alert with something that's a lot less sensitive to bad etcd (fsync) latency.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 20, 2022

@dusk125 / @hasbro17 PTAL so we can unblock the TRT team

@hasbro17
Copy link
Contributor

/lgtm
/retest-required

@hasbro17
Copy link
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 20, 2022
@openshift-ci-robot
Copy link

@hasbro17: This pull request references Jira Issue OCPBUGS-1130, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 20, 2022
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 45a07ee and 2 for PR HEAD 25ace42 in total

@@ -130,6 +119,17 @@ spec:
severity: warning
- name: openshift-etcd.rules
rules:
- alert: etcdGRPCRequestsSlow
annotations:
description: 'etcd cluster "{{ $labels.%s }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want $labels.%s here instead of $labels.job?

/hold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks - fixing

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2022
> 1
for: 30m
labels:
severity: critical
Copy link
Member

@wking wking Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical is midnight-page territory. And currently the only direct mitigation mentioned in the runbook is running a defrag. Perhaps the alert should pivot from "right now things are awful!" to "the past day has been pretty weak", and it could be a warning so folks could start thinking about provisioning faster disks, or whatever, in the near future? Because it's hard for me to imaging rolling out of bed to run a defrag, or to scale up my control plane disks, and feeling like that was a healthy UX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, I'm going to also update the runbook. Will send you the link to the updated one later today.
I would definitely want to get woken up if etcd takes longer than 1s to respond to anything. I'll revise the mitigation.

so folks could start thinking about provisioning faster disks

On BM that makes sense, in clouds you can easily scale up your VM type to unlock more IOPS or a faster CPU in a couple of minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely want to get woken up if etcd takes longer than 1s to respond to anything.

Would you like to have had a warning ping the day before that response times were frequently up over 0.25s? Or would that be too noisy, and you'd rather not hear about that and take the midnight page if/when it gets up to 1s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would, if we assume that this is something that degrades slowly over time. The failures we observe in CI are transient, they come briefly and go away again. Would you pro-actively change anything in the build cluster when you see such warning the day before?

Maybe we can find some folks over at SD to see whether their ROSA/ARO clusters would benefit from such warning. Not sure they would find that actionable as such and it would just drown in the rest of their 400 alerts a day.

summary: etcd grpc requests are slow
expr: |
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method!="Defragment", grpc_type="unary"}[10m])) without(grpc_type))
> 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We collect a bunch of etcd performance metrics in Telemetry. Maybe we can process those, and possibly also correlate to other signs of cluster distress, to motivate particular thresholds? No worries if we want to take a guess now, and circle back later when we have more time for analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this in our backlog already in:
https://issues.redhat.com/browse/ETCD-144

feel free to add anything that's missing from your PoV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for that threshold, we don't gather the grpc latencies sadly, so there's nothing much we can gather from telemeter unfortunately.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2022
@tjungblu
Copy link
Contributor Author

@Elbehery can you please take a look again? I fixed a label for @wking and updated the runbook here:
openshift/runbooks#69

@tjungblu
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2022
@tjungblu
Copy link
Contributor Author

/cherry-pick release-4.11

@openshift-cherrypick-robot

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.11 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Elbehery
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery, hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 45a07ee and 2 for PR HEAD b54cf99 in total

@soltysh
Copy link
Member

soltysh commented Sep 21, 2022

/test e2e-aws-ovn-serial

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Sep 21, 2022

The current pass rate via sippy is 10% on ci and 20% on nightly for aws-ovn-serial jobs. I'm really not sure this job should be holding up this PR at this point.

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 21, 2022

@hasbro17 and @dusk125 can help you out with overrides during the US working hours if this continues to fail.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 21, 2022

@tjungblu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-five-control-plane-replicas b54cf99 link false /test e2e-gcp-ovn-five-control-plane-replicas

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit ac362e9 into openshift:master Sep 21, 2022
@openshift-ci-robot
Copy link

@tjungblu: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-1130 has not been moved to the MODIFIED state.

In response to this:

This replaces the upstream alert with something that's a lot less sensitive to bad etcd (fsync) latency.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@tjungblu: new pull request created: #934

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor

aws-ovn-serial heard me complain and proved me wrong 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants