Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2092395: etcdHighNumberOfFailedGRPCRequests alerts with wrong results #843

Merged
merged 2 commits into from Jun 9, 2022

Conversation

tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Jun 1, 2022

No description provided.

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 1, 2022

@tjungblu: This pull request references Bugzilla bug 2092395, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

Bug 2092395: etcdHighNumberOfFailedGRPCRequests alerts with wrong results

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 1, 2022

/cherry-pick release-4.10
/cherry-pick release-4.9

@openshift-cherrypick-robot

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.10 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.10
/cherry-pick release-4.9
/cherry-pick release-4.8
/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Elbehery
Copy link
Contributor

Elbehery commented Jun 1, 2022

/retest

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 1, 2022

/retest-required

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 1, 2022

runbook update here: openshift/runbooks#55

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 1, 2022

/retest-required

2 similar comments
@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 2, 2022

/retest-required

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 2, 2022

/retest-required

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 2, 2022

@wking by popular demand of the etcd team, do you also want to take a look?

@tjungblu tjungblu requested a review from wking June 2, 2022 14:27
@dusk125
Copy link
Contributor

dusk125 commented Jun 2, 2022

@wking by popular demand of the etcd team, do you also want to take a look?

Unless Trevor has any thoughts, lgtm

@EmilyM1
Copy link

EmilyM1 commented Jun 2, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2022
@dusk125
Copy link
Contributor

dusk125 commented Jun 3, 2022

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2022
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 2 against base HEAD 7e439f3 and 8 for PR HEAD 05b302a in total

runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: etcdHighFsyncDurations is not claiming a specific percentile, so you might want to go more generic here in summary with something like "etcd cluster fsync durations are too high.". The description should definitely call out 99th percentile. That would give you space for extending your existing series of graduated etcdHighFsyncDurations with versions that set different thresholds for different percentiles, and still have them all rolled up into a single reporting entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback applies to some other summary, like the one for etcdHighCommitDurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's from the upstream mixin: https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet#L168-L182

nit: etcdHighFsyncDurations is not claiming a specific percentile, so you might want to go more generic here in summary with something like "etcd cluster fsync durations are too high.".

hmm, are you sure? it does say histogram_quantile(0.99, ...) in the expression of the alert.

description: 'etcd cluster "{{ $labels.job }}": database size exceeds the
defined quota on etcd instance {{ $labels.instance }}, please defrag or
increase the quota as the writes to etcd will be disabled when it is full.'
description: 'etcd cluster "{{ $labels.job }}": database size exceeds the defined quota on etcd instance {{ $labels.instance }}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... exceeds the defined quota..." is a bit premature. I'd just say "... is {{ $value }}% of the defined quota..." or some such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's also from upstream: https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet#L213-L226

but I agree, this is a rather strange description.

runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdBackendQuotaLowSpace.md
summary: etcd cluster database is running full.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"running full" is not very specific about what is full. Maybe rephrase to "etcd cluster database size is near quota" or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same upstream mixing as above: https://github.com/etcd-io/etcd/blob/main/contrib/mixin/mixin.libsonnet#L213-L226

I think this one is actually fine, but diverges from it's alert description above. Should definitely be updated upstream.

leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes leading to 50% increase in database size over the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.'
summary: etcd cluster database growing very fast.
expr: |
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated nit, but putting etcd_server_quota_backend_bytes inside the increase could lead to false-positives if the user scales down the quota because they had overprovisioned. What you care about is consumption vs. the current quota, so maybe:

increase(etcd_mvcc_db_total_size_in_bytes[4h]) / etcd_server_quota_backend_bytes *100 > 50

with a description like:

etcd cluster "{{ $labels.job }}": etcd database size increased {{ $value }}% of the configured quota over the past four hours on etcd instance {{ $labels.instance }}. Please defrag (FIXME: doc link?) or increase the quota (FIXME: doc link?) as the writes to etcd will be disabled when it is full. Alternatively, investigate consumption and see if you can remove whatever's spewing this data into etcd (FIXME: reword. Doc link).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let's take that upstream too.

I think OCP customers currently don't have the option to change the size, so unless you go completely unmanaged etcd it's not possible to get into such scenario.

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 7, 2022

/retest-required

@dusk125
Copy link
Contributor

dusk125 commented Jun 7, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, EmilyM1, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 7, 2022

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 2 against base HEAD 7e439f3 and 8 for PR HEAD 5c7d414 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 1 against base HEAD 7e439f3 and 7 for PR HEAD 5c7d414 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7e439f3 and 6 for PR HEAD 5c7d414 in total

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 8, 2022

/retest-required

1 similar comment
@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 8, 2022

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 2 against base HEAD 82bf899 and 5 for PR HEAD 5c7d414 in total

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 1 against base HEAD 82bf899 and 4 for PR HEAD 5c7d414 in total

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

/retest-required

1 similar comment
@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 82bf899 and 3 for PR HEAD 5c7d414 in total

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

.......

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

overriding the jobs now, dunno why prow wanted to retest the whole thing after all was green already...

/override ci/prow/e2e-aws-serial

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

@tjungblu: tjungblu unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file.

In response to this:

overriding the jobs now, dunno why prow wanted to retest the whole thing after all was green already...

/override ci/prow/e2e-aws-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

/retest-required

1 similar comment
@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 9, 2022

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

@tjungblu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-five-control-plane-replicas 5c7d414 link false /test e2e-gcp-five-control-plane-replicas

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 2 against base HEAD b71793a and 2 for PR HEAD 5c7d414 in total

@openshift-merge-robot openshift-merge-robot merged commit 28a4ae4 into openshift:master Jun 9, 2022
10 of 11 checks passed
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

@tjungblu: All pull requests linked via external trackers have merged:

Bugzilla bug 2092395 has been moved to the MODIFIED state.

In response to this:

Bug 2092395: etcdHighNumberOfFailedGRPCRequests alerts with wrong results

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@tjungblu: new pull request created: #850

In response to this:

/cherry-pick release-4.10
/cherry-pick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@tjungblu: #843 failed to apply on top of branch "release-4.9":

Applying: Bug 2092395: etcdHighNumberOfFailedGRPCRequests alerts with wrong results
Using index info to reconstruct a base tree...
M	jsonnet/custom.libsonnet
M	jsonnet/jsonnetfile.lock.json
M	jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/mixin.libsonnet
M	manifests/0000_90_etcd-operator_03_prometheusrule.yaml
Falling back to patching base and 3-way merge...
Auto-merging manifests/0000_90_etcd-operator_03_prometheusrule.yaml
CONFLICT (content): Merge conflict in manifests/0000_90_etcd-operator_03_prometheusrule.yaml
Auto-merging jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/mixin.libsonnet
CONFLICT (content): Merge conflict in jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/mixin.libsonnet
Auto-merging jsonnet/jsonnetfile.lock.json
CONFLICT (content): Merge conflict in jsonnet/jsonnetfile.lock.json
Auto-merging jsonnet/custom.libsonnet
CONFLICT (content): Merge conflict in jsonnet/custom.libsonnet
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Bug 2092395: etcdHighNumberOfFailedGRPCRequests alerts with wrong results
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.10
/cherry-pick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tjungblu tjungblu deleted the alert branch November 28, 2022 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants