Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod #1064

Merged
merged 1 commit into from Feb 19, 2021

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Feb 18, 2021

The upstream etcd alert is incorrect because it only excludes instance labels, but OpenShift runs etcd in a pod and therefore the pod label must be excluded.

Exclude the upstream alert, improve the resiliency of the alert expression, target the alert to the expected job for the cluster etcd
(job="etcd"), update the description and health text to include a clearer description of what insufficient members means and consequences and some impact actions, and separate the alert into its own rule group to prepare (in the future) of moving the alert into the cluster-etcd-operator repo.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 18, 2021
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references Bugzilla bug 1929944, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.7.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references Bugzilla bug 1929944, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.7.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The upstream etcd alert is incorrect because it only excludes instance,
but OpenShift runs etcd in a pod and therefore the pod label must be
excluded.

Exclude the upstream alert, improve the resiliency of the alert
expression, target the alert to the expected job for the cluster etcd
(job="etcd"), update the description and health text to include a
clearer description of what insufficient members means and consequences
and some impact actions, and separate the alert into its own rule
group to prepare (in the future) of moving the alert into the
cluster-etcd-operator repo. The alert now includes
etcd_server_has_leader == 1 to ensure that if an instance from a
previous quorum appears we will not consider it part of the majority
calculation. This also flags when we can't establish quorum due to
failures in communication between nodes (but not between monitoring
and etcd).
@s-urbaniak
Copy link
Contributor

/approve
/cc @hexfusion
for another pair of eyes

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 18, 2021
@smarterclayton
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@smarterclayton: An error was encountered adding this pull request to the external tracker bugs for bug 1929944 on the Bugzilla server at https://bugzilla.redhat.com. We were able to detect the following conditions from the error:

  • The Bugzilla server failed to load data from GitHub when creating the bug. This is usually caused by rate-limiting, please try again later.
Full error message. JSONRPC error 32000: There was an error reported for the RPC call to Jira: There was an error reported for a GitHub REST call. URL: https://api.github.com/repos/openshift/cluster-monitoring-operator/pulls/1064 Error: 403 Forbidden at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 111. at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 111. eval {...} called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 98 Bugzilla::Extension::ExternalBugs::Type::GitHub::_do_rest_call('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'https://api.github.com/repos/openshift/cluster-monitoring-ope...', 'GET') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 62 Bugzilla::Extension::ExternalBugs::Type::GitHub::get_data('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 eval {...} called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 Bugzilla::Extension::ExternalBugs::Bug::update_ext_info('Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)', 1) called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 125 Bugzilla::Extension::ExternalBugs::Bug::create('Bugzilla::Extension::ExternalBugs::Bug', 'HASH(0x55cf02ba9e48)') called at /var/www/html/bugzilla/extensions/ExternalBugs/Extension.pm line 940 Bugzilla::Extension::ExternalBugs::bug_start_of_update('Bugzilla::Extension::ExternalBugs=HASH(0x55ceffd742f0)', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Hook.pm line 21 Bugzilla::Hook::process('bug_start_of_update', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Bug.pm line 1173 Bugzilla::Bug::update('Bugzilla::Bug=HASH(0x55cf04ff8060)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/WebService.pm line 88 Bugzilla::Extension::ExternalBugs::WebService::add_external_bug('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf00b09a18)') called at (eval 2389) line 1 eval ' $procedure->{code}->($self, @params) ;' called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 220 JSON::RPC::Legacy::Server::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 297 Bugzilla::WebService::Server::JSONRPC::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 126 JSON::RPC::Legacy::Server::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 70 Bugzilla::WebService::Server::JSONRPC::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/jsonrpc.cgi line 31 ModPerl::ROOT::Bugzilla::ModPerl::ResponseHandler::var_www_html_bugzilla_jsonrpc_2ecgi::handler('Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 eval {...} called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 ModPerl::RegistryCooker::run('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 173 ModPerl::RegistryCooker::default_handler('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/Registry.pm line 32 ModPerl::Registry::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /var/www/html/bugzilla/mod_perl.pl line 139 Bugzilla::ModPerl::ResponseHandler::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at (eval 2389) line 0 eval {...} called at (eval 2389) line 0 at /var/www/html/bugzilla/Bugzilla/Error.pm line 130. Bugzilla::Error::_throw_error('global/user-error.html.tmpl', 'ext_bz_rest_error', 'HASH(0x55cf044edfe0)') called at /var/www/html/bugzilla/Bugzilla/Error.pm line 193 Bugzilla::Error::ThrowUserError('ext_bz_rest_error', 'HASH(0x55cf044edfe0)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 120 Bugzilla::Extension::ExternalBugs::Type::GitHub::_do_rest_call('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'https://api.github.com/repos/openshift/cluster-monitoring-ope...', 'GET') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 62 Bugzilla::Extension::ExternalBugs::Type::GitHub::get_data('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 eval {...} called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 Bugzilla::Extension::ExternalBugs::Bug::update_ext_info('Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)', 1) called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 125 Bugzilla::Extension::ExternalBugs::Bug::create('Bugzilla::Extension::ExternalBugs::Bug', 'HASH(0x55cf02ba9e48)') called at /var/www/html/bugzilla/extensions/ExternalBugs/Extension.pm line 940 Bugzilla::Extension::ExternalBugs::bug_start_of_update('Bugzilla::Extension::ExternalBugs=HASH(0x55ceffd742f0)', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Hook.pm line 21 Bugzilla::Hook::process('bug_start_of_update', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Bug.pm line 1173 Bugzilla::Bug::update('Bugzilla::Bug=HASH(0x55cf04ff8060)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/WebService.pm line 88 Bugzilla::Extension::ExternalBugs::WebService::add_external_bug('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf00b09a18)') called at (eval 2389) line 1 eval ' $procedure->{code}->($self, @params) ;' called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 220 JSON::RPC::Legacy::Server::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 297 Bugzilla::WebService::Server::JSONRPC::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 126 JSON::RPC::Legacy::Server::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 70 Bugzilla::WebService::Server::JSONRPC::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/jsonrpc.cgi line 31 ModPerl::ROOT::Bugzilla::ModPerl::ResponseHandler::var_www_html_bugzilla_jsonrpc_2ecgi::handler('Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 eval {...} called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 ModPerl::RegistryCooker::run('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 173 ModPerl::RegistryCooker::default_handler('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/Registry.pm line 32 ModPerl::Registry::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /var/www/html/bugzilla/mod_perl.pl line 139 Bugzilla::ModPerl::ResponseHandler::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at (eval 2389) line 0 eval {...} called at (eval 2389) line 0

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 19, 2021
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: This pull request references Bugzilla bug 1929944, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 19, 2021
@smarterclayton
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

thanks running a few quick tests

@hexfusion
Copy link
Contributor

I like this so we are comparing members reporting have leader

sum(up{job="etcd"} == bool 1 and etcd_server_has_leader{job="etcd"} == bool 1) without (instance,pod)

against the majority

((count(up{job="etcd"}) without (instance,pod) + 1) / 2)

so that if a quorum has been lost for more than 3 minutes we fire a critical alert.

I guess my only concern here is 3m is a really long time to lose quorum. We probably want to know if we ever lose quorum for more than 30s as a warn? Leader elections are very fast and if we have functional issues with actions the cluster is taking that result in quorum loss they are bugs. What do you think?

@hexfusion
Copy link
Contributor

I think this is progress though I will try to get a grasp on how often we might see quorum loss in the wild and can followup with warn thresholds.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, s-urbaniak, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit e60321a into openshift:master Feb 19, 2021
@openshift-ci-robot
Copy link
Contributor

@smarterclayton: All pull requests linked via external trackers have merged:

Bugzilla bug 1929944 has been moved to the MODIFIED state.

In response to this:

Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion
Copy link
Contributor

/cherry-pick release-4.7

@openshift-cherrypick-robot

@hexfusion: new pull request created: #1066

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion
Copy link
Contributor

I will grab the changelog

paulfantom added a commit to paulfantom/cluster-monitoring-operator that referenced this pull request Mar 1, 2021
@hexfusion
Copy link
Contributor

/cherry-pick release-4.6

@openshift-cherrypick-robot

@hexfusion: #1064 failed to apply on top of branch "release-4.6":

Applying: jsonnet/rules: etcdInsufficientMembers is wrong when etcd is in a pod
Using index info to reconstruct a base tree...
M	assets/prometheus-k8s/rules.yaml
M	jsonnet/main.jsonnet
M	jsonnet/rules.jsonnet
Falling back to patching base and 3-way merge...
Auto-merging jsonnet/rules.jsonnet
Auto-merging jsonnet/main.jsonnet
Auto-merging assets/prometheus-k8s/rules.yaml
CONFLICT (content): Merge conflict in assets/prometheus-k8s/rules.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 jsonnet/rules: etcdInsufficientMembers is wrong when etcd is in a pod
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants