New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod #1064
Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod #1064
Conversation
@smarterclayton: This pull request references Bugzilla bug 1929944, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@smarterclayton: This pull request references Bugzilla bug 1929944, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The upstream etcd alert is incorrect because it only excludes instance, but OpenShift runs etcd in a pod and therefore the pod label must be excluded. Exclude the upstream alert, improve the resiliency of the alert expression, target the alert to the expected job for the cluster etcd (job="etcd"), update the description and health text to include a clearer description of what insufficient members means and consequences and some impact actions, and separate the alert into its own rule group to prepare (in the future) of moving the alert into the cluster-etcd-operator repo. The alert now includes etcd_server_has_leader == 1 to ensure that if an instance from a previous quorum appears we will not consider it part of the majority calculation. This also flags when we can't establish quorum due to failures in communication between nodes (but not between monitoring and etcd).
277ba3d
to
b19f564
Compare
/approve |
/bugzilla refresh |
@smarterclayton: An error was encountered adding this pull request to the external tracker bugs for bug 1929944 on the Bugzilla server at https://bugzilla.redhat.com. We were able to detect the following conditions from the error:
Full error message.
JSONRPC error 32000: There was an error reported for the RPC call to Jira: There was an error reported for a GitHub REST call. URL: https://api.github.com/repos/openshift/cluster-monitoring-operator/pulls/1064 Error: 403 Forbidden at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 111. at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 111. eval {...} called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 98 Bugzilla::Extension::ExternalBugs::Type::GitHub::_do_rest_call('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'https://api.github.com/repos/openshift/cluster-monitoring-ope...', 'GET') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 62 Bugzilla::Extension::ExternalBugs::Type::GitHub::get_data('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 eval {...} called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 Bugzilla::Extension::ExternalBugs::Bug::update_ext_info('Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)', 1) called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 125 Bugzilla::Extension::ExternalBugs::Bug::create('Bugzilla::Extension::ExternalBugs::Bug', 'HASH(0x55cf02ba9e48)') called at /var/www/html/bugzilla/extensions/ExternalBugs/Extension.pm line 940 Bugzilla::Extension::ExternalBugs::bug_start_of_update('Bugzilla::Extension::ExternalBugs=HASH(0x55ceffd742f0)', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Hook.pm line 21 Bugzilla::Hook::process('bug_start_of_update', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Bug.pm line 1173 Bugzilla::Bug::update('Bugzilla::Bug=HASH(0x55cf04ff8060)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/WebService.pm line 88 Bugzilla::Extension::ExternalBugs::WebService::add_external_bug('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf00b09a18)') called at (eval 2389) line 1 eval ' $procedure->{code}->($self, @params) ;' called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 220 JSON::RPC::Legacy::Server::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 297 Bugzilla::WebService::Server::JSONRPC::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 126 JSON::RPC::Legacy::Server::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 70 Bugzilla::WebService::Server::JSONRPC::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/jsonrpc.cgi line 31 ModPerl::ROOT::Bugzilla::ModPerl::ResponseHandler::var_www_html_bugzilla_jsonrpc_2ecgi::handler('Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 eval {...} called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 ModPerl::RegistryCooker::run('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 173 ModPerl::RegistryCooker::default_handler('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/Registry.pm line 32 ModPerl::Registry::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /var/www/html/bugzilla/mod_perl.pl line 139 Bugzilla::ModPerl::ResponseHandler::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at (eval 2389) line 0 eval {...} called at (eval 2389) line 0 at /var/www/html/bugzilla/Bugzilla/Error.pm line 130. Bugzilla::Error::_throw_error('global/user-error.html.tmpl', 'ext_bz_rest_error', 'HASH(0x55cf044edfe0)') called at /var/www/html/bugzilla/Bugzilla/Error.pm line 193 Bugzilla::Error::ThrowUserError('ext_bz_rest_error', 'HASH(0x55cf044edfe0)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 120 Bugzilla::Extension::ExternalBugs::Type::GitHub::_do_rest_call('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'https://api.github.com/repos/openshift/cluster-monitoring-ope...', 'GET') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Type/GitHub.pm line 62 Bugzilla::Extension::ExternalBugs::Type::GitHub::get_data('Bugzilla::Extension::ExternalBugs::Type::GitHub=HASH(0x55cf04...', 'Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 eval {...} called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 302 Bugzilla::Extension::ExternalBugs::Bug::update_ext_info('Bugzilla::Extension::ExternalBugs::Bug=HASH(0x55cf023fa620)', 1) called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/Bug.pm line 125 Bugzilla::Extension::ExternalBugs::Bug::create('Bugzilla::Extension::ExternalBugs::Bug', 'HASH(0x55cf02ba9e48)') called at /var/www/html/bugzilla/extensions/ExternalBugs/Extension.pm line 940 Bugzilla::Extension::ExternalBugs::bug_start_of_update('Bugzilla::Extension::ExternalBugs=HASH(0x55ceffd742f0)', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Hook.pm line 21 Bugzilla::Hook::process('bug_start_of_update', 'HASH(0x55cf02c0af88)') called at /var/www/html/bugzilla/Bugzilla/Bug.pm line 1173 Bugzilla::Bug::update('Bugzilla::Bug=HASH(0x55cf04ff8060)') called at /loader/0x55cef398d888/Bugzilla/Extension/ExternalBugs/WebService.pm line 88 Bugzilla::Extension::ExternalBugs::WebService::add_external_bug('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf00b09a18)') called at (eval 2389) line 1 eval ' $procedure->{code}->($self, @params) ;' called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 220 JSON::RPC::Legacy::Server::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 297 Bugzilla::WebService::Server::JSONRPC::_handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...', 'HASH(0x55cf02ba48e8)') called at /usr/share/perl5/vendor_perl/JSON/RPC/Legacy/Server.pm line 126 JSON::RPC::Legacy::Server::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/Bugzilla/WebService/Server/JSONRPC.pm line 70 Bugzilla::WebService::Server::JSONRPC::handle('Bugzilla::WebService::Server::JSONRPC::Bugzilla::Extension::E...') called at /var/www/html/bugzilla/jsonrpc.cgi line 31 ModPerl::ROOT::Bugzilla::ModPerl::ResponseHandler::var_www_html_bugzilla_jsonrpc_2ecgi::handler('Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 eval {...} called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 207 ModPerl::RegistryCooker::run('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/RegistryCooker.pm line 173 ModPerl::RegistryCooker::default_handler('Bugzilla::ModPerl::ResponseHandler=HASH(0x55cf02c294c0)') called at /usr/lib64/perl5/vendor_perl/ModPerl/Registry.pm line 32 ModPerl::Registry::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at /var/www/html/bugzilla/mod_perl.pl line 139 Bugzilla::ModPerl::ResponseHandler::handler('Bugzilla::ModPerl::ResponseHandler', 'Apache2::RequestRec=SCALAR(0x55cf00bc79f8)') called at (eval 2389) line 0 eval {...} called at (eval 2389) line 0
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/bugzilla refresh |
@smarterclayton: This pull request references Bugzilla bug 1929944, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
thanks running a few quick tests |
I like this so we are comparing members reporting have leader
against the majority
so that if a quorum has been lost for more than 3 minutes we fire a critical alert. I guess my only concern here is 3m is a really long time to lose quorum. We probably want to know if we ever lose quorum for more than 30s as a warn? Leader elections are very fast and if we have functional issues with actions the cluster is taking that result in quorum loss they are bugs. What do you think? |
I think this is progress though I will try to get a grasp on how often we might see quorum loss in the wild and can followup with warn thresholds. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, s-urbaniak, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@smarterclayton: All pull requests linked via external trackers have merged: Bugzilla bug 1929944 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.7 |
@hexfusion: new pull request created: #1066 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I will grab the changelog |
/cherry-pick release-4.6 |
@hexfusion: #1064 failed to apply on top of branch "release-4.6":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The upstream etcd alert is incorrect because it only excludes instance labels, but OpenShift runs etcd in a pod and therefore the pod label must be excluded.
Exclude the upstream alert, improve the resiliency of the alert expression, target the alert to the expected job for the cluster etcd
(job="etcd"), update the description and health text to include a clearer description of what insufficient members means and consequences and some impact actions, and separate the alert into its own rule group to prepare (in the future) of moving the alert into the cluster-etcd-operator repo.