Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.7] Bug 1930876: etcdInsufficientMembers is wrong when etcd is in a pod #1066

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 18 additions & 10 deletions assets/prometheus-k8s/rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -961,6 +961,24 @@ spec:
offset 25s) or (absent(cluster:usage:workload:capacity_physical_cpu_core_seconds
offset 25s)*0))
record: cluster:usage:workload:capacity_physical_cpu_core_seconds
- name: openshift-etcd.rules
rules:
- alert: etcdInsufficientMembers
annotations:
message: etcd is reporting fewer instances are available than are needed ({{
$value }}). When etcd does not have a majority of instances available the
Kubernetes and OpenShift APIs will reject read and write requests and operations
that preserve the health of workloads cannot be performed. This can occur
when multiple control plane nodes are powered off or are unable to connect
to each other via the network. Check that all control plane nodes are powered
on and that network connections between each machine are functional.
summary: etcd is reporting that a majority of instances are unavailable.
expr: sum(up{job="etcd"} == bool 1 and etcd_server_has_leader{job="etcd"} ==
bool 1) without (instance,pod) < ((count(up{job="etcd"}) without (instance,pod)
+ 1) / 2)
for: 3m
labels:
severity: critical
- name: openshift-ingress.rules
rules:
- expr: sum by (code) (rate(haproxy_server_http_responses_total[5m]) > 0)
Expand Down Expand Up @@ -2335,16 +2353,6 @@ spec:
for: 10m
labels:
severity: critical
- alert: etcdInsufficientMembers
annotations:
description: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value
}}).'
summary: etcd cluster has insufficient number of members.
expr: |
sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)
for: 3m
labels:
severity: critical
- alert: etcdNoLeader
annotations:
description: 'etcd cluster "{{ $labels.job }}": member {{ $labels.instance
Expand Down
2 changes: 1 addition & 1 deletion jsonnet/main.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ local kp = (import 'kube-prometheus/kube-prometheus.libsonnet') +
std.map(
function(ruleGroup)
if ruleGroup.name == 'etcd' then
ruleGroup { rules: std.filter(function(rule) !('alert' in rule && rule.alert == 'etcdHighNumberOfFailedGRPCRequests'), ruleGroup.rules) }
ruleGroup { rules: std.filter(function(rule) !('alert' in rule && (rule.alert == 'etcdHighNumberOfFailedGRPCRequests' || rule.alert == 'etcdInsufficientMembers')), ruleGroup.rules) }
else if ruleGroup.name == 'kubernetes-system' then
ruleGroup { rules: std.filter(function(rule) !('alert' in rule && rule.alert == 'KubeVersionMismatch'), ruleGroup.rules) }
// Removing CPUThrottlingHigh alert as per https://bugzilla.redhat.com/show_bug.cgi?id=1843346
Expand Down
17 changes: 17 additions & 0 deletions jsonnet/rules.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,23 @@ local droppedKsmLabels = 'endpoint, instance, job, pod, service';
},
],
},
{
name: 'openshift-etcd.rules',
rules: [
{
expr: 'sum(up{job="etcd"} == bool 1 and etcd_server_has_leader{job="etcd"} == bool 1) without (instance,pod) < ((count(up{job="etcd"}) without (instance,pod) + 1) / 2)',
alert: 'etcdInsufficientMembers',
'for': '3m',
annotations: {
message: 'etcd is reporting fewer instances are available than are needed ({{ $value }}). When etcd does not have a majority of instances available the Kubernetes and OpenShift APIs will reject read and write requests and operations that preserve the health of workloads cannot be performed. This can occur when multiple control plane nodes are powered off or are unable to connect to each other via the network. Check that all control plane nodes are powered on and that network connections between each machine are functional.',
summary: 'etcd is reporting that a majority of instances are unavailable.',
},
labels: {
severity: 'critical',
},
},
],
},
{
name: 'openshift-ingress.rules',
rules: [
Expand Down