Skip to content

Commit

Permalink
Add MachineHealthCheckUnterminatedShortCircuitSRE alert
Browse files Browse the repository at this point in the history
  • Loading branch information
Dee-6777 committed Apr 25, 2024
1 parent 18af189 commit d2f40f4
Showing 1 changed file with 14 additions and 0 deletions.
14 changes: 14 additions & 0 deletions install/0000_90_machine-api-operator_04_alertrules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,17 @@ spec:
description: |
The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check
the status of machines in the cluster.
- name: machine-health-check-unterminated-short-circuit-sre
rules:
- alert: MachineHealthCheckUnterminatedShortCircuitSRE
expr: |
mapi_machinehealthcheck_short_circuit == 1
for: 30m
labels:
severity: critical
annotations:
summary: "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"
description: |
The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check
the status of machines in the cluster.

0 comments on commit d2f40f4

Please sign in to comment.