Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2093016: Add alert about attach / mount failing #324

Merged
merged 1 commit into from Oct 18, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
19 changes: 19 additions & 0 deletions manifests/12_prometheusrules.yaml
Expand Up @@ -25,3 +25,22 @@ spec:
Cluster storage operator monitors all storage classes configured in the cluster
and checks there is not more than one default StorageClass configured.
message: "StorageClass count check is failing (there should not be more than one default StorageClass)"

- name: storage-operations.rules
rules:
- alert: PodStartupStorageOperationsFailing
# There was at least one failing operation in past 5 minutes *and* there was no successful one.
# Focus on attach and mount operations - they have the same diagnostic steps and are the most common.
expr: |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps add a comment that - this does not cover the cases where attach operation does not even start. For example - a node is shutdown and pod is deleted, but detach won't start until a certain check in ADC expires (I think something like 5 minutes after which volume is force detached).

Which reminds me - do we need an alert for detach failures? may be we can do that in a follow up. But if detach is failing then attach operation may not even start and hence no attach failure metric might be emitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we can address comment in a follow up or something. Not super critical.

increase(storage_operation_duration_seconds_count{status != "success", operation_name =~"volume_attach|volume_mount"}[5m]) > 0
and on() increase(storage_operation_duration_seconds_count{status = "success", operation_name =~"volume_attach|volume_mount"}[5m]) == 0
for: 5m
labels:
severity: info
annotations:
summary: "Pods can't start because {{ $labels.operation_name }} of volume plugin {{ $labels.volume_plugin }} is permanently failing on node {{ $labels.node }}."
description: |
Failing storage operation "{{ $labels.operation_name }}" of volume plugin {{ $labels.volume_plugin }} was preventing Pods on node {{ $labels.node }}
from starting for past 5 minutes.
Please investigate Pods that are "ContainerCreating" on the node: "oc get pod --field-selector=spec.nodeName=ip-10-0-130-168.ec2.internal --all-namespaces | grep ContainerCreating".
Events of the Pods should contain exact error message: "oc describe pod -n <pod namespace> <pod name>".