Prometheus alerts mapping #34

bzurkowski · 2020-04-02T07:33:17Z

Prometheus provides a comprehensive set of alerting rules for Kubernetes such as:

KubeContainerWaiting
KubeDaemonSetNotScheduled
KubeAPIErrorsHigh

Open RCA enables connecting some of these alerts to elements present in the infra graph by using a mapping file. The entries in the file are of the form:

- name: KubePodCrashLooping
    source_mapping:
    origin: kubernetes
    kind: pod
    properties:
        name: pod
        namespace: namespace

The example above describes that whenever there is a KubePodCrashLooping alert detected, it should be mapped to graph element of kubernetes origin and pod kind, and connected to an element with properties name and namespace with values fetched from labels in alert payload, named correspondingly pod and namespace.

The mapping file is not complete. There is still a significant number of alerts that Open RCA cannot recognize. The remaining alerting rules should be reviewed and integrated into the mapping.

The text was updated successfully, but these errors were encountered:

aleksandra-galara · 2020-04-03T11:37:52Z

bzurkowski · 2020-04-03T12:29:37Z

@aleksandra-galara Thanks for preparing the list. Very helpful! 👍 Let's keep it up to date as the support for new alerts is added to the mapping.

In the next iteration we should focus on the following items:

Kubernetes control plane:
- KubeAPIDown
- KubeSchedulerDown
- KubeControllerManagerDown
Kubernetes jobs:
- KubeCronJobRunning
- KubeJobCompletion
- KubeJobFailed
Kubernetes resources:
- KubeCPUOvercommit
- KubeMemOvercommit
- KubeQuotaExceeded
Node timing:
- NodeClockSkewDetected
- NodeClockNotSynchronising
Version semantics:
- KubeVersionMismatch

aleksandra-galara · 2020-04-13T21:16:25Z

Hi,
I've mapped and tested first part of the alerts. And for now I'm working on alerts triggered by Prometheus components.
Hope to finish mapping&testing all of the alerts from the list soon!

bzurkowski · 2020-04-14T07:00:25Z

@aleksandra-galara Good work! I look forward to the first commits 😉

It refers to openrca#34 and complete mapping alerts due to list created in issue

aleksandra-galara · 2020-04-16T07:00:27Z

Hi, I've mapped the alerts due to updated list of Prometheus alerts:

AggregatedAPIDown
AggregatedAPIErrors
AlertmanagerConfigInconsistent
AlertmanagerDown
AlertmanagerFailedReload
AlertmanagerMembersInconsistent
ClockSkewDetected
CPUThrottlingHigh
ErrorBudgetBurn
etcdGRPCRequestsSlow
etcdHighCommitDurations
etcdHighFsyncDurations
etcdHighNumberOfFailedGRPCRequests
etcdHighNumberOfFailedHTTPRequests
etcdHighNumberOfFailedProposals
etcdHighNumberOfLeaderChanges
etcdHTTPRequestsSlow
etcdInsufficientMembers
etcdMemberCommunicationSlow
etcdNoLeader
KubeAPIDown
KubeAPIErrorBudgetBurn
KubeAPIErrorsHigh
KubeAPILatencyHigh
KubeClientCertificateExpiration
KubeClientErrors
KubeContainerWaiting
KubeControllerManagerDown
KubeCPUOvercommit
KubeCPUQuotaOvercommit
KubeCronJobRunning
KubeDaemonSetMisScheduled
KubeDaemonSetNotScheduled
KubeDaemonSetRolloutStuck
KubeDeploymentGenerationMismatch
KubeDeploymentReplicasMismatch
KubeHpaMaxedOut
KubeHpaReplicasMismatch
KubeJobCompletion
KubeJobFailed
KubeletDown
KubeletPlegDurationHigh
KubeletPodStartUpLatencyHigh
KubeletTooManyPods
KubeMemoryOvercommit
KubeMemoryQuotaOvercommit
KubeMemOvercommit
KubeNodeNotReady
KubeNodeReadinessFlapping
KubeNodeUnreachable
KubePersistentVolumeErrors
KubePersistentVolumeFillingUp
KubePersistentVolumeFullInFourDays
KubePersistentVolumeUsageCritical
KubePodCrashLooping
KubePodNotReady
KubeQuotaExceeded
KubeSchedulerDown
KubeStatefulSetGenerationMismatch
KubeStatefulSetReplicasMismatch
KubeStatefulSetUpdateNotRolledOut
KubeStateMetricsDown
KubeStateMetricsListErrors
KubeStateMetricsWatchErrors
KubeVersionMismatch
NodeClockNotSynchronising
NodeClockSkewDetected
NodeExporterDown
NodeFilesystemAlmostOutOfFiles
NodeFilesystemAlmostOutOfFiles
NodeFilesystemAlmostOutOfSpace
NodeFilesystemFilesFillingUp
NodeFilesystemSpaceFillingUp
NodeHighNumberConntrackEntriesUsed
NodeNetworkInterfaceFlapping
NodeNetworkReceiveErrs
NodeNetworkTransmitErrs
PrometheusBadConfig
PrometheusDown
PrometheusDuplicateTimestamps
PrometheusErrorSendingAlertsToAnyAlertmanager
PrometheusErrorSendingAlertsToSomeAlertmanagers
PrometheusMissingRuleEvaluations
PrometheusNotConnectedToAlertmanagers
PrometheusNotificationQueueRunningFull
PrometheusNotIngestingSamples
PrometheusOperatorDown
PrometheusOperatorNodeLookupErrors
PrometheusOperatorReconcileErrors
PrometheusOutOfOrderTimestamps
PrometheusRemoteStorageFailures
PrometheusRemoteWriteBehind
PrometheusRemoteWriteDesiredShards
PrometheusRuleFailures
PrometheusTSDBCompactionsFailing
PrometheusTSDBReloadsFailing
TargetDown

It refers to openrca#34 and complete mapping alerts due to list created in issue Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

It refers to openrca#34 and complete mapping alerts due to list created in issue It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski · 2020-04-22T18:10:11Z

Closing, because all alerts from the above list have been mapped. The ones awaiting mapping improvement are tracked by #58.

It implements openrca#34 Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski added enhancement New feature or request good first issue Good for newcomers labels Apr 2, 2020

bzurkowski changed the title ~~Prometheus alerts support~~ Prometheus alerts mapping Apr 2, 2020

bzurkowski assigned aleksandra-galara Apr 3, 2020

bzurkowski added this to the 0.2 milestone Apr 5, 2020

aleksandra-galara pushed a commit to aleksandra-galara/orca that referenced this issue Apr 16, 2020

Complete mapping of Prometheus alerts

14d4da5

It refers to openrca#34 and complete mapping alerts due to list created in issue

aleksandra-galara mentioned this issue Apr 16, 2020

Complete mapping of Prometheus alerts #53

Merged

aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 16, 2020

Complete mapping of Prometheus alerts

bfa3ea7

It refers to openrca#34 and complete mapping alerts due to list created in issue Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski closed this as completed Apr 22, 2020

aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 29, 2020

Add probe for Kubernetes jobs

4d37070

It implements openrca#34 Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

aleksandra-galara mentioned this issue Apr 29, 2020

Add probe for Kubernetes jobs #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus alerts mapping #34

Prometheus alerts mapping #34

bzurkowski commented Apr 2, 2020 •

edited

aleksandra-galara commented Apr 3, 2020 •

edited

bzurkowski commented Apr 3, 2020 •

edited

aleksandra-galara commented Apr 13, 2020

bzurkowski commented Apr 14, 2020

aleksandra-galara commented Apr 16, 2020

bzurkowski commented Apr 22, 2020

Prometheus alerts mapping #34

Prometheus alerts mapping #34

Comments

bzurkowski commented Apr 2, 2020 • edited

aleksandra-galara commented Apr 3, 2020 • edited

bzurkowski commented Apr 3, 2020 • edited

aleksandra-galara commented Apr 13, 2020

bzurkowski commented Apr 14, 2020

aleksandra-galara commented Apr 16, 2020

bzurkowski commented Apr 22, 2020

bzurkowski commented Apr 2, 2020 •

edited

aleksandra-galara commented Apr 3, 2020 •

edited

bzurkowski commented Apr 3, 2020 •

edited