Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus alerts mapping #34

Closed
bzurkowski opened this issue Apr 2, 2020 · 6 comments
Closed

Prometheus alerts mapping #34

bzurkowski opened this issue Apr 2, 2020 · 6 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@bzurkowski
Copy link
Member

bzurkowski commented Apr 2, 2020

Prometheus provides a comprehensive set of alerting rules for Kubernetes such as:

  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeAPIErrorsHigh

Open RCA enables connecting some of these alerts to elements present in the infra graph by using a mapping file. The entries in the file are of the form:

- name: KubePodCrashLooping
    source_mapping:
    origin: kubernetes
    kind: pod
    properties:
        name: pod
        namespace: namespace

The example above describes that whenever there is a KubePodCrashLooping alert detected, it should be mapped to graph element of kubernetes origin and pod kind, and connected to an element with properties name and namespace with values fetched from labels in alert payload, named correspondingly pod and namespace.

The mapping file is not complete. There is still a significant number of alerts that Open RCA cannot recognize. The remaining alerting rules should be reviewed and integrated into the mapping.

@bzurkowski bzurkowski added enhancement New feature or request good first issue Good for newcomers labels Apr 2, 2020
@bzurkowski bzurkowski changed the title Prometheus alerts support Prometheus alerts mapping Apr 2, 2020
@aleksandra-galara
Copy link
Member

aleksandra-galara commented Apr 3, 2020

List of Prometheus alerts:

  • KubeStateMetricsListErrors
  • KubeStateMetricsWatchErrors
  • NodeFilesystemSpaceFillingUp
  • NodeFilesystemAlmostOutOfSpace
  • NodeFilesystemFilesFillingUp
  • NodeFilesystemAlmostOutOfFiles
  • NodeNetworkReceiveErrs
  • NodeNetworkTransmitErrs
  • NodeHighNumberConntrackEntriesUsed
  • NodeClockSkewDetected
  • NodeClockNotSynchronising
  • KubePodCrashLooping
  • KubePodNotReady
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeDaemonSetRolloutStuck
  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetMisScheduled
  • KubeCronJobRunning
  • KubeJobCompletion
  • KubeJobFailed
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut
  • KubeCPUOvercommit
  • KubeMemOvercommit
  • KubeCPUOvercommit
  • KubeMemOvercommit
  • KubeQuotaExceeded
  • CPUThrottlingHigh
  • KubePersistentVolumeUsageCritical
  • KubePersistentVolumeFullInFourDays
  • KubePersistentVolumeErrors
  • KubeVersionMismatch
  • KubeClientErrors
  • ErrorBudgetBurn
  • KubeAPILatencyHigh
  • KubeAPIErrorsHigh
  • KubeClientCertificateExpiration
  • AggregatedAPIErrors
  • AggregatedAPIDown
  • KubeAPIDown
  • KubeNodeNotReady
  • KubeNodeUnreachable
  • KubeletTooManyPods
  • KubeNodeReadinessFlapping
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletDown
  • KubeSchedulerDown
  • KubeControllerManagerDown
  • PrometheusBadConfig
  • PrometheusNotificationQueueRunningFull
  • PrometheusErrorSendingAlertsToSomeAlertmanagers
  • PrometheusErrorSendingAlertsToAnyAlertmanager
  • PrometheusNotConnectedToAlertmanagers
  • PrometheusTSDBReloadsFailing
  • PrometheusTSDBCompactionsFailing
  • PrometheusNotIngestingSamples
  • PrometheusDuplicateTimestamps
  • PrometheusOutOfOrderTimestamps
  • PrometheusRemoteStorageFailures
  • PrometheusRemoteWriteBehind
  • PrometheusRemoteWriteDesiredShards
  • PrometheusRuleFailures
  • PrometheusMissingRuleEvaluations
  • AlertmanagerConfigInconsistent
  • AlertmanagerFailedReload
  • AlertmanagerMembersInconsistent
  • TargetDown
  • Watchdog
  • NodeNetworkInterfaceFlapping
  • PrometheusOperatorReconcileErrors
  • PrometheusOperatorNodeLookupErrors

@bzurkowski
Copy link
Member Author

bzurkowski commented Apr 3, 2020

@aleksandra-galara Thanks for preparing the list. Very helpful! 👍 Let's keep it up to date as the support for new alerts is added to the mapping.

In the next iteration we should focus on the following items:

  • Kubernetes control plane:
    • KubeAPIDown
    • KubeSchedulerDown
    • KubeControllerManagerDown
  • Kubernetes jobs:
    • KubeCronJobRunning
    • KubeJobCompletion
    • KubeJobFailed
  • Kubernetes resources:
    • KubeCPUOvercommit
    • KubeMemOvercommit
    • KubeQuotaExceeded
  • Node timing:
    • NodeClockSkewDetected
    • NodeClockNotSynchronising
  • Version semantics:
    • KubeVersionMismatch

@bzurkowski bzurkowski added this to the 0.2 milestone Apr 5, 2020
@aleksandra-galara
Copy link
Member

Hi,
I've mapped and tested first part of the alerts. And for now I'm working on alerts triggered by Prometheus components.
Hope to finish mapping&testing all of the alerts from the list soon!

@bzurkowski
Copy link
Member Author

@aleksandra-galara Good work! I look forward to the first commits 😉

aleksandra-galara pushed a commit to aleksandra-galara/orca that referenced this issue Apr 16, 2020
It refers to openrca#34 and complete mapping
alerts due to list created in issue
@aleksandra-galara
Copy link
Member

Hi, I've mapped the alerts due to updated list of Prometheus alerts:

  • AggregatedAPIDown
  • AggregatedAPIErrors
  • AlertmanagerConfigInconsistent
  • AlertmanagerDown
  • AlertmanagerFailedReload
  • AlertmanagerMembersInconsistent
  • ClockSkewDetected
  • CPUThrottlingHigh
  • ErrorBudgetBurn
  • etcdGRPCRequestsSlow
  • etcdHighCommitDurations
  • etcdHighFsyncDurations
  • etcdHighNumberOfFailedGRPCRequests
  • etcdHighNumberOfFailedHTTPRequests
  • etcdHighNumberOfFailedProposals
  • etcdHighNumberOfLeaderChanges
  • etcdHTTPRequestsSlow
  • etcdInsufficientMembers
  • etcdMemberCommunicationSlow
  • etcdNoLeader
  • KubeAPIDown
  • KubeAPIErrorBudgetBurn
  • KubeAPIErrorsHigh
  • KubeAPILatencyHigh
  • KubeClientCertificateExpiration
  • KubeClientErrors
  • KubeContainerWaiting
  • KubeControllerManagerDown
  • KubeCPUOvercommit
  • KubeCPUQuotaOvercommit
  • KubeCronJobRunning
  • KubeDaemonSetMisScheduled
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetRolloutStuck
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeHpaMaxedOut
  • KubeHpaReplicasMismatch
  • KubeJobCompletion
  • KubeJobFailed
  • KubeletDown
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletTooManyPods
  • KubeMemoryOvercommit
  • KubeMemoryQuotaOvercommit
  • KubeMemOvercommit
  • KubeNodeNotReady
  • KubeNodeReadinessFlapping
  • KubeNodeUnreachable
  • KubePersistentVolumeErrors
  • KubePersistentVolumeFillingUp
  • KubePersistentVolumeFullInFourDays
  • KubePersistentVolumeUsageCritical
  • KubePodCrashLooping
  • KubePodNotReady
  • KubeQuotaExceeded
  • KubeSchedulerDown
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeStateMetricsDown
  • KubeStateMetricsListErrors
  • KubeStateMetricsWatchErrors
  • KubeVersionMismatch
  • NodeClockNotSynchronising
  • NodeClockSkewDetected
  • NodeExporterDown
  • NodeFilesystemAlmostOutOfFiles
  • NodeFilesystemAlmostOutOfFiles
  • NodeFilesystemAlmostOutOfSpace
  • NodeFilesystemFilesFillingUp
  • NodeFilesystemSpaceFillingUp
  • NodeHighNumberConntrackEntriesUsed
  • NodeNetworkInterfaceFlapping
  • NodeNetworkReceiveErrs
  • NodeNetworkTransmitErrs
  • PrometheusBadConfig
  • PrometheusDown
  • PrometheusDuplicateTimestamps
  • PrometheusErrorSendingAlertsToAnyAlertmanager
  • PrometheusErrorSendingAlertsToSomeAlertmanagers
  • PrometheusMissingRuleEvaluations
  • PrometheusNotConnectedToAlertmanagers
  • PrometheusNotificationQueueRunningFull
  • PrometheusNotIngestingSamples
  • PrometheusOperatorDown
  • PrometheusOperatorNodeLookupErrors
  • PrometheusOperatorReconcileErrors
  • PrometheusOutOfOrderTimestamps
  • PrometheusRemoteStorageFailures
  • PrometheusRemoteWriteBehind
  • PrometheusRemoteWriteDesiredShards
  • PrometheusRuleFailures
  • PrometheusTSDBCompactionsFailing
  • PrometheusTSDBReloadsFailing
  • TargetDown

aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 16, 2020
It refers to openrca#34 and complete mapping
alerts due to list created in issue

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 20, 2020
It refers to openrca#34 and complete mapping
alerts due to list created in issue
It introduces the changes suggested
in the review

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 20, 2020
It refers to openrca#34 and complete mapping
alerts due to list created in issue.
It introduces the changes suggested
in the review

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 21, 2020
It refers to openrca#34 and complete mapping
alerts due to list created in issue.
It introduces the changes suggested
in the review

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
@bzurkowski
Copy link
Member Author

Closing, because all alerts from the above list have been mapped. The ones awaiting mapping improvement are tracked by #58.

aleksandra-galara added a commit to aleksandra-galara/orca that referenced this issue Apr 29, 2020
It implements openrca#34

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants