Complete mapping of Prometheus alerts by aleksandra-galara · Pull Request #53 · openrca/orca

aleksandra-galara · 2020-04-16T06:54:21Z

It refers to #34 and complete mapping
alerts due to list created in issue

It refers to openrca#34 and complete mapping alerts due to list created in issue Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski

@aleksandra-galara First off, thanks for your first contribution. 🎉🍰 You did really well! 👌I added only a few minor remarks. Some are purely cosmetic, others concern the mapping scope.

I noticed, a significant number of alerts is mapped to the root cluster node. In most cases it's correct because these alerts don't provide enough labels to correlate with less generic elements. If you see any options to narrow down the mapping scope, please share your ideas 😉

bzurkowski · 2020-04-17T08:38:42Z

+    - name: KubeCronJobRunning
+      source_mapping:
+        origin: kubernetes
+        kind: cronjob


Suggested change

kind: cronjob

kind: cron_job

Let's follow the convention of using underscores for kind names consisting of multiple parts. Note e.g. persistent_volume_claim or daemon_set .

bzurkowski · 2020-04-17T08:40:02Z

+        properties:
+          name: pod
+          namespace: namespace
+    - name: PrometheusOperatorReconcileErrors


I'm wondering what the controller label means in the alert definition. Maybe together with the namespace label we could map it to something less generic than the namespace object?

bzurkowski · 2020-04-17T08:40:48Z

+        origin: kubernetes
+        kind: cluster
+        properties: {}
+    - name: KubeDeploymentGenerationMismatch


Please, move it up near to other deployment-related alerts.

bzurkowski · 2020-04-17T08:41:25Z

+    - name: etcdGRPCRequestsSlow
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHTTPRequestsSlow
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHighCommitDurations
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHighFsyncDurations
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHighNumberOfFailedGRPCRequests
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHighNumberOfFailedHTTPRequests
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHighNumberOfFailedProposals
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdHighNumberOfLeaderChanges
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdMemberCommunicationSlow
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: etcdNoLeader
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance


I'm not sure whether instance label is actually what we think it is. It might be the IP/hostname of etcd pod rather than the name of a node. Look at this issue and here for examples.

If that's the case, do you think we could map these alerts by pod IP? We already extract an IP for each pod.

bzurkowski · 2020-04-17T08:41:43Z

+        origin: kubernetes
+        kind: persistent_volume_claim
+        properties:
+          name: persistentvolumeclaim


PVCs are namespaced. In order to prevent conflicts you must include namespace property as well.

bzurkowski · 2020-04-17T09:16:26Z

+    - name: TargetDown
+      source_mapping:
+        origin: kubernetes
+        kind: namespace


Is it possible to map this alert to service kind instead of namespace kind based on service + namespace labels?

bzurkowski · 2020-04-17T09:18:32Z

+    - name: NodeClockSkewDetected
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: NodeClockNotSynchronising
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance
+    - name: NodeHighNumberConntrackEntriesUsed
+      source_mapping:
+        origin: kubernetes
+        kind: node
+        properties:
+          name: instance


Please, move this mapping up near to other node-related alerts.

bzurkowski · 2020-04-17T09:23:11Z

+    - name: AggregatedAPIDown
+      source_mapping:
+        origin: kubernetes
+        kind: namespace
+        properties:
+          name: namespace
+    - name: AggregatedAPIErrors
+      source_mapping:
+        origin: kubernetes
+        kind: namespace
+        properties:
+          name: namespace


For now, let's map aggregated API alerts to cluster kind.

bzurkowski · 2020-04-17T09:26:15Z

+    - name: KubeCPUQuotaOvercommit
+      source_mapping:
+        origin: kubernetes
+        kind: namespace
+        properties:
+          name: namespace
+    - name: KubeMemoryQuotaOvercommit
+      source_mapping:
+        origin: kubernetes
+        kind: namespace
+        properties:
+          name: namespace


Shouldn't we map these alerts to the cluster kind instead of the namespace kind?

Although the kube_resourcequota metric is namespace-scoped, alert expressions for these two alerts sum values for all metric instances (namespaces):

- alert: KubeCPUQuotaOvercommit annotations: message: Cluster has overcommitted CPU resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5

The alert name (for Namespaces) is misleading though..

bzurkowski · 2020-04-17T09:30:44Z

+    - name: PrometheusOperatorNodeLookupErrors
+      source_mapping:
+        origin: kubernetes
+        kind: namespace


Let's map to cluster kind instead of namespace kind.

Namespace kind is not enabled in the graph - we should avoid it as much as possible. Since many other Prometheus alerts (e.g. PrometheusOperatorDown) have been already mapped to cluster kind, we should be consistent.

It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski

Thanks for the fixes. I added two more minor remarks. Thanks!

bzurkowski · 2020-04-21T05:43:23Z

+        origin: kubernetes
+        kind: cluster
+        properties: {}
+    - name: KubeMemoryQuotaOvercommit


This one is duplicated and should be removed.

bzurkowski · 2020-04-21T05:46:52Z

+    - name: KubeStateMetricsDown
+      source_mapping:
+        origin: kubernetes
+        kind: cluster
+        properties: {}
+    - name: AlertmanagerDown
+      source_mapping:
+        origin: kubernetes
+        kind: cluster
+        properties: {}
+    - name: NodeExporterDown
+      source_mapping:
+        origin: kubernetes
+        kind: cluster
+        properties: {}
+    - name: PrometheusDown
+      source_mapping:
+        origin: kubernetes
+        kind: cluster
+        properties: {}
+    - name: PrometheusOperatorDown
+      source_mapping:
+        origin: kubernetes
+        kind: cluster
+        properties: {}


I see. Then for now, let's keep the mapping as it is. Thanks for adding the issue for improvements.

bzurkowski · 2020-04-21T05:55:05Z

+        properties:
+          name: service
+          namespace: namespace
+    - name: KubeClientErrors


I just looked at this Github thread and it seems that the instance label does not contain name of a node, but rather IP address of a client:

Kubernetes API server client 'kubelet/10.0.2.15:10250' is experiencing 3% errors. Kubernetes API server client 'apiserver/192.168.99.100:8443' is experiencing 24% errors.

I'm afraid we must map to the cluster node again.

It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski

Change approved. Ready for merge!

Complete mapping of Prometheus alerts

bfa3ea7

It refers to openrca#34 and complete mapping alerts due to list created in issue Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

aleksandra-galara force-pushed the Prometheus_alerts_mapping branch from 14d4da5 to bfa3ea7 Compare April 16, 2020 09:41

bzurkowski requested changes Apr 17, 2020

View reviewed changes

bzurkowski added the enhancement New feature or request label Apr 17, 2020

bzurkowski added this to the 0.2 milestone Apr 17, 2020

Correct mapping of Prometheus alerts

e6067bf

It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

aleksandra-galara force-pushed the Prometheus_alerts_mapping branch from 18ae256 to e6067bf Compare April 20, 2020 14:01

bzurkowski requested changes Apr 21, 2020

View reviewed changes

Correct mapping of Prometheus alerts(2)

1136be0

It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>

bzurkowski approved these changes Apr 21, 2020

View reviewed changes

bzurkowski merged commit 36db729 into openrca:master Apr 21, 2020

bzurkowski removed the enhancement New feature or request label Apr 26, 2020

bzurkowski removed this from the 0.2 milestone Sep 9, 2020

Conversation

aleksandra-galara commented Apr 16, 2020

Uh oh!

bzurkowski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzurkowski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzurkowski left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants