Skip to content

Complete mapping of Prometheus alerts#53

Merged
bzurkowski merged 3 commits intoopenrca:masterfrom
aleksandra-galara:Prometheus_alerts_mapping
Apr 21, 2020
Merged

Complete mapping of Prometheus alerts#53
bzurkowski merged 3 commits intoopenrca:masterfrom
aleksandra-galara:Prometheus_alerts_mapping

Conversation

@aleksandra-galara
Copy link
Copy Markdown
Member

It refers to #34 and complete mapping
alerts due to list created in issue

It refers to openrca#34 and complete mapping
alerts due to list created in issue

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
@aleksandra-galara aleksandra-galara force-pushed the Prometheus_alerts_mapping branch from 14d4da5 to bfa3ea7 Compare April 16, 2020 09:41
Copy link
Copy Markdown
Member

@bzurkowski bzurkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksandra-galara First off, thanks for your first contribution. 🎉🍰 You did really well! 👌I added only a few minor remarks. Some are purely cosmetic, others concern the mapping scope.

I noticed, a significant number of alerts is mapped to the root cluster node. In most cases it's correct because these alerts don't provide enough labels to correlate with less generic elements. If you see any options to narrow down the mapping scope, please share your ideas 😉

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
- name: KubeCronJobRunning
source_mapping:
origin: kubernetes
kind: cronjob
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kind: cronjob
kind: cron_job

Let's follow the convention of using underscores for kind names consisting of multiple parts. Note e.g. persistent_volume_claim or daemon_set .

properties:
name: pod
namespace: namespace
- name: PrometheusOperatorReconcileErrors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what the controller label means in the alert definition. Maybe together with the namespace label we could map it to something less generic than the namespace object?

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
origin: kubernetes
kind: cluster
properties: {}
- name: KubeDeploymentGenerationMismatch
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, move it up near to other deployment-related alerts.

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
Comment on lines +500 to +559
- name: etcdGRPCRequestsSlow
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHTTPRequestsSlow
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHighCommitDurations
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHighFsyncDurations
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHighNumberOfFailedGRPCRequests
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHighNumberOfFailedHTTPRequests
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHighNumberOfFailedProposals
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdHighNumberOfLeaderChanges
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdMemberCommunicationSlow
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: etcdNoLeader
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether instance label is actually what we think it is. It might be the IP/hostname of etcd pod rather than the name of a node. Look at this issue and here for examples.

If that's the case, do you think we could map these alerts by pod IP? We already extract an IP for each pod.

origin: kubernetes
kind: persistent_volume_claim
properties:
name: persistentvolumeclaim
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVCs are namespaced. In order to prevent conflicts you must include namespace property as well.

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
- name: TargetDown
source_mapping:
origin: kubernetes
kind: namespace
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to map this alert to service kind instead of namespace kind based on service + namespace labels?

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
Comment on lines +306 to +323
- name: NodeClockSkewDetected
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: NodeClockNotSynchronising
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
- name: NodeHighNumberConntrackEntriesUsed
source_mapping:
origin: kubernetes
kind: node
properties:
name: instance
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, move this mapping up near to other node-related alerts.

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
Comment on lines +566 to +577
- name: AggregatedAPIDown
source_mapping:
origin: kubernetes
kind: namespace
properties:
name: namespace
- name: AggregatedAPIErrors
source_mapping:
origin: kubernetes
kind: namespace
properties:
name: namespace
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's map aggregated API alerts to cluster kind.

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
Comment on lines +239 to +250
- name: KubeCPUQuotaOvercommit
source_mapping:
origin: kubernetes
kind: namespace
properties:
name: namespace
- name: KubeMemoryQuotaOvercommit
source_mapping:
origin: kubernetes
kind: namespace
properties:
name: namespace
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we map these alerts to the cluster kind instead of the namespace kind?

Although the kube_resourcequota metric is namespace-scoped, alert expressions for these two alerts sum values for all metric instances (namespaces):

- alert: KubeCPUQuotaOvercommit
    annotations:
    message: Cluster has overcommitted CPU resource requests for Namespaces.
    runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit
    expr: |
    sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
        /
    sum(kube_node_status_allocatable_cpu_cores)
        > 1.5

The alert name (for Namespaces) is misleading though..

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
- name: PrometheusOperatorNodeLookupErrors
source_mapping:
origin: kubernetes
kind: namespace
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's map to cluster kind instead of namespace kind.

Namespace kind is not enabled in the graph - we should avoid it as much as possible. Since many other Prometheus alerts (e.g. PrometheusOperatorDown) have been already mapped to cluster kind, we should be consistent.

@bzurkowski bzurkowski added the enhancement New feature or request label Apr 17, 2020
@bzurkowski bzurkowski added this to the 0.2 milestone Apr 17, 2020
It refers to openrca#34 and complete mapping
alerts due to list created in issue.
It introduces the changes suggested
in the review

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
@aleksandra-galara aleksandra-galara force-pushed the Prometheus_alerts_mapping branch from 18ae256 to e6067bf Compare April 20, 2020 14:01
Copy link
Copy Markdown
Member

@bzurkowski bzurkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes. I added two more minor remarks. Thanks!

Comment thread helm/orca/config/alerts-mapping.yaml Outdated
origin: kubernetes
kind: cluster
properties: {}
- name: KubeMemoryQuotaOvercommit
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is duplicated and should be removed.

Comment on lines +262 to +286
- name: KubeStateMetricsDown
source_mapping:
origin: kubernetes
kind: cluster
properties: {}
- name: AlertmanagerDown
source_mapping:
origin: kubernetes
kind: cluster
properties: {}
- name: NodeExporterDown
source_mapping:
origin: kubernetes
kind: cluster
properties: {}
- name: PrometheusDown
source_mapping:
origin: kubernetes
kind: cluster
properties: {}
- name: PrometheusOperatorDown
source_mapping:
origin: kubernetes
kind: cluster
properties: {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then for now, let's keep the mapping as it is. Thanks for adding the issue for improvements.

properties:
name: service
namespace: namespace
- name: KubeClientErrors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at this Github thread and it seems that the instance label does not contain name of a node, but rather IP address of a client:

Kubernetes API server client 'kubelet/10.0.2.15:10250' is experiencing 3% errors.
Kubernetes API server client 'apiserver/192.168.99.100:8443' is experiencing 24% errors.

I'm afraid we must map to the cluster node again.

It refers to openrca#34 and complete mapping
alerts due to list created in issue.
It introduces the changes suggested
in the review

Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
Copy link
Copy Markdown
Member

@bzurkowski bzurkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change approved. Ready for merge!

@bzurkowski bzurkowski merged commit 36db729 into openrca:master Apr 21, 2020
@bzurkowski bzurkowski removed the enhancement New feature or request label Apr 26, 2020
@bzurkowski bzurkowski removed this from the 0.2 milestone Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants