Complete mapping of Prometheus alerts#53
Conversation
It refers to openrca#34 and complete mapping alerts due to list created in issue Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
14d4da5 to
bfa3ea7
Compare
bzurkowski
left a comment
There was a problem hiding this comment.
@aleksandra-galara First off, thanks for your first contribution. 🎉🍰 You did really well! 👌I added only a few minor remarks. Some are purely cosmetic, others concern the mapping scope.
I noticed, a significant number of alerts is mapped to the root cluster node. In most cases it's correct because these alerts don't provide enough labels to correlate with less generic elements. If you see any options to narrow down the mapping scope, please share your ideas 😉
| - name: KubeCronJobRunning | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: cronjob |
There was a problem hiding this comment.
| kind: cronjob | |
| kind: cron_job |
Let's follow the convention of using underscores for kind names consisting of multiple parts. Note e.g. persistent_volume_claim or daemon_set .
| properties: | ||
| name: pod | ||
| namespace: namespace | ||
| - name: PrometheusOperatorReconcileErrors |
There was a problem hiding this comment.
I'm wondering what the controller label means in the alert definition. Maybe together with the namespace label we could map it to something less generic than the namespace object?
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} | ||
| - name: KubeDeploymentGenerationMismatch |
There was a problem hiding this comment.
Please, move it up near to other deployment-related alerts.
| - name: etcdGRPCRequestsSlow | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHTTPRequestsSlow | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHighCommitDurations | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHighFsyncDurations | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHighNumberOfFailedGRPCRequests | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHighNumberOfFailedHTTPRequests | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHighNumberOfFailedProposals | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdHighNumberOfLeaderChanges | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdMemberCommunicationSlow | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: etcdNoLeader | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance |
There was a problem hiding this comment.
| origin: kubernetes | ||
| kind: persistent_volume_claim | ||
| properties: | ||
| name: persistentvolumeclaim |
There was a problem hiding this comment.
PVCs are namespaced. In order to prevent conflicts you must include namespace property as well.
| - name: TargetDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: namespace |
There was a problem hiding this comment.
Is it possible to map this alert to service kind instead of namespace kind based on service + namespace labels?
| - name: NodeClockSkewDetected | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: NodeClockNotSynchronising | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance | ||
| - name: NodeHighNumberConntrackEntriesUsed | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: node | ||
| properties: | ||
| name: instance |
There was a problem hiding this comment.
Please, move this mapping up near to other node-related alerts.
| - name: AggregatedAPIDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: namespace | ||
| properties: | ||
| name: namespace | ||
| - name: AggregatedAPIErrors | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: namespace | ||
| properties: | ||
| name: namespace |
There was a problem hiding this comment.
For now, let's map aggregated API alerts to cluster kind.
| - name: KubeCPUQuotaOvercommit | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: namespace | ||
| properties: | ||
| name: namespace | ||
| - name: KubeMemoryQuotaOvercommit | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: namespace | ||
| properties: | ||
| name: namespace |
There was a problem hiding this comment.
Shouldn't we map these alerts to the cluster kind instead of the namespace kind?
Although the kube_resourcequota metric is namespace-scoped, alert expressions for these two alerts sum values for all metric instances (namespaces):
- alert: KubeCPUQuotaOvercommit
annotations:
message: Cluster has overcommitted CPU resource requests for Namespaces.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit
expr: |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
/
sum(kube_node_status_allocatable_cpu_cores)
> 1.5The alert name (for Namespaces) is misleading though..
| - name: PrometheusOperatorNodeLookupErrors | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: namespace |
There was a problem hiding this comment.
Let's map to cluster kind instead of namespace kind.
Namespace kind is not enabled in the graph - we should avoid it as much as possible. Since many other Prometheus alerts (e.g. PrometheusOperatorDown) have been already mapped to cluster kind, we should be consistent.
It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
18ae256 to
e6067bf
Compare
bzurkowski
left a comment
There was a problem hiding this comment.
Thanks for the fixes. I added two more minor remarks. Thanks!
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} | ||
| - name: KubeMemoryQuotaOvercommit |
There was a problem hiding this comment.
This one is duplicated and should be removed.
| - name: KubeStateMetricsDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} | ||
| - name: AlertmanagerDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} | ||
| - name: NodeExporterDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} | ||
| - name: PrometheusDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} | ||
| - name: PrometheusOperatorDown | ||
| source_mapping: | ||
| origin: kubernetes | ||
| kind: cluster | ||
| properties: {} |
There was a problem hiding this comment.
I see. Then for now, let's keep the mapping as it is. Thanks for adding the issue for improvements.
| properties: | ||
| name: service | ||
| namespace: namespace | ||
| - name: KubeClientErrors |
There was a problem hiding this comment.
I just looked at this Github thread and it seems that the instance label does not contain name of a node, but rather IP address of a client:
Kubernetes API server client 'kubelet/10.0.2.15:10250' is experiencing 3% errors.
Kubernetes API server client 'apiserver/192.168.99.100:8443' is experiencing 24% errors.
I'm afraid we must map to the cluster node again.
It refers to openrca#34 and complete mapping alerts due to list created in issue. It introduces the changes suggested in the review Signed-off-by: Aleksandra Galara <a.galara@samsung.com>
bzurkowski
left a comment
There was a problem hiding this comment.
Change approved. Ready for merge!
It refers to #34 and complete mapping
alerts due to list created in issue