-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sum(container_memory_usage_bytes{...}) rule doubles values #136
Comments
Hi @rrichardson; thanks for the issue! I'd be surprised if node_exporter is exporting container_* metrics, but cadvisor (embedded in the kubelet) exports metrics in a hierarchical fashion - and hence if we aggregate lower levels of the hierarchy with upper levels, we can get doubling... We solve this by dropping lower levels at scrape time, see https://github.com/grafana/jsonnet-libs/blob/master/prometheus-ksonnet/lib/prometheus-config.libsonnet#L287 Could you confirm that |
image != "" does fix it.
…On Sat, Jan 19, 2019 at 2:55 AM Tom Wilkie ***@***.***> wrote:
Hi @rrichardson <https://github.com/rrichardson>; thanks for the issue!
I'd be surprised if node_exporter is exporting container_* metrics, but
cadvisor (embedded in the kubelet) exports metrics in a hierarchical
fashion - and hence if we aggregate lower levels of the hierarchy with
upper levels, we can get doubling...
We solve this by dropping lower levels at scrape time, see
https://github.com/grafana/jsonnet-libs/blob/master/prometheus-ksonnet/lib/prometheus-config.libsonnet#L287
Could you confirm that image!="" to the rules fixes this for you? We
should add that here, and see if we can have prometheus operator also drop
those rules.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#136 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHlC593tLW4Pr1KaTf2JwKVKlMakICqks5vEvmkgaJpZM4aGLrz>
.
|
The Prometheus-Operator supports relabeling on the Endpoints of a ServiceMonitor: What we would need to do is to update this endpoint in the kube-prometheus stack: Would you like to do a PR @rrichardson? Let me know, otherwise I can do the change too, but it's a nice little contribution if you want to. |
@metalmatze Sure. I'll take a whack at it. |
I still don't fully understand prometheus or the scope of cAdvisor and its hierarchy of metrics. But shouldn't https://github.com/grafana/jsonnet-libs/blob/master/prometheus-ksonnet/lib/prometheus-config.libsonnet#L286-L292 already solve my problem? It should be dropping |
… (doubles values for container_cpu_usage
Hey @rrichardson , Thanks for documing your findings. I'm currently having the exact same problem with the metrics from cAdvisor not playing well with sum (for the exact same reason you opened this issue: the "total" (
I see that the PR on kube-prometheus was closed as this metric is not redundant and only Grafana dashboards should be fixed to filter it out. How did you ended up fixing the issue? Were the changes up-streamed and I'm missing something? Edit: To add a bit more context, here is an example of value I get in double: Just like your issue, one is the Pod Total ( |
…ner_cpu_cfs Currently, Kubelet cAdvisor exports metrics for the parent cgroup as well as for each container. This leads to having "duplicate metrics" and espacially lead to strange or wrong visualisations. Filtering by `container!=""` exclude metrics from the parent cgroup. This patch avoids having two time-series in the CPU Throttling panel. Related to kubernetes-monitoring#136.
…ner_cpu_cfs Currently, Kubelet cAdvisor exports metrics for the parent cgroup as well as for each container. This leads to having "duplicate metrics" and espacially lead to strange or wrong visualisations. Filtering by `container!=""` exclude metrics from the parent cgroup. This patch avoids having two time-series in the CPU Throttling panel. Related to kubernetes-monitoring#136. Signed-off-by: Mathis Raguin <mathis@cri.epita.fr>
…ner_cpu_cfs (#456) Currently, Kubelet cAdvisor exports metrics for the parent cgroup as well as for each container. This leads to having "duplicate metrics" and espacially lead to strange or wrong visualisations. Filtering by `container!=""` exclude metrics from the parent cgroup. This patch avoids having two time-series in the CPU Throttling panel. Related to #136. Signed-off-by: Mathis Raguin <mathis@cri.epita.fr>
I think this can be closed after #512 was merged. |
the
sum(container* ...)
rules are duplicates of data provided by cAdvisor within kubelet, but they are reported in the same record names, albeit with different labels.The label selectors in the rules in the default rules file collects both the NodeExporter(I think?) records as well as the kubelet cAdvisor records. This results in values that are exactly double reality.
I think the solution here is to just use the
service="kubelet"
andcontainer_name!=""
label selectors, and there is no need for a sum()Originally posted here:
prometheus-operator/prometheus-operator#2302
What did you do?;
Installed Prometheus chart and friends via Helm in a K8s cluster created by Kubeadm 1.11
What did you expect to see?
Correct values aggregated by the rules:
If the rules were changed to just use the output from Kubelet, a
sum()
would not be necessary. This would require setting{service="kubelet", container_name!=""}
What did you see instead? Under which circumstances? : In addition to NodeExporter(I think?) exporting data under these record names, kubelet also reports data under these record names, albeit with different labels. Kublet reports the exact sum of all containers in the Pod.. so the above rules report a value that is exactly double the actual value.
Environment
Prometheus Operator version:
Image ID: docker-pullable://quay.io/coreos/prometheus-operator@sha256:faa9f8a9045092b9fe311016eb3888e2c2c824eb2b4029400f188a765b97648a
Kubernetes version information:
Kubernetes cluster kind:
kubeadm on bare metal
Manifests:
https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L22
not relevant
The text was updated successfully, but these errors were encountered: