Some of the basic dashboards not working right #3058

XDavidT · 2023-02-22T16:31:47Z

Describe the bug a clear and concise description of what the bug is.

Screenshots: https://tinyurl.com/2exxpskb https://tinyurl.com/2qdrsk5b https://tinyurl.com/2hfq6zrl
Networking: https://tinyurl.com/2otnavgt
In this example, we not even see namespaces: https://tinyurl.com/2pbekvbz
I've used a screenshot cause it's the best easy what to use it
Where to start? what components to check? or query?

What's your helm version?

v3.11.1

What's your kubectl version?

Client v1.25.0 Server v1.24.9

Which chart?

kube-prometheus-stack

What's the chart version?

45.2.0

What happened?

I've tried to look at some of the basic dashboards to investigate the usage of my new Jenkins and found that most of the dashboard doesn't provide information.

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

kube-prometheus-stack:
  prometheus:
    service:
      type: NodePort
      nodePort: 30090
    prometheusSpec:
      additionalScrapeConfigs:
      - job_name: nexus
        scrape_interval: 30s
        static_configs:
          - targets: [ "nexus.local:8081" ]
        metrics_path: /service/metrics/prometheus
        basic_auth:
          username: metrics
          password: metrics
      - job_name: postgresql
        scrape_interval: 10s
        static_configs:
          - targets: ["pg.local:9187"]
      - job_name: netapp
        scrape_interval: 10s
        static_configs:
          - targets: ["poller-cluster-01:13000"]
      storageSpec:
        volumeClaimTemplate:
          spec:
            storageClassName: netapp-storage
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 50Gi
  grafana:
    service:
      type: NodePort
      nodePort: 30091
    persistence:
      existingClaim: kube-prometheus-stack-netapp-pvc
    additionalDataSources:
     - name: Prometheus-cnvrg
       basicAuth: true
       basicAuthPassword: *********
       basicAuthUser: user
       jsonData:
           tlsSkipVerify: true
       orgId: 1
       type: prometheus
       url: http://host.local:32216/

Enter the command that you execute and failing/misfunctioning.

None

Anything else we need to know?

I have 2 Prometheus sources (1 old from another system added to Grafana manauly, and one new that came with kube-prom-stack), and now I've checked dashboard: kubernetes-compute-resources-node-pods, and saw CPU metric looks for sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{cluster="$cluster", node=~"$node"}) by (pod), on the explorer page I've looked for node_namespace_pod_container but didn't find it their, only from old promethues, but this metric not available for the new one.

Is there any chance metric are updated, but dashboards aren't?

The text was updated successfully, but these errors were encountered:

yashiang1986 · 2023-02-23T05:34:38Z

I have the same issue.
My k8s version is 1.25.6 and helm-chart is 45.2.0.

zeritti · 2023-02-24T13:32:54Z

You are specifically mentioning node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate. This is a recording defined in k8s.rules.yaml including the metric
container_cpu_usage_seconds_total produced by job kubelet and endpoint /metrics/cadvisor.

If you cannot see any time series of this metric/endpoint in Grafana or Prometheus UI, Prometheus has likely not discovered/scraped that specific kubelet endpoint. The Status/Targets page in Prometheus UI shows info on discovered targets.

BTW The corresponding service monitors get created through

kubelet:
  enabled: true
  namespace: kube-system
  serviceMonitor:
    cAdvisor: true

and the service through

prometheusOperator:
  kubeletService:
    enabled: true
    namespace: kube-system

Both are enabled by default.

zeritti · 2023-02-24T13:52:58Z

Further on that query, it includes variables cluster (may be hidden) and node. Please, make sure these get populated on the dashboard.

XDavidT · 2023-02-26T11:34:25Z

You are specifically mentioning node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate. This is a recording defined in k8s.rules.yaml including the metric container_cpu_usage_seconds_total produced by job kubelet and endpoint /metrics/cadvisor.

If you cannot see any time series of this metric/endpoint in Grafana or Prometheus UI, Prometheus has likely not discovered/scraped that specific kubelet endpoint. The Status/Targets page in Prometheus UI shows info on discovered targets.

BTW The corresponding service monitors get created through
kubelet:
  enabled: true
  namespace: kube-system
  serviceMonitor:
    cAdvisor: true
and the service through
prometheusOperator:
  kubeletService:
    enabled: true
    namespace: kube-system
Both are enabled by default.

I didn't override those values and kept the default. (you can see my values)
Which mean it's not this issue.
I've looked and container_cpu_usage_seconds_total is working, but node_namespace_pod_container still not exsits

zeritti · 2023-02-26T14:06:47Z

Prometheus has to successfully complete recording the results of the rule before they can be returned on query. In Prometheus UI, Status/Rules, in section k8s.rules, the rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate should be green/OK meaning the rule gets executed successfully, otherwise an error will be shown, e.g.:

If you click on the expression, a query will be executed through the query field where you can evaluate results.

XDavidT · 2023-02-26T14:16:32Z

Prometheus has to successfully complete recording the results of the rule before they can be returned on query. In Prometheus UI, Status/Rules, in section k8s.rules, the rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate should be green/OK meaning the rule gets executed successfully, otherwise an error will be shown, e.g.:

If you click on the expression, a query will be executed through the query field where you can evaluate results.

https://www.awesomescreenshot.com/video/15129229?key=6ffd2deb551c5de77d3a5d46408980a2
I can see it, but not query it

zeritti · 2023-02-26T14:34:46Z

I can see it, but not query it

There do not seem to be any recorded time series matching the expression. Will you show a sample output of this query?

irate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])

XDavidT · 2023-02-26T14:36:56Z

I can see it, but not query it

There do not seem to be any recorded time series matching the expression. Will you show a sample output of this query?
irate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])

https://tinyurl.com/2f2kmgbc

zeritti · 2023-02-26T14:39:10Z

And a sample output of container_cpu_usage_seconds_total?

XDavidT · 2023-02-26T14:46:07Z

Working well. https://tinyurl.com/2ohjr6zj
Is that any reference to the fact I've installed my cluster with RKE?

zeritti · 2023-02-26T14:57:13Z

The expression in the recording does not find a match as we can see since labels image, cluster and container are not present in the time series of container_cpu_usage_seconds_total. This is also the reason why you see an empty dashboard panel.

XDavidT · 2023-02-26T19:47:00Z

@zeritti Thanks for checking this out. What can be the reason for that? what needs to be checked next?

liniann · 2023-02-28T14:01:39Z

i have the same problem. i provisioned cluster by rke v1.4.2 and my helm chart version is 45.2.0.
the following dahsbords have no data (some just have few data to show):
kubernetes / compute resources / *
kubernetes / network / *
thanks.

BongoEADGC6 · 2023-03-02T04:09:24Z

Same here on a k3s cluster. Using K3s v1.25.4+k3s1 and chart version 45.4.0. it seems that the cluster label is not populated and throws off the rest of the recordings.

XDavidT · 2023-03-02T07:30:21Z

I saw @johnswarbrick-napier and @monotek released a new version 3 days ago. Can you relate to this issue before the next publish?

BongoEADGC6 · 2023-03-02T13:28:54Z

Tested 45.5.0 just now and no difference at this time
Wanted to follow-up on my solution for this issue. This was actually related to labels being overwritten by ArgoCD. More information can be found here: #1769 (comment)

akantak · 2023-03-03T15:08:54Z

I have installed kube-prometheus-stack v45.1.1 (tested with 45.5.0 and 44.4.1) on K8S 1.25.6 and I have the same issue. I did some digging inside the Prometheus and the root cause is the existence of image!="", in the rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate, which is not present inside the container_cpu_usage_seconds_total metric.

I see a lack of that label in @XDavidT 's screenshot too.

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate silently fails to proceed (returning 0 results and the state OK, because no real failure happened there, 0 results is fine). That causes charts to be empty because there is no data to show.

Smart Prometheus guys, please advise. Probably the k8s.rules in Prometheus should be adjusted. Or the lack of the image label is a bug here.

WRT questions asked by @zeritti:
irate(container_cpu_usage_seconds_total{job="kubelet",metrics_path="/metrics/cadvisor"}[5m]) works like a charm, even whole:

record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
expr: sum by (cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}))

would work but need to remove image!="" from the default one.

So that works for me, and so should work for other people:

sum by (cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}))

zeritti · 2023-03-03T18:58:26Z

Is that any reference to the fact I've installed my cluster with RKE?

What can be the reason for that? what needs to be checked next?

The fact that values for container and image originating at the container runtime are missing usually point to some problem that prevents cAdvisor from retrieving the necessary metadata from the container runtime. It can be a transient problem arising at specific runtime conditions which goes then away, i.e. a restart of kubelet/container runtime may restore normal communication between the two and the labels are suddently back. I would suggest looking in the kubelet log to check for events related to cAdvisor or read/write permission issues.

Another thing is that cAdvisor may not be able to get that information from the container runtime at all because it does not make it available any longer - see e.g. this issue at rancher or this issue at kubernetes. Some distributions have been affected, others have not but 1.24 seems to be the critical release in some distributions when docker shim was removed from kubelet.

zeritti · 2023-03-03T19:28:07Z

Same here on a k3s cluster. Using K3s v1.25.4+k3s1 and chart version 45.4.0. it seems that the cluster label is not populated and throws off the rest of the recordings.

No, the cluster label is not populated, it needs to be set. It is present in the dashboards to support multi-cluster environments. External labels can be used for this purpose. Note, though, that these labels get attached to the time series only as they are leaving the Prometheus instance through federation or remote write (the label will be present upstream only). Otherwise the label can be set through relabelings, metric relabelings, pod target labels and target labels in service monitors or through additional scrape configs.

akantak · 2023-03-07T10:52:50Z

@XDavidT I assume you are using the docker engine, if it is not a requirement for you, you can switch to containerd, with that prometheus stack works fine without any workarounds/fixes

XDavidT · 2023-03-07T15:31:32Z

Is that any reference to the fact I've installed my cluster with RKE?

What can be the reason for that? what needs to be checked next?

The fact that values for container and image originating at the container runtime are missing usually point to some problem that prevents cAdvisor from retrieving the necessary metadata from the container runtime. It can be a transient problem arising at specific runtime conditions which goes then away, i.e. a restart of kubelet/container runtime may restore normal communication between the two and the labels are suddently back. I would suggest looking in the kubelet log to check for events related to cAdvisor or read/write permission issues.

Another thing is that cAdvisor may not be able to get that information from the container runtime at all because it does not make it available any longer - see e.g. this issue at rancher or this issue at kubernetes. Some distributions have been affected, others have not but 1.24 seems to be the critical release in some distributions when docker shim was removed from kubelet.

I can tell after trying to upgrade the cluster to: v1.24.10-rancher3-1
By this comment: rancher/rancher#38934 (comment)
Still, no relevant information showed up.
https://tinyurl.com/2fuhnrd5

@akantak Thank's but RKE2 only support this, and we stick with RKE1 for now.
And current cluster installation is docker based.

Edit1:
Today I saw new release: https://github.com/rancher/rke/releases/tag/v1.4.3
I've tried to use v1.24.10-rancher4-1 but nothing changed.

stale · 2023-04-07T08:59:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

ivanfavi · 2023-04-07T09:04:52Z

It seem related to this issue in cadvisor.

stale · 2023-05-20T12:21:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2023-06-11T08:19:38Z

This issue is being automatically closed due to inactivity.

XDavidT added the bug Something isn't working label Feb 22, 2023

stale bot added the lifecycle/stale label Apr 7, 2023

stale bot removed the lifecycle/stale label Apr 7, 2023

stale bot added the lifecycle/stale label May 20, 2023

stale bot closed this as completed Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some of the basic dashboards not working right #3058

Some of the basic dashboards not working right #3058

XDavidT commented Feb 22, 2023 •

edited

yashiang1986 commented Feb 23, 2023

zeritti commented Feb 24, 2023

zeritti commented Feb 24, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

liniann commented Feb 28, 2023

BongoEADGC6 commented Mar 2, 2023 •

edited

XDavidT commented Mar 2, 2023

BongoEADGC6 commented Mar 2, 2023 •

edited

akantak commented Mar 3, 2023 •

edited

zeritti commented Mar 3, 2023

zeritti commented Mar 3, 2023

akantak commented Mar 7, 2023

XDavidT commented Mar 7, 2023 •

edited

stale bot commented Apr 7, 2023

ivanfavi commented Apr 7, 2023

stale bot commented May 20, 2023

stale bot commented Jun 11, 2023

Some of the basic dashboards not working right #3058

Some of the basic dashboards not working right #3058

Comments

XDavidT commented Feb 22, 2023 • edited

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

yashiang1986 commented Feb 23, 2023

zeritti commented Feb 24, 2023

zeritti commented Feb 24, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

zeritti commented Feb 26, 2023

XDavidT commented Feb 26, 2023

liniann commented Feb 28, 2023

BongoEADGC6 commented Mar 2, 2023 • edited

XDavidT commented Mar 2, 2023

BongoEADGC6 commented Mar 2, 2023 • edited

akantak commented Mar 3, 2023 • edited

zeritti commented Mar 3, 2023

zeritti commented Mar 3, 2023

akantak commented Mar 7, 2023

XDavidT commented Mar 7, 2023 • edited

stale bot commented Apr 7, 2023

ivanfavi commented Apr 7, 2023

stale bot commented May 20, 2023

stale bot commented Jun 11, 2023

XDavidT commented Feb 22, 2023 •

edited

BongoEADGC6 commented Mar 2, 2023 •

edited

BongoEADGC6 commented Mar 2, 2023 •

edited

akantak commented Mar 3, 2023 •

edited

XDavidT commented Mar 7, 2023 •

edited