Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some of the basic dashboards not working right #3058

Closed
XDavidT opened this issue Feb 22, 2023 · 25 comments
Closed

Some of the basic dashboards not working right #3058

XDavidT opened this issue Feb 22, 2023 · 25 comments
Labels
bug Something isn't working lifecycle/stale

Comments

@XDavidT
Copy link

XDavidT commented Feb 22, 2023

Describe the bug a clear and concise description of what the bug is.

Screenshots: https://tinyurl.com/2exxpskb https://tinyurl.com/2qdrsk5b https://tinyurl.com/2hfq6zrl
Networking: https://tinyurl.com/2otnavgt
In this example, we not even see namespaces: https://tinyurl.com/2pbekvbz
I've used a screenshot cause it's the best easy what to use it
Where to start? what components to check? or query?

What's your helm version?

v3.11.1

What's your kubectl version?

Client v1.25.0 Server v1.24.9

Which chart?

kube-prometheus-stack

What's the chart version?

45.2.0

What happened?

I've tried to look at some of the basic dashboards to investigate the usage of my new Jenkins and found that most of the dashboard doesn't provide information.

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

kube-prometheus-stack:
  prometheus:
    service:
      type: NodePort
      nodePort: 30090
    prometheusSpec:
      additionalScrapeConfigs:
      - job_name: nexus
        scrape_interval: 30s
        static_configs:
          - targets: [ "nexus.local:8081" ]
        metrics_path: /service/metrics/prometheus
        basic_auth:
          username: metrics
          password: metrics
      - job_name: postgresql
        scrape_interval: 10s
        static_configs:
          - targets: ["pg.local:9187"]
      - job_name: netapp
        scrape_interval: 10s
        static_configs:
          - targets: ["poller-cluster-01:13000"]
      storageSpec:
        volumeClaimTemplate:
          spec:
            storageClassName: netapp-storage
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 50Gi
  grafana:
    service:
      type: NodePort
      nodePort: 30091
    persistence:
      existingClaim: kube-prometheus-stack-netapp-pvc
    additionalDataSources:
     - name: Prometheus-cnvrg
       basicAuth: true
       basicAuthPassword: *********
       basicAuthUser: user
       jsonData:
           tlsSkipVerify: true
       orgId: 1
       type: prometheus
       url: http://host.local:32216/

Enter the command that you execute and failing/misfunctioning.

None

Anything else we need to know?

I have 2 Prometheus sources (1 old from another system added to Grafana manauly, and one new that came with kube-prom-stack), and now I've checked dashboard: kubernetes-compute-resources-node-pods, and saw CPU metric looks for sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{cluster="$cluster", node=~"$node"}) by (pod), on the explorer page I've looked for node_namespace_pod_container but didn't find it their, only from old promethues, but this metric not available for the new one.

Is there any chance metric are updated, but dashboards aren't?

@XDavidT XDavidT added the bug Something isn't working label Feb 22, 2023
@yashiang1986
Copy link

I have the same issue.
My k8s version is 1.25.6 and helm-chart is 45.2.0.

@zeritti
Copy link
Member

zeritti commented Feb 24, 2023

You are specifically mentioning node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate. This is a recording defined in k8s.rules.yaml including the metric
container_cpu_usage_seconds_total produced by job kubelet and endpoint /metrics/cadvisor.

If you cannot see any time series of this metric/endpoint in Grafana or Prometheus UI, Prometheus has likely not discovered/scraped that specific kubelet endpoint. The Status/Targets page in Prometheus UI shows info on discovered targets.

BTW The corresponding service monitors get created through

kubelet:
  enabled: true
  namespace: kube-system
  serviceMonitor:
    cAdvisor: true

and the service through

prometheusOperator:
  kubeletService:
    enabled: true
    namespace: kube-system

Both are enabled by default.

@zeritti
Copy link
Member

zeritti commented Feb 24, 2023

Further on that query, it includes variables cluster (may be hidden) and node. Please, make sure these get populated on the dashboard.

@XDavidT
Copy link
Author

XDavidT commented Feb 26, 2023

You are specifically mentioning node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate. This is a recording defined in k8s.rules.yaml including the metric container_cpu_usage_seconds_total produced by job kubelet and endpoint /metrics/cadvisor.

If you cannot see any time series of this metric/endpoint in Grafana or Prometheus UI, Prometheus has likely not discovered/scraped that specific kubelet endpoint. The Status/Targets page in Prometheus UI shows info on discovered targets.

BTW The corresponding service monitors get created through

kubelet:
  enabled: true
  namespace: kube-system
  serviceMonitor:
    cAdvisor: true

and the service through

prometheusOperator:
  kubeletService:
    enabled: true
    namespace: kube-system

Both are enabled by default.

I didn't override those values and kept the default. (you can see my values)
Which mean it's not this issue.
I've looked and container_cpu_usage_seconds_total is working, but node_namespace_pod_container still not exsits

@zeritti
Copy link
Member

zeritti commented Feb 26, 2023

Prometheus has to successfully complete recording the results of the rule before they can be returned on query. In Prometheus UI, Status/Rules, in section k8s.rules, the rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate should be green/OK meaning the rule gets executed successfully, otherwise an error will be shown, e.g.:
image

If you click on the expression, a query will be executed through the query field where you can evaluate results.

@XDavidT
Copy link
Author

XDavidT commented Feb 26, 2023

Prometheus has to successfully complete recording the results of the rule before they can be returned on query. In Prometheus UI, Status/Rules, in section k8s.rules, the rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate should be green/OK meaning the rule gets executed successfully, otherwise an error will be shown, e.g.: image

If you click on the expression, a query will be executed through the query field where you can evaluate results.

https://www.awesomescreenshot.com/video/15129229?key=6ffd2deb551c5de77d3a5d46408980a2
I can see it, but not query it

@zeritti
Copy link
Member

zeritti commented Feb 26, 2023

I can see it, but not query it

There do not seem to be any recorded time series matching the expression. Will you show a sample output of this query?

irate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])

@XDavidT
Copy link
Author

XDavidT commented Feb 26, 2023

I can see it, but not query it

There do not seem to be any recorded time series matching the expression. Will you show a sample output of this query?

irate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])

https://tinyurl.com/2f2kmgbc

@zeritti
Copy link
Member

zeritti commented Feb 26, 2023

And a sample output of container_cpu_usage_seconds_total?

@XDavidT
Copy link
Author

XDavidT commented Feb 26, 2023

Working well. https://tinyurl.com/2ohjr6zj
Is that any reference to the fact I've installed my cluster with RKE?

@zeritti
Copy link
Member

zeritti commented Feb 26, 2023

The expression in the recording does not find a match as we can see since labels image, cluster and container are not present in the time series of container_cpu_usage_seconds_total. This is also the reason why you see an empty dashboard panel.

@XDavidT
Copy link
Author

XDavidT commented Feb 26, 2023

@zeritti Thanks for checking this out. What can be the reason for that? what needs to be checked next?

@liniann
Copy link

liniann commented Feb 28, 2023

i have the same problem. i provisioned cluster by rke v1.4.2 and my helm chart version is 45.2.0.
the following dahsbords have no data (some just have few data to show):
kubernetes / compute resources / *
kubernetes / network / *
thanks.

@BongoEADGC6
Copy link

BongoEADGC6 commented Mar 2, 2023

Same here on a k3s cluster. Using K3s v1.25.4+k3s1 and chart version 45.4.0. it seems that the cluster label is not populated and throws off the rest of the recordings.

@XDavidT
Copy link
Author

XDavidT commented Mar 2, 2023

I saw @johnswarbrick-napier and @monotek released a new version 3 days ago. Can you relate to this issue before the next publish?

@BongoEADGC6
Copy link

BongoEADGC6 commented Mar 2, 2023

Tested 45.5.0 just now and no difference at this time
Wanted to follow-up on my solution for this issue. This was actually related to labels being overwritten by ArgoCD. More information can be found here: #1769 (comment)

@akantak
Copy link

akantak commented Mar 3, 2023

I have installed kube-prometheus-stack v45.1.1 (tested with 45.5.0 and 44.4.1) on K8S 1.25.6 and I have the same issue. I did some digging inside the Prometheus and the root cause is the existence of image!="", in the rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate, which is not present inside the container_cpu_usage_seconds_total metric.

I see a lack of that label in @XDavidT 's screenshot too.

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate silently fails to proceed (returning 0 results and the state OK, because no real failure happened there, 0 results is fine). That causes charts to be empty because there is no data to show.

Smart Prometheus guys, please advise. Probably the k8s.rules in Prometheus should be adjusted. Or the lack of the image label is a bug here.

WRT questions asked by @zeritti:
irate(container_cpu_usage_seconds_total{job="kubelet",metrics_path="/metrics/cadvisor"}[5m]) works like a charm, even whole:

record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
expr: sum by (cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}))

would work but need to remove image!="" from the default one.

So that works for me, and so should work for other people:

sum by (cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, max by (cluster, namespace, pod, node) (kube_pod_info{node!=""}))

@zeritti
Copy link
Member

zeritti commented Mar 3, 2023

Is that any reference to the fact I've installed my cluster with RKE?

What can be the reason for that? what needs to be checked next?

The fact that values for container and image originating at the container runtime are missing usually point to some problem that prevents cAdvisor from retrieving the necessary metadata from the container runtime. It can be a transient problem arising at specific runtime conditions which goes then away, i.e. a restart of kubelet/container runtime may restore normal communication between the two and the labels are suddently back. I would suggest looking in the kubelet log to check for events related to cAdvisor or read/write permission issues.

Another thing is that cAdvisor may not be able to get that information from the container runtime at all because it does not make it available any longer - see e.g. this issue at rancher or this issue at kubernetes. Some distributions have been affected, others have not but 1.24 seems to be the critical release in some distributions when docker shim was removed from kubelet.

@zeritti
Copy link
Member

zeritti commented Mar 3, 2023

Same here on a k3s cluster. Using K3s v1.25.4+k3s1 and chart version 45.4.0. it seems that the cluster label is not populated and throws off the rest of the recordings.

No, the cluster label is not populated, it needs to be set. It is present in the dashboards to support multi-cluster environments. External labels can be used for this purpose. Note, though, that these labels get attached to the time series only as they are leaving the Prometheus instance through federation or remote write (the label will be present upstream only). Otherwise the label can be set through relabelings, metric relabelings, pod target labels and target labels in service monitors or through additional scrape configs.

@akantak
Copy link

akantak commented Mar 7, 2023

@XDavidT I assume you are using the docker engine, if it is not a requirement for you, you can switch to containerd, with that prometheus stack works fine without any workarounds/fixes

@XDavidT
Copy link
Author

XDavidT commented Mar 7, 2023

Is that any reference to the fact I've installed my cluster with RKE?

What can be the reason for that? what needs to be checked next?

The fact that values for container and image originating at the container runtime are missing usually point to some problem that prevents cAdvisor from retrieving the necessary metadata from the container runtime. It can be a transient problem arising at specific runtime conditions which goes then away, i.e. a restart of kubelet/container runtime may restore normal communication between the two and the labels are suddently back. I would suggest looking in the kubelet log to check for events related to cAdvisor or read/write permission issues.

Another thing is that cAdvisor may not be able to get that information from the container runtime at all because it does not make it available any longer - see e.g. this issue at rancher or this issue at kubernetes. Some distributions have been affected, others have not but 1.24 seems to be the critical release in some distributions when docker shim was removed from kubelet.

I can tell after trying to upgrade the cluster to: v1.24.10-rancher3-1
By this comment: rancher/rancher#38934 (comment)
Still, no relevant information showed up.
https://tinyurl.com/2fuhnrd5

@akantak Thank's but RKE2 only support this, and we stick with RKE1 for now.
And current cluster installation is docker based.

Edit1:
Today I saw new release: https://github.com/rancher/rke/releases/tag/v1.4.3
I've tried to use v1.24.10-rancher4-1 but nothing changed.

@stale
Copy link

stale bot commented Apr 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Apr 7, 2023
@ivanfavi
Copy link

ivanfavi commented Apr 7, 2023

It seem related to this issue in cadvisor.

@stale stale bot removed the lifecycle/stale label Apr 7, 2023
@stale
Copy link

stale bot commented May 20, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale
Copy link

stale bot commented Jun 11, 2023

This issue is being automatically closed due to inactivity.

@stale stale bot closed this as completed Jun 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lifecycle/stale
Projects
None yet
Development

No branches or pull requests

7 participants