Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cadvisor not reporting Container/Image metadata #473

Closed
cfchad opened this issue May 13, 2019 · 62 comments
Closed

Cadvisor not reporting Container/Image metadata #473

cfchad opened this issue May 13, 2019 · 62 comments
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@cfchad
Copy link

cfchad commented May 13, 2019

Describe the bug
When making the call to retrieve metrics via Cadvisor, the Container and Images values are empty in all values.

container_tasks_state{container="",container_name="",id="/system.slice/lxd.socket",image="",name="",namespace="",pod="",pod_name="",state="running"} 0 1557525150119

To Reproduce
Install k3s via multipass https://medium.com/@zhimin.wen/running-k3s-with-multipass-on-mac-fbd559966f7c

kubectl get --raw /api/v1/nodes/k3s/proxy/metrics/cadvisor

Expected behavior
container and image values should be populated

Additional context
Wondering if it might be related to #213

@cfchad
Copy link
Author

cfchad commented May 15, 2019

@ibuildthecloud do you have any insight into this? Did you happen to run into any issues of the sorts? I saw that you are hosting the fork for cadvisor for k3s.

@fore5fire
Copy link

I'm running into the same issue, making cadvisor metrics in prometheus basically unusable since I think container_name is the only way to distinguish between certain aggregated and individual metrics.

@cfchad
Copy link
Author

cfchad commented Jun 7, 2019

@lsmith130 seems like a pretty important metric. I am unable to see memory statistics at the pod and container level.

@ahmedmagdiosman
Copy link

ahmedmagdiosman commented Jul 14, 2019

I've ran into this as well with some metrics like container_network_receive_bytes_total

using Prometheus returns (just showing 1 entry, all others are similar):

{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",id="/",instance="node1",interface="enp2s0",job="kubernetes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="node1",kubernetes_io_os="linux"}`

value: 318908420765

I found these issues with metrics related to Disk I/O and Network I/O.

Edit: this could be related to google/cadvisor#2249 (containerd instead of docker runtime)

@carlosedp
Copy link
Contributor

carlosedp commented Aug 21, 2019

There are other metrics that also don't have pod and image metadata like container_cpu_usage_seconds_total and container_memory_working_set_bytes, for example.

On K3s:

container_cpu_usage_seconds_total{cpu="total",endpoint="https-metrics",id="/kubepods/burstable/pod52bf597a-c3ac-11e9-93c5-080027881c8e",instance="10.0.2.15:10250",job="kubelet",namespace="monitoring",node="ubuntu-k3s",pod="prometheus-k8s-0",pod_name="prometheus-k8s-0",service="kubelet"}

container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/burstable/pod52bf597a-c3ac-11e9-93c5-080027881c8e",instance="10.0.2.15:10250",job="kubelet",namespace="monitoring",node="ubuntu-k3s",pod="prometheus-k8s-0",pod_name="prometheus-k8s-0",service="kubelet"}

K8s:

container_cpu_usage_seconds_total{container="prometheus",container_name="prometheus",cpu="total",endpoint="http-metrics",id="/kubepods/burstable/pod9ac8dd9a-544e-4389-a00b-0a6441f95b22/8c63324993ebf607921e797588b22e4e6792001ac0f10a4291b910818e3e26b5",image="prom/prometheus@sha256:8f34c18cf2ccaf21e361afd18e92da2602d0fa23a8917f759f906219242d8572",instance="10.0.2.15:10255",job="kubelet",name="k8s_prometheus_prometheus-k8s-0_monitoring_9ac8dd9a-544e-4389-a00b-0a6441f95b22_0",namespace="monitoring",node="minikube",pod="prometheus-k8s-0",pod_name="prometheus-k8s-0",service="kubelet"}

container_memory_working_set_bytes{container="POD",container_name="POD",endpoint="http-metrics",id="/kubepods/burstable/pod9ac8dd9a-544e-4389-a00b-0a6441f95b22/f63eba5bae95994f82e873e64e834a07bea0f6bcb14ce89ffa6f090cf02b57d7",image="k8s.gcr.io/pause:3.1",instance="10.0.2.15:10255",job="kubelet",name="k8s_POD_prometheus-k8s-0_monitoring_9ac8dd9a-544e-4389-a00b-0a6441f95b22_0",namespace="monitoring",node="minikube",pod="prometheus-k8s-0",pod_name="prometheus-k8s-0",service="kubelet"}

This breaks many metrics from the monitoring stack and rules provided by https://github.com/carlosedp/cluster-monitoring and the kube-prometheus project libs.

@ibuildthecloud
Copy link
Contributor

@carlosedp are you running k3s with the embedded containerd or with docker?

@carlosedp
Copy link
Contributor

Did the default deploy with k3s server.

@cfchad
Copy link
Author

cfchad commented Aug 22, 2019

I believe running it with docker does fix many of the issues. but I see the recommendation to use containerd instead. not sure what the cost benefit would be between having the metrics vs using docker with k3s

@geekdave
Copy link

geekdave commented Aug 22, 2019

To expand on @cfchad 's comment, the docs state:

k3s includes and defaults to containerd. Why? Because it’s just plain better. If you want to run with Docker first stop and think, “Really? Do I really want more headache?” If still yes then you just need to run the agent with the --docker flag.

https://rancher.com/docs/k3s/latest/en/configuration/#containerd-and-docker

@ibuildthecloud before I try switching to --docker would you mind speaking to the above disclaimer? What kinds of headaches do you imagine users would face by switching to docker? I'd love to have cadvisor metrics working with k3s so I can understand my container resource usage better, but I want to know what I'm signing up for, and if I'm trading one problem for another. 😄

@carlosedp
Copy link
Contributor

I deployed with Docker but now my ingress routes are not working. Any difference on creating ingress resources by using Docker?

@carlosedp
Copy link
Contributor

@ibuildthecloud confirming that when K3s is deployed with Docker as runtime, the cadvisor metrics have all metadata to be used on the monitoring stack:

container_cpu_usage_seconds_total{container="POD",container_name="POD",cpu="total",endpoint="https-metrics",id="/kubepods/burstable/pod1fbbf5ec-c519-11e9-a3e3-080027881c8e/a5187b99b06f032724be1076852701735a3f40573d1e69e5f180c531a2ac4ab2",image="k8s.gcr.io/pause:3.1",instance="10.0.2.15:10250",job="kubelet",name="k8s_POD_prometheus-k8s-0_monitoring_1fbbf5ec-c519-11e9-a3e3-080027881c8e_0",namespace="monitoring",node="ubuntu-k3s",pod="prometheus-k8s-0",pod_name="prometheus-k8s-0",service="kubelet"}

image

@ibuildthecloud
Copy link
Contributor

@carlosedp Thanks for finding that out. We'll have to look into why containerd doesn't report the metrics.

@geekdave
Copy link

@carlosedp Regarding your comment:

I deployed with Docker but now my ingress routes are not working.

Saw your tweet about the Traefik metrics so I was wondering if you had to do anything special to get ingress working again.

@carlosedp
Copy link
Contributor

carlosedp commented Aug 23, 2019

Ah yes, please disregard that comment. Was my environment that was messed up.
If you edit the Traefik helm chart with k3s kubectl edit helmchart traefik -n kube-system and add metrics.prometheus.enabled: "true" config to spec.set, it starts exposing the metrics. Then rebuild the monitoring stack with Traefik module on.

I opened a PR to K3s so it will expose the Traefik metrics by default when it gets merged.

@ludopaquet
Copy link

Hello, any news on this topic ? I will try to change my k3s cluster to add docker too.

@carlosedp
Copy link
Contributor

Hi @ibuildthecloud , any news on why K3s is not reporting cadvisor metrics without Docker as the runtime? Thanks!

@ludopaquet
Copy link

Hello, I think it's ok now with 0.10.x @carlosedp

@carlosedp
Copy link
Contributor

Cool, gonna check and report back! Thanks!

@carlosedp
Copy link
Contributor

Tested on K3s v0.10.2 and metrics are generated without requiring --docker as runtime.

@cjellick
Copy link
Contributor

Thanks for checking!

@carlosedp
Copy link
Contributor

I might have spoken too early. Starting from fresh K3s, I deployed the stack and CPU metrics still doesn't show up.

Here is an example rule that fails:

record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
expr: sum
  by(namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet"}[5m]))
  * on(namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)

image

The metrics don't have image and container tags so metrics are blank. Can you reopen this @cjellick.

@erikwilson erikwilson reopened this Nov 14, 2019
@cjellick cjellick added this to the v1.x - Backlog milestone Nov 15, 2019
@pckbls
Copy link

pckbls commented Dec 2, 2019

I'm running k3s v1.0.0 (18bd921) on a bunch of Raspberry Pis and also ran into this particular problem. Unfortunately neither the containerd nor docker container engines populate the image and container fields.

$ kubectl get --raw /api/v1/nodes/rpi-k3s-master/proxy/metrics/cadvisor | grep container_cpu_usage_seconds_total

[...]
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods/burstable/podcf2436bb-d82f-4464-9ede-df28261829cb/b52c4241d9a75c5e8a375a69b8f8dd38c25df4820c236ca7c99b7600d0f598d5",image="",name="",namespace="",pod=""} 0.088365464 1575303361185
container_cpu_usage_seconds_total{container="",cpu="total",id="/system.slice/docker.service",image="",name="",namespace="",pod=""} 3.604857392 1575303359295
container_cpu_usage_seconds_total{container="",cpu="total",id="/systemd/system.slice",image="",name="",namespace="",pod=""} 2867.487169506 1575303363690

Any idea how to fix or at least further investigate this issue? I do lack a bit of background knowledge in that area.

@carlosedp a few months back you wrote:

I deployed with Docker but now my ingress routes are not working. Any difference on creating ingress resources by using Docker?

I have also noticed that when using the Docker engine, I cannot access services from the outside world. I could address this problem by running sudo iptables -P FORWARD ACCEPT on all my nodes.

@NicklasWallgren
Copy link

I have also encountered this particular issue since k3s 0.9+.

container_cpu_usage_seconds_total is no longer outputting relevant metadata, the image field is missing.

@NicklasWallgren
Copy link

container_memory_rss have also lost metadata, such as container.

@borg286
Copy link

borg286 commented Jan 8, 2020

Does anyone have a reference to where cadvisor calculates the container name? Perhaps k3s is simply not specifying enough data when it creates containers.

@Vad1mo
Copy link

Vad1mo commented Mar 6, 2020

I tried with v1.17.3+k3s1 (5b17a17)

the --kubelet-arg containerd=/run/k3s/containerd/containerd.sock is set, I can see the settings running in the current process

ps aux | grep /usr/local/bin/k3s
root       543 12.7  2.0 799424 671676 ?       Ssl  23:25   0:45 /usr/local/bin/k3s server --no-deploy traefik --default-local-storage-path /data --node-external-ip 192.168.10.131 --kubelet-arg containerd=/run/k3s/containerd/containerd.sock
root      6781  0.0  0.0   6180   884 pts/0    S+   23:31   0:00 grep /usr/local/bin/k3s

However the node-exporter doesn't show container_cpu_**, I guess the metrics are provided via Cadvisor only

@brondum
Copy link

brondum commented Mar 17, 2020

@Vad1mo Did you ever figure this one out ?

@Vad1mo
Copy link

Vad1mo commented Mar 17, 2020

yes it works
I added --kubelet-arg containerd=/run/k3s/containerd/containerd.sock

and Here my helmfile that i use from code-chris/helm-charts

repositories:
  - name: stable
    url: "https://kubernetes-charts.storage.googleapis.com/"
  - name: code-chris
    #url: https://code-chris.github.io/helm-charts -- Wait until 8gears patch is merged
    url: git+https://github.com/8gears/helm-charts@charts/cadvisor?sparse=1
releases:
  - name: "cadvisor"
    chart: "code-chris/cadvisor"
    namespace: "monitoring"
    values:
      - metrics:
          enabled: true
      - resources:
          limits:
            cpu: 500m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 128Mi
  - name: "prometheus-operator"
    chart: "stable/prometheus-operator"
    # version: "6.2.1"
    namespace: "monitoring"

@brondum
Copy link

brondum commented Mar 18, 2020

@Vad1mo Any reason to use code-chris' cadvisor instead of the "builtin" metrics server from k3s ?

@Vad1mo
Copy link

Vad1mo commented Mar 18, 2020

  • metrics server was designed to provide metrics used for autoscaling says the FAQ.
  • cadvisor exposes all the container_* metrics
  • prometheus-operator comes with dashboards that need container_* metrics, we wanted to reuse that.

I am not sure if the metrics server is exposes a similar amount of metrics as cadvisor.

@brondum
Copy link

brondum commented Mar 18, 2020

It seems to do that actually, at least from k3s perspective, except i am still missing those node_namespace_pod_container.
That is whats bugging the dashboards i think.

@Vad1mo
Copy link

Vad1mo commented Mar 18, 2020

Do you have a reference at hand what is exposed?
I would like to compare them

@brondum
Copy link

brondum commented Mar 18, 2020

I think you have a point there! And when i look at my scrape config i actually just scrape the following:
https://kubernetes.default.svc:443/api/v1/nodes/<nodename>/proxy/metrics/cadvisor.

Maybe that does not contain those.

@Vad1mo
Copy link

Vad1mo commented Mar 18, 2020

thx I see some similar metrics, do you use the prometheus operator and its dashboard?

@brondum
Copy link

brondum commented Mar 18, 2020

I actually dont use the operator, i wanted to go low tech and learn some prometheus scrape configs instead of using the CRDs. but i do use the mixin dashboards. which is where i have my problems :D

@Vad1mo
Copy link

Vad1mo commented Mar 18, 2020

would you mind having a quick chat, i think we are working on something similar https://meet.google.com/xjg-vsgh-zwu

@Vad1mo
Copy link

Vad1mo commented Mar 18, 2020

I am puzzling to find out how to tell Prometheus what URL to scrape. If that is the correct scrapping URL
https://kubernetes.default.svc:443/api/v1/nodes//proxy/metrics/cadvisor?

Doesn't seem to be dynamic so hardcoding nodenames is required.

@brondum
Copy link

brondum commented Mar 18, 2020

Actually my scrape for that looks like this:

 # Scrape config for Kubelet cAdvisor.
  - job_name: 'kubernetes-cadvisor'
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    # fix for mixin:
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: pod
    - source_labels: [__meta_kubernetes_pod_node_name]
      action: replace
      target_label: node

@carlosedp
Copy link
Contributor

I've deployed a fresh K3s (v1.17.3+k3s1) into an ARM64 node and updated my monitoring stack (https://github.com/carlosedp/cluster-monitoring) with latest mixins.

The json output of the metrics still don't include some information but the dashboards are populated correctly:

container_cpu_usage_seconds_total{cpu="total",endpoint="https-metrics",id="/kubepods/burstable/podcd2b5231-b73c-44d7-b6a1-4a51dab6e21f",instance="192.168.15.15:10250",job="kubelet",metrics_path="/metrics/cadvisor",namespace="monitoring",node="odroidn2",pod="prometheus-k8s-0",service="kubelet"}

image

image

I believe most problems related to this have been fixed now.

@brondum
Copy link

brondum commented Mar 21, 2020

@carlosedp Neat!, but what i dont cant seem to understand is where does the node_namespace_pod_container come from.

Just spun up a fresh k3s and yes that works, but i cant look the node_namespace_pod_container from Prometheus, but Grafana gets it somehow ?

What am i missing here?

@carlosedp
Copy link
Contributor

carlosedp commented Mar 21, 2020

They are Prometheus Rules... composed of many other fields, like:

- expr: |
    sum by (cluster, namespace, pod, container) (
      rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])
    ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
      1, max by(cluster, namespace, pod, node) (kube_pod_info)
    )
  record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate

The rule node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate comes from the expression above. It's like a "view" on SQL or something like that.

Most are defined here: https://github.com/carlosedp/cluster-monitoring/blob/master/manifests/prometheus-rules.yaml

@brondum
Copy link

brondum commented Mar 21, 2020

Now it makes sense! Perfect, thank you very much for that explanation. :D

@sandys
Copy link

sandys commented May 6, 2020

@Vad1mo - have you gotten this to work ? we are not using metrics-server in our k3s install and are instead using prometheus-adapter.
We are looking for CPU, etc metrics being exported from cadvisor and hooked into kubernetes via the prometheus adapter.
did you get this to work ? could you explain how

@Vad1mo
Copy link

Vad1mo commented May 6, 2020

Yes, it works now, was my config for Prometheus operator. cAdvisor is not needed

@sandys
Copy link

sandys commented May 6, 2020

@Vad1mo that's great to know! can you share your prometheus operator config ?

Would like to know two additional things:

  1. did you setup k3s to use docker instead of containerd ? lots of people are reporting that cadvisor doesnt work properly without that.
  2. do you apply your prometheus operator config after setting up prometheus operator ? or did you include this config within the deploy of the operator itself (by forking it)

any help would be very welcome!

@Vad1mo
Copy link

Vad1mo commented May 6, 2020

  1. containerd with - containerd: /run/k3s/containerd/containerd.sock
  2. its part of the operator installation that also installs prometheus

we use https://github.com/cloudposse/helmfiles/blob/master/releases/prometheus-operator.yaml
and this is our modification to it.

 - global:
          rbac:
            create: true
            pspEnabled: true
        defaultRules:
          create: true
          rules:
            kubernetesResources: false
        additionalPrometheusRulesMap:
          # These rules are copied from https://raw.githubusercontent.com/coreos/kube-prometheus/release-0.1/manifests/prometheus-rules.yaml
          # Only CPUThrottlingHigh has been modified, to be replaced with a customizable version
          # to reduce alerts caused by https://github.com/kubernetes/kubernetes/pull/63437
          kubernetes-resources:
            groups:
              - name: kubernetes-resources
                rules:
                  - alert: KubeCPUOvercommit
                    annotations:
                      message: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
                      runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
                    expr: |-
                      sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
                        /
                      sum(node:node_num_cpu:sum)
                        >
                      (count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum)
                    for: 5m
                    labels:
                      severity: warning
                  - alert: KubeMemOvercommit
                    annotations:
                      message: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.
                      runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit
                    expr: |-
                      sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)
                        /
                      sum(node_memory_MemTotal_bytes)
                        >
                      (count(node:node_num_cpu:sum)-1)
                        /
                      count(node:node_num_cpu:sum)
                    for: 5m
                    labels:
                      severity: warning
                  - alert: KubeCPUOvercommit
                    annotations:
                      message: Cluster has overcommitted CPU resource requests for Namespaces.
                      runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
                    expr: |-
                      sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
                        /
                      sum(node:node_num_cpu:sum)
                        > 1.5
                    for: 5m
                    labels:
                      severity: warning
                  - alert: KubeMemOvercommit
                    annotations:
                      message: Cluster has overcommitted memory resource requests for Namespaces.
                      runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit
                    expr: |-
                      sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
                        /
                      sum(node_memory_MemTotal_bytes{job="node-exporter"})
                        > 1.5
                    for: 5m
                    labels:
                      severity: warning
                  - alert: KubeQuotaExceeded
                    annotations:
                      message: Namespace {{`{{ $labels.namespace }}`}} is using {{`{{ printf "%0.0f" $value }}`}}% of its {{`{{ $labels.resource }}`}} quota.
                      runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded
                    expr: |-
                      100 * kube_resourcequota{job="kube-state-metrics", type="used"}
                        / ignoring(instance, job, type)
                      (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
                        > 90
                    for: 15m
                    labels:
                      severity: warning
                  # Original rule is 25% for 15 minutes
                  - alert: CPUThrottlingHigh-{{- env "PROMETHEUS_OPERATOR_RULES_CPU_THROTTLING_HIGH_THRESHOLD_PERCENT" | default "50" -}}-{{- env "PROMETHEUS_OPERATOR_RULES_CPU_THROTTLING_HIGH_THRESHOLD_TIME" | default "25m" }}
                    annotations:
                      message: '{{`{{ printf "%0.0f" $value }}`}}% throttling of CPU in namespace {{`{{ $labels.namespace }}`}} for container {{`{{ $labels.container_name }}`}} in pod {{`{{ $labels.pod_name }}`}}.'
                      runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh
                    expr: |-
                      100 * sum(increase(container_cpu_cfs_throttled_periods_total{container_name!="", }[5m])) by (container_name, pod_name, namespace)
                        /
                      sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container_name, pod_name, namespace)
                        > {{ env "PROMETHEUS_OPERATOR_RULES_CPU_THROTTLING_HIGH_THRESHOLD_PERCENT" | default "50" }}
                    for: 25m
                    labels:
                      severity: warning
        prometheusOperator:
          enabled: true
          # log level must be one of "all", "debug",	"info", "warn",	"error", "none"
          logLevel: "warn"
          resources:
            limits:
              cpu: "100m"
              memory: "96Mi"
            requests:
              cpu: "20m"
              memory: "48Mi"
          image:
            pullPolicy: "IfNotPresent"
        prometheus:
          enabled: true
          podDisruptionBudget:
            enabled: false
          ingress:
            enabled: false
          additionalServiceMonitors: []
          additionalPodMonitors: []
          prometheusSpec:
            replicas: 1
            retention: 45d
            logLevel: "warn"
            podMetadata:
              annotations:
                "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
            scrapeInterval: ""
            evaluationInterval: ""
            ## If true, a nil or {} value for prometheus.prometheusSpec.ruleSelector will cause the
            ## prometheus resource to be created with selectors based on values in the helm deployment,
            ## which will also match the PrometheusRule resources created.
            ## If false, a nil or or {} value for ruleSelector will select all PrometheusRule resources.
            ruleSelectorNilUsesHelmValues: false
            ## serviceMonitorSelectorNilUsesHelmValues works just like ruleSelectorNilUsesHelmValues
            serviceMonitorSelectorNilUsesHelmValues: false
            #externalUrl: "{{- env "PROMETHEUS_PROMETHEUS_EXTERNAL_URL" | default (print "https://api." (env "KOPS_CLUSTER_NAME") "/api/v1/namespaces/monitoring/services/prometheus-operator-prometheus:web/proxy/") }}"
            resources:
              limits:
                cpu: 300m
                memory: 1526Mi
              requests:
                cpu: 75m
                memory: 768Mi
        alertmanager:
          enabled: true
          ## Alertmanager configuration directives
          ## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
          ##      https://prometheus.io/webtools/alerting/routing-tree-editor/
          ##
          config:
            global:
              resolve_timeout: 5m
            route:
              group_by:
                - "alertname"
                - "namespace"
              group_wait: 30s
              group_interval: 5m
              repeat_interval: 12h
              receiver: "general"
              routes:
                - match:
                    alertname: Watchdog
                  receiver: "null"
            receivers:
              - name: "null"
              - name: "general"
            templates:
              - ./*.tmpl
          alertmanagerSpec:
            #externalUrl: "{{- env "PROMETHEUS_ALERTMANAGER_EXTERNAL_URL" | default (print "https://api." (env "KOPS_CLUSTER_NAME") "/api/v1/namespaces/monitoring/services/prometheus-operator-alertmanager:web/proxy/") }}"
            resources:
              limits:
                cpu: "200m"
                memory: "96Mi"
              requests:
                cpu: "10m"
                memory: "24Mi"
        grafana:
          # https://github.com/helm/charts/tree/dfa02f9b117a29de889f9c35e0b3abb6012a0877/stable/grafana#configuration
          enabled: true
          adminPassword: "CHANGEME"
          defaultDashboardsEnabled: true
          sidecar:
            dashboards:
              enabled: true
              searchNamespace: ALL
              label: grafana_dashboard
          plugins:
            - grafana-piechart-panel
          resources:
            limits:
              cpu: "250m"
              memory: "128Mi"
            requests:
              cpu: "25m"
              memory: "72Mi"
          grafana.ini:
            dataproxy:
              # default is 30 seconds
              timeout: 90
            server:
            # root_url: "https://api.{{- env "KOPS_CLUSTER_NAME" }}/api/v1/namespaces/kube-system/services/prometheus-operator-grafana:service/proxy/"
            # root_url: "{{- env "PROMETHEUS_GRAFANA_ROOT_URL" | default (print "https://api." (env "KOPS_CLUSTER_NAME") "/api/v1/namespaces/monitoring/services/prometheus-operator-grafana:service/proxy/") }}"
            auth.anonymous:
              enabled: true
              org_role: Admin
        kubeStateMetrics:
          enabled: true
        kubeApiServer:
          enabled: true
        kubelet:
          enabled: true
          # In general, few clusters are set up to allow kublet to authenticate a bearer token, and
          # the HTTPS endpoint requires authentication, so Prometheus cannot access it.
          # The HTTP endpoint does not require authentication, so Prometheus can access it.
          # See https://github.com/coreos/prometheus-operator/issues/926
          serviceMonitor:
            https: true
        kubeControllerManager:
          enabled: true
        coreDns:
          enabled: true
        kubeDns:
          enabled: false
        kubeEtcd:
          # Access to etcd is a huge security risk, so nodes are blocked from accessing it.
          # Therefore Prometheus cannot access it without extra setup, which is beyond the scope of this helmfile.
          # See https://github.com/kubernetes/kops/issues/5852
          #     https://github.com/kubernetes/kops/issues/4975#issuecomment-381055946
          #     https://github.com/coreos/prometheus-operator/issues/2397
          #     https://github.com/coreos/prometheus-operator/blob/v0.19.0/contrib/kube-prometheus/docs/Monitoring%20external%20etcd.md
          #     https://gist.github.com/jhohertz/476bd616d4171649a794b8c409f8d548
          # So we disable it since it is not going to work anyway
          enabled: false
        kubeScheduler:
          enabled: true
        nodeExporter:
          enabled: true
    # set:
    # - name: "alertmanager.templateFiles.deployment\\.tmpl"
    #   file: values/kube-prometheus.alerts.template

@sandys
Copy link

sandys commented May 6, 2020

@Vad1mo thank you so much for this !

are you by chance using this for autoscaling using prometheus-adapter ? if you are - what has been your experience there ?

@adelcast
Copy link

For now, I have to use the docker runtime, since containerd mounts volumes as root:root, which break my Druid installation. Using the docker runtime, I get metrics that are missing container/image:

container_cpu_system_seconds_total{container="",id="/",image="",name="",namespace="",pod=""} 385870.64 1605141064258

the workarounds mentioned here (--kubelet-arg containerd=/run/k3s/containerd/containerd.sock) don't apply to docker. Any guidance on what to do to make sure cadvisor populates container/image using the docker runtime?

I am running k3s version v1.18.9+k3s1 (630bebf)

thanks!

@tbcdns
Copy link

tbcdns commented Jan 29, 2021

Same issue here, it was fine before installing the latest version (v1.20.2+k3s1)

kubectl get --raw /api/v1/nodes/<node>/proxy/metrics/cadvisor contains no value for image and container labels.

@brandond
Copy link
Contributor

@tbcdns I think you're looking for #2831 - this one has been closed for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests