-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cadvisor not reporting Container/Image metadata #473
Comments
@ibuildthecloud do you have any insight into this? Did you happen to run into any issues of the sorts? I saw that you are hosting the fork for cadvisor for k3s. |
I'm running into the same issue, making cadvisor metrics in prometheus basically unusable since I think container_name is the only way to distinguish between certain aggregated and individual metrics. |
@lsmith130 seems like a pretty important metric. I am unable to see memory statistics at the pod and container level. |
I've ran into this as well with some metrics like using Prometheus returns (just showing 1 entry, all others are similar):
I found these issues with metrics related to Disk I/O and Network I/O. Edit: this could be related to google/cadvisor#2249 (containerd instead of docker runtime) |
There are other metrics that also don't have pod and image metadata like On K3s:
K8s:
This breaks many metrics from the monitoring stack and rules provided by https://github.com/carlosedp/cluster-monitoring and the kube-prometheus project libs. |
@carlosedp are you running k3s with the embedded containerd or with docker? |
Did the default deploy with |
I believe running it with docker does fix many of the issues. but I see the recommendation to use containerd instead. not sure what the cost benefit would be between having the metrics vs using docker with k3s |
To expand on @cfchad 's comment, the docs state:
https://rancher.com/docs/k3s/latest/en/configuration/#containerd-and-docker @ibuildthecloud before I try switching to |
I deployed with Docker but now my ingress routes are not working. Any difference on creating ingress resources by using Docker? |
@ibuildthecloud confirming that when K3s is deployed with Docker as runtime, the cadvisor metrics have all metadata to be used on the monitoring stack:
|
@carlosedp Thanks for finding that out. We'll have to look into why containerd doesn't report the metrics. |
@carlosedp Regarding your comment:
Saw your tweet about the Traefik metrics so I was wondering if you had to do anything special to get ingress working again. |
Ah yes, please disregard that comment. Was my environment that was messed up. I opened a PR to K3s so it will expose the Traefik metrics by default when it gets merged. |
Hello, any news on this topic ? I will try to change my k3s cluster to add docker too. |
Hi @ibuildthecloud , any news on why K3s is not reporting cadvisor metrics without Docker as the runtime? Thanks! |
Hello, I think it's ok now with 0.10.x @carlosedp |
Cool, gonna check and report back! Thanks! |
Tested on K3s v0.10.2 and metrics are generated without requiring |
Thanks for checking! |
I might have spoken too early. Starting from fresh K3s, I deployed the stack and CPU metrics still doesn't show up. Here is an example rule that fails:
The metrics don't have |
I'm running k3s v1.0.0 (18bd921) on a bunch of Raspberry Pis and also ran into this particular problem. Unfortunately neither the
Any idea how to fix or at least further investigate this issue? I do lack a bit of background knowledge in that area. @carlosedp a few months back you wrote:
I have also noticed that when using the Docker engine, I cannot access services from the outside world. I could address this problem by running |
I have also encountered this particular issue since k3s 0.9+.
|
|
Does anyone have a reference to where cadvisor calculates the container name? Perhaps k3s is simply not specifying enough data when it creates containers. |
I tried with v1.17.3+k3s1 (5b17a17) the ps aux | grep /usr/local/bin/k3s
root 543 12.7 2.0 799424 671676 ? Ssl 23:25 0:45 /usr/local/bin/k3s server --no-deploy traefik --default-local-storage-path /data --node-external-ip 192.168.10.131 --kubelet-arg containerd=/run/k3s/containerd/containerd.sock
root 6781 0.0 0.0 6180 884 pts/0 S+ 23:31 0:00 grep /usr/local/bin/k3s However the node-exporter doesn't show |
@Vad1mo Did you ever figure this one out ? |
yes it works and Here my helmfile that i use from code-chris/helm-charts repositories:
- name: stable
url: "https://kubernetes-charts.storage.googleapis.com/"
- name: code-chris
#url: https://code-chris.github.io/helm-charts -- Wait until 8gears patch is merged
url: git+https://github.com/8gears/helm-charts@charts/cadvisor?sparse=1
releases:
- name: "cadvisor"
chart: "code-chris/cadvisor"
namespace: "monitoring"
values:
- metrics:
enabled: true
- resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
- name: "prometheus-operator"
chart: "stable/prometheus-operator"
# version: "6.2.1"
namespace: "monitoring" |
@Vad1mo Any reason to use code-chris' cadvisor instead of the "builtin" metrics server from k3s ? |
I am not sure if the metrics server is exposes a similar amount of metrics as cadvisor. |
It seems to do that actually, at least from k3s perspective, except i am still missing those |
Do you have a reference at hand what is exposed? |
I think you have a point there! And when i look at my scrape config i actually just scrape the following: Maybe that does not contain those. |
thx I see some similar metrics, do you use the prometheus operator and its dashboard? |
I actually dont use the operator, i wanted to go low tech and learn some prometheus scrape configs instead of using the CRDs. but i do use the mixin dashboards. which is where i have my problems :D |
would you mind having a quick chat, i think we are working on something similar https://meet.google.com/xjg-vsgh-zwu |
I am puzzling to find out how to tell Prometheus what URL to scrape. If that is the correct scrapping URL Doesn't seem to be dynamic so hardcoding nodenames is required. |
Actually my scrape for that looks like this:
|
I've deployed a fresh K3s (v1.17.3+k3s1) into an ARM64 node and updated my monitoring stack (https://github.com/carlosedp/cluster-monitoring) with latest mixins. The json output of the metrics still don't include some information but the dashboards are populated correctly: container_cpu_usage_seconds_total{cpu="total",endpoint="https-metrics",id="/kubepods/burstable/podcd2b5231-b73c-44d7-b6a1-4a51dab6e21f",instance="192.168.15.15:10250",job="kubelet",metrics_path="/metrics/cadvisor",namespace="monitoring",node="odroidn2",pod="prometheus-k8s-0",service="kubelet"} I believe most problems related to this have been fixed now. |
@carlosedp Neat!, but what i dont cant seem to understand is where does the Just spun up a fresh k3s and yes that works, but i cant look the What am i missing here? |
They are Prometheus Rules... composed of many other fields, like:
The rule Most are defined here: https://github.com/carlosedp/cluster-monitoring/blob/master/manifests/prometheus-rules.yaml |
Now it makes sense! Perfect, thank you very much for that explanation. :D |
@Vad1mo - have you gotten this to work ? we are not using metrics-server in our k3s install and are instead using prometheus-adapter. |
Yes, it works now, was my config for Prometheus operator. cAdvisor is not needed |
@Vad1mo that's great to know! can you share your prometheus operator config ? Would like to know two additional things:
any help would be very welcome! |
we use https://github.com/cloudposse/helmfiles/blob/master/releases/prometheus-operator.yaml - global:
rbac:
create: true
pspEnabled: true
defaultRules:
create: true
rules:
kubernetesResources: false
additionalPrometheusRulesMap:
# These rules are copied from https://raw.githubusercontent.com/coreos/kube-prometheus/release-0.1/manifests/prometheus-rules.yaml
# Only CPUThrottlingHigh has been modified, to be replaced with a customizable version
# to reduce alerts caused by https://github.com/kubernetes/kubernetes/pull/63437
kubernetes-resources:
groups:
- name: kubernetes-resources
rules:
- alert: KubeCPUOvercommit
annotations:
message: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
expr: |-
sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
/
sum(node:node_num_cpu:sum)
>
(count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum)
for: 5m
labels:
severity: warning
- alert: KubeMemOvercommit
annotations:
message: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit
expr: |-
sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)
/
sum(node_memory_MemTotal_bytes)
>
(count(node:node_num_cpu:sum)-1)
/
count(node:node_num_cpu:sum)
for: 5m
labels:
severity: warning
- alert: KubeCPUOvercommit
annotations:
message: Cluster has overcommitted CPU resource requests for Namespaces.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit
expr: |-
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
/
sum(node:node_num_cpu:sum)
> 1.5
for: 5m
labels:
severity: warning
- alert: KubeMemOvercommit
annotations:
message: Cluster has overcommitted memory resource requests for Namespaces.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit
expr: |-
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
/
sum(node_memory_MemTotal_bytes{job="node-exporter"})
> 1.5
for: 5m
labels:
severity: warning
- alert: KubeQuotaExceeded
annotations:
message: Namespace {{`{{ $labels.namespace }}`}} is using {{`{{ printf "%0.0f" $value }}`}}% of its {{`{{ $labels.resource }}`}} quota.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded
expr: |-
100 * kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
> 90
for: 15m
labels:
severity: warning
# Original rule is 25% for 15 minutes
- alert: CPUThrottlingHigh-{{- env "PROMETHEUS_OPERATOR_RULES_CPU_THROTTLING_HIGH_THRESHOLD_PERCENT" | default "50" -}}-{{- env "PROMETHEUS_OPERATOR_RULES_CPU_THROTTLING_HIGH_THRESHOLD_TIME" | default "25m" }}
annotations:
message: '{{`{{ printf "%0.0f" $value }}`}}% throttling of CPU in namespace {{`{{ $labels.namespace }}`}} for container {{`{{ $labels.container_name }}`}} in pod {{`{{ $labels.pod_name }}`}}.'
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh
expr: |-
100 * sum(increase(container_cpu_cfs_throttled_periods_total{container_name!="", }[5m])) by (container_name, pod_name, namespace)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container_name, pod_name, namespace)
> {{ env "PROMETHEUS_OPERATOR_RULES_CPU_THROTTLING_HIGH_THRESHOLD_PERCENT" | default "50" }}
for: 25m
labels:
severity: warning
prometheusOperator:
enabled: true
# log level must be one of "all", "debug", "info", "warn", "error", "none"
logLevel: "warn"
resources:
limits:
cpu: "100m"
memory: "96Mi"
requests:
cpu: "20m"
memory: "48Mi"
image:
pullPolicy: "IfNotPresent"
prometheus:
enabled: true
podDisruptionBudget:
enabled: false
ingress:
enabled: false
additionalServiceMonitors: []
additionalPodMonitors: []
prometheusSpec:
replicas: 1
retention: 45d
logLevel: "warn"
podMetadata:
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
scrapeInterval: ""
evaluationInterval: ""
## If true, a nil or {} value for prometheus.prometheusSpec.ruleSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the PrometheusRule resources created.
## If false, a nil or or {} value for ruleSelector will select all PrometheusRule resources.
ruleSelectorNilUsesHelmValues: false
## serviceMonitorSelectorNilUsesHelmValues works just like ruleSelectorNilUsesHelmValues
serviceMonitorSelectorNilUsesHelmValues: false
#externalUrl: "{{- env "PROMETHEUS_PROMETHEUS_EXTERNAL_URL" | default (print "https://api." (env "KOPS_CLUSTER_NAME") "/api/v1/namespaces/monitoring/services/prometheus-operator-prometheus:web/proxy/") }}"
resources:
limits:
cpu: 300m
memory: 1526Mi
requests:
cpu: 75m
memory: 768Mi
alertmanager:
enabled: true
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
route:
group_by:
- "alertname"
- "namespace"
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: "general"
routes:
- match:
alertname: Watchdog
receiver: "null"
receivers:
- name: "null"
- name: "general"
templates:
- ./*.tmpl
alertmanagerSpec:
#externalUrl: "{{- env "PROMETHEUS_ALERTMANAGER_EXTERNAL_URL" | default (print "https://api." (env "KOPS_CLUSTER_NAME") "/api/v1/namespaces/monitoring/services/prometheus-operator-alertmanager:web/proxy/") }}"
resources:
limits:
cpu: "200m"
memory: "96Mi"
requests:
cpu: "10m"
memory: "24Mi"
grafana:
# https://github.com/helm/charts/tree/dfa02f9b117a29de889f9c35e0b3abb6012a0877/stable/grafana#configuration
enabled: true
adminPassword: "CHANGEME"
defaultDashboardsEnabled: true
sidecar:
dashboards:
enabled: true
searchNamespace: ALL
label: grafana_dashboard
plugins:
- grafana-piechart-panel
resources:
limits:
cpu: "250m"
memory: "128Mi"
requests:
cpu: "25m"
memory: "72Mi"
grafana.ini:
dataproxy:
# default is 30 seconds
timeout: 90
server:
# root_url: "https://api.{{- env "KOPS_CLUSTER_NAME" }}/api/v1/namespaces/kube-system/services/prometheus-operator-grafana:service/proxy/"
# root_url: "{{- env "PROMETHEUS_GRAFANA_ROOT_URL" | default (print "https://api." (env "KOPS_CLUSTER_NAME") "/api/v1/namespaces/monitoring/services/prometheus-operator-grafana:service/proxy/") }}"
auth.anonymous:
enabled: true
org_role: Admin
kubeStateMetrics:
enabled: true
kubeApiServer:
enabled: true
kubelet:
enabled: true
# In general, few clusters are set up to allow kublet to authenticate a bearer token, and
# the HTTPS endpoint requires authentication, so Prometheus cannot access it.
# The HTTP endpoint does not require authentication, so Prometheus can access it.
# See https://github.com/coreos/prometheus-operator/issues/926
serviceMonitor:
https: true
kubeControllerManager:
enabled: true
coreDns:
enabled: true
kubeDns:
enabled: false
kubeEtcd:
# Access to etcd is a huge security risk, so nodes are blocked from accessing it.
# Therefore Prometheus cannot access it without extra setup, which is beyond the scope of this helmfile.
# See https://github.com/kubernetes/kops/issues/5852
# https://github.com/kubernetes/kops/issues/4975#issuecomment-381055946
# https://github.com/coreos/prometheus-operator/issues/2397
# https://github.com/coreos/prometheus-operator/blob/v0.19.0/contrib/kube-prometheus/docs/Monitoring%20external%20etcd.md
# https://gist.github.com/jhohertz/476bd616d4171649a794b8c409f8d548
# So we disable it since it is not going to work anyway
enabled: false
kubeScheduler:
enabled: true
nodeExporter:
enabled: true
# set:
# - name: "alertmanager.templateFiles.deployment\\.tmpl"
# file: values/kube-prometheus.alerts.template |
@Vad1mo thank you so much for this ! are you by chance using this for autoscaling using prometheus-adapter ? if you are - what has been your experience there ? |
For now, I have to use the docker runtime, since containerd mounts volumes as root:root, which break my Druid installation. Using the docker runtime, I get metrics that are missing container/image: container_cpu_system_seconds_total{container="",id="/",image="",name="",namespace="",pod=""} 385870.64 1605141064258 the workarounds mentioned here (--kubelet-arg containerd=/run/k3s/containerd/containerd.sock) don't apply to docker. Any guidance on what to do to make sure cadvisor populates container/image using the docker runtime? I am running k3s version v1.18.9+k3s1 (630bebf) thanks! |
Same issue here, it was fine before installing the latest version (v1.20.2+k3s1)
|
Describe the bug
When making the call to retrieve metrics via Cadvisor, the Container and Images values are empty in all values.
To Reproduce
Install k3s via multipass https://medium.com/@zhimin.wen/running-k3s-with-multipass-on-mac-fbd559966f7c
Expected behavior
container and image values should be populated
Additional context
Wondering if it might be related to #213
The text was updated successfully, but these errors were encountered: