New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubectl top nodes/pods values vs uptime/free/top #447
Comments
Thanks for you feedback @wu105, From what I understood your main problem is lack of proper tools to investigate reason for node crashes and resource spikes. I don't think metrics will be a good tool for your problem. To do a proper root causing you need to look into layers below Kubernetes (e.g. kernel logs, memory pressure). Kubernetes is a container management solution that is independent of platform that it's running on (OS, container runtime, cloud). Monitoring of Linux kernel should be best solved by third party solutions like Prometheus. I noted possible improvements:
|
Well, kubernetes is supposed to manage the resources and achieve high utilization, with the power to kill pods when necessary. The node crashes we have left no traces behind that we can find. They tend to happen during higher load, usually high memory and low cpu utilizations, moderate networking and disk i/o, can be spiky, probably typical for java build jobs. There is a question lurking in the distance: does kubernetes manage well such best-of-effort and somewhat spiky loads? By the way, Prometheus keeps the metrics history leading up to the crashes, but we have not found any thing obvious. This leads us to look into what kubernetes sees as the metrics and how it uses the metrics, in part to interpret the metrics recorded by Prometheus. As a thorn in the side, it is a peculiar situation with swapping off (because swapping is a challenge for kubernetes to manage) while the system still uses large amounts of memory for buffering and caching and swaps on the buffer and cache memory. |
I'm also interested in this to some extent. I'm curious, what value for CPU% busy That the node is pretty close to 100% utilised - but In my case it looks like maybe the CPU% in We do use Prometheus etc for any real monitoring, but |
This is something that Kubernetes can help you achieve but it not guaranteed by just migrating to it.
Youn can read more about handling out-of-resource handling here https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
Kubelet currently uses cAdvisor to measure resources and making decision of pod Eviction (if node is overloaded). All this information is currently exposed in /metrics/cadvisor endpoint that can be collected by Prometheus. I would look into projects using Prometheus for monitoring Kubernetes.
I think there is an misunderstanding what |
Well, let us start with something we have evidence: When too many java jobs ended on some nodes, the node runs at high load factor and not responsive. There were two expectations: 1. kubernetes is expected to kill some of the jobs but it did not, 2. kubernetes is expected to distribute the jobs to begin with but some nodes ended with more than others, 3. kubernetes is expected to wait to start some of the jobs due to resource pressure. We did not specify requests and limits because if we use the peak number it would end up severe node under-utilization and smaller numbers probably won't help -- especially not knowing exactly what kubernetes sees. The document on out-of-resource seems mostly on out-of-memory. If kubernetes does not have an out-of-cpu kill, we may need it. Similarly for out-of-storage. Our java jobs tends to use high cpu when initializing, then would be idling most of the time with spikes. Such characteristics cannot be expressed in requests and limits, but fit "best effort" well. Are there documents on running them on kubernetes? From a higher level, the static requests and limits most likely won't cut it, turning off swapping may have made it worse. Without complicating the model too much, a new "head room" number, i.e., start only when a node has that much resource "free", might help. The "head room" is similar to request, but is not guaranteed, which would factor more of the actual resource usages into scheduling. Similarly, the nodes may need some resources in reserve. On our own, we are considering turn on swap and see how it goes. I understand that the kubernetes metrics is not meant for kubernetes scheduling. However, when there are issues that are expressed in Linux metrics, such as our of disk, out of memory, load too high, and kubernetes were not responding as expected, we would need to know how the metrics are related, whether the differences are contributing factors, in order to get kubernetes and the expectations in line. The metrics document may want to "ground" the kubernetes metrics in Linux metrics, and explicit about whether a number is present value or measured 5 minutes ago, or some average. Now onto some speculation that our node crash was caused by out of memory. If we can make this assumption, then we would have the same expectations as above of kubernetes to prevent such crashes. Here the issue may be whether kubernetes can detect OOM earlier enough to prevent a crash. For filing the issue with the proper group, can you help me? This is already a spun off of #193 in seeking out the proper group. |
I get to the same
|
@wuestkamp Great! That is what 'kubectl top node' is reporting! Talking about metrics from Prometheus, on the risk of being off topic, node_filesystem_size_bytes, node_filesystem_free_bytes, and node_filesystem_avail_bytes are reporting on the /var/lib/docker device, matching
By the way, the node_filesystem_*_bytes metrics are labeled with the host / device, which would be wrong when the docker storage /var/lib/docker is on a different device. |
I have a similar issue with metrics shown by
However, HPA shows only 7m (4%), hence doesn't scale up
Any ideas as to why?? |
@hasakura12 Please reach out to SIG Autoscaling similar problems. You can reach out to them via K8s Slack. My suggestion would be to look into resource requests, but please don't continue this topic on unrelated issue. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added:
This is spun off #193
I can understand the concept of units, requests, limits and actuals, but have a hard time to reconcile the 'kubectl top' output with that of "traditional linux commands", e.g., uptime (load everages of past 1, 5, and 15 minutes), free (memory used, free, shared, buff/cache, available), top ('uptime' + 'free -k', then VIRT/RES/SHR memory and %CPU/%MEM on processes).
One question on the kubectl top commands is for what time period it is reporting? When samples are taken and what samples are rolled up for the output? On the other hand, Linux commands always report the state at the time the command is executed, and they include non-kubernetes processes.
I have looked at https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md which offers this: "How resource utilization is calculated?
Metrics server doesn't provide resource utilization metric (e.g. percent of CPU used). Kubectl top and HPA calculate those values by themselves based on pod resource requests or node capacity."
kubectl top man pages are silent on how they calculate the utilizations they report.
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details explains HPA, but it used terms like currentMetricValue, desiredMetricValue, targetAverageValue,
and targetAverageUtilization. We may want to define those terms in the numbers reported by kubectl top commands as well as by kubectl describe node and the limits and requests specified on the containers.
As to Prometheus monitoring, it dashboards present their own set of CPU, Memory, and Disk metrics, rolled up to nodes, namespaces, etc. The base level metrics seems from kubernetes on containers, and the roll up periods vary from chart to chart. The disk part is not reported by kubectl top commands at al, and seems not cover all Linux partitions. It is a separate challenge for reconciling Prometheus dashboards with Linux. The promise is that Prometheus uses kubernetes metrics thus understanding the kubernetes metrics will go a long way in understanding Prometheus monitoring.
Why is this needed:
The understanding would have impact on managing the resource utilization and the stability of kubernetes. We have random node crashes, suspecting resource spikes, but the utilization numbers are usually low. By the way, the majority of our pods are "best effort", do not specify requests and limits.
It is also suspected that kubernetes OOM may not kick in soon enough. A related question is how those metrics figure in kubernetes OOM calculation which seems to use memory metrics reported by 'vmstat -s' or 'free -k' among other Linux metrics, thus this could be where the kubernetes memory metrics meet Linux's. Kubernetes require us to turn off swapping, which can be a contributing factor because no swapping means no wiggle room in memory overruns, and wasted inactive contents occupying memory.
/kind feature
The text was updated successfully, but these errors were encountered: