Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl top nodes/pods values vs uptime/free/top #447

Closed
wu105 opened this issue Feb 24, 2020 · 13 comments
Closed

kubectl top nodes/pods values vs uptime/free/top #447

wu105 opened this issue Feb 24, 2020 · 13 comments
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@wu105
Copy link

wu105 commented Feb 24, 2020

What would you like to be added:
This is spun off #193

I can understand the concept of units, requests, limits and actuals, but have a hard time to reconcile the 'kubectl top' output with that of "traditional linux commands", e.g., uptime (load everages of past 1, 5, and 15 minutes), free (memory used, free, shared, buff/cache, available), top ('uptime' + 'free -k', then VIRT/RES/SHR memory and %CPU/%MEM on processes).

One question on the kubectl top commands is for what time period it is reporting? When samples are taken and what samples are rolled up for the output? On the other hand, Linux commands always report the state at the time the command is executed, and they include non-kubernetes processes.

I have looked at https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md which offers this: "How resource utilization is calculated?
Metrics server doesn't provide resource utilization metric (e.g. percent of CPU used). Kubectl top and HPA calculate those values by themselves based on pod resource requests or node capacity."

kubectl top man pages are silent on how they calculate the utilizations they report.

https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details explains HPA, but it used terms like currentMetricValue, desiredMetricValue, targetAverageValue,
and targetAverageUtilization. We may want to define those terms in the numbers reported by kubectl top commands as well as by kubectl describe node and the limits and requests specified on the containers.

As to Prometheus monitoring, it dashboards present their own set of CPU, Memory, and Disk metrics, rolled up to nodes, namespaces, etc. The base level metrics seems from kubernetes on containers, and the roll up periods vary from chart to chart. The disk part is not reported by kubectl top commands at al, and seems not cover all Linux partitions. It is a separate challenge for reconciling Prometheus dashboards with Linux. The promise is that Prometheus uses kubernetes metrics thus understanding the kubernetes metrics will go a long way in understanding Prometheus monitoring.

Why is this needed:

The understanding would have impact on managing the resource utilization and the stability of kubernetes. We have random node crashes, suspecting resource spikes, but the utilization numbers are usually low. By the way, the majority of our pods are "best effort", do not specify requests and limits.

It is also suspected that kubernetes OOM may not kick in soon enough. A related question is how those metrics figure in kubernetes OOM calculation which seems to use memory metrics reported by 'vmstat -s' or 'free -k' among other Linux metrics, thus this could be where the kubernetes memory metrics meet Linux's. Kubernetes require us to turn off swapping, which can be a contributing factor because no swapping means no wiggle room in memory overruns, and wasted inactive contents occupying memory.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 24, 2020
@serathius
Copy link
Contributor

Thanks for you feedback @wu105,

From what I understood your main problem is lack of proper tools to investigate reason for node crashes and resource spikes. I don't think metrics will be a good tool for your problem. To do a proper root causing you need to look into layers below Kubernetes (e.g. kernel logs, memory pressure).

Kubernetes is a container management solution that is independent of platform that it's running on (OS, container runtime, cloud). Monitoring of Linux kernel should be best solved by third party solutions like Prometheus.

I noted possible improvements:

  • kubectl top is not well documented, making it hard to understand meaning of values and their relation to HPA
  • Undocumented Kubelet OOM calculation

@wu105
Copy link
Author

wu105 commented Feb 26, 2020

Well, kubernetes is supposed to manage the resources and achieve high utilization, with the power to kill pods when necessary. The node crashes we have left no traces behind that we can find. They tend to happen during higher load, usually high memory and low cpu utilizations, moderate networking and disk i/o, can be spiky, probably typical for java build jobs. There is a question lurking in the distance: does kubernetes manage well such best-of-effort and somewhat spiky loads?

By the way, Prometheus keeps the metrics history leading up to the crashes, but we have not found any thing obvious. This leads us to look into what kubernetes sees as the metrics and how it uses the metrics, in part to interpret the metrics recorded by Prometheus.

As a thorn in the side, it is a peculiar situation with swapping off (because swapping is a challenge for kubernetes to manage) while the system still uses large amounts of memory for buffering and caching and swaps on the buffer and cache memory.

@mcginne
Copy link

mcginne commented Feb 27, 2020

I'm also interested in this to some extent. I'm curious, what value for CPU% busy kubectl top nodes is reporting. When running some tests I can see with both top on the host and the prometheus query:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100)

That the node is pretty close to 100% utilised - but kubectl top nodes is reporting ~60-70%. busy.

In my case it looks like maybe the CPU% in steal or softirq or possibly system are not being counted? In my opinion it would be better to include all these, as what I really care about is if there is spare capacity on the host.

We do use Prometheus etc for any real monitoring, but kubectl top is pretty useful for quick checks etc, so it would be nice if it was accurate.

@serathius
Copy link
Contributor

serathius commented Feb 29, 2020

@wu105

Well, kubernetes is supposed to manage the resources and achieve high utilization, with the power to kill pods when necessary.

This is something that Kubernetes can help you achieve but it not guaranteed by just migrating to it.
High utilization directly depends on setting proper resource requests and limits. Without those Kubelet cannot properly configure how those resources are shared, leading to workload starvation or node crashed (e.g. kubelet not reserving resources for itself).

The node crashes we have left no traces behind that we can find. They tend to happen during higher load, usually high memory and low cpu utilizations, moderate networking and disk i/o, can be spiky, probably typical for java build jobs. There is a question lurking in the distance: does kubernetes manage well such best-of-effort and somewhat spiky loads?

Youn can read more about handling out-of-resource handling here https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
If you have more questions please reach out to SIG-Node. I would really encourage using requests and limits.
Running Java workloads on Kubernetes requires some more work then other containers. I would encourage reading about it more.

By the way, Prometheus keeps the metrics history leading up to the crashes, but we have not found any thing obvious. This leads us to look into what kubernetes sees as the metrics and how it uses the metrics, in part to interpret the metrics recorded by Prometheus.

Kubelet currently uses cAdvisor to measure resources and making decision of pod Eviction (if node is overloaded). All this information is currently exposed in /metrics/cadvisor endpoint that can be collected by Prometheus. I would look into projects using Prometheus for monitoring Kubernetes.

@mcginne

We do use Prometheus etc for any real monitoring, but kubectl top is pretty useful for quick checks etc, so it would be nice if it was accurate.

I think there is an misunderstanding what kubectl top purpose is. It's not meant to be accurate tool that would replace monitoring solutions or running top on node. It's meant for exporting metrics from resource metrics pipeline. You can read about it more here https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/

@wu105
Copy link
Author

wu105 commented Mar 1, 2020

Well, let us start with something we have evidence: When too many java jobs ended on some nodes, the node runs at high load factor and not responsive. There were two expectations: 1. kubernetes is expected to kill some of the jobs but it did not, 2. kubernetes is expected to distribute the jobs to begin with but some nodes ended with more than others, 3. kubernetes is expected to wait to start some of the jobs due to resource pressure. We did not specify requests and limits because if we use the peak number it would end up severe node under-utilization and smaller numbers probably won't help -- especially not knowing exactly what kubernetes sees.

The document on out-of-resource seems mostly on out-of-memory. If kubernetes does not have an out-of-cpu kill, we may need it. Similarly for out-of-storage.

Our java jobs tends to use high cpu when initializing, then would be idling most of the time with spikes. Such characteristics cannot be expressed in requests and limits, but fit "best effort" well. Are there documents on running them on kubernetes?

From a higher level, the static requests and limits most likely won't cut it, turning off swapping may have made it worse. Without complicating the model too much, a new "head room" number, i.e., start only when a node has that much resource "free", might help. The "head room" is similar to request, but is not guaranteed, which would factor more of the actual resource usages into scheduling. Similarly, the nodes may need some resources in reserve. On our own, we are considering turn on swap and see how it goes.

I understand that the kubernetes metrics is not meant for kubernetes scheduling. However, when there are issues that are expressed in Linux metrics, such as our of disk, out of memory, load too high, and kubernetes were not responding as expected, we would need to know how the metrics are related, whether the differences are contributing factors, in order to get kubernetes and the expectations in line. The metrics document may want to "ground" the kubernetes metrics in Linux metrics, and explicit about whether a number is present value or measured 5 minutes ago, or some average.

Now onto some speculation that our node crash was caused by out of memory. If we can make this assumption, then we would have the same expectations as above of kubernetes to prevent such crashes. Here the issue may be whether kubernetes can detect OOM earlier enough to prevent a crash.

For filing the issue with the proper group, can you help me? This is already a spun off of #193 in seeking out the proper group.

@serathius serathius added kind/support Categorizes issue or PR as a support question. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Mar 21, 2020
@wuestkamp
Copy link

I get to the same kubectl top values with these prometheus queries:

# memory
container_memory_working_set_bytes{id="/"}

# cpu
rate(container_cpu_usage_seconds_total{id="/"}[1m])

@wu105
Copy link
Author

wu105 commented Jun 16, 2020

@wuestkamp Great! That is what 'kubectl top node' is reporting!

Talking about metrics from Prometheus, on the risk of being off topic, node_filesystem_size_bytes, node_filesystem_free_bytes, and node_filesystem_avail_bytes are reporting on the /var/lib/docker device, matching df -B1 output:

df metrics   Prometheus metrics from node exporter
1B-blocks    node_filesystem_size_bytes
Used         node_filesystem_size_bytes - node_filesystem_free_bytes
Available    node_filesystem_avail_bytes

By the way, the node_filesystem_*_bytes metrics are labeled with the host / device, which would be wrong when the docker storage /var/lib/docker is on a different device.

@hasakura12
Copy link

I have a similar issue with metrics shown by kubectl top pod vs kubectl get hpa:

kubectl top pod

# output
NAME                           CPU(cores)   MEMORY(bytes)
TEST_POD                   330m          164Mi

However, HPA shows only 7m (4%), hence doesn't scale up

NAME           REFERENCE                 TARGETS   MINP
ODS   MAXPODS   REPLICAS   AGE
TEST_POD   Deployment/TEST_POD   4%/80%    115        1          36d

Any ideas as to why??

@serathius
Copy link
Contributor

@hasakura12 Please reach out to SIG Autoscaling similar problems. You can reach out to them via K8s Slack. My suggestion would be to look into resource requests, but please don't continue this topic on unrelated issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 20, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants