Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consistent metric use for memory #227

Closed
brancz opened this issue Jul 12, 2019 · 15 comments
Closed

consistent metric use for memory #227

brancz opened this issue Jul 12, 2019 · 15 comments

Comments

@brancz
Copy link
Member

brancz commented Jul 12, 2019

The resource dashboards use inconsistent metrics for displaying memory usage. Currently the cluster dashboard uses RSS and the namespace one uses total usage. I would propose that both use working-set-bytes, and the pod dashboard continues to show the distinct types as a stacked graph.

@gouthamve @metalmatze @csmarchbanks @tomwilkie @paulfantom @kakkoyun

@csmarchbanks
Copy link
Member

Why working set over RSS? I have found that working set can under report pretty significantly, and personally would prefer to over report using RSS. For example:
working_set_vs_heap_stack

@brancz
Copy link
Member Author

brancz commented Jul 18, 2019

I am ok with RSS as well (although unfortunately not as representative as working set bytes for go programs anymore), as long as we're consistent I'm happy :)

FWIW we probably should differentiate further between the different types in our Pod dashboard.

@csmarchbanks
Copy link
Member

I agree the focus should be on consistency.

RSS is definitely not as useful as I would like for go >= 1.12. I would be happy to use working set if someone could explain to me how my above graphs show such different values. But otherwise, I think it would be safer to overestimate memory usage by using RSS than have a pod OOM and our reported memory not be close to the limit.

@brancz
Copy link
Member Author

brancz commented Jul 18, 2019

Yeah, I need to dig into the OOMkiller again, and I feel like whatever it uses should be the default that we use for display, and then show all the breakdown(s) in the Pod dashboard.

@csmarchbanks
Copy link
Member

👍 that sounds ideal. If you get to digging into the OOMKiller before me I would love to hear what you learn!

@brancz
Copy link
Member Author

brancz commented Jul 18, 2019

Reading this, it sounds like container_memory_working_set_bytes is the right metric to default to.

@s-urbaniak
Copy link
Contributor

Disclaimer: I am not a virtual memory subsystem expert ;-) Just working on consolidating those metrics.

I agree with @brancz on using container_memory_working_set_bytes. It originates from the actual cgroup memory controller. When looking at the cadvisor code, it is calculated as

container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file

Which has RSSish semantics (as in "accounted resident memory" minus "unused file caches") although it might include some fuzziness as per https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt.

@csmarchbanks I rechecked your graph and noted that your stack query doesn't apply the {pod="prometheus-k8s-0"} filter.

On my cluster

go_memstats_heap_inuse_bytes{pod="prometheus-k8s-0"}+ go_memstats_stack_inuse_bytes{pod="prometheus-k8s-0"}

is less then container_memory_working_set_bytes{pod="prometheus-k8s-0"} which is expected

The latter also accounts for active (aka non evictable) filesystem cache memory which is not present in the heap/stack golang metrics.

image

@s-urbaniak
Copy link
Contributor

ugh nevermind 🤦‍♂️ the subsequent stack query inherits the label selector from the heap query.

@metalmatze
Copy link
Member

Did we kind of have an agreement on container_memory_working_set_bytes? This is what's used in #238. We could thus go ahead and merge that PR?

@s-urbaniak
Copy link
Contributor

container_memory_working_set_bytes is the way to go for now and I agree with going ahead and merge #238 👍

Also for another documentation reference about the semantics of that metric: http://www.brendangregg.com/wss.html

(courtesy of @paulfantom)

@csmarchbanks
Copy link
Member

I am ok with moving forward with container_memory_working_set_bytes. I would like to dig into the behavior (possible bug?) I posted above, but most of the time working set is good for me.

Also, @s-urbaniak I do not think the reference you posted by brendan gregg is the same working set as reported by cAdvisor. As you said above: container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file, whereas the article you just provided tries to calculate recently touched memory.

@paulfantom
Copy link
Member

container_memory_usage_bytes - total_inactive_file is a naive way to get "hot" memory (recently touched memory) otherwise known as WSS (Wokring Set Size).

@csmarchbanks
Copy link
Member

I am going to echo what @s-urbaniak said and say I am also not a virtual memory subsystem expert.

Is it possible that the reason I am seeing such low working set size is that prometheus caches things in memory but does not touch them for a long time that they would be removed from container_memory_working_set_bytes? If so I am back to being against using WSS, because that memory cannot be reclaimed by the kernel, and an OOM could happen when WSS is very low.

Another datapoint, today I have a prom server with:

  • go_memstats_heap_inuse_bytes} + go_memstats_stack_inuse_bytes: 40 GB
  • WSS: 11 GB
  • RSS: 85 GB

@paulfantom
Copy link
Member

paulfantom commented Aug 13, 2019

I spent some more time dwelling into inner workings of kernel and kubernetes memory management system. From that I would say we have 3 main concerns about choosing right metric:

  1. UX
  2. OOMKiller
  3. Pod eviciton

First one I hope is self explanatory, so let's look at second one - OOMKiller.

OOMKiller

This beast is taking only things that can be reliably measured by kernel and kill process with highest oom_score. Score is proportional to RSS + SWAP divided by total available memory [1][2] and also takes into consideration an adjuster in form of oom_score_adj (imporant for k8s [3]). Since everything in linux runs in a cgroup this score can be counted for any container by using "total available memory" of said cgroup (or a parent one if current cgroup doesn't have limits). So if we want to go only this route it seems like choosing RSS (+SWAP) would be the best way. However let's look at third option - Pod eviction.

Pod eviction

According to kubernetes documentation there are 5 signals which might cause pod eviciton [4] and only one of them relates to memory. Memory-based eviction signal is derived from cgoups and is known as memory.available which is counted as TOTAL_RAM - WSS (Working Set Size [5]). In this calculation kubelet excludes amount of bytes of file-backed memory on inactive LRU list known as inactive_file as this memory is reclaimable under pressure. It is worth noting that kubelet doesn't look at RSS, but makes its decisions based on WSS. So in this scenario it would be better to use WSS as it is more kubernetes specific. Now we just need to find out what is happening earlier, OOMKill or pod eviction to provide better UX.

What's first?

In normal conditions pod eviction should happen before OOMKill due to how node eviction thresholds [6] are set compared to all available memory. When thresholds are met kubelet should induce memory pressure and processes should avoid OOMKill. However due to how kubelet obtains data [7] there might be a case where it won't see a condition before OOMKiller kicks in.

Summary

Considering all those findings I would say that our reference metric responsible for "used" memory should be WSS. However we should keep in mind that this makes sense ONLY for kubernetes due to some additional memory tweaking made by kubelet on every pod.

[1]: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L547-L557
[2]: https://github.com/torvalds/linux/blob/master//mm/oom_kill.c#L198-L240
[3]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior
[4]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy
[5]: http://brendangregg.com/wss.html
[6]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-thresholds
[7]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#kubelet-may-not-observe-memory-pressure-right-away

@csmarchbanks
Copy link
Member

Thank you for the in depth description @paulfantom!

One point, I would say I experience far more OOMKills from container limits than pod evictions, but I am sure that depends on your deployment.

I am happy to use WSS for now, and see how it goes. Closing this ticket since #238 has already been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants