consistent metric use for memory #227

brancz · 2019-07-12T14:49:54Z

The resource dashboards use inconsistent metrics for displaying memory usage. Currently the cluster dashboard uses RSS and the namespace one uses total usage. I would propose that both use working-set-bytes, and the pod dashboard continues to show the distinct types as a stacked graph.

@gouthamve @metalmatze @csmarchbanks @tomwilkie @paulfantom @kakkoyun

csmarchbanks · 2019-07-17T21:30:28Z

Why working set over RSS? I have found that working set can under report pretty significantly, and personally would prefer to over report using RSS. For example:

brancz · 2019-07-18T03:53:12Z

I am ok with RSS as well (although unfortunately not as representative as working set bytes for go programs anymore), as long as we're consistent I'm happy :)

FWIW we probably should differentiate further between the different types in our Pod dashboard.

csmarchbanks · 2019-07-18T19:07:50Z

I agree the focus should be on consistency.

RSS is definitely not as useful as I would like for go >= 1.12. I would be happy to use working set if someone could explain to me how my above graphs show such different values. But otherwise, I think it would be safer to overestimate memory usage by using RSS than have a pod OOM and our reported memory not be close to the limit.

brancz · 2019-07-18T19:21:43Z

Yeah, I need to dig into the OOMkiller again, and I feel like whatever it uses should be the default that we use for display, and then show all the breakdown(s) in the Pod dashboard.

csmarchbanks · 2019-07-18T19:25:06Z

👍 that sounds ideal. If you get to digging into the OOMKiller before me I would love to hear what you learn!

brancz · 2019-07-18T21:25:18Z

Reading this, it sounds like container_memory_working_set_bytes is the right metric to default to.

s-urbaniak · 2019-08-02T13:16:24Z

Disclaimer: I am not a virtual memory subsystem expert ;-) Just working on consolidating those metrics.

I agree with @brancz on using container_memory_working_set_bytes. It originates from the actual cgroup memory controller. When looking at the cadvisor code, it is calculated as

container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file

Which has RSSish semantics (as in "accounted resident memory" minus "unused file caches") although it might include some fuzziness as per https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt.

@csmarchbanks I rechecked your graph and noted that your stack query doesn't apply the {pod="prometheus-k8s-0"} filter.

On my cluster

go_memstats_heap_inuse_bytes{pod="prometheus-k8s-0"}+ go_memstats_stack_inuse_bytes{pod="prometheus-k8s-0"}

is less then container_memory_working_set_bytes{pod="prometheus-k8s-0"} which is expected

The latter also accounts for active (aka non evictable) filesystem cache memory which is not present in the heap/stack golang metrics.

s-urbaniak · 2019-08-02T14:00:12Z

ugh nevermind 🤦‍♂️ the subsequent stack query inherits the label selector from the heap query.

metalmatze · 2019-08-12T13:34:09Z

Did we kind of have an agreement on container_memory_working_set_bytes? This is what's used in #238. We could thus go ahead and merge that PR?

s-urbaniak · 2019-08-12T13:43:58Z

container_memory_working_set_bytes is the way to go for now and I agree with going ahead and merge #238 👍

Also for another documentation reference about the semantics of that metric: http://www.brendangregg.com/wss.html

(courtesy of @paulfantom)

csmarchbanks · 2019-08-12T14:19:47Z

I am ok with moving forward with container_memory_working_set_bytes. I would like to dig into the behavior (possible bug?) I posted above, but most of the time working set is good for me.

Also, @s-urbaniak I do not think the reference you posted by brendan gregg is the same working set as reported by cAdvisor. As you said above: container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file, whereas the article you just provided tries to calculate recently touched memory.

paulfantom · 2019-08-12T15:49:18Z

container_memory_usage_bytes - total_inactive_file is a naive way to get "hot" memory (recently touched memory) otherwise known as WSS (Wokring Set Size).

csmarchbanks · 2019-08-12T16:59:48Z

I am going to echo what @s-urbaniak said and say I am also not a virtual memory subsystem expert.

Is it possible that the reason I am seeing such low working set size is that prometheus caches things in memory but does not touch them for a long time that they would be removed from container_memory_working_set_bytes? If so I am back to being against using WSS, because that memory cannot be reclaimed by the kernel, and an OOM could happen when WSS is very low.

Another datapoint, today I have a prom server with:

go_memstats_heap_inuse_bytes} + go_memstats_stack_inuse_bytes: 40 GB
WSS: 11 GB
RSS: 85 GB

paulfantom · 2019-08-13T13:30:58Z

I spent some more time dwelling into inner workings of kernel and kubernetes memory management system. From that I would say we have 3 main concerns about choosing right metric:

UX
OOMKiller
Pod eviciton

First one I hope is self explanatory, so let's look at second one - OOMKiller.

OOMKiller

This beast is taking only things that can be reliably measured by kernel and kill process with highest oom_score. Score is proportional to RSS + SWAP divided by total available memory [1][2] and also takes into consideration an adjuster in form of oom_score_adj (imporant for k8s [3]). Since everything in linux runs in a cgroup this score can be counted for any container by using "total available memory" of said cgroup (or a parent one if current cgroup doesn't have limits). So if we want to go only this route it seems like choosing RSS (+SWAP) would be the best way. However let's look at third option - Pod eviction.

Pod eviction

According to kubernetes documentation there are 5 signals which might cause pod eviciton [4] and only one of them relates to memory. Memory-based eviction signal is derived from cgoups and is known as memory.available which is counted as TOTAL_RAM - WSS (Working Set Size [5]). In this calculation kubelet excludes amount of bytes of file-backed memory on inactive LRU list known as inactive_file as this memory is reclaimable under pressure. It is worth noting that kubelet doesn't look at RSS, but makes its decisions based on WSS. So in this scenario it would be better to use WSS as it is more kubernetes specific. Now we just need to find out what is happening earlier, OOMKill or pod eviction to provide better UX.

What's first?

In normal conditions pod eviction should happen before OOMKill due to how node eviction thresholds [6] are set compared to all available memory. When thresholds are met kubelet should induce memory pressure and processes should avoid OOMKill. However due to how kubelet obtains data [7] there might be a case where it won't see a condition before OOMKiller kicks in.

Summary

Considering all those findings I would say that our reference metric responsible for "used" memory should be WSS. However we should keep in mind that this makes sense ONLY for kubernetes due to some additional memory tweaking made by kubelet on every pod.

[1]: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L547-L557
[2]: https://github.com/torvalds/linux/blob/master//mm/oom_kill.c#L198-L240
[3]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior
[4]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy
[5]: http://brendangregg.com/wss.html
[6]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-thresholds
[7]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#kubelet-may-not-observe-memory-pressure-right-away

csmarchbanks · 2019-08-13T16:52:07Z

Thank you for the in depth description @paulfantom!

One point, I would say I experience far more OOMKills from container limits than pod evictions, but I am sure that depends on your deployment.

I am happy to use WSS for now, and see how it goes. Closing this ticket since #238 has already been merged.

metalmatze mentioned this issue Aug 12, 2019

dashboards: use container_memory_working_set_bytes instead of container_memory_usage_bytes #238

Merged

csmarchbanks closed this as completed Aug 13, 2019

brancz mentioned this issue Jul 10, 2020

Different memory values in the k8s #458

Closed

skonto mentioned this issue Feb 9, 2021

Unregister views to avoid slow oom issue during meter cleanup knative/pkg#2005

Merged

potiuk mentioned this issue Nov 29, 2022

Scheduler pods memory leak on airflow 2.3.2 apache/airflow#27589

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consistent metric use for memory #227

consistent metric use for memory #227

brancz commented Jul 12, 2019

csmarchbanks commented Jul 17, 2019

brancz commented Jul 18, 2019

csmarchbanks commented Jul 18, 2019

brancz commented Jul 18, 2019

csmarchbanks commented Jul 18, 2019

brancz commented Jul 18, 2019

s-urbaniak commented Aug 2, 2019

s-urbaniak commented Aug 2, 2019

metalmatze commented Aug 12, 2019

s-urbaniak commented Aug 12, 2019

csmarchbanks commented Aug 12, 2019

paulfantom commented Aug 12, 2019

csmarchbanks commented Aug 12, 2019

paulfantom commented Aug 13, 2019 •

edited

Loading

csmarchbanks commented Aug 13, 2019

consistent metric use for memory #227

consistent metric use for memory #227

Comments

brancz commented Jul 12, 2019

csmarchbanks commented Jul 17, 2019

brancz commented Jul 18, 2019

csmarchbanks commented Jul 18, 2019

brancz commented Jul 18, 2019

csmarchbanks commented Jul 18, 2019

brancz commented Jul 18, 2019

s-urbaniak commented Aug 2, 2019

s-urbaniak commented Aug 2, 2019

metalmatze commented Aug 12, 2019

s-urbaniak commented Aug 12, 2019

csmarchbanks commented Aug 12, 2019

paulfantom commented Aug 12, 2019

csmarchbanks commented Aug 12, 2019

paulfantom commented Aug 13, 2019 • edited Loading

OOMKiller

Pod eviction

What's first?

Summary

csmarchbanks commented Aug 13, 2019

paulfantom commented Aug 13, 2019 •

edited

Loading