Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to update container_memory_usage_bytes to Cache+RSS #3286

Open
HonakerM opened this issue Mar 31, 2023 · 8 comments
Open

Proposal to update container_memory_usage_bytes to Cache+RSS #3286

HonakerM opened this issue Mar 31, 2023 · 8 comments

Comments

@HonakerM
Copy link

HonakerM commented Mar 31, 2023

Hello, I would like to propose updating container_memory_usage_bytes from reading memory.usage_in_bytes/memory.current to calculating it manually from Cache+RSS found in the memory.stat file. The reason being that there can be a discrepancy with memory.usage_in_bytes that's exacerbated on multi-core systems.

Background

Metrics

My original investigation started when I was trying to debug the memory usage of an unrelated application running in a Kubernetes pod and was watching the container_memory_working_set_bytes metric as suggested by the kubernetes docs. I wanted to find out the source of this value which lead me to cadvisor GetStats which told me it was container_memory_usage_bytes - inactive file. My question then became where does container_memory_usage_bytes come from? The cadvisor code calls out to runc's GetStats which answered my question: the information is gathered from files in the /sys/fs/cgroup/memory directory specifically memory.stat and either memory.usage_in_bytes or memory.current depending on cgroup version.

Linux Kernel

Throughout this research I ran into a few other cadvisor memory issues like #3197 and #3081 which discuss these values and what we should be subtracting. This made me assume that usage_in_bytes could be calculated from stat file so now I was curious about the calculation. That lead me to section5.5 in the Kernel Docs which says the following:

5.5 usage_in_bytes

For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

memory.usage_in_bytes isn't a calculation its a fuzz value! I was surprised to see that the kernel didn't have an exact value for memory usage especially when RSS+CACHE is available to it in memory.stat so I kept on digging. Looking back to the commit of that documentation a111c966 we see the following:

These changes improved performance of memory cgroup very much, but made res_counter->usage usually have a bigger value than the actual value of memory usage. So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise values of memory(and swap) usage anymore.

Instead of removing these files completely(because we cannot know res_counter->usage without them), this patch updates the meaning of those files.

That tells us that usage_in_bytes is not precise and can be an unreliable metric for memory measurement. However, res_counter->usage should still be pretty close to actual usage right? From the kernel email discussion regarding the above change there is atleast the guarantee that rss+cache <= usage_in_bytes. However, the difference between the two will grow with the size of each per cpu bulk pre-allocated memory. In other words this difference can grow with the number of cpu's!

At this point you might be wondering what a 12 year old commit and email thread talking about a removed res_counter api have to do with the current linux kernel? Well lockless page counters are the replacement to resource counters which tried its best to keep the syntax of the usage_in_bytes and stat unchanged. In the most recent kernel file mm/memcontrol.c file we see two functions mem_cgroup_usage and memcg_stat_show which correspond to the usage_in_bytes and stat respectively.

In the first function mem_cgroup_usage we can see that for nonroot cgroups the return value is the current page counter value accessed using page_counter_read(&memcg-memory). This page_counter_read is effectively the same call as res_counter->usage but without the lock requirement. On the flip side memcg_stat_show pulls its stats from the memory controllers vmstats struct which are synced either every 2 seconds or when a large enough stat change occurs as per comments in memcontrol.c

Throughout the above discussion I've been focusing on cgroup v1 usage_in_bytes. Well it turns out that cgroup v2 .current uses memory_current_read which pulls from the same page_counter_read as v1 so is similarly affected.

Example

To illustrate this on a reproducible application I've copied the memory.stat and memory.usage_in_bytes from an nginx pod as described in the kube docs and copied below

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80
root@nginx:/sys/fs/cgroup/memory# cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 270336
total_rss 1826816
<redacted for readability can provide full output>
4562944

As you can see total_cache + total_rss equals 2097152 which is almost half of the reported usage_in_bytes 4562944!! Albeit half in this case is a measly 2mb but that difference can add up. In the original cache and memory intensive application I was debugging I noticed the following in my memory files.

bash-4.4$ cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 2286772224
total_rss 565260288
<redacted for readability can provide full output>
3483992064

In this example CACHE+RSS equals 285203251 which is 602.68Mb less than the reported usage_in_bytes! However, 602Mb is only a 20% overestimation which is better than nginx's 50% but the impact is more visible.

Proposal

I'd like to update container_memory_usage_bytes to be a calculation of CACHE+RSS instead of the current usage_in_bytes. This is what's suggested in the kernel docs above and also what the kernel does for usage_in_bytes for the root cgroup like on the host/node. I also think this would be beneficial for the users since now container_memory_usage_bytes would be easily calculable and understandable from other statistics.

Most of my understanding about this issue comes from a 12 year old email discussion so I'm still familiarizing myself with the current kernel. Any corrections or historical context for the current implementation would be greatly appreciated. I'd also love to know if this has been discussed before. I'd be happy to work on implementing this change if the proposal is accepted.

HonakerM added a commit to HonakerM/cadvisor that referenced this issue Mar 31, 2023
This commit updates container_memory_usage to use the calculated
value of RSS+Cache instead of the value from cgroup. See issue google#3286
for reasoning

Signed-off-by: Michael Honaker <mchonaker@gmail.com>
@HeGaoYuan
Copy link

follow with interest

@ganga1980
Copy link

/track

@lance5890
Copy link

m

@fabiand
Copy link

fabiand commented Dec 6, 2023

Thoughts?

@SuperQ
Copy link

SuperQ commented Feb 25, 2024

The best thing to do is to stop attempting to compute metrics in cAdvisor. Simply pass the values from the kernel to metrics so that the downstream users can decide what to do.

This is exactly how we do it in the node_exporter. Values from things like /proc/meminfo are exposed directly, with no manipulation. This avoids confusion as to what metrics mean and allows end users to compute exactly what they need from the raw data.

@HonakerM
Copy link
Author

The best thing to do is to stop attempting to compute metrics in cAdvisor. Simply pass the values from the kernel to metrics so that the downstream users can decide what to do.

@SuperQ I completely agree, and I think that is the best long-term approach. I forgot to to update this issue with the thread but I emailed the lkml and found out that Cache+RSS is not a valid usage metric. Here is a quote from the thread:

What you see as the difference is mainly kernel memory (e.g. dentries, inodes, task_struct,...). --- RSS+Cache would only show memory that userspace is directly responsible for but not the kernel structures (whose size depends on kernel implementation afterall).

@SuperQ
Copy link

SuperQ commented Feb 26, 2024

Yes, I would love to see more granular data from the kernel on these fields. Plus it would be nice if the kernel would actually differentiate gauges from monotonic counters. 😁

@astronaut0131
Copy link

Same problem here

/ # cat /sys/fs/cgroup/memory/memory.usage_in_bytes 
1553903616
/ # cat /sys/fs/cgroup/memory/memory.stat 
cache 2703360
rss 237719552
rss_huge 415236096
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 611193
pgpgout 578948
pgfault 637131
pgmajfault 0
inactive_anon 1549688832
active_anon 0
inactive_file 1351680
active_file 1486848
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 2703360
total_rss 237719552
total_rss_huge 415236096
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 611193
total_pgpgout 578948
total_pgfault 637131
total_pgmajfault 0
total_inactive_anon 1549688832
total_active_anon 0
total_inactive_file 1351680
total_active_file 1486848
total_unevictable 0

usage_in_bytes is far more greater than RSS + CACHE, which leads to unreasonable kubernetes metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants