container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979

kolyshkin · 2021-10-26T23:04:17Z

The logic of the existing code of schedulerStatsFromProcs is to provide
a cumulative stats for all the processes inside a container. Once the
process is dead, its stat entry is no longer updated, but still used in
totals calculation. This creates two problems:

pidsMetricsCache map is ever growing -- in case of many short-lived
processes in containers this can impact kubelet memory usage a lot;
in case a new process with the same PID appears (as a result of PID
reuse), the stats from the old one are overwritten, resulting in
wrong totals (e.g. they can be less than previous, which should not
ever be the case).

To kill these two birds with one stone, let's accumulate stats from dead
processes in pidsMetricsSaved, and remove them from the pidsMetricsCache.

Closes: #2978

Instead of passing a few parameters, make this function a method, to simplify its calling as well as further development. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

k8s-ci-robot · 2021-10-26T23:04:25Z

Hi @kolyshkin. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rphillips · 2021-10-26T23:32:51Z

/ok-to-test

k8s-ci-robot · 2021-10-26T23:33:05Z

@rphillips: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mrunalp · 2021-10-26T23:33:24Z

/ok-to-test

rphillips · 2021-10-27T20:43:31Z

This fixes the memory leak reported on Kuberrnetes nodes using many exec probes.

BZ2016175

/lgtm

rphillips · 2021-10-27T20:52:31Z

cc @bobbypage

bobbypage · 2021-10-27T23:49:15Z

container/libcontainer/handler.go

 		schedstats.RunPeriods += v.RunPeriods
 		schedstats.RunqueueTime += v.RunqueueTime
 		schedstats.RunTime += v.RunTime
+		if _, alive := alivePids[p]; !alive {


This took me a bit to come to the conclusion that new proposed logic here is the generates the same stats as the old logic.

To confirm the main change here: previously the semantics of pidMetricsCache is that it contains all pids (both alive and old) and the logic was that everytime the stats were generated, we would loop over all pids in pidMetricsCache and sum up RunPeriods, RunqueueTime, and RunTime.

The new logic proposed here is slightly different. pidsMetricsSaved represents metrics from all "dead" pids, and h.pidMetricsCache only represents metrics from currently alive pids. Thus when we generate the final stats at the end here, we need to simply update pidMetricsSaved if any pids are gone, remove from cache. So everytime we generate final stats, we start with h.pidMetricsSaved and basically "add" the current alive pids to it (and any new dead pids).

Let's maybe add some doc comment to make it clear the semantics between h.pidMetricsSaved and h.pidMetricsCache (suggested above).

Good description, I think you got it right (took me a while to wrap my head around it, too). Just wanted to add one thing (that is in common between the old and the new logic) -- for pids that were present during the previous call, and are still present, the stats are updated. Once the pid is gone, its stats are no longer updated, and so we do have cumulative stats for all the pids ever existed.

Now, the differences are,

the old code do not work properly in case of PID recycling (i.e. when a different process with the same PID as before appears) -- in such case the stats from the older process gets overwritten and thus lost; the new code is free of such issue.

in the old code the pidsMetricsCache map can grow up almost to a value of /proc/sys/kernel/pid_max (4194304 entries), in the new code its size do not exceed the sum of processes during the current and the previous iterations, which should be much less, especially in case of many short-lived processes.

Out of sheer curiosity I wrote some code to assess how much memory the old map can use:

package main import "testing" // Cpu Aggregated scheduler statistics type CpuSchedstat struct { // https://www.kernel.org/doc/Documentation/scheduler/sched-stats.txt // time spent on the cpu RunTime uint64 `json:"run_time"` // time spent waiting on a runqueue RunqueueTime uint64 `json:"runqueue_time"` // # of timeslices run on this cpu RunPeriods uint64 `json:"run_periods"` } const pidMax = 4194304 // From /proc/sys/kernel/pid_max, YMMV func BenchmarkMapSize(b *testing.B) { pidMetricsCache := make(map[int]*CpuSchedstat) for i := 1; i < pidMax; i++ { pidMetricsCache[i] = &CpuSchedstat{ RunTime: 42, RunqueueTime: 42, RunPeriods: 42, } } }

A test run:

[kir@kir-rhat map-size]$ go test -bench . -benchmem -v goos: linux goarch: amd64 pkg: github.com/kolyshkin/test/map-size cpu: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz BenchmarkMapSize size_test.go:29: done BenchmarkMapSize-4 1 2059912783 ns/op 443691456 B/op 4348077 allocs/op PASS ok github.com/kolyshkin/test/map-size 2.118s

So, this map can eat up to 432 MB of RAM in the worst case scenario. Multiply that by number of containers on the system (let's say we have 1000), and 😱

Gotcha, thanks for the clarification. And good to hear, this also fixes issue with pid reuse.

Thanks for creating to benchmark, very surprising results 😱. Assuming that pidMax is maximum 4194304 might be a stretch, but again this is on a per container basis, so if there is very high process churn, the map could grow quite a bit. Thanks again for the fix!

container/libcontainer/handler.go

The logic of the existing code of schedulerStatsFromProcs is to provide a cumulative stats for all the processes inside a container. Once the process is dead, its stat entry is no longer updated, but still used in totals calculation. This creates two problems: - pidsMetricsCache map is ever growing -- in case of many short-lived processes in containers this can impact kubelet memory usage a lot; - in case a new process with the same PID appears (as a result of PID reuse), the stats from the old one are overwritten, resulting in wrong totals (e.g. they can be less than previous, which should not ever be the case). To kill these two birds with one stone, let's accumulate stats from dead processes in pidsMetricsSaved, and remove them from the pidsMetricsCache. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2021-10-28T01:46:48Z

Let's maybe add some doc comment to make it clear the semantics between h.pidMetricsSaved and h.pidMetricsCache

done; PTAL

bobbypage

LGTM, thanks!

container/libcontainer: make schedulerStatsFromProcs a method

2b607f0

Instead of passing a few parameters, make this function a method, to simplify its calling as well as further development. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

google-cla bot added the cla: yes label Oct 26, 2021

k8s-ci-robot added the needs-ok-to-test label Oct 26, 2021

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 26, 2021

bobbypage reviewed Oct 27, 2021

View reviewed changes

kolyshkin force-pushed the fix-mem-hog branch from a62912d to 9a22e3e Compare October 28, 2021 01:45

kolyshkin requested a review from bobbypage October 28, 2021 15:07

bobbypage approved these changes Oct 28, 2021

View reviewed changes

bobbypage merged commit f5d0824 into google:master Oct 28, 2021

kolyshkin mentioned this pull request Oct 28, 2021

UPSTREAM: google/cadvisor: 2979: container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats openshift/google-cadvisor#13

Merged

kolyshkin mentioned this pull request Nov 22, 2021

Bug 2016175: UPSTREAM: <carry>: bump cadvisor for 2957, 2999, 2979 upstream patches openshift/kubernetes#1070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979

container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979

kolyshkin commented Oct 26, 2021

k8s-ci-robot commented Oct 26, 2021

rphillips commented Oct 26, 2021

k8s-ci-robot commented Oct 26, 2021

mrunalp commented Oct 26, 2021

rphillips commented Oct 27, 2021 •

edited

rphillips commented Oct 27, 2021

bobbypage Oct 27, 2021 •

edited

kolyshkin Oct 28, 2021

bobbypage Oct 28, 2021

kolyshkin commented Oct 28, 2021

bobbypage left a comment

container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979

container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979

Conversation

kolyshkin commented Oct 26, 2021

k8s-ci-robot commented Oct 26, 2021

rphillips commented Oct 26, 2021

k8s-ci-robot commented Oct 26, 2021

mrunalp commented Oct 26, 2021

rphillips commented Oct 27, 2021 • edited

rphillips commented Oct 27, 2021

bobbypage Oct 27, 2021 • edited

Choose a reason for hiding this comment

kolyshkin Oct 28, 2021

Choose a reason for hiding this comment

bobbypage Oct 28, 2021

Choose a reason for hiding this comment

kolyshkin commented Oct 28, 2021

bobbypage left a comment

Choose a reason for hiding this comment

rphillips commented Oct 27, 2021 •

edited

bobbypage Oct 27, 2021 •

edited