Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979

Merged
merged 2 commits into from
Oct 28, 2021

Conversation

kolyshkin
Copy link
Contributor

The logic of the existing code of schedulerStatsFromProcs is to provide
a cumulative stats for all the processes inside a container. Once the
process is dead, its stat entry is no longer updated, but still used in
totals calculation. This creates two problems:

  • pidsMetricsCache map is ever growing -- in case of many short-lived
    processes in containers this can impact kubelet memory usage a lot;

  • in case a new process with the same PID appears (as a result of PID
    reuse), the stats from the old one are overwritten, resulting in
    wrong totals (e.g. they can be less than previous, which should not
    ever be the case).

To kill these two birds with one stone, let's accumulate stats from dead
processes in pidsMetricsSaved, and remove them from the pidsMetricsCache.

Closes: #2978

Instead of passing a few parameters, make this function a method,
to simplify its calling as well as further development.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@google-cla google-cla bot added the cla: yes label Oct 26, 2021
@k8s-ci-robot
Copy link
Collaborator

Hi @kolyshkin. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rphillips
Copy link
Contributor

/ok-to-test

@k8s-ci-robot
Copy link
Collaborator

@rphillips: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrunalp
Copy link
Collaborator

mrunalp commented Oct 26, 2021

/ok-to-test

@rphillips
Copy link
Contributor

rphillips commented Oct 27, 2021

This fixes the memory leak reported on Kuberrnetes nodes using many exec probes.

BZ2016175

/lgtm

@rphillips
Copy link
Contributor

cc @bobbypage

schedstats.RunPeriods += v.RunPeriods
schedstats.RunqueueTime += v.RunqueueTime
schedstats.RunTime += v.RunTime
if _, alive := alivePids[p]; !alive {
Copy link
Collaborator

@bobbypage bobbypage Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a bit to come to the conclusion that new proposed logic here is the generates the same stats as the old logic.

To confirm the main change here: previously the semantics of pidMetricsCache is that it contains all pids (both alive and old) and the logic was that everytime the stats were generated, we would loop over all pids in pidMetricsCache and sum up RunPeriods, RunqueueTime, and RunTime.

The new logic proposed here is slightly different. pidsMetricsSaved represents metrics from all "dead" pids, and h.pidMetricsCache only represents metrics from currently alive pids. Thus when we generate the final stats at the end here, we need to simply update pidMetricsSaved if any pids are gone, remove from cache. So everytime we generate final stats, we start with h.pidMetricsSaved and basically "add" the current alive pids to it (and any new dead pids).

Let's maybe add some doc comment to make it clear the semantics between h.pidMetricsSaved and h.pidMetricsCache (suggested above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good description, I think you got it right (took me a while to wrap my head around it, too). Just wanted to add one thing (that is in common between the old and the new logic) -- for pids that were present during the previous call, and are still present, the stats are updated. Once the pid is gone, its stats are no longer updated, and so we do have cumulative stats for all the pids ever existed.

Now, the differences are,

  • the old code do not work properly in case of PID recycling (i.e. when a different process with the same PID as before appears) -- in such case the stats from the older process gets overwritten and thus lost; the new code is free of such issue.
  • in the old code the pidsMetricsCache map can grow up almost to a value of /proc/sys/kernel/pid_max (4194304 entries), in the new code its size do not exceed the sum of processes during the current and the previous iterations, which should be much less, especially in case of many short-lived processes.

Out of sheer curiosity I wrote some code to assess how much memory the old map can use:

package main

import "testing"

// Cpu Aggregated scheduler statistics
type CpuSchedstat struct {
	// https://www.kernel.org/doc/Documentation/scheduler/sched-stats.txt

	// time spent on the cpu
	RunTime uint64 `json:"run_time"`
	// time spent waiting on a runqueue
	RunqueueTime uint64 `json:"runqueue_time"`
	// # of timeslices run on this cpu
	RunPeriods uint64 `json:"run_periods"`
}

const pidMax = 4194304 // From /proc/sys/kernel/pid_max, YMMV

func BenchmarkMapSize(b *testing.B) {
	pidMetricsCache := make(map[int]*CpuSchedstat)

	for i := 1; i < pidMax; i++ {
		pidMetricsCache[i] = &CpuSchedstat{
			RunTime:      42,
			RunqueueTime: 42,
			RunPeriods:   42,
		}
	}
}

A test run:

[kir@kir-rhat map-size]$ go test -bench . -benchmem -v
goos: linux
goarch: amd64
pkg: github.com/kolyshkin/test/map-size
cpu: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
BenchmarkMapSize
    size_test.go:29: done
BenchmarkMapSize-4   	       1	2059912783 ns/op	443691456 B/op	 4348077 allocs/op
PASS
ok  	github.com/kolyshkin/test/map-size	2.118s

So, this map can eat up to 432 MB of RAM in the worst case scenario. Multiply that by number of containers on the system (let's say we have 1000), and 😱

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the clarification. And good to hear, this also fixes issue with pid reuse.

Thanks for creating to benchmark, very surprising results 😱. Assuming that pidMax is maximum 4194304 might be a stretch, but again this is on a per container basis, so if there is very high process churn, the map could grow quite a bit. Thanks again for the fix!

container/libcontainer/handler.go Show resolved Hide resolved
container/libcontainer/handler.go Show resolved Hide resolved
The logic of the existing code of schedulerStatsFromProcs is to provide
a cumulative stats for all the processes inside a container. Once the
process is dead, its stat entry is no longer updated, but still used in
totals calculation. This creates two problems:

 - pidsMetricsCache map is ever growing -- in case of many short-lived
   processes in containers this can impact kubelet memory usage a lot;

 - in case a new process with the same PID appears (as a result of PID
   reuse), the stats from the old one are overwritten, resulting in
   wrong totals (e.g. they can be less than previous, which should not
   ever be the case).

To kill these two birds with one stone, let's accumulate stats from dead
processes in pidsMetricsSaved, and remove them from the pidsMetricsCache.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

Let's maybe add some doc comment to make it clear the semantics between h.pidMetricsSaved and h.pidMetricsCache

done; PTAL

Copy link
Collaborator

@bobbypage bobbypage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants