-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
container/libcontainer: fix schedulerStatsFromProcs hogging memory and wrong stats #2979
Conversation
Instead of passing a few parameters, make this function a method, to simplify its calling as well as further development. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Hi @kolyshkin. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@rphillips: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
This fixes the memory leak reported on Kuberrnetes nodes using many exec probes. /lgtm |
cc @bobbypage |
schedstats.RunPeriods += v.RunPeriods | ||
schedstats.RunqueueTime += v.RunqueueTime | ||
schedstats.RunTime += v.RunTime | ||
if _, alive := alivePids[p]; !alive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me a bit to come to the conclusion that new proposed logic here is the generates the same stats as the old logic.
To confirm the main change here: previously the semantics of pidMetricsCache
is that it contains all pids (both alive and old) and the logic was that everytime the stats were generated, we would loop over all pids in pidMetricsCache
and sum up RunPeriods
, RunqueueTime
, and RunTime
.
The new logic proposed here is slightly different. pidsMetricsSaved
represents metrics from all "dead" pids, and h.pidMetricsCache
only represents metrics from currently alive pids. Thus when we generate the final stats at the end here, we need to simply update pidMetricsSaved
if any pids are gone, remove from cache. So everytime we generate final stats, we start with h.pidMetricsSaved
and basically "add" the current alive pids to it (and any new dead pids).
Let's maybe add some doc comment to make it clear the semantics between h.pidMetricsSaved
and h.pidMetricsCache
(suggested above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good description, I think you got it right (took me a while to wrap my head around it, too). Just wanted to add one thing (that is in common between the old and the new logic) -- for pids that were present during the previous call, and are still present, the stats are updated. Once the pid is gone, its stats are no longer updated, and so we do have cumulative stats for all the pids ever existed.
Now, the differences are,
- the old code do not work properly in case of PID recycling (i.e. when a different process with the same PID as before appears) -- in such case the stats from the older process gets overwritten and thus lost; the new code is free of such issue.
- in the old code the pidsMetricsCache map can grow up almost to a value of /proc/sys/kernel/pid_max (4194304 entries), in the new code its size do not exceed the sum of processes during the current and the previous iterations, which should be much less, especially in case of many short-lived processes.
Out of sheer curiosity I wrote some code to assess how much memory the old map can use:
package main
import "testing"
// Cpu Aggregated scheduler statistics
type CpuSchedstat struct {
// https://www.kernel.org/doc/Documentation/scheduler/sched-stats.txt
// time spent on the cpu
RunTime uint64 `json:"run_time"`
// time spent waiting on a runqueue
RunqueueTime uint64 `json:"runqueue_time"`
// # of timeslices run on this cpu
RunPeriods uint64 `json:"run_periods"`
}
const pidMax = 4194304 // From /proc/sys/kernel/pid_max, YMMV
func BenchmarkMapSize(b *testing.B) {
pidMetricsCache := make(map[int]*CpuSchedstat)
for i := 1; i < pidMax; i++ {
pidMetricsCache[i] = &CpuSchedstat{
RunTime: 42,
RunqueueTime: 42,
RunPeriods: 42,
}
}
}
A test run:
[kir@kir-rhat map-size]$ go test -bench . -benchmem -v
goos: linux
goarch: amd64
pkg: github.com/kolyshkin/test/map-size
cpu: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
BenchmarkMapSize
size_test.go:29: done
BenchmarkMapSize-4 1 2059912783 ns/op 443691456 B/op 4348077 allocs/op
PASS
ok github.com/kolyshkin/test/map-size 2.118s
So, this map can eat up to 432 MB of RAM in the worst case scenario. Multiply that by number of containers on the system (let's say we have 1000), and 😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks for the clarification. And good to hear, this also fixes issue with pid reuse.
Thanks for creating to benchmark, very surprising results 😱. Assuming that pidMax is maximum 4194304
might be a stretch, but again this is on a per container basis, so if there is very high process churn, the map could grow quite a bit. Thanks again for the fix!
The logic of the existing code of schedulerStatsFromProcs is to provide a cumulative stats for all the processes inside a container. Once the process is dead, its stat entry is no longer updated, but still used in totals calculation. This creates two problems: - pidsMetricsCache map is ever growing -- in case of many short-lived processes in containers this can impact kubelet memory usage a lot; - in case a new process with the same PID appears (as a result of PID reuse), the stats from the old one are overwritten, resulting in wrong totals (e.g. they can be less than previous, which should not ever be the case). To kill these two birds with one stone, let's accumulate stats from dead processes in pidsMetricsSaved, and remove them from the pidsMetricsCache. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
a62912d
to
9a22e3e
Compare
done; PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
The logic of the existing code of
schedulerStatsFromProcs
is to providea cumulative stats for all the processes inside a container. Once the
process is dead, its stat entry is no longer updated, but still used in
totals calculation. This creates two problems:
pidsMetricsCache
map is ever growing -- in case of many short-livedprocesses in containers this can impact kubelet memory usage a lot;
in case a new process with the same PID appears (as a result of PID
reuse), the stats from the old one are overwritten, resulting in
wrong totals (e.g. they can be less than previous, which should not
ever be the case).
To kill these two birds with one stone, let's accumulate stats from dead
processes in
pidsMetricsSaved
, and remove them from thepidsMetricsCache
.Closes: #2978