PodGC default threshold causes OOMs on small master machines. #28484

wojtek-t · 2016-07-05T07:37:07Z

I've seen a cluster (from 1.2.4) where there is a job and the pods it is starting are simply crashing.
The problem is that those failed pods are not being garbage collected.

In the example cluster I've seen, we were able to produce ~13.000 failed pods over the week and it seems it none of these were ever garbage collected.

This basically results in constantly increasing memory usage of master components.
This seems like a bug to me.

@soltysh @gmarek @erictune @kubernetes/goog-control-plane @lavalamp @roberthbailey

wojtek-t · 2016-07-05T07:40:30Z

To clarify - the cluster was from 1.2.4 release.

gmarek · 2016-07-05T08:00:02Z

To be clear. The problem is that GC kicks in only after accumulating 12.5k Pods in the system and small (1, 2 core) master machines can't really handle this amount of load.

davidopp · 2016-07-05T23:47:28Z

ref/ #22680

lavalamp · 2016-07-06T19:59:15Z

Yeah. There's a flag for this already.

wojtek-t · 2016-07-06T20:01:53Z

@lavalamp - why did you close that?
Even though the flag exists, it's not used and this is a real problem.

lavalamp · 2016-07-06T20:17:51Z

@wojtek-t because this is a dup-- we already have #22680 and #25831 filed about this.

wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane labels Jul 5, 2016

soltysh mentioned this issue Jul 5, 2016

Job controller should not rely on calculating pod for its status #28486

Closed

gmarek changed the title ~~Failed pods from a job are not garbage collected.~~ PodGC default threshold causes OOMs on small master machines. Jul 5, 2016

lavalamp closed this as completed Jul 6, 2016

wojtek-t reopened this Jul 6, 2016

lavalamp closed this as completed Jul 6, 2016

jgreat mentioned this issue Jan 2, 2019

CIS Benchmark recommended options rancher/rke#1056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PodGC default threshold causes OOMs on small master machines. #28484

PodGC default threshold causes OOMs on small master machines. #28484

wojtek-t commented Jul 5, 2016

wojtek-t commented Jul 5, 2016

gmarek commented Jul 5, 2016

davidopp commented Jul 5, 2016

lavalamp commented Jul 6, 2016

wojtek-t commented Jul 6, 2016

lavalamp commented Jul 6, 2016

PodGC default threshold causes OOMs on small master machines. #28484

PodGC default threshold causes OOMs on small master machines. #28484

Comments

wojtek-t commented Jul 5, 2016

wojtek-t commented Jul 5, 2016

gmarek commented Jul 5, 2016

davidopp commented Jul 5, 2016

lavalamp commented Jul 6, 2016

wojtek-t commented Jul 6, 2016

lavalamp commented Jul 6, 2016