Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Nodes may die due to possible memory leak caused by emptyDir.medium.Memory #72759
This happens if a pod with such mount keeps crashing for a long period of time (days)
If for some reason the pod starts crashing, after a while, the node will eventually die.
I was able to reproduce this issue by using dd to write into shared memory from the init container, in the reproduce example the pod sometimes get's evicted (due to sizeLimit) and other times gets OOMKilled/CrashloopBackoff (not sure why evictions are not consistent)
We believe this is unrelated to initContainer, this will also happen on a main container if it crashes for long enough, the use of an InitContainer which is crashing simply reproduces faster since there is no Backoff for init.
In our pre-prod we saw this issue after pods crashed thousands of times over days, some can reproduce faster as it seem to also depend on how much data is written to /dev/shm.
Pre crash symptoms:
System OOMs encountered on node
Pause container process gets killed by OOM killer before dd which is consuming the memory!
Post crash symptoms:
Docker daemon on node is hung, PLEG makes node NotReady:
Garbage Collector reports failure to clean up:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Run the following spec:
On an instance type with 32GB RAM (t2.2xlarge) running 6 guaranteed replicas (with 4GB RAM), it takes about 1-2 hrs to kill the node.