New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800
Comments
/sig node |
/triage accepted |
Hey,
Testing limited swap, no system reserved memory. To tune the Kernel, one can play with: The only thing that might need to be changed is rankMemoryPressure as I don't believe it takes swap into account. |
Hi @jan-kantert ,
Node setup:
|
Our case works as following 32GB of swap, 16GB of memory. Spin up pods with request 1GB of memory but use 2GB. Should work fine but we see evictions from kubelet and memory pressure on the node. We use cgroup v2. Just enabled swap - no tuning. Evictions happen in kubelet so it is not related to Linux OOM killer at all. |
Would you also share if you configure kube/system reserved memory and if you are using the default memory threshold? |
And @jan-kantert, please mention k8s version, container runtime you use (and version). |
I have been trying to reproduce the reported behavior but I mostly failed. TLDR I think Kubelet behaves as expected. For completeness, I am sharing my setup and the cases I tried. I am using a VM with 4 CPUs and 16GB of memory. I have set up a swap of 32GB to match the reported environment. I am using a local build of Kubernetes (with etcd, API, ... pushing it to edge). The distro is Fedora 39 without any mentioned tunings. I am reserving 500M for kube and system with KubeReserved and SystemReserved respectively, keeping the memory threshold on default (100M).
|
Thank you for your effort! How much do you use as memory reservation in each test?
This works for me as well if I keep a reservation of 2GB per pod. Swap is not really used nor needed here.
This looks suspicious to me. Swap is a synchronous activity in Linux. I assume that you set a Reservation of 1GB and used 2GB. With 32GB swap and 16GB RAM this should neither cause OOM nor Evictions.
How much is reserved here? How much used?
That indicates that you are operating very close to the Eviction threshold. My repro looks like this (same node):
The node is clearly not out of memory. You can even construct cases where the node has free memory available. Step 3 is important as kubelet will take a few seconds to react. |
@jan-kantert Are you changing the config to LimitedSwap or just using the default settings for swap (ie UnlimitedSwap)? |
Since this was considered a blocker for swap, I wanted to go through and try and reproduce the case. From my understanding, the issue is that we are evicting pods before swap can be used. Setup
Memory before running any pods:
I have two workloads to simulate @jan-kantert example: Pod1.yaml
Even with running just this pod, I see free saying we are using swap. Pod2.yaml
Running these workloads, I see that swap is used using free and neither pods are evicted before swap is used.
I see swap being utilized and eviction is avoided. |
/triage needs-moreinformation @jan-kantert can you please share with us all the steps you did to reproduce this? What version of kube are you using? What kind of workloads are running on the node? What is the pod yaml you use to reproduce this? What container runtime (and version) are you using? Are you using LimitedSwap or UnlimitedSwap? What kind of swap are you using - disk, memory, type of disk? What is the OS (and version)? What is your kernel setup for swap? |
@kannon92: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/triage needs-more-information |
@kannon92: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/triage needs-information |
Using swap, I've been hitting memory pressure that looks like it's caused by #43916, trying to work around it by hitting /sys/fs/cgroup/memory.reclaim when |
I'd suggest creating a new issue. |
What would you like to be added?
A replacement for #105023
The issue is for addressing problems related to node pressures and swap.
In short, the problem is that the kernel would try to avoid/defer swapping as much as possible. Therefore, swap generally kicks in when the node is already pressured. On the other hand, kubelet configures node-eviction thresholds so it would be able to reclaim memory before the kernel starts doing so (e.g. with OOM kills).
So, now the flow is as follows:
With this flow, swap does not have the opportunity to free memory space because kubelet starts evicting before it kicks in.
The desired flow is (same as above with different order):
Important pointers:
Why is this needed?
KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap
The text was updated successfully, but these errors were encountered: