[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

iholder101 · 2023-09-21T10:56:59Z

What would you like to be added?

A replacement for #105023

The issue is for addressing problems related to node pressures and swap.

In short, the problem is that the kernel would try to avoid/defer swapping as much as possible. Therefore, swap generally kicks in when the node is already pressured. On the other hand, kubelet configures node-eviction thresholds so it would be able to reclaim memory before the kernel starts doing so (e.g. with OOM kills).

So, now the flow is as follows:

Kubelet threshold is exceeded => kubelet reclaims memory by killing pods.
Swap kicks in.
Kernel starts reclaiming space (e.g. by OOMing processes).

With this flow, swap does not have the opportunity to free memory space because kubelet starts evicting before it kicks in.

The desired flow is (same as above with different order):

Swap kicks in.
Kubelet threshold is exceeded => kubelet reclaims memory by killing pods.
Kernel starts reclaiming space (e.g. by OOMing processes).

Important pointers:

See this comment.
Current discussion in Slack.

Why is this needed?

KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap

iholder101 · 2023-09-21T10:57:55Z

/sig node
/assign

iholder101 · 2023-09-21T10:59:07Z

/triage accepted

xpivarc · 2023-11-17T09:59:08Z

Hey,
I have tried to reproduce the issue but it varies a lot based on the setup(both node and kubelet). I think this is not really an issue on the Kubernetes side but rather the Node setup.
My setup: 4 CPU 10G memory VM with 4G file swap.

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "4G"]
    resources:
      requests:
        memory: 3G

Testing limited swap, no system reserved memory.
With my setup Kernel is by default trying to keep ~100M of free memory and the eviction does not happen. The Burstable Pods end up consuming as much swap as they can (~1.2G per Pod in my case). Any additional Pod is either evicted or OOM killed, depending on how aggressive the allocations are and if Kubelet has enough time to react (this to me seems like wanted behavior).

To tune the Kernel, one can play with:
/proc/sys/vm/swappiness
/proc/sys/vm/watermark_scale_factor
/proc/sys/vm/min_free_kbytes
Additionally one can setup system/user cgroup slices to not swap which will allow only Burstable Pods to use swap.

The only thing that might need to be changed is rankMemoryPressure as I don't believe it takes swap into account.

xpivarc · 2023-11-17T10:03:52Z

Hi @jan-kantert ,
Could you share your setup in which you observed evictions before Swap was exhausted?
Mainly I am interested in:
Kubelet setup:

system reserved?
kube reserved?
eviction threshold?

Node setup:

any Swap tunings?
cgroup v1/v2?
CRI ?
It would be great if you could share how you performed the observation and what type of workloads/Pods you were running.

jan-kantert · 2023-11-17T14:08:23Z

Hey, I have tried to reproduce the issue but it varies a lot based on the setup(both node and kubelet). I think this is not really an issue on the Kubernetes side but rather the Node setup. My setup: 4 CPU 10G memory VM with 4G file swap.
apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "4G"]
    resources:
      requests:
        memory: 3G
Testing limited swap, no system reserved memory. With my setup Kernel is by default trying to keep ~100M of free memory and the eviction does not happen. The Burstable Pods end up consuming as much swap as they can (~1.2G per Pod in my case). Any additional Pod is either evicted or OOM killed, depending on how aggressive the allocations are and if Kubelet has enough time to react (this to me seems like wanted behavior).

To tune the Kernel, one can play with: /proc/sys/vm/swappiness /proc/sys/vm/watermark_scale_factor /proc/sys/vm/min_free_kbytes Additionally one can setup system/user cgroup slices to not swap which will allow only Burstable Pods to use swap.

The only thing that might need to be changed is rankMemoryPressure as I don't believe it takes swap into account.

Our case works as following 32GB of swap, 16GB of memory. Spin up pods with request 1GB of memory but use 2GB. Should work fine but we see evictions from kubelet and memory pressure on the node. We use cgroup v2. Just enabled swap - no tuning. Evictions happen in kubelet so it is not related to Linux OOM killer at all.

xpivarc · 2023-11-17T14:35:06Z

Our case works as following 32GB of swap, 16GB of memory. Spin up pods with request 1GB of memory but use 2GB. Should work fine but we see evictions from kubelet and memory pressure on the node. We use cgroup v2. Just enabled swap - no tuning. Evictions happen in kubelet so it is not related to Linux OOM killer at all.

Would you also share if you configure kube/system reserved memory and if you are using the default memory threshold?

kannon92 · 2023-11-17T15:20:21Z

And @jan-kantert, please mention k8s version, container runtime you use (and version).

xpivarc · 2023-12-01T16:08:22Z

I have been trying to reproduce the reported behavior but I mostly failed. TLDR I think Kubelet behaves as expected.

For completeness, I am sharing my setup and the cases I tried.

I am using a VM with 4 CPUs and 16GB of memory. I have set up a swap of 32GB to match the reported environment. I am using a local build of Kubernetes (with etcd, API, ... pushing it to edge). The distro is Fedora 39 without any mentioned tunings. I am reserving 500M for kube and system with KubeReserved and SystemReserved respectively, keeping the memory threshold on default (100M).

case -Pod reserved 1G - I used stress used in Kubernetes tests to progressively allocate memory up to 2G per Pod. This works just fine.
case - Pod reserved 1G - I used a combination of stress, allocating up to 1G per Pod, and burst allocating 1G block of memory in a loop on top of it. This results in a few OOMs. This seems reasonable to me as Kernel can't swap fast enough.
case -Pod reserved 1G - I used standard stress (stress --vm 4 --vm-bytes 500M --vm-keep), consuming 2G . For the default thresholds, this worked fine.
case - Pod reserved 1G, consuming 2G - I gave a second try to 3. case with increased threshold (1500M). Here I observed evictions as expected. Increasing the /proc/sys/vm/min_free_kbytes to ~500-1000M resulted in no eviction.

jan-kantert · 2023-12-04T08:23:31Z

I have been trying to reproduce the reported behavior but I mostly failed. TLDR I think Kubelet behaves as expected.

For completeness, I am sharing my setup and the cases I tried.

I am using a VM with 4 CPUs and 16GB of memory. I have set up a swap of 32GB to match the reported environment. I am using a local build of Kubernetes (with etcd, API, ... pushing it to edge). The distro is Fedora 39 without any mentioned tunings. I am reserving 500M for kube and system with KubeReserved and SystemReserved respectively, keeping the memory threshold on default (100M).

Thank you for your effort! How much do you use as memory reservation in each test?

case - I used stress used in Kubernetes tests to progressively allocate memory up to 2G per Pod. This works just fine.

This works for me as well if I keep a reservation of 2GB per pod. Swap is not really used nor needed here.

case - I used a combination of stress, allocating up to 1G per Pod, and burst allocating 1G block of memory in a loop on top of it. This results in a few OOMs. This seems reasonable to me as Kernel can't swap fast enough.

This looks suspicious to me. Swap is a synchronous activity in Linux. I assume that you set a Reservation of 1GB and used 2GB. With 32GB swap and 16GB RAM this should neither cause OOM nor Evictions.

case - I used standard stress (stress --vm 4 --vm-bytes 500M --vm-keep) . For the default thresholds, this worked fine.

How much is reserved here? How much used?

case - I gave a second try to 3. case with increased threshold (1500M). Here I observed evictions as expected. Increasing the /proc/sys/vm/min_free_kbytes to ~500-1000M resulted in no eviction.

That indicates that you are operating very close to the Eviction threshold.

My repro looks like this (same node):

Start pod A with 14GB reservation which uses 20GB.
Start pod B with 1GB reservation and 4GB usage.
Wait.
Node will go into memory pressure and evict one of the pods.

The node is clearly not out of memory. You can even construct cases where the node has free memory available. Step 3 is important as kubelet will take a few seconds to react.

kannon92 · 2023-12-13T19:33:55Z

@jan-kantert Are you changing the config to LimitedSwap or just using the default settings for swap (ie UnlimitedSwap)?

kannon92 · 2023-12-14T20:21:02Z

Since this was considered a blocker for swap, I wanted to go through and try and reproduce the case.

From my understanding, the issue is that we are evicting pods before swap can be used.

Setup

Provision a GCP Fedora 38 node with 8 GB of boot and 8 GB of zram for swap (this was the default settings).
Kernel Swapiness is set to 60.
Run crio release-1.28
Build and run local-up-cluster for k/k on release-1.28
Ran FEATURE_GATES=NodeSwap=true hack/local-up-cluster.sh

Memory before running any pods:

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       634Mi       222Mi       1.6Gi       6.9Gi       5.2Gi
Swap:          7.7Gi       0.0Ki       7.7Gi

I have two workloads to simulate @jan-kantert example:

Pod1.yaml

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "4G"]
    resources:
      requests:
        memory: 1G

Even with running just this pod, I see free saying we are using swap.

Pod2.yaml

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "3G"]
    resources:
      requests:
        memory: 1.5G

Running these workloads, I see that swap is used using free and neither pods are evicted before swap is used.

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       6.4Gi       220Mi        35Mi       1.1Gi       1.0Gi
Swap:          7.7Gi       3.3Gi       4.4Gi

I see swap being utilized and eviction is avoided.

kannon92 · 2023-12-14T20:26:32Z

/triage needs-moreinformation

@jan-kantert can you please share with us all the steps you did to reproduce this?

What version of kube are you using?

What kind of workloads are running on the node?

What is the pod yaml you use to reproduce this?

What container runtime (and version) are you using?

Are you using LimitedSwap or UnlimitedSwap?

What kind of swap are you using - disk, memory, type of disk?

What is the OS (and version)?

What is your kernel setup for swap?

k8s-ci-robot · 2023-12-14T20:26:35Z

@kannon92: The label(s) triage/needs-moreinformation cannot be applied, because the repository doesn't have them.

In response to this:

/triage needs-moreinformation

@jan-kantert can you please share with us all the steps you did to reproduce this?

What version of kube are you using?

What kind of workloads are running on the node?

What is the pod yaml you use to reproduce this?

What container runtime (and version) are you using?

Are you using LimitedSwap or UnlimitedSwap?

What kind of swap are you using - disk, memory, type of disk?

What is the OS (and version)?

What is your kernel setup for swap?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 · 2023-12-14T20:26:45Z

/triage needs-more-information

k8s-ci-robot · 2023-12-14T20:26:47Z

@kannon92: The label(s) triage/needs-more-information cannot be applied, because the repository doesn't have them.

In response to this:

/triage needs-more-information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 · 2023-12-14T20:27:21Z

/triage needs-information

ozzieba · 2024-04-02T16:15:03Z

Using swap, I've been hitting memory pressure that looks like it's caused by #43916, trying to work around it by hitting /sys/fs/cgroup/memory.reclaim when awk '/MemFree/ {print $2}' /host/proc/meminfo gets too low, but would be nice not to have to do that

kannon92 · 2024-04-16T18:33:38Z

Using swap, I've been hitting memory pressure that looks like it's caused by #43916, trying to work around it by hitting /sys/fs/cgroup/memory.reclaim when awk '/MemFree/ {print $2}' /host/proc/meminfo gets too low, but would be nice not to have to do that

I'd suggest creating a new issue.

iholder101 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 21, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023

k8s-ci-robot assigned iholder101 Sep 21, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 21, 2023

iholder101 mentioned this issue Sep 21, 2023

[KEP-2400] Verify MemoryPressure behaviour with swap enabled #105023

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023

iholder101 mentioned this issue Sep 21, 2023

Node memory swap support kubernetes/enhancements#2400

Open

44 tasks

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Dec 14, 2023

twz123 mentioned this issue Dec 18, 2023

k0s and swap: Pods got swapped but memory-pressure taint triggers. k0sproject/k0s#3830

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

iholder101 commented Sep 21, 2023

iholder101 commented Sep 21, 2023

iholder101 commented Sep 21, 2023

xpivarc commented Nov 17, 2023 •

edited

xpivarc commented Nov 17, 2023

jan-kantert commented Nov 17, 2023

xpivarc commented Nov 17, 2023

kannon92 commented Nov 17, 2023

xpivarc commented Dec 1, 2023 •

edited

jan-kantert commented Dec 4, 2023

kannon92 commented Dec 13, 2023

kannon92 commented Dec 14, 2023 •

edited

kannon92 commented Dec 14, 2023

k8s-ci-robot commented Dec 14, 2023

kannon92 commented Dec 14, 2023

k8s-ci-robot commented Dec 14, 2023

kannon92 commented Dec 14, 2023

ozzieba commented Apr 2, 2024

kannon92 commented Apr 16, 2024

[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

Comments

iholder101 commented Sep 21, 2023

What would you like to be added?

Why is this needed?

iholder101 commented Sep 21, 2023

iholder101 commented Sep 21, 2023

xpivarc commented Nov 17, 2023 • edited

xpivarc commented Nov 17, 2023

jan-kantert commented Nov 17, 2023

xpivarc commented Nov 17, 2023

kannon92 commented Nov 17, 2023

xpivarc commented Dec 1, 2023 • edited

jan-kantert commented Dec 4, 2023

kannon92 commented Dec 13, 2023

kannon92 commented Dec 14, 2023 • edited

Setup

kannon92 commented Dec 14, 2023

k8s-ci-robot commented Dec 14, 2023

kannon92 commented Dec 14, 2023

k8s-ci-robot commented Dec 14, 2023

kannon92 commented Dec 14, 2023

ozzieba commented Apr 2, 2024

kannon92 commented Apr 16, 2024

xpivarc commented Nov 17, 2023 •

edited

xpivarc commented Dec 1, 2023 •

edited

kannon92 commented Dec 14, 2023 •

edited