Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

Open
Tracked by #2400
iholder101 opened this issue Sep 21, 2023 · 18 comments
Open
Tracked by #2400

[KEP-2400] [Swap]: Verify memory pressure behavior with swap enabled #120800

iholder101 opened this issue Sep 21, 2023 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@iholder101
Copy link
Contributor

What would you like to be added?

A replacement for #105023

The issue is for addressing problems related to node pressures and swap.

In short, the problem is that the kernel would try to avoid/defer swapping as much as possible. Therefore, swap generally kicks in when the node is already pressured. On the other hand, kubelet configures node-eviction thresholds so it would be able to reclaim memory before the kernel starts doing so (e.g. with OOM kills).

So, now the flow is as follows:

  • Kubelet threshold is exceeded => kubelet reclaims memory by killing pods.
  • Swap kicks in.
  • Kernel starts reclaiming space (e.g. by OOMing processes).

With this flow, swap does not have the opportunity to free memory space because kubelet starts evicting before it kicks in.

The desired flow is (same as above with different order):

  • Swap kicks in.
  • Kubelet threshold is exceeded => kubelet reclaims memory by killing pods.
  • Kernel starts reclaiming space (e.g. by OOMing processes).

Important pointers:

Why is this needed?

KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap

@iholder101 iholder101 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 21, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023
@iholder101
Copy link
Contributor Author

/sig node
/assign

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 21, 2023
@iholder101
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023
@xpivarc
Copy link

xpivarc commented Nov 17, 2023

Hey,
I have tried to reproduce the issue but it varies a lot based on the setup(both node and kubelet). I think this is not really an issue on the Kubernetes side but rather the Node setup.
My setup: 4 CPU 10G memory VM with 4G file swap.

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "4G"]
    resources:
      requests:
        memory: 3G

Testing limited swap, no system reserved memory.
With my setup Kernel is by default trying to keep ~100M of free memory and the eviction does not happen. The Burstable Pods end up consuming as much swap as they can (~1.2G per Pod in my case). Any additional Pod is either evicted or OOM killed, depending on how aggressive the allocations are and if Kubelet has enough time to react (this to me seems like wanted behavior).

To tune the Kernel, one can play with:
/proc/sys/vm/swappiness
/proc/sys/vm/watermark_scale_factor
/proc/sys/vm/min_free_kbytes
Additionally one can setup system/user cgroup slices to not swap which will allow only Burstable Pods to use swap.

The only thing that might need to be changed is rankMemoryPressure as I don't believe it takes swap into account.

@xpivarc
Copy link

xpivarc commented Nov 17, 2023

Hi @jan-kantert ,
Could you share your setup in which you observed evictions before Swap was exhausted?
Mainly I am interested in:
Kubelet setup:

  • system reserved?
  • kube reserved?
  • eviction threshold?

Node setup:

  • any Swap tunings?
  • cgroup v1/v2?
  • CRI ?
    It would be great if you could share how you performed the observation and what type of workloads/Pods you were running.

@jan-kantert
Copy link

Hey, I have tried to reproduce the issue but it varies a lot based on the setup(both node and kubelet). I think this is not really an issue on the Kubernetes side but rather the Node setup. My setup: 4 CPU 10G memory VM with 4G file swap.

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "4G"]
    resources:
      requests:
        memory: 3G

Testing limited swap, no system reserved memory. With my setup Kernel is by default trying to keep ~100M of free memory and the eviction does not happen. The Burstable Pods end up consuming as much swap as they can (~1.2G per Pod in my case). Any additional Pod is either evicted or OOM killed, depending on how aggressive the allocations are and if Kubelet has enough time to react (this to me seems like wanted behavior).

To tune the Kernel, one can play with: /proc/sys/vm/swappiness /proc/sys/vm/watermark_scale_factor /proc/sys/vm/min_free_kbytes Additionally one can setup system/user cgroup slices to not swap which will allow only Burstable Pods to use swap.

The only thing that might need to be changed is rankMemoryPressure as I don't believe it takes swap into account.

Our case works as following 32GB of swap, 16GB of memory. Spin up pods with request 1GB of memory but use 2GB. Should work fine but we see evictions from kubelet and memory pressure on the node. We use cgroup v2. Just enabled swap - no tuning. Evictions happen in kubelet so it is not related to Linux OOM killer at all.

@xpivarc
Copy link

xpivarc commented Nov 17, 2023

Our case works as following 32GB of swap, 16GB of memory. Spin up pods with request 1GB of memory but use 2GB. Should work fine but we see evictions from kubelet and memory pressure on the node. We use cgroup v2. Just enabled swap - no tuning. Evictions happen in kubelet so it is not related to Linux OOM killer at all.

Would you also share if you configure kube/system reserved memory and if you are using the default memory threshold?

@kannon92
Copy link
Contributor

And @jan-kantert, please mention k8s version, container runtime you use (and version).

@xpivarc
Copy link

xpivarc commented Dec 1, 2023

I have been trying to reproduce the reported behavior but I mostly failed. TLDR I think Kubelet behaves as expected.

For completeness, I am sharing my setup and the cases I tried.

I am using a VM with 4 CPUs and 16GB of memory. I have set up a swap of 32GB to match the reported environment. I am using a local build of Kubernetes (with etcd, API, ... pushing it to edge). The distro is Fedora 39 without any mentioned tunings. I am reserving 500M for kube and system with KubeReserved and SystemReserved respectively, keeping the memory threshold on default (100M).

  1. case -Pod reserved 1G - I used stress used in Kubernetes tests to progressively allocate memory up to 2G per Pod. This works just fine.
  2. case - Pod reserved 1G - I used a combination of stress, allocating up to 1G per Pod, and burst allocating 1G block of memory in a loop on top of it. This results in a few OOMs. This seems reasonable to me as Kernel can't swap fast enough.
  3. case -Pod reserved 1G - I used standard stress (stress --vm 4 --vm-bytes 500M --vm-keep), consuming 2G . For the default thresholds, this worked fine.
  4. case - Pod reserved 1G, consuming 2G - I gave a second try to 3. case with increased threshold (1500M). Here I observed evictions as expected. Increasing the /proc/sys/vm/min_free_kbytes to ~500-1000M resulted in no eviction.

@jan-kantert
Copy link

I have been trying to reproduce the reported behavior but I mostly failed. TLDR I think Kubelet behaves as expected.

For completeness, I am sharing my setup and the cases I tried.

I am using a VM with 4 CPUs and 16GB of memory. I have set up a swap of 32GB to match the reported environment. I am using a local build of Kubernetes (with etcd, API, ... pushing it to edge). The distro is Fedora 39 without any mentioned tunings. I am reserving 500M for kube and system with KubeReserved and SystemReserved respectively, keeping the memory threshold on default (100M).

Thank you for your effort! How much do you use as memory reservation in each test?

  1. case - I used stress used in Kubernetes tests to progressively allocate memory up to 2G per Pod. This works just fine.

This works for me as well if I keep a reservation of 2GB per pod. Swap is not really used nor needed here.

  1. case - I used a combination of stress, allocating up to 1G per Pod, and burst allocating 1G block of memory in a loop on top of it. This results in a few OOMs. This seems reasonable to me as Kernel can't swap fast enough.

This looks suspicious to me. Swap is a synchronous activity in Linux. I assume that you set a Reservation of 1GB and used 2GB. With 32GB swap and 16GB RAM this should neither cause OOM nor Evictions.

  1. case - I used standard stress (stress --vm 4 --vm-bytes 500M --vm-keep) . For the default thresholds, this worked fine.

How much is reserved here? How much used?

  1. case - I gave a second try to 3. case with increased threshold (1500M). Here I observed evictions as expected. Increasing the /proc/sys/vm/min_free_kbytes to ~500-1000M resulted in no eviction.

That indicates that you are operating very close to the Eviction threshold.

My repro looks like this (same node):

  1. Start pod A with 14GB reservation which uses 20GB.
  2. Start pod B with 1GB reservation and 4GB usage.
  3. Wait.
  4. Node will go into memory pressure and evict one of the pods.

The node is clearly not out of memory. You can even construct cases where the node has free memory available. Step 3 is important as kubelet will take a few seconds to react.

@kannon92
Copy link
Contributor

@jan-kantert Are you changing the config to LimitedSwap or just using the default settings for swap (ie UnlimitedSwap)?

@kannon92
Copy link
Contributor

kannon92 commented Dec 14, 2023

Since this was considered a blocker for swap, I wanted to go through and try and reproduce the case.

From my understanding, the issue is that we are evicting pods before swap can be used.

Setup

  1. Provision a GCP Fedora 38 node with 8 GB of boot and 8 GB of zram for swap (this was the default settings).
  2. Kernel Swapiness is set to 60.
  3. Run crio release-1.28
  4. Build and run local-up-cluster for k/k on release-1.28
  5. Ran FEATURE_GATES=NodeSwap=true hack/local-up-cluster.sh

Memory before running any pods:

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       634Mi       222Mi       1.6Gi       6.9Gi       5.2Gi
Swap:          7.7Gi       0.0Ki       7.7Gi

I have two workloads to simulate @jan-kantert example:

Pod1.yaml

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "4G"]
    resources:
      requests:
        memory: 1G

Even with running just this pod, I see free saying we are using swap.

Pod2.yaml

apiVersion: v1
kind: Pod
metadata: 
  generateName: alloc
spec:
  restartPolicy: Never
  containers:
  - name: el
    image: registry.k8s.io/stress:v1
    args: ["-mem-alloc-size", "1G", "-mem-alloc-sleep", "100ms", "-mem-total", "3G"]
    resources:
      requests:
        memory: 1.5G

Running these workloads, I see that swap is used using free and neither pods are evicted before swap is used.

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       6.4Gi       220Mi        35Mi       1.1Gi       1.0Gi
Swap:          7.7Gi       3.3Gi       4.4Gi

I see swap being utilized and eviction is avoided.

@kannon92
Copy link
Contributor

/triage needs-moreinformation

@jan-kantert can you please share with us all the steps you did to reproduce this?

What version of kube are you using?

What kind of workloads are running on the node?

What is the pod yaml you use to reproduce this?

What container runtime (and version) are you using?

Are you using LimitedSwap or UnlimitedSwap?

What kind of swap are you using - disk, memory, type of disk?

What is the OS (and version)?

What is your kernel setup for swap?

@k8s-ci-robot
Copy link
Contributor

@kannon92: The label(s) triage/needs-moreinformation cannot be applied, because the repository doesn't have them.

In response to this:

/triage needs-moreinformation

@jan-kantert can you please share with us all the steps you did to reproduce this?

What version of kube are you using?

What kind of workloads are running on the node?

What is the pod yaml you use to reproduce this?

What container runtime (and version) are you using?

Are you using LimitedSwap or UnlimitedSwap?

What kind of swap are you using - disk, memory, type of disk?

What is the OS (and version)?

What is your kernel setup for swap?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kannon92
Copy link
Contributor

/triage needs-more-information

@k8s-ci-robot
Copy link
Contributor

@kannon92: The label(s) triage/needs-more-information cannot be applied, because the repository doesn't have them.

In response to this:

/triage needs-more-information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kannon92
Copy link
Contributor

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Dec 14, 2023
@ozzieba
Copy link

ozzieba commented Apr 2, 2024

Using swap, I've been hitting memory pressure that looks like it's caused by #43916, trying to work around it by hitting /sys/fs/cgroup/memory.reclaim when awk '/MemFree/ {print $2}' /host/proc/meminfo gets too low, but would be nice not to have to do that

@kannon92
Copy link
Contributor

Using swap, I've been hitting memory pressure that looks like it's caused by #43916, trying to work around it by hitting /sys/fs/cgroup/memory.reclaim when awk '/MemFree/ {print $2}' /host/proc/meminfo gets too low, but would be nice not to have to do that

I'd suggest creating a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

6 participants