Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet/Kubernetes should work with Swap Enabled #53533

Open
outcoldman opened this issue Oct 6, 2017 · 115 comments
Open

Kubelet/Kubernetes should work with Swap Enabled #53533

outcoldman opened this issue Oct 6, 2017 · 115 comments

Comments

@outcoldman
Copy link

@outcoldman outcoldman commented Oct 6, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

Kubelet/Kubernetes 1.8 does not work with Swap enabled on Linux Machines.

I have found this original issue #31676
This PR #31996
and last change which enabled it by default 71e8c8e

If Kubernetes does not know how to handle memory eviction when Swap is enabled - it should find a way how to do that, but not asking to get rid of swap.

Please follow kernel.org Chapter 11 Swap Management, for example

The casual reader may think that with a sufficient amount of memory, swap is unnecessary but this brings us to the second reason. A significant number of the pages referenced by a process early in its life may only be used for initialisation and then never used again. It is better to swap out those pages and create more disk buffers than leave them resident and unused.

In case of running a lot of node/java applications I have seen always a lot of pages are swapped, just because they aren't used anymore.

What you expected to happen:

Kubelet/Kubernetes should work with Swap enabled. I believe instead of disabling swap and giving users no choices kubernetes should support more use cases and various workloads, some of them can be an applications which might rely on caches.

I am not sure how kubernetes decided what to kill with memory eviction, but considering that Linux has this capability, maybe it should align with how Linux does that? https://www.kernel.org/doc/gorman/html/understand/understand016.html

I would suggest to rollback the change for failing when swap is enabled, and revisit how the memory eviction works currently in kubernetes. Swap can be important for some workloads.

How to reproduce it (as minimally and precisely as possible):

Run kubernetes/kublet with default settings on linux box

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/sig node
cc @mtaufen @vishh @derekwaynecarr @dims

@derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Oct 7, 2017

Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods.

We discussed this topic at the resource mgmt face to face earlier this year. We are not super interested in tackling this in the near term relative to the gains it could realize. We would prefer to improve reliability around pressure detection, and optimize issues around latency before trying to optimize for swap, but if this is a higher priority for you, we would love your help.

@derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Oct 7, 2017

/kind feature

@outcoldman
Copy link
Author

@outcoldman outcoldman commented Oct 9, 2017

@derekwaynecarr thank you for explanation! It was hard to get any information/documentation why swap should be disabled for kubernetes. This was the main reason why I opened this topic. At this point I do not have high priority for this issue, just wanted to be sure that we have a place where it can be discussed.

@matthiasr
Copy link
Member

@matthiasr matthiasr commented Oct 9, 2017

There is more context in the discussion here: #7294 – having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would then start spilling over into swap (this appears to be fixed since f4edaf2 – they won't be allowed to use any swap whether it's there or not).

@fieryorc
Copy link

@fieryorc fieryorc commented Jan 2, 2018

This is critical use case for us too. We have a cron job that occasionally runs into high memory usage (>30GB) and we don't want to permanently allocate 40+GB nodes. Also, given that we run in three zones (GKE), this will allocate 3 such machines (1 in each zone). And this configuration has to be repeated in 3+ production instances and 10+ test instances making this super expensive to use K8s. We are forced to have 25+ 48GB nodes which incurs huge cost!.
Please enable swap!.

@hjwp
Copy link

@hjwp hjwp commented Jan 5, 2018

A workaround for those who really want swap. If you

  • start kubelet with --fail-swap-on=false
  • add swap to your nodes
  • containers which do not specify a memory requirement will then by default be able to use all of the machine memory, including swap.

That's what we're doing. Or at least, I'm pretty sure it is, I didn't actually implement it personally, but that's what I gather.

This might only really be a viable strategy if none of your containers ever specify an explicit memory requirement...

@fieryorc
Copy link

@fieryorc fieryorc commented Jan 6, 2018

We run in GKE, and I don't know of a way to set those options.

@vishh
Copy link
Member

@vishh vishh commented Jan 25, 2018

I'd be open to considering adopting zswap if someone can evaluate the implications to memory evictions in kubelet.

@ghost
Copy link

@ghost ghost commented Jan 30, 2018

I am running Kubernetes in my local Ubuntu laptop and with each restart I have to turnoff swap. Also I have to worry about not to go near memory limit as swap is off.

Is there any way with each restart I don't have to turn off swap like some configuration file change in existing installation?

I don't need swap on nodes running in cluster.

Its just other applications on my laptop other than Kubernetes Local Dev cluster who need swap to be turned on.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Right now the flag is not working.

# systemctl restart kubelet --fail-swap-on=false
systemctl: unrecognized option '--fail-swap-on=false'
@mtaufen
Copy link
Contributor

@mtaufen mtaufen commented Feb 2, 2018

@ghost
Copy link

@ghost ghost commented Feb 2, 2018

thanks @mtaufen

@dbogatov
Copy link

@dbogatov dbogatov commented Feb 14, 2018

For systems that bootstrap cluster for you (like terraform), you may need to modify the service file

This worked for me

sudo sed -i '/kubelet-wrapper/a \ --fail-swap-on=false \\\' /etc/systemd/system/kubelet.service

@srevenant
Copy link

@srevenant srevenant commented Apr 3, 2018

Not supporting swap as a default? I was surprised to hear this -- I thought Kubernetes was ready for the prime time? Swap is one of those features.

This is not really optional in most open use cases -- it is how the Unix ecosystem is designed to run, with the VMM switching out inactive pages.

If the choice is no swap or no memory limits, I'll choose to keep swap any day, and just spin up more hosts when I start paging, and I will still come out saving money.

Can somebody clarify -- is the problem with memory eviction only a problem if you are using memory limits in the pod definition, but otherwise, it is okay?

It'd be nice to work in a world where I have control over the way an application memory works so I don't have to worry about poor memory usage, but most applications have plenty of inactive memory space.

I honestly think this recent move to run servers without swap is driven by the PaaS providers trying to coerce people into larger memory instances--while disregarding ~40 years of memory management design. The reality is that the kernel is really good about knowing what memory pages are active or not--let it do its job.

@chrissound
Copy link

@chrissound chrissound commented May 1, 2018

This also has an effect that if the memory gets exhausted on the node, it will potentially become completely locked up - requiring a restart of the node, rather than just slowing down and recovering a while later.

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jul 30, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Nov 4, 2020

I put swap on the agenda to see if there is community appetite or volunteers to help push this forward in 1.21. I have no objection to supporting swap as I noted in 2017, it just needs to make sure not to confuse kubelet eviction, pod priority, pod quality of service, and importantly pods must be able to say if they tolerate swap or not. All these things are important to ensure pods are portable.

A lot of energy has been focused lately on making things like NUMA aligned memory work, but if there are folks that are less performance sensitive and equally motivated to help move this space forward, we would love help to get a head-start on design of detailed KEP in this space.

@superdave
Copy link

@superdave superdave commented Nov 24, 2020

I have not kept up with the community process terribly well of late as things have been super busy for me lately, however they look like they should be calming down somewhat soon. Is there a way I can engage without having to join a Slack channel?

@ehashman
Copy link
Member

@ehashman ehashman commented Dec 17, 2020

@superdave I'm going to look into putting a document together to cover use cases and requirements in the upcoming year. (Will link here and send to the mailing list when I have something!)

@cgwalters
Copy link
Contributor

@cgwalters cgwalters commented Dec 22, 2020

Somewhat related to this: xref https://fedoraproject.org/wiki/Changes/EnableSystemdOomd - this uses PSI, which looks like it was mentioned at least here #43916 (comment)
(I am currently recommending systemd-oomd is disabled at least for FCOS, but if systemd upstream enables it by default it's likely to become something we need to either check for and disable, or integrate with, or share code with)

@iMartyn
Copy link

@iMartyn iMartyn commented Dec 26, 2020

@ehashman Great that this is going forward. One use-case that I haven't seen mentioned is the growing number of "edge compute" clusters - for example, I've got a clusterboard of 7 machines with a fair amount of compute but constrained memory - 2Gb per node. They will run k3s but the memory left over would be minimal for workloads. With swap they would sure run slower but they would be able to have normal workloads. Raspberry Pi clusters are also becoming common, and whilst they have a staggering (hah) 4Gb of ram on the top-end boards now, it's still an environment where k8s could thrive, if it had swap (wear on mmc devices aside).

@RussianNeuroMancer
Copy link

@RussianNeuroMancer RussianNeuroMancer commented Dec 26, 2020

I've got a clusterboard of 7 machines with a fair amount of compute but constrained memory - 2Gb per node.

Raspberry Pi clusters are also becoming common

There is also cluster servers from Firefly (R1, R2) that combine up to 72 Core-3399-JD4 boards, each with no more than 4GB RAM, unfortunately.

@jolestar
Copy link
Contributor

@jolestar jolestar commented Dec 30, 2020

Share my story:
I tried to migrate github action to k8s using https://github.com/summerwind/actions-runner-controller , it supports autoscale runner by pending pull request on-demand. But the runner build process is killed by the system when it tries to build rust project because the llvm eating pod's whole memory,it was useless to upgrade the machine to 64G。 the only way is to increment the node's swap, but the swap option needs the cloud service provider to support it, I can not do it by myself.

@sftim
Copy link
Contributor

@sftim sftim commented Dec 31, 2020

Maybe the MVP version of this is to let the kubelet work on nodes where a swap area is configured, and have the kubelet update the .status of the node (maybe a node condition that leads to a taint?). If the kubelet requires a TolerateSwapConfigured setting (or otherwise won't start when swap is active) and such nodes are also tainted by default, then there is a low risk of surprise to cluster operators.

With that minimum available, scheduling plugins and other mechanisms could allow out-of-tree ways to ensure that task placement takes account of paging and swap space. I'd be fine with out-of-tree mechanisms so long as they work; if a clear winner emerges, it could move in-tree.

@dElogics
Copy link

@dElogics dElogics commented Jan 4, 2021

This must be combined with cgroup limits on memory + swap -- which's incredibly useful when used with a combination zswap + ssd or zram.

@ehashman
Copy link
Member

@ehashman ehashman commented Jan 5, 2021

I have put together a draft document that lists use cases, scope, and current proposals, and shared it with the SIG Node group: https://docs.google.com/document/d/1CZtRtC8W8FwW_VWQLKP9DW_2H-hcLBcH9KBbukie67M/edit#

Please take a look, document missing use cases, and add your name to the "Contacts" section of the doc if you are interested in contributing!

We are looking at targeting 1.22 for alpha support.

@karan
Copy link
Member

@karan karan commented Jan 5, 2021

And here's my doc with some implementation details and ideas: https://docs.google.com/document/d/1qFH-RA7GvaEidOnp7Y-QwAZ8g5Zcyk28VkEJFWcEX-U/edit

@davidvossel
Copy link

@davidvossel davidvossel commented Jan 5, 2021

Great to see some traction here. This is a feature that the KubeVirt community is keen on leveraging as soon as possible. I'll share a bit about our use case for context.

The KubeVirt project allows managing traditional virtual machines in a Kubernetes cluster. The expectations around VMs differ from those of pods. As opposed to pods, VMs are long running "immortal" workloads that survive being killed and are portable between nodes. For example, we can "Live Migrate" a running VM between nodes in order to keep the VM online while a node is being updated or having maintenance performed on it. So, with Live Migration of VM workloads, this means we're capable of updating the entire cluster without ever disrupting the VM workloads.

Here's where swap comes into play and why we're so interested in it. We're finding that while we can Live Migrate workloads to survive node disruption events, that's only possible if there's adequate memory capacity within the remain cluster nodes to host the VMs being migrated away from a node that is being drained for maintenance. If a cluster is running with thin memory margins, there likely isn't enough free memory to completely absorb the workloads from a node being placed into maintenance mode. With swap, we have the ability to briefly overcommit the rest of the cluster's memory usage in order to allow the VM live migrations to take place and perform node maintenance without disrupting workloads (other than the potential performance hit related to swap)

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Jan 5, 2021

@ehashman just saw your email to sig-node / kubernetes-dev, thanks! 👋

FWIW, re: use cases ... KIND currently sets fail-on-swap=false because we think it's unreasonable to expect developers to change kernel options to hack on Kubernetes / Kubernetes apps and things work well enough for these purposes as-is. We'd love to see them work better though 😄 (there are of course other issues we need to work through, namely: google/cadvisor#2699 / kubernetes-sigs/kind#877)

@MarSik
Copy link

@MarSik MarSik commented Jan 6, 2021

@ehashman @davidvossel I would like to point out one additional use case for swap with regards to Virtual Machines.

It is possible to save memory by using KSM (deduplication) and/or memory ballooning (reusing memory currently unused by a VM). Both techniques allow higher VM density for use cases where not all the VMs are active at the same time or all the time. Virtual desktops (VDI, VM per student/assignment/class) and CI infrastructures use this.

Swap is useful for being able to start a VM when the node is almost full (but under the limit), swap helps with the memory spike needed to boot the OS before the deduplication/ballooning kicks in.

@ehashman
Copy link
Member

@ehashman ehashman commented Jan 20, 2021

/assign

@abdennour
Copy link

@abdennour abdennour commented Feb 24, 2021

swapoff -a && systemctl restart kubelet my way in offline environment

@agowa338
Copy link

@agowa338 agowa338 commented Feb 26, 2021

@abdennour That doesn't solve the issue. You're just disabling swap. That depending on your workload may or may not be viable as has been already pointed out within this issue.

@t3hmrman
Copy link

@t3hmrman t3hmrman commented Mar 29, 2021

Is there any actual downside to leaving swap on and setting fail-on-swap=false?

I understand the decision/recommendation and can understand it will take some time to work through reconsidering it, but the regardless of how that goes am I correct in thinking that the only actual immediate downside is over-committed memory and the resulting degraded memory performance for some workloads on the margin?

The scheduler does not take swap into account when determining node resources right, and the OOMKiller will still come around and kill processes that escape their limits -- theoretically a node with no Burstable/BestEffort-classified workloads (with little/no additional usage from outside external sources) would still function well especially when you're not at the margins right?

@ehashman
Copy link
Member

@ehashman ehashman commented Apr 6, 2021

Swap KEP draft up at kubernetes/enhancements#2602

Feature tracked at kubernetes/enhancements#2400

Aiming for an alpha MVP for 1.22 release (the upcoming one). PTAL!

@ehashman
Copy link
Member

@ehashman ehashman commented Apr 6, 2021

(And sorry, I realized I assigned this and left everyone hanging - I've sent out some emails to the mailing list and we have been iterating on a design doc I used to develop the draft KEP above.)

@ehashman ehashman mentioned this issue Apr 7, 2021
0 of 4 tasks complete
@deavid
Copy link

@deavid deavid commented Apr 22, 2021

I'm new to Kubernetes and just learned about this issue; I think swap should get at least a very minimal support for swap.
I mostly understand why swap is a problem to implement and why is seen as something that doesn't add much value: Linux doesn't provide much tooling around swap to properly control it (although possible) and as Kubernetes is expected to run many pods then one would run it in a big machine with lots of memory. In this scenario enabling swap will not bring almost any benefit.

But Kubernetes should aim for broader support of other scenarios. For example, I'm planning to have 3 very small nodes to run 1 pod each, and use Kubernetes mainly for replica and fail-over. Nothing fancy, just one big app on three VPS.

When the amount of memory of the host is small, having swap is critical for the stability of the host system. A Linux distribution does not run on constant or pre-allocated memory, therefore there is always a chance that something in the host OS could produce a surge in memory and without swap the oom killer would be invoked. And in my experience, when the linux OOM comes in the results are nothing good, and configuring it properly requires extended knowledge on how your particular OS installation behaves, what's critical and what's not.

Following this train of thought my problem is more about Kubernetes requiring the sysadmin to disable swap entirely on the node than having proper swap support on the pods. Showing a nasty warning instead of failing to start it seems a better option to me than require to set a flag.

Having proper swap support for pods sounds also really interesting as it can make the nodes very dense, which it can be interesting on certain applications. (Some apps preallocate a lot of memory, touch it and almost never go back to it). And we're also seeing faster drives lately, PCIe 4.0 and new standards for using drives as memory; with these, moving back from disk to memory is fast enough to consider swapping as an option to get more stuff packed per server.

My point here is basically: 1) I believe swap support is needed. 2) kubernetes doesn't need to get from 0 to 100 in one shot, there are lots of middle options that are also reasonably valid that would mitigate the majority of issues people have with removing swap entirely.

@ehashman
Copy link
Member

@ehashman ehashman commented Apr 30, 2021

Since we have a lot of folks commenting on this issue who are new to the Kubernetes development process, I'll try to explain what I linked above a bit more clearly.

  • I've assigned myself to this issue. This means I am working on the implementation for MVP swap support.
  • This is a very large feature, so we can't track its implementation just with a GitHub issue. It must go through the Kubernetes enhancements process aka a KEP to ensure that all of the API changes, etc. get properly reviewed.
  • It will take a minimum of 3 release cycles to "graduate". I am targeting the 1.22 release [August 2021] for alpha support. Alpha means you must enable the feature flag to use it. I'm targeting 1.23 [December 2021] for beta support, where the feature flag would be on by default. In either case, an end user would still need to provision swap and explicitly enable swap support in the kubelet.
  • The design proposal I'm working on is here: https://github.com/ehashman/k-enhancements/blob/kep-2400/keps/sig-node/2400-node-swap/README.md#summary
  • The design proposal has not yet been accepted, but the deadline for this upcoming release cycle is May 13th, so we will know soon, and I'll post an update here when it is marked implementable.

At this point I don't think there's opposition to implementing some kind of swap support, it's just a matter of doing so carefully and in a way that will address most use cases without adding too much complexity or breaking existing users.

@ehashman
Copy link
Member

@ehashman ehashman commented May 13, 2021

My proposal has been accepted for the 1.22 cycle.

We will proceed with the implementation described in the design doc: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet