New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet/Kubernetes should work with Swap Enabled #53533

Open
outcoldman opened this Issue Oct 6, 2017 · 35 comments

Comments

Projects
None yet
@outcoldman

outcoldman commented Oct 6, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

Kubelet/Kubernetes 1.8 does not work with Swap enabled on Linux Machines.

I have found this original issue #31676
This PR #31996
and last change which enabled it by default 71e8c8e

If Kubernetes does not know how to handle memory eviction when Swap is enabled - it should find a way how to do that, but not asking to get rid of swap.

Please follow kernel.org Chapter 11 Swap Management, for example

The casual reader may think that with a sufficient amount of memory, swap is unnecessary but this brings us to the second reason. A significant number of the pages referenced by a process early in its life may only be used for initialisation and then never used again. It is better to swap out those pages and create more disk buffers than leave them resident and unused.

In case of running a lot of node/java applications I have seen always a lot of pages are swapped, just because they aren't used anymore.

What you expected to happen:

Kubelet/Kubernetes should work with Swap enabled. I believe instead of disabling swap and giving users no choices kubernetes should support more use cases and various workloads, some of them can be an applications which might rely on caches.

I am not sure how kubernetes decided what to kill with memory eviction, but considering that Linux has this capability, maybe it should align with how Linux does that? https://www.kernel.org/doc/gorman/html/understand/understand016.html

I would suggest to rollback the change for failing when swap is enabled, and revisit how the memory eviction works currently in kubernetes. Swap can be important for some workloads.

How to reproduce it (as minimally and precisely as possible):

Run kubernetes/kublet with default settings on linux box

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/sig node
cc @mtaufen @vishh @derekwaynecarr @dims

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Oct 7, 2017

Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods.

We discussed this topic at the resource mgmt face to face earlier this year. We are not super interested in tackling this in the near term relative to the gains it could realize. We would prefer to improve reliability around pressure detection, and optimize issues around latency before trying to optimize for swap, but if this is a higher priority for you, we would love your help.

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Oct 7, 2017

/kind feature

@liggitt liggitt removed the kind/bug label Oct 7, 2017

@outcoldman

This comment has been minimized.

outcoldman commented Oct 9, 2017

@derekwaynecarr thank you for explanation! It was hard to get any information/documentation why swap should be disabled for kubernetes. This was the main reason why I opened this topic. At this point I do not have high priority for this issue, just wanted to be sure that we have a place where it can be discussed.

@matthiasr

This comment has been minimized.

Member

matthiasr commented Oct 9, 2017

There is more context in the discussion here: #7294 – having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would then start spilling over into swap (this appears to be fixed since f4edaf2 – they won't be allowed to use any swap whether it's there or not).

@fieryorc

This comment has been minimized.

fieryorc commented Jan 2, 2018

This is critical use case for us too. We have a cron job that occasionally runs into high memory usage (>30GB) and we don't want to permanently allocate 40+GB nodes. Also, given that we run in three zones (GKE), this will allocate 3 such machines (1 in each zone). And this configuration has to be repeated in 3+ production instances and 10+ test instances making this super expensive to use K8s. We are forced to have 25+ 48GB nodes which incurs huge cost!.
Please enable swap!.

@hjwp

This comment has been minimized.

hjwp commented Jan 5, 2018

A workaround for those who really want swap. If you

  • start kubelet with --fail-swap-on=false
  • add swap to your nodes
  • containers which do not specify a memory requirement will then by default be able to use all of the machine memory, including swap.

That's what we're doing. Or at least, I'm pretty sure it is, I didn't actually implement it personally, but that's what I gather.

This might only really be a viable strategy if none of your containers ever specify an explicit memory requirement...

@fieryorc

This comment has been minimized.

fieryorc commented Jan 6, 2018

We run in GKE, and I don't know of a way to set those options.

@vishh

This comment has been minimized.

Member

vishh commented Jan 25, 2018

I'd be open to considering adopting zswap if someone can evaluate the implications to memory evictions in kubelet.

@icewheel

This comment has been minimized.

icewheel commented Jan 30, 2018

I am running Kubernetes in my local Ubuntu laptop and with each restart I have to turnoff swap. Also I have to worry about not to go near memory limit as swap is off.

Is there any way with each restart I don't have to turn off swap like some configuration file change in existing installation?

I don't need swap on nodes running in cluster.

Its just other applications on my laptop other than Kubernetes Local Dev cluster who need swap to be turned on.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Right now the flag is not working.

# systemctl restart kubelet --fail-swap-on=false
systemctl: unrecognized option '--fail-swap-on=false'
@mtaufen

This comment has been minimized.

Contributor

mtaufen commented Feb 2, 2018

@icewheel

This comment has been minimized.

icewheel commented Feb 2, 2018

thanks @mtaufen

@dbogatov

This comment has been minimized.

dbogatov commented Feb 14, 2018

For systems that bootstrap cluster for you (like terraform), you may need to modify the service file

This worked for me

sudo sed -i '/kubelet-wrapper/a \ --fail-swap-on=false \\\' /etc/systemd/system/kubelet.service

@srevenant

This comment has been minimized.

srevenant commented Apr 3, 2018

Not supporting swap as a default? I was surprised to hear this -- I thought Kubernetes was ready for the prime time? Swap is one of those features.

This is not really optional in most open use cases -- it is how the Unix ecosystem is designed to run, with the VMM switching out inactive pages.

If the choice is no swap or no memory limits, I'll choose to keep swap any day, and just spin up more hosts when I start paging, and I will still come out saving money.

Can somebody clarify -- is the problem with memory eviction only a problem if you are using memory limits in the pod definition, but otherwise, it is okay?

It'd be nice to work in a world where I have control over the way an application memory works so I don't have to worry about poor memory usage, but most applications have plenty of inactive memory space.

I honestly think this recent move to run servers without swap is driven by the PaaS providers trying to coerce people into larger memory instances--while disregarding ~40 years of memory management design. The reality is that the kernel is really good about knowing what memory pages are active or not--let it do its job.

@chrissound

This comment has been minimized.

chrissound commented May 1, 2018

This also has an effect that if the memory gets exhausted on the node, it will potentially become completely locked up - requiring a restart of the node, rather than just slowing down and recovering a while later.

@fejta-bot

This comment has been minimized.

fejta-bot commented Jul 30, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@veerendra2

This comment has been minimized.

veerendra2 commented Aug 23, 2018

I have high number of disk reads in my cluster nodes(K8s Version - v1.11.2). May be because of disabling swap memory?

https://stackoverflow.com/questions/51988566/high-number-of-disk-reads-in-kubernetes-nodes

@bronger

This comment has been minimized.

bronger commented Aug 27, 2018

@srevenant In the cluster world, the other node's RAM is the new swap. That said, I run two one-node K8s instances where swap makes sense. But this is not the typical use case of K8s.

@linuxman1

This comment has been minimized.

linuxman1 commented Sep 1, 2018

@srevenant I completely agree with you, SWAP is used on Unix and Linux by default since they were born, I think I didn't see an app during 15 years of working on Linux that asks for SWAP to be off.
The issue SWAP is always on by default when we install any Linux distro, so I must set it off before I install K8s and that was a surprise.
Linux Kernel knows well how to manage SWAP to increase performance of servers specially temporarily when server is about to reach RAM limit.
Does this mean I must switch SWAP off for K8s to work well?

@superdave

This comment has been minimized.

superdave commented Sep 6, 2018

I have an interest in making this work, and I have the skills and a number of machines to test on. If I wanted to contribute, where is the best place to start?

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Sep 6, 2018

@superdave please put together a KEP in kubernetes/community describing how you would like swap to be supported, and present it to sig-node. we would love to have your help.

@sindelio

This comment has been minimized.

sindelio commented Sep 24, 2018

I stand for enabling swap in Kubernete's pods properly. it really does not make sense to not have swap, since almost all containers are custom Linux instances, and hence support swap by default.
It's understandable that the feature is complex to implement, but since when did that stopped us from moving forward?

@vasicvuk

This comment has been minimized.

vasicvuk commented Sep 26, 2018

I must agree that the swap issues should be solved in Kubernetes since disabling swap causes Node failure when running out of memory on node. For example if you having 3 worker nodes (20GB of ram each) and one node goes down because the limit of ram is reached 2 other worker nodes will also go down after transferring all the pods to them in that time.

@matthiasr

This comment has been minimized.

Member

matthiasr commented Sep 26, 2018

@vasicvuk

This comment has been minimized.

vasicvuk commented Sep 26, 2018

@matthiasr You can do that when you have 10-50 services. But when u have Cluster running over 200 services and half of them are deployed using official Helm charts without any memory request in them your hands are tide.

@bronger

This comment has been minimized.

bronger commented Sep 26, 2018

But then, isn’t missing memory requests the problem to be addressed?

@zerkms

This comment has been minimized.

zerkms commented Sep 26, 2018

@matthiasr in a lot of cases memory once mapped to the process only used once or never actually used. Those are valid cases and are not memory leaks. When you have swap those pages are eventually swapped and may never be swapped in again, yet you release fast ram for better use.

@Baughn

This comment has been minimized.

Baughn commented Oct 2, 2018

Nor is turning swap off a good way to ensure responsiveness. Unless you pin files in memory (a capability K8s should have for executables, at least), the kernel will still swap out any and all file-backed pages in response to memory pressure, or even simply lack of use.

Having swap enabled doesn't markedly change kernel behavior. All it does is provide a space to swap out anonymous pages, or modified pages loaded from COW-mapped files.

You can't turn off swapping entirely, so K8s needs to survive its existence whether or not the special case of anonymous memory swapping is enabled.

That makes this a bug: You're failing to support a kernel feature that can't actually be turned off.

@AndrewSav

This comment has been minimized.

AndrewSav commented Oct 2, 2018

@Baughn

the kernel will still swap out any and all file-backed pages in response to memory pressure, or even simply lack of use. Having swap enabled doesn't markedly change kernel behavior.

You can't turn off swapping entirely,

Can you provide some reference for this so that I could educate myself?

@vishh

This comment has been minimized.

Member

vishh commented Oct 2, 2018

Unless you pin files in memory (a capability K8s should have for executables, at least),

What is the capability you want k8s to use? If a binary is static just copying it over to a tmpfs on the pod should help with paging latency.

@adityakali any thoughts on impact of swap in the kernel when swap is turned off?

@anguslees

This comment has been minimized.

Member

anguslees commented Oct 5, 2018

Can you provide some reference for this so that I could educate myself?

Like all modern virtual memory OSes, Linux demand pages executables from disk into memory. Under memory pressure, the kernel swaps the actual executable code of your program to/from disk just like any other memory pages (the "swap out" is simply a discard because read-only, but the mechanism is the same), and they will be re-fetched if required again. Same goes for things like string constants, which are typically mmapped read-only from other sections of the executable file. Other mmapped files (common for database-type workloads) are also swapped in+out to their relevant backing files (requiring an actual write-out if they've been modified) in response to memory pressure. The only swapping you disable by "disabling swap" is "anonymous memory" - memory that is not associated with a file (the best examples are the "stack" and "heap" data structures).

There are lots of details I'm skipping over in the above description of course. In particular, executables can "lock" portions of their memory space into ram using the mlock family of syscalls, do clever things via madvise(), it gets complicated when the same pages are shared by multiple processes (eg libc.so), etc. I'm afraid I don't have a more useful pointer to read more other than those manpages, or general things like textbooks or linux kernel source/docs/mailing-list.

So, a practical effect of the above is that as your process gets close to its memory limit, the kernel will be forced to evict code portions and constants from ram. The next time that bit of code or constant value is required, the program will pause, waiting to fetch it back from disk (and evict something else). So even with "swap disabled", you still get the same degradation when your working set exceeds available memory.

Before people read the above and start calling to mlock everything into memory or copy everything into a ramdrive as part of the anti-swap witch hunt, I'd like to repeat that the real resource of interest here is working set size - not total size. A program that works linearly through gigabytes of data in ram might only work on a narrow window of that data at a time. This hypothetical program would work just fine with a large amount of swap and a small ram limit - and it would be terribly inefficient to lock it all into real ram. As you've learned from the above explanation, this is exactly the same as a program that has a large amount of code but only executes a small amount of it at any particular moment.

My latest personal real-world example of something like that is linking the kubernetes executables. I'm currently (ironically) unable to compile kubernetes on my kubernetes cluster because the go link stage requires several gigabytes of (anonymous) virtual memory, even though the working set is much smaller.

To really belabour the "its about working set, not virtual memory" point, consider a program that does lots of regular file I/O and nothing to do with mmap. If you have sufficient ram, the kernel will cache repeatedly-used directory structures and file data in ram and avoid going to disk, and it will allow writes to burst into ram temporarily to optimise disk write-out. Even a "naive" program like this will degrade from ram-speeds to disk-speeds depending on working set size vs available ram. When you pin something into ram unnecessarily (eg: using mlock or disabling swap), you prevent the kernel from using that page of physical ram for something actually useful and (if you didn't have enough ram for working set) you've just moved the disk I/O to somewhere even more expensive.

@superdave: I too am interested in improving the status-quo here. Please include me if you want another pair of eyes to review a doc, or hands at a keyboard.

@superdave

This comment has been minimized.

superdave commented Oct 5, 2018

@csis0247

This comment has been minimized.

csis0247 commented Nov 9, 2018

I would like to point out that zram works by piggybacking on swaps. If there is no swaps on k8, then there is no memory compression, which is something most non-Linux OS has enabled by default (cue Windows, MacOS).

We have a Ubuntu instance on k8 that runs a large batch job every night which consumes a lot of memory. As the workload is not predetermined, we are forced to (expensively) allocate 16GB to the node regardless of its actual memory consumption to avoid OOM. With memory compression on our local dev server, the job peaks at only 3GB. Otherwise during the day, it takes only 1GB of memory. Banning swaps and thus memory compression is quite silly a move.

@1e100

This comment has been minimized.

1e100 commented Nov 9, 2018

I think the main concern here is probably isolation. A typical machine can host a ton of pods, and if memory gets tight, they could start swapping and completely destroy performance for each other. If there's no swap, isolation is much easier.

@Baughn

This comment has been minimized.

Baughn commented Nov 9, 2018

I think the main concern here is probably isolation. A typical machine can host a ton of pods, and if memory gets tight, they could start swapping and completely destroy performance for each other. If there's no swap, isolation is much easier.

But as explained previously, disabling swap doesn't buy us anything here. In fact, since it increases memory pressure overall, it may force the kernel to drop parts of the working set when it could otherwise have swapped out unused data -- so it makes the situation worse.

Enabling swap, on its own, should actually improve isolation.

@1e100

This comment has been minimized.

1e100 commented Nov 10, 2018

But it does buy you a lot, if you run things the way you're supposed to (and the way Google runs things on Borg): all containers should specify the upper memory limit. Borg takes advantage of Google infra and learns the limits if you want it to (from past resource usage and OOM behavior), but there are limits nonetheless.

I'm actually kind of baffled that K8S folks allowed the memory limit to be optional. Unlike CPU, memory pressure has a very non-linear effect on system performance as anyone who has seen a system completely lock up due to swapping will attest. It should really require it by default, and give you a warning if you choose to disable it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment