New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet needs to allow configuration of container memory-swap #7294

Closed
derekwaynecarr opened this Issue Apr 24, 2015 · 45 comments

Comments

Projects
None yet
@derekwaynecarr
Copy link
Member

derekwaynecarr commented Apr 24, 2015

I would like the Kubelet to support a flag like the following:

--container-memory-swap-limit-factor=0.1  (defaults to 0, meaning no memory-swap)

To control the default amount of MemorySwap that is given to a container as a percentage of their requested memory limit.

See https://docs.docker.com/reference/run/#runtime-constraints-on-cpu-and-memory for the third-row which is the behavior we show today when the user sets a memory-limit, but the kubelet does not specify memory-swap

(specify memory without memory-swap) The container is not allowed to use more than L bytes of memory, swap plus memory usage is double of that.

We think using memory-swap at all is a big step to take, and if the administrator wanted to allow for memory swap, it should not default to 2L where (L is the memory specified).

For affected area of Kubelet, see:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/dockertools/manager.go#L442

For value we should explicitly set, see:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/Godeps/_workspace/src/github.com/fsouza/go-dockerclient/container.go#L177

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Apr 24, 2015

/cc @thockin this aligns with our IRC discussion.

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Apr 24, 2015

/cc @danmcp per our chat as well.

@nkwangleiGIT

This comment has been minimized.

Copy link
Contributor

nkwangleiGIT commented Jun 25, 2015

Hi all,
For this issue, so we want to support memory swap at both kubelet and API level?

  1. Add option --container-memory-swap-limit-factor=0 to kubelet, and it'll be a global default settings for memory swap
  2. In the container's resource limits, we can also specify something like container.resources.limits.memoryswap=1, then it'll use 1L as memory swap(L is the memory specified). And it'll map to 2L of docker memory swap, as this value in docker is the summary of memory and memory swap.

We're also customizing k8s to support memory swap in our project, so it can leverage docker memory swap. I'd like to contribute code using this issue, is my understanding above correct?
Let me know if any design change, thanks!

@erictune

This comment has been minimized.

Copy link
Member

erictune commented Jun 25, 2015

I'd rather not put factors into a ResourceList, as is implied in item 2 of previous comment. All the items in resources.limits should be additive. If you want to allow users to specify this from the API, then they can just specify an absolute amount of swap, or if they leave it empty, then some admission controller could default it, and they could set it to an explicit value of 0 or maybe 1 if they don't want any swap.

@nkwangleiGIT

This comment has been minimized.

Copy link
Contributor

nkwangleiGIT commented Jun 26, 2015

Hi erictune,
I agree with you on option 2, we'd better to let the user specify an absolute value for memory swap instead of factors in API approach, it'll make it consistent with Docker, and sure we should have default value if the user doesn't use it. How about the question below, the memory swap here should be only memory swap or the summary?

And it'll map to 2L of docker memory swap, as this value in docker is the summary of memory and memory swap.

How about kubelet level support, do we still need it?
Thanks.

@erictune

This comment has been minimized.

Copy link
Member

erictune commented Jun 26, 2015

The default value should be to not use swap at all.

I think docker chose their format to be just like the cgroups interface. But, I don't think we need to copy that pattern. I'd rather that the resources list does not double-count resources or mix resources. It should be easy to add up resource lists like you add up vectors, to get aggregate amounts of resources.

So, I think swap should just be the amount of swap, not swap + memory. And it should be called "swap" not "memoryswap", to reflect that.

One case to think about: Swapfiles can be on rotating disk or on SSD. This gives very different performance characteristics. Also, the scheduler needs to subtract any swap space from the storage space of the node when a pod is bound to it. It needs to know whether to subtract that from rotating disk or SSD. Should, should we have separate "diskswap" and "ssdswap" resources?

@nkwangleiGIT

This comment has been minimized.

Copy link
Contributor

nkwangleiGIT commented Jun 29, 2015

Hi erictune,
that's a good point, but I'm not sure why we need separate resource for "diskswap" and "ssdswap", the user can use normal disk or SSD memory swap as they need from performance perspective, but we only need to substract the size of swap when calculate the space size of a node.
Do you mean the swap size maybe differenent depending on it's normal disk or SSD?
Thanks for help!

@erictune

This comment has been minimized.

Copy link
Member

erictune commented Jun 29, 2015

I was thinking of the case where swap files are added dynamically by
kubelet in response to pods being started. In this model, the scheduler
needs to subtract the swap from the resources of the machine, and it needs
to know which resources to reduce..

However, if nodes are statically provisioned with swap, and they exclude
the provisioned swap resources from the reported capacity, then I agree we
can do it like you say.

On Sun, Jun 28, 2015 at 11:20 PM, WangLei notifications@github.com wrote:

Hi erictune,
that's a good point, but I'm not sure why we need separate resource for
"diskswap" and "ssdswap", the user can use normal disk or SSD memory swap
as they need from performance perspective, but we only need to substract
the size of swap when calculate the space size of a node.
Do you mean the swap size maybe differenent depending on it's normal disk
or SSD?
Thanks for help!


Reply to this email directly or view it on GitHub
#7294 (comment)
.

@nkwangleiGIT

This comment has been minimized.

Copy link
Contributor

nkwangleiGIT commented Jun 29, 2015

Hi erictune,
OK, I see, then we may also want to use 'memory swap' as a factor when the scheduler tries to calculate the fit node and schedule the pod, and we also need to know the total size of swap on each node someway.
Do we already have the swap size or capability added in current scheduler and kubelet implementation? it'll probably need more change if we want to add them all, thanks!

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jun 29, 2015

I don't think we should allow pods to add swap to the system. I'd rather
see something simple like a "max swap" which defaults to 0. If the system
has swap enabled, the pod can use up to that much swap. If the system does
not have swap, too bad. This is bad from an isolation (IOPS, IO bandwidth)
point of view, but we have nothing there now, so we can maybe revisit later?

On Mon, Jun 29, 2015 at 8:32 AM, WangLei notifications@github.com wrote:

Hi erictune,
OK, I see, then we may also want to use 'memory swap' as a factor when the
scheduler tries to calculate the fit node and schedule the pod, and we also
need to know the total size of swap on each node someway.
Do we already have the swap size or capability added in current scheduler
and kubelet implementation? it'll probably need more change if we want to
add them all, thanks!


Reply to this email directly or view it on GitHub
#7294 (comment)
.

@nkwangleiGIT

This comment has been minimized.

Copy link
Contributor

nkwangleiGIT commented Jun 30, 2015

Hi thockin,
OK, we can revisit this after v1.
BTW, for now we enabled the memory swap of pods in our k8s system, just follow what Docker expect. Thanks!

@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented Aug 7, 2015

I don't want to support memory-swap by default, at least for a near future

  1. It won't work well with our qos proposal: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/resource-qos.md

  2. The workloads asking for Guaranteed resource will not use it. They need repeatable latency to memory. They want predictability on performance too.

  3. Adding swap slows down jobs and introducing more bandwidth to disk and isolation issues. We don't manage disk io yet, and it is hard to manage too. Without better disk io management, simply enabling swap for container/pod is bad solution.

On another side, I understand some users might want to use swap no matter what; or some application is running on a dedicate node without any sharing, which still require swap space. @brendandburns proposed a way to simply pass-through such request to docker engine for certain container / pod with privileges. cc/ @bgrant0607

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Aug 10, 2015

What is the use case for swap? If swap is enabled for any containers on a host, then no latency guarantees can be provided for other containers.

@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented Nov 2, 2015

Unassign myself since we don't want to add swap support for now unless someone provides me a real appealing use cases.

@dchen1107 dchen1107 removed their assignment Nov 2, 2015

@streamnsight

This comment has been minimized.

Copy link

streamnsight commented Apr 29, 2016

what's the proper way to handle this then? disable swap on the host?

@erictune

This comment has been minimized.

Copy link
Member

erictune commented Apr 29, 2016

Disabling swap is a good approach. When you have multiple containers and multiple machines they could be scheduled on, it is better to kill one container, than to have all the containers on a machine operate at an unpredictable, probably slow, rate.

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented May 19, 2016

cc @vishh

@vishh

This comment has been minimized.

Copy link
Member

vishh commented May 19, 2016

Based on the existing comments, this PR can be closed since there is no intention to support swap in the kubelet in the near future. @derekwaynecarr Feel free to re-open if you'd like to continue discussing support for swap.

@vishh vishh closed this May 19, 2016

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented May 19, 2016

Yeah. This can be closed. I actually don't want to support swap right now
either, when this was first opened swap was being set incorrectly. But all
is good now.

On Thursday, May 19, 2016, Vish Kannan notifications@github.com wrote:

Closed #7294 #7294.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7294 (comment)

@yoshuawuyts

This comment has been minimized.

Copy link

yoshuawuyts commented Aug 14, 2016

To contribute a use case for future reference: Swap space for running npm install on smaller nodes is essential. Small being about <1GB RAM.

When npm install is run, the dependency graph can become a lot larger than the amount of memory available, causing npm install to reliably crash. Being able to allocate swap for this step would be tremendously helpful, if anything just to build a new container with all dependencies compiled that can then be redistributed.

Hope this is somewhat helpful. Thanks!

See Also

@yoshuawuyts

This comment has been minimized.

Copy link

yoshuawuyts commented Aug 15, 2016

@vishh the google compute engine micro burst nodes have 0.6GB RAM; having a small swarm of these floating around actually works quite well for applications that have bursts of activity or when you want to spread nodes out over multiple AZs but wanna keep costs in check. Not to forget people new to k8 that want to take it for a spin and see how well it holds up

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Aug 15, 2016

IMHO, supporting swap is not worth the cost of making micro node clusters
reliable.

On Mon, Aug 15, 2016 at 1:10 PM, Yoshua Wuyts notifications@github.com
wrote:

@vishh https://github.com/vishh the google compute engine micro nodes
https://cloud.google.com/compute/pricing have 0.6GB RAM; having a small
swarm of these floating around actually works quite well for applications
that have bursts of activity or when you want to spread nodes out over
multiple AZs but wanna keep costs in check. Not to forget people new to k8
that want to take it for a spin and see how well it holds up


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKEaah2AvSORlsPXGrXUdAg7mUlCQks5qgMe-gaJpZM4EH_zA
.

@yoshuawuyts

This comment has been minimized.

Copy link

yoshuawuyts commented Aug 15, 2016

IMHO, supporting swap is not worth the cost of making micro node clusters
reliable

Yeah agree, there's probably more pressing things to work on right now. All I'm hoping is that when future discussions about adding swap / disk management occur, you'll take the use case of tiny nodes into account C:

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Aug 18, 2016

I would rather support swap as a cheap means of supporting overcommit, but
there are a lot of intermediate steps to get from a to b. Would prefer we
finish up existing QOS capability in 1.5 before supporting swap as an
advanced usage scenario at that point.

On Monday, August 15, 2016, Yoshua Wuyts notifications@github.com wrote:

IMHO, supporting swap is not worth the cost of making micro node clusters
reliable

Yeah agree, there's probably more pressing things to work on right now.
All I'm hoping to achieve is that when future discussions about adding swap
/ disk management occur, you'll take the use case of tiny nodes into
account C:


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbCe1WnqzCnahXpRXFweVnWkpfbymks5qgOCtgaJpZM4EH_zA
.

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Aug 18, 2016

And to be clear this original issue was because Kubelet passed the wrong
value (allowing swap) and I wanted to reduce or eliminate it. This was far
before QOS was fully defined.

On Wednesday, August 17, 2016, Derek Carr decarr@redhat.com wrote:

I would rather support swap as a cheap means of supporting overcommit, but
there are a lot of intermediate steps to get from a to b. Would prefer we
finish up existing QOS capability in 1.5 before supporting swap as an
advanced usage scenario at that point.

On Monday, August 15, 2016, Yoshua Wuyts <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

IMHO, supporting swap is not worth the cost of making micro node clusters
reliable

Yeah agree, there's probably more pressing things to work on right now.
All I'm hoping to achieve is that when future discussions about adding swap
/ disk management occur, you'll take the use case of tiny nodes into
account C:


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbCe1WnqzCnahXpRXFweVnWkpfbymks5qgOCtgaJpZM4EH_zA
.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented Oct 7, 2016

@derekwaynecarr @vishh sorry to comment on a closed issue, but this has some context here:

Kubelet seems to set MemorySwap (libcontainer config) to -1 by default here:
https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/docker_manager.go#L662
https://github.com/kubernetes/kubernetes/blob/v1.4.0/vendor/github.com/docker/engine-api/types/container/host_config.go#L253

which means /sys/fs/cgroup/memory.memsw.limit_in_bytes is set to unlimited (mem + swap) for the container.

Isn't the Memory+swap should also be set to container.Resources.Limits.Memory().Value() above. Or am i missing something.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Oct 7, 2016

Our approach has been that of asking users to not enable swap, but we never
really disallowed usage of swap. Given that it interferes with our ability
to provide QoS features, we are starting to disallow swap. We can either
disallow swap completely and/or have each container configured to not use
swap.

On Fri, Oct 7, 2016 at 3:18 PM, ravilr notifications@github.com wrote:

@derekwaynecarr https://github.com/derekwaynecarr @vishh
https://github.com/vishh sorry to comment on a closed issue, but this
has some context here:

Kubelet seems to set MemorySwap (libcontainer config) to -1 by default
here:
https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/
kubelet/dockertools/docker_manager.go#L662

which means /sys/fs/cgroup/memory.memsw.limit_in_bytes is set to
unlimited (mem + swap) for the container.

Isn't the Memory+swap should also be set to container.Resources.Limits.Memory().Value()
above. Or am i missing something.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKI0Jp9VwgtgUtpamjEr5UGLNynNKks5qxsVDgaJpZM4EH_zA
.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented Oct 10, 2016

@derekwaynecarr @vishh setting memsw.limit to unlimited by default, doesn't seem right. Can we fix that. Not every environment has swap disabled, by default.

@hongxima

This comment has been minimized.

Copy link

hongxima commented Dec 19, 2016

Dear all, there is a new use case where we really need either:
a) Disable swap by default (set MemorySwap=0) or
b) Provide an option to do so for kubelet

It's about Windows Docker compatibility, it was all good when using Windows docker version 1.12.2-cs2-ws-beta, but after upgraded to 1.14.0-dev, Windows Docker doesn't start any containers with below error output:
"invalid option: Windows does not support MemorySwap"
The meantime MemorySwap was set to "-1" for "unlimited", but the code expects it to be zero:

https://github.com/docker/docker/blob/master/daemon/daemon_windows.go#L175

Disabling MemorySwap for every container is an alternative, but that would almost kill me in a long term point of view.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Dec 19, 2016

@ozbillwang

This comment has been minimized.

Copy link

ozbillwang commented Dec 19, 2016

@vishh

Is this statement for Linux as well?

Do you have any document for swap management in Kubernetes? Or any road map that it may support in the future?

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Dec 20, 2016

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Dec 20, 2016

@hongxima

This comment has been minimized.

Copy link

hongxima commented Dec 20, 2016

I've created a separated issue #39003 specifically pointing to the showstopper for Windows case. @brendanburns provided a PR #39005 to disable MemorySwap only when it's Windows OS. Which keeps the current behavior for Linux (as of "unlimited memory swap") and also make Windows Docker working, it seems to be a proper solution at least for my case.

@munkyboy

This comment has been minimized.

Copy link

munkyboy commented Jul 18, 2017

FWIW, the default behavior is to disable swap since 1.6.0

@warmchang

This comment has been minimized.

Copy link
Contributor

warmchang commented Dec 27, 2017

👍

@jonmoter

This comment has been minimized.

Copy link

jonmoter commented Jan 7, 2018

Sorry to comment on an old dead thread, but I was looking at the commit that disabled swap, and comparing that to the docker docs on swap.

In particular, the Docker docs say:

  • If --memory-swap is set to 0, the setting is ignored, and the value is treated as unset.
  • If --memory-swapis set to the same value as --memory, and --memory is set to a positive integer, the container will not have access to swap. See Prevent a container from using swap.
  • If --memory-swap is unset, and --memory is set, the container can use twice as much swap as the --memory setting, if the host container has swap memory configured. For instance, if --memory="300m" and --memory-swap is not set, the container can use 300m of memory and 600m of swap.

So the way I'm reading the docs, if kubelet sets memory swap to 0, then that will allow the container to have 2x its memory limit in swap. To disable swap, --memory-swap should have the same value as the memory limit we set for the container.

Am I missing something here? Or is this a bug?

@jonmoter

This comment has been minimized.

Copy link

jonmoter commented Jan 8, 2018

FYI, I'm running a 1.6.11 Kubernetes cluster, and confirmed that the pods running in that cluster are being allocated swap, at 2x the amount of memory that their limit is set to. (The below is a container running kube-state-metrics.)

[root@node ~]# docker inspect b77bf3c86acc | grep Memory
            "Memory": 150000000,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 300000000,
            "MemorySwappiness": -1,
@embano1

This comment has been minimized.

Copy link

embano1 commented Jan 22, 2018

@jonmoter Recent kubelet versions have a flag to work around this effect, by checking whether swap is turned on.

--fail-swap-on (default: true)
https://kubernetes.io/docs/reference/generated/kubelet/

The kubelet won't start in this case. In your version, this setting was not implemented imho.

@Dominik-K

This comment has been minimized.

Copy link

Dominik-K commented Feb 2, 2018

@jonmoter Even you're right Kubernetes sets MemorySwap = 2 * Memory, the container will be out-of-memory (OOM) killed when Memory will be allocated. You can use my small test image: dominikk/swap-test.

@jonmoter

This comment has been minimized.

Copy link

jonmoter commented Feb 2, 2018

@Dominik-K, the container will be OOMkilled when Memory + Swap is exhausted. So if I have a container where I set the memory limit to 100MB, it will be killed when it exceeds ~300MB of total memory consumption.

I can see this happening in the containers running in my system. And in the stackexchange answer you link in your image, it specifically says (emphasis mine):

The OOM killer should kill this once you are out of RAM and SWAP to give to the program.

My point in bringing all this up was that @munkyboy claimed above that swap was disabled by default in 1.6.0. That's not true. Starting in 1.6, swap is set to be Memory * 2 for each container.

And yes, in recent versions, there's the --fail-swap-on flag. That makes the point moot if you have swap disabled on your host.

k8s-merge-robot added a commit that referenced this issue Feb 27, 2018

Merge pull request #59404 from ohmystack/docker-mem-swap
Automatic merge from submit-queue (batch tested with PRs 50724, 59025, 59710, 59404, 59958). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

dockertools: disable MemorySwap on Linux

In this commit, set `MemorySwap` the same with `Memory` to prevent using swap on Linux.

**What this PR does / why we need it**:

In #39731, @pires tried to disable swap on Linux by setting `MemorySwap` to 0.
However, according to [Docker's docs](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details), setting `MemorySwap` to 0 is treated as unset, and its [default behavior](https://github.com/moby/moby/blob/v17.05.0-ce/daemon/daemon_unix.go#L266-L269) is to set to twice the size of `Memory`, which can still cause the container to use the swap.

**Which issue(s) this PR fixes** :

This issue was mentioned in this comment: #7294 (comment)

**Special notes for your reviewer**:

1. For the case on Windows, we can still use the 0 because [Windows does not support `MemorySwap`](https://github.com/moby/moby/blob/v17.05.0-ce/daemon/daemon_windows.go#L185-L187).
2. There is another place using the `DefaultMemorySwap()` is for [sandbox](https://github.com/kubernetes/kubernetes/blob/v1.9.2/pkg/kubelet/dockershim/docker_sandbox.go#L505).
Maybe setting the sandbox's `MemorySwap` to 0 is fine. I didn't change that.

**Release note**:

```release-note
dockertools: disable memory swap on Linux.
```
@huggsboson

This comment has been minimized.

Copy link
Contributor

huggsboson commented Jul 24, 2018

@jonmoter @Dominik-K This exact thing happened to us when we noticed that one of our java services was running super slow, and we debugged it to container.resources.limits.memory = 1Gi and the jvm Xms = 3Gi Xmx = 3Gi. So it wasn't getting OOM Killed but 2Gi of address space were swapped.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Jul 24, 2018

@huggsboson what version of k8s were you using? Is kubelet flag --fail-swap-on set to false?

@huggsboson

This comment has been minimized.

Copy link
Contributor

huggsboson commented Jul 24, 2018

We're still back on 1.6. We plan on disabling swap going forward. Just wanted to document some the bizarre and hard to debug behavior you can see with older versions of kube with swap enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment