Add size limit to tmpfs #63641

lovejoy · 2018-05-10T03:32:26Z

What this PR does / why we need it:
Add a size option when mount tmpfs
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63126

Special notes for your reviewer:
i run a pod like below

apiVersion: v1
kind: Pod
metadata:
  name: busybox-1
  namespace: default
spec:
  containers:
  - command:
    - sleep
    - "360000"
    image: busybox
    imagePullPolicy: IfNotPresent
    name: busybox
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
      - name: foo
        mountPath: /data/mysql
    resources:
      limits:
        memory: 1229Mi
      requests:
        cpu: 500m
        memory: 1Gi
  volumes:
   - name: foo
     emptyDir:
       sizeLimit: "350Mi"
       medium: "Memory"
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30

and it have a sizelimit now
/ # df -h | grep mysql
tmpfs 350.0M 350.0M 0 100% /data/mysql
Release note:

Fixed emptyDir to use the sizeLimit when creating the temporary volumes. Before this fix the tmpfs volume size was set to half of the available RAM (default linux kernel behavior)

lovejoy · 2018-05-10T03:42:56Z

/assign @thockin
cc @jingxu97

wgliang · 2018-05-10T03:44:42Z

/ok-to-test

lovejoy · 2018-05-10T05:03:02Z

/retest

pkg/volume/empty_dir/empty_dir.go

msau42 · 2018-05-10T17:05:26Z

/assign @jingxu97

lovejoy · 2018-05-11T03:22:48Z

/retest

lovejoy · 2018-05-11T03:36:51Z

/retest

dims · 2018-05-11T12:16:21Z

/lgtm

dims · 2018-05-11T12:16:33Z

Please add a release note

lovejoy · 2018-05-11T15:51:35Z

@dims added the release note and may be you have some better describe for the release note？

dims · 2018-05-11T16:12:05Z

How about this?

Fixed emptyDir to use the sizeLimit when creating the temporary volumes. Before this fix the tmpfs volume size was set to half of the available RAM (default linux kernel behavior)

msau42 · 2018-05-11T16:32:46Z

We need to carefully consider and document the behavior implications of this. The emptyDir sizeLimit for non-tmpfs volumes is a soft limit, ie pods get evicted vs IO error in the app. But this change makes it a hard limit for tmpfs. So users are going to see different behavior of sizeLimit depending on the emptydir type, and that could be confusing.

I think this same difference exists when handing limits for cpu/memory, so it could just be a matter of good documentation.

lovejoy · 2018-05-14T02:00:44Z

@msau42 ok i will update the document about tmpdir in the website

jonaz · 2020-05-06T06:48:13Z

Why is this not merged? We want to run our jenkins builds in kubernetes with emptydir memory but cannot do that right now until this is merged.

wesleyw72 · 2020-06-12T21:35:20Z

+1 to the above comments. @dims sorry to tag you specifically, but you're one of the most active thus far. What are the next steps for getting this in?

dims · 2020-06-13T01:42:50Z

/assign @xing-yang

@msau42 @xing-yang this PR has been languishing for some time, can you please make a call one way or another?

/milestone v1.19

sftim · 2020-06-13T17:19:07Z

Suggestion for release note:

Fixed emptyDir to use the sizeLimit when creating temporary volumes. Before this fix, the tmpfs volume size relied on default kernel behavior (Linux: half of the available RAM).

@lovejoy what do you think?

cvanlabe · 2020-07-07T12:52:03Z

Just to confirm, this would not be affected by memory resource limits right?

Currently, it's not entirely clear to me from above conversations, current experience, and the statement "Memory-backed volume sizes are now bounded by cumulative pod memory limits." on https://bugzilla.redhat.com/show_bug.cgi?id=1422049 (openshift I know) if the memory limit set also needs to incorporate the memory volume or not.

ie: some say or give the impression the memory limit kills a pod once the memory backed volume goes over the said limit, other discussion outputs indicate there's no problem and is tracked separately.

scrummyin · 2020-07-07T15:50:26Z

I think based off of section 2.3 Shared Page Accounting and 10. OOM Control in https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt it might depend on your specific situation, but this PR should not change the current behavior around memory resource limits. Instead, it allows for more customization over resource usage. In the bug you cited, this feature is listed as one of the expected result, which I read as a possible solution to it. In theory, tmpfs size limits could be used to mitigate the issue of a container running out of memory by allowing users to set a smaller tmpfs size so that they would first hit the tmpfs limit before running out of memory.

cvanlabe · 2020-07-08T13:09:02Z

Thanks Brian. I don't want to sidetrack this conversation, but would really like to understand how someone can actually set a guaranteed memory volume size (if this is possible at all). As a practical example, I have a node with 240G memory, and tmpfs got half of that so 120G. The problem is that a lot of things use that tmpfs... secrets are stored there, CNIs like Calico use it, and others. I run an application which expects to have a certain amount of disk space, now a memory filesystem, but had a few occurrences where we ran out of disk space, because apparently, we have less than 120G, but hard to say how much exactly.

Is there a way to do the opposite of setting a sizeLimit, but more of a sizeRequest? (Can open a new github issue for this is this is too much distracting..)

msau42 · 2020-07-08T18:27:24Z

/assign @derekwaynecarr @dashpole
to review. This change has implications on memory accounting so I think sig-node would be best to drive the direction of this.

dashpole · 2020-07-08T20:00:40Z

I've read through the history, and think there are two separate problems.

The default value for the size limit breaks some applications
We want to add a feature to be able to limit the size of tmpfs

Based on the discussion above, users mostly care about 1, which sounds like a bug. I don't see anyone that actually wants 2, which would require some consideration to get right, and which we should implement for all medium types.

I'm in favor of @derekwaynecarr's suggestion above to set the tmpfs limit to allocatable. Alternatively, we could set the tmpfs limit to the pod's memory limit. Both of these seem preferable to the current behavior, and don't have any downsides I can think of.

2rs2ts · 2020-07-08T23:41:20Z

we could set the tmpfs limit to the pod's memory limit

This please. Or perhaps its request. The reason why is so that the scheduler can take it into account.

cvanlabe · 2020-07-09T06:52:17Z

For my use case I do not prefer the pods memory limit. I actually want to make sure the pod won't consume more than a specific amount of memory before being killed and see that independent of its (memory backed) volumes that may and are expected to grow to much higher sizes.

Eg: server with 240G.. I want 20G for the OS doing it's base functionality, 20G for the pod. A healthy run wouldn't consume more than that. 200G for memory backed fast filesystem.

If it's depending on the pods memory limit, I would have to set it to 220G, but when my old process would get out of control and grows to 50G say, my filesystem would not be my expected 200G anymore and a lot more unexpected things can happen.

sizeLimit and sizeRequest would be my preference.

wangjia184 · 2020-08-12T06:02:19Z

When can this fix be released?

JoshBroomberg · 2020-08-12T18:15:19Z

@dims any progress on this? We're looking to limit the size of the tmpfs that we're using to increase the size of the shared memory available to pods. We risk pod eviction without this kind of limit.

matti · 2020-08-12T18:39:42Z

@JoshBroomberg can tou describe how you are using tmpfs on your pods? I think that this thread needs more concrete use cases.

JoshBroomberg · 2020-08-12T19:58:18Z

@matti we run user (long-living) jobs based on programmatically generated pod specs. Some of these user jobs involve things like training DL models using PyTorch multiprocessing. This often requires more than 64mb shared memory.

Our temporary work around is to mount a memory-backed empty dir to /dev/shm (as is done by many others). The annoying thing about this is that, for all other memory use, we use the pod spec for careful control. IE, users are able to select, and be limited to, an exact amount of memory. But when we use the tmpfs workaround, we violate this limit. Pods can use up to half of the node memory.

We haven't released this option yet because we are concerned about the side effects of this usage and the resulting UX. We will release without the limit (because this is badly needed by some). But we would love the kind of limit created by this PR.

derekwaynecarr · 2020-09-01T14:39:33Z

The proposal from @dashpole is the path I see as safe to move forward.
#63641 (comment)

If all containers in a pod specify a memory limit, we could set to limit of pod.
If all containers in a pod do not specify a memory limit, we could set to node allocatable.

Its important to remember that when a container restarts, the charge for writes to that emptyDir transfer back to the pod level cgroup when doing memory accounting. As a result, its best to just scope the size to match the pod cgroup bounding for memory, and if none, fall back to the node allocatable bounding cgroup.

derekwaynecarr · 2020-09-02T15:47:15Z

I am putting a PR together that combines a user specifiable setting with a feature gate to control rollout.
Will link shortly.

derekwaynecarr · 2020-09-02T18:33:48Z

see: #94444

mitar · 2020-09-03T05:12:45Z

Shouldn't the issue be closed once PR is merged?

cvanlabe · 2020-09-03T07:35:08Z

I have a question on this code snippet:

..
sizeLimit = nodeAllocatable.Memory()
..
// volume local size is  used if and only if less than what pod could consume
if spec.Volume.EmptyDir.SizeLimit != nil {
	volumeSizeLimit := spec.Volume.EmptyDir.SizeLimit
	if volumeSizeLimit.Cmp(*sizeLimit) < 1 {
		sizeLimit = volumeSizeLimit
	}
}

Not being very familiar with the k8s codebase, do I understand this right and we only look at a potential volume size limit when it's actually less than half of the node's memory?

If that's the case, this is a good addition, but still doesn't solve the problem I'm facing. What I'm after is being able to allocate more than 50%, but have a granular way of saying that a pod can consume x amount of memory for its processing, and y amount for a memory filesystem. More than often, y will be much larger than x. Or is there simply no way to alter the default behavior of allocating 50% of a node's memory?

Thanks for the PR.. very clean!

derekwaynecarr · 2020-09-03T22:17:59Z

@mitar see linked pr and kep.
#94444
kubernetes/enhancements#1968

mitar · 2020-09-04T01:53:25Z

Yes, I have seen the PR, but PR is not yet merged.

2rs2ts · 2020-10-07T17:38:19Z

@mitar this is a PR you're commenting on. The parent issue is still open #63126

k8s-ci-robot requested review from msau42 and sjenning May 10, 2018 03:32

k8s-ci-robot assigned thockin May 10, 2018

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 10, 2018

dims reviewed May 10, 2018

View reviewed changes

pkg/volume/empty_dir/empty_dir.go Outdated Show resolved Hide resolved

dims reviewed May 10, 2018

View reviewed changes

pkg/volume/empty_dir/empty_dir.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned jingxu97 May 10, 2018

lovejoy force-pushed the addSizeLimitToTmpfs branch 3 times, most recently from 852c9b4 to 8816196 Compare May 11, 2018 02:14

k8s-ci-robot assigned dims May 11, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2018

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 11, 2018

k8s-ci-robot assigned xing-yang Jun 13, 2020

k8s-ci-robot added this to the v1.19 milestone Jun 13, 2020

k8s-ci-robot assigned dashpole Jul 8, 2020

liggitt modified the milestones: v1.19, v1.20 Jul 20, 2020

liggitt removed this from the v1.20 milestone Aug 21, 2020

derekwaynecarr closed this Sep 2, 2020

Add size limit to tmpfs #63641

Add size limit to tmpfs #63641

Conversation

lovejoy commented May 10, 2018 • edited

lovejoy commented May 10, 2018

wgliang commented May 10, 2018

lovejoy commented May 10, 2018

msau42 commented May 10, 2018

lovejoy commented May 11, 2018

lovejoy commented May 11, 2018

dims commented May 11, 2018

dims commented May 11, 2018

lovejoy commented May 11, 2018

dims commented May 11, 2018

msau42 commented May 11, 2018

lovejoy commented May 14, 2018

jonaz commented May 6, 2020

wesleyw72 commented Jun 12, 2020

dims commented Jun 13, 2020

sftim commented Jun 13, 2020

cvanlabe commented Jul 7, 2020

scrummyin commented Jul 7, 2020

cvanlabe commented Jul 8, 2020

msau42 commented Jul 8, 2020

dashpole commented Jul 8, 2020

2rs2ts commented Jul 8, 2020 • edited

cvanlabe commented Jul 9, 2020

wangjia184 commented Aug 12, 2020

JoshBroomberg commented Aug 12, 2020

matti commented Aug 12, 2020

JoshBroomberg commented Aug 12, 2020 • edited

derekwaynecarr commented Sep 1, 2020 • edited

derekwaynecarr commented Sep 2, 2020

derekwaynecarr commented Sep 2, 2020

mitar commented Sep 3, 2020

cvanlabe commented Sep 3, 2020

derekwaynecarr commented Sep 3, 2020

mitar commented Sep 4, 2020

2rs2ts commented Oct 7, 2020

lovejoy commented May 10, 2018 •

edited

2rs2ts commented Jul 8, 2020 •

edited

JoshBroomberg commented Aug 12, 2020 •

edited

derekwaynecarr commented Sep 1, 2020 •

edited