Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size limit to tmpfs #63641

Closed
wants to merge 1 commit into from
Closed

Conversation

lovejoy
Copy link
Contributor

@lovejoy lovejoy commented May 10, 2018

What this PR does / why we need it:
Add a size option when mount tmpfs
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63126

Special notes for your reviewer:
i run a pod like below

apiVersion: v1
kind: Pod
metadata:
  name: busybox-1
  namespace: default
spec:
  containers:
  - command:
    - sleep
    - "360000"
    image: busybox
    imagePullPolicy: IfNotPresent
    name: busybox
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
      - name: foo
        mountPath: /data/mysql
    resources:
      limits:
        memory: 1229Mi
      requests:
        cpu: 500m
        memory: 1Gi
  volumes:
   - name: foo
     emptyDir:
       sizeLimit: "350Mi"
       medium: "Memory"
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30

and it have a sizelimit now
/ # df -h | grep mysql
tmpfs 350.0M 350.0M 0 100% /data/mysql
Release note:

Fixed emptyDir to use the sizeLimit when creating the temporary volumes. Before this fix the tmpfs volume size was set to half of the available RAM (default linux kernel behavior)

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 10, 2018
@lovejoy
Copy link
Contributor Author

lovejoy commented May 10, 2018

/assign @thockin
cc @jingxu97

@wgliang
Copy link
Contributor

wgliang commented May 10, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 10, 2018
@lovejoy
Copy link
Contributor Author

lovejoy commented May 10, 2018

/retest

@msau42
Copy link
Member

msau42 commented May 10, 2018

/assign @jingxu97

@lovejoy lovejoy force-pushed the addSizeLimitToTmpfs branch 3 times, most recently from 852c9b4 to 8816196 Compare May 11, 2018 02:14
@lovejoy
Copy link
Contributor Author

lovejoy commented May 11, 2018

/retest

1 similar comment
@lovejoy
Copy link
Contributor Author

lovejoy commented May 11, 2018

/retest

@dims
Copy link
Member

dims commented May 11, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2018
@dims
Copy link
Member

dims commented May 11, 2018

Please add a release note

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 11, 2018
@lovejoy
Copy link
Contributor Author

lovejoy commented May 11, 2018

@dims added the release note and may be you have some better describe for the release note?

@dims
Copy link
Member

dims commented May 11, 2018

How about this?

Fixed emptyDir to use the sizeLimit when creating the temporary volumes. Before this fix the tmpfs volume size was set to half of the available RAM (default linux kernel behavior)

@msau42
Copy link
Member

msau42 commented May 11, 2018

We need to carefully consider and document the behavior implications of this. The emptyDir sizeLimit for non-tmpfs volumes is a soft limit, ie pods get evicted vs IO error in the app. But this change makes it a hard limit for tmpfs. So users are going to see different behavior of sizeLimit depending on the emptydir type, and that could be confusing.

I think this same difference exists when handing limits for cpu/memory, so it could just be a matter of good documentation.

@lovejoy
Copy link
Contributor Author

lovejoy commented May 14, 2018

@msau42 ok i will update the document about tmpdir in the website

@jonaz
Copy link

jonaz commented May 6, 2020

Why is this not merged? We want to run our jenkins builds in kubernetes with emptydir memory but cannot do that right now until this is merged.

@wesleyw72
Copy link
Contributor

+1 to the above comments. @dims sorry to tag you specifically, but you're one of the most active thus far. What are the next steps for getting this in?

@dims
Copy link
Member

dims commented Jun 13, 2020

/assign @xing-yang

@msau42 @xing-yang this PR has been languishing for some time, can you please make a call one way or another?

/milestone v1.19

@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jun 13, 2020
@sftim
Copy link
Contributor

sftim commented Jun 13, 2020

Suggestion for release note:

Fixed emptyDir to use the sizeLimit when creating temporary volumes. Before this fix, the tmpfs volume size relied on default kernel behavior (Linux: half of the available RAM).

@lovejoy what do you think?

@cvanlabe
Copy link

cvanlabe commented Jul 7, 2020

Just to confirm, this would not be affected by memory resource limits right?

Currently, it's not entirely clear to me from above conversations, current experience, and the statement "Memory-backed volume sizes are now bounded by cumulative pod memory limits." on https://bugzilla.redhat.com/show_bug.cgi?id=1422049 (openshift I know) if the memory limit set also needs to incorporate the memory volume or not.

ie: some say or give the impression the memory limit kills a pod once the memory backed volume goes over the said limit, other discussion outputs indicate there's no problem and is tracked separately.

@scrummyin
Copy link

I think based off of section 2.3 Shared Page Accounting and 10. OOM Control in https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt it might depend on your specific situation, but this PR should not change the current behavior around memory resource limits. Instead, it allows for more customization over resource usage. In the bug you cited, this feature is listed as one of the expected result, which I read as a possible solution to it. In theory, tmpfs size limits could be used to mitigate the issue of a container running out of memory by allowing users to set a smaller tmpfs size so that they would first hit the tmpfs limit before running out of memory.

@cvanlabe
Copy link

cvanlabe commented Jul 8, 2020

Thanks Brian. I don't want to sidetrack this conversation, but would really like to understand how someone can actually set a guaranteed memory volume size (if this is possible at all). As a practical example, I have a node with 240G memory, and tmpfs got half of that so 120G. The problem is that a lot of things use that tmpfs... secrets are stored there, CNIs like Calico use it, and others. I run an application which expects to have a certain amount of disk space, now a memory filesystem, but had a few occurrences where we ran out of disk space, because apparently, we have less than 120G, but hard to say how much exactly.

Is there a way to do the opposite of setting a sizeLimit, but more of a sizeRequest? (Can open a new github issue for this is this is too much distracting..)

@msau42
Copy link
Member

msau42 commented Jul 8, 2020

/assign @derekwaynecarr @dashpole
to review. This change has implications on memory accounting so I think sig-node would be best to drive the direction of this.

@dashpole
Copy link
Contributor

dashpole commented Jul 8, 2020

I've read through the history, and think there are two separate problems.

  1. The default value for the size limit breaks some applications
  2. We want to add a feature to be able to limit the size of tmpfs

Based on the discussion above, users mostly care about 1, which sounds like a bug. I don't see anyone that actually wants 2, which would require some consideration to get right, and which we should implement for all medium types.

I'm in favor of @derekwaynecarr's suggestion above to set the tmpfs limit to allocatable. Alternatively, we could set the tmpfs limit to the pod's memory limit. Both of these seem preferable to the current behavior, and don't have any downsides I can think of.

@2rs2ts
Copy link
Contributor

2rs2ts commented Jul 8, 2020

we could set the tmpfs limit to the pod's memory limit

This please. Or perhaps its request. The reason why is so that the scheduler can take it into account.

@cvanlabe
Copy link

cvanlabe commented Jul 9, 2020

For my use case I do not prefer the pods memory limit. I actually want to make sure the pod won't consume more than a specific amount of memory before being killed and see that independent of its (memory backed) volumes that may and are expected to grow to much higher sizes.

Eg: server with 240G.. I want 20G for the OS doing it's base functionality, 20G for the pod. A healthy run wouldn't consume more than that. 200G for memory backed fast filesystem.

If it's depending on the pods memory limit, I would have to set it to 220G, but when my old process would get out of control and grows to 50G say, my filesystem would not be my expected 200G anymore and a lot more unexpected things can happen.

sizeLimit and sizeRequest would be my preference.

@liggitt liggitt modified the milestones: v1.19, v1.20 Jul 20, 2020
@wangjia184
Copy link

When can this fix be released?

@JoshBroomberg
Copy link

@dims any progress on this? We're looking to limit the size of the tmpfs that we're using to increase the size of the shared memory available to pods. We risk pod eviction without this kind of limit.

@matti
Copy link

matti commented Aug 12, 2020

@JoshBroomberg can tou describe how you are using tmpfs on your pods? I think that this thread needs more concrete use cases.

@JoshBroomberg
Copy link

JoshBroomberg commented Aug 12, 2020

@matti we run user (long-living) jobs based on programmatically generated pod specs. Some of these user jobs involve things like training DL models using PyTorch multiprocessing. This often requires more than 64mb shared memory.

Our temporary work around is to mount a memory-backed empty dir to /dev/shm (as is done by many others). The annoying thing about this is that, for all other memory use, we use the pod spec for careful control. IE, users are able to select, and be limited to, an exact amount of memory. But when we use the tmpfs workaround, we violate this limit. Pods can use up to half of the node memory.

We haven't released this option yet because we are concerned about the side effects of this usage and the resulting UX. We will release without the limit (because this is badly needed by some). But we would love the kind of limit created by this PR.

@liggitt liggitt removed this from the v1.20 milestone Aug 21, 2020
@derekwaynecarr
Copy link
Member

derekwaynecarr commented Sep 1, 2020

The proposal from @dashpole is the path I see as safe to move forward.
#63641 (comment)

If all containers in a pod specify a memory limit, we could set to limit of pod.
If all containers in a pod do not specify a memory limit, we could set to node allocatable.

Its important to remember that when a container restarts, the charge for writes to that emptyDir transfer back to the pod level cgroup when doing memory accounting. As a result, its best to just scope the size to match the pod cgroup bounding for memory, and if none, fall back to the node allocatable bounding cgroup.

@derekwaynecarr
Copy link
Member

I am putting a PR together that combines a user specifiable setting with a feature gate to control rollout.
Will link shortly.

@derekwaynecarr
Copy link
Member

see: #94444

@mitar
Copy link
Contributor

mitar commented Sep 3, 2020

Shouldn't the issue be closed once PR is merged?

@cvanlabe
Copy link

cvanlabe commented Sep 3, 2020

I have a question on this code snippet:

..
sizeLimit = nodeAllocatable.Memory()
..
// volume local size is  used if and only if less than what pod could consume
if spec.Volume.EmptyDir.SizeLimit != nil {
	volumeSizeLimit := spec.Volume.EmptyDir.SizeLimit
	if volumeSizeLimit.Cmp(*sizeLimit) < 1 {
		sizeLimit = volumeSizeLimit
	}
}

Not being very familiar with the k8s codebase, do I understand this right and we only look at a potential volume size limit when it's actually less than half of the node's memory?

If that's the case, this is a good addition, but still doesn't solve the problem I'm facing. What I'm after is being able to allocate more than 50%, but have a granular way of saying that a pod can consume x amount of memory for its processing, and y amount for a memory filesystem. More than often, y will be much larger than x. Or is there simply no way to alter the default behavior of allocating 50% of a node's memory?

Thanks for the PR.. very clean!

@derekwaynecarr
Copy link
Member

@mitar see linked pr and kep.
#94444
kubernetes/enhancements#1968

@mitar
Copy link
Contributor

mitar commented Sep 4, 2020

Yes, I have seen the PR, but PR is not yet merged.

@2rs2ts
Copy link
Contributor

2rs2ts commented Oct 7, 2020

@mitar this is a PR you're commenting on. The parent issue is still open #63126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

emptyDir with medium: Memory ignores sizeLimit