New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add size limit to tmpfs #63641
Add size limit to tmpfs #63641
Conversation
/ok-to-test |
/retest |
/assign @jingxu97 |
852c9b4
to
8816196
Compare
/retest |
1 similar comment
/retest |
/lgtm |
Please add a release note |
@dims added the release note and may be you have some better describe for the release note? |
How about this?
|
We need to carefully consider and document the behavior implications of this. The emptyDir sizeLimit for non-tmpfs volumes is a soft limit, ie pods get evicted vs IO error in the app. But this change makes it a hard limit for tmpfs. So users are going to see different behavior of sizeLimit depending on the emptydir type, and that could be confusing. I think this same difference exists when handing limits for cpu/memory, so it could just be a matter of good documentation. |
@msau42 ok i will update the document about tmpdir in the website |
Why is this not merged? We want to run our jenkins builds in kubernetes with emptydir memory but cannot do that right now until this is merged. |
+1 to the above comments. @dims sorry to tag you specifically, but you're one of the most active thus far. What are the next steps for getting this in? |
/assign @xing-yang @msau42 @xing-yang this PR has been languishing for some time, can you please make a call one way or another? /milestone v1.19 |
Suggestion for release note:
@lovejoy what do you think? |
Just to confirm, this would not be affected by memory resource limits right? Currently, it's not entirely clear to me from above conversations, current experience, and the statement "Memory-backed volume sizes are now bounded by cumulative pod memory limits." on https://bugzilla.redhat.com/show_bug.cgi?id=1422049 (openshift I know) if the memory limit set also needs to incorporate the memory volume or not. ie: some say or give the impression the memory limit kills a pod once the memory backed volume goes over the said limit, other discussion outputs indicate there's no problem and is tracked separately. |
I think based off of section |
Thanks Brian. I don't want to sidetrack this conversation, but would really like to understand how someone can actually set a guaranteed memory volume size (if this is possible at all). As a practical example, I have a node with 240G memory, and tmpfs got half of that so 120G. The problem is that a lot of things use that tmpfs... secrets are stored there, CNIs like Calico use it, and others. I run an application which expects to have a certain amount of disk space, now a memory filesystem, but had a few occurrences where we ran out of disk space, because apparently, we have less than 120G, but hard to say how much exactly. Is there a way to do the opposite of setting a sizeLimit, but more of a sizeRequest? (Can open a new github issue for this is this is too much distracting..) |
/assign @derekwaynecarr @dashpole |
I've read through the history, and think there are two separate problems.
Based on the discussion above, users mostly care about 1, which sounds like a bug. I don't see anyone that actually wants 2, which would require some consideration to get right, and which we should implement for all medium types. I'm in favor of @derekwaynecarr's suggestion above to set the tmpfs limit to allocatable. Alternatively, we could set the tmpfs limit to the pod's memory limit. Both of these seem preferable to the current behavior, and don't have any downsides I can think of. |
This please. Or perhaps its request. The reason why is so that the scheduler can take it into account. |
For my use case I do not prefer the pods memory limit. I actually want to make sure the pod won't consume more than a specific amount of memory before being killed and see that independent of its (memory backed) volumes that may and are expected to grow to much higher sizes. Eg: server with 240G.. I want 20G for the OS doing it's base functionality, 20G for the pod. A healthy run wouldn't consume more than that. 200G for memory backed fast filesystem. If it's depending on the pods memory limit, I would have to set it to 220G, but when my old process would get out of control and grows to 50G say, my filesystem would not be my expected 200G anymore and a lot more unexpected things can happen. sizeLimit and sizeRequest would be my preference. |
When can this fix be released? |
@dims any progress on this? We're looking to limit the size of the tmpfs that we're using to increase the size of the shared memory available to pods. We risk pod eviction without this kind of limit. |
@JoshBroomberg can tou describe how you are using tmpfs on your pods? I think that this thread needs more concrete use cases. |
@matti we run user (long-living) jobs based on programmatically generated pod specs. Some of these user jobs involve things like training DL models using PyTorch multiprocessing. This often requires more than 64mb shared memory. Our temporary work around is to mount a memory-backed empty dir to We haven't released this option yet because we are concerned about the side effects of this usage and the resulting UX. We will release without the limit (because this is badly needed by some). But we would love the kind of limit created by this PR. |
The proposal from @dashpole is the path I see as safe to move forward. If all containers in a pod specify a memory limit, we could set to limit of pod. Its important to remember that when a container restarts, the charge for writes to that emptyDir transfer back to the pod level cgroup when doing memory accounting. As a result, its best to just scope the size to match the pod cgroup bounding for memory, and if none, fall back to the node allocatable bounding cgroup. |
I am putting a PR together that combines a user specifiable setting with a feature gate to control rollout. |
see: #94444 |
Shouldn't the issue be closed once PR is merged? |
I have a question on this code snippet: ..
sizeLimit = nodeAllocatable.Memory()
..
// volume local size is used if and only if less than what pod could consume
if spec.Volume.EmptyDir.SizeLimit != nil {
volumeSizeLimit := spec.Volume.EmptyDir.SizeLimit
if volumeSizeLimit.Cmp(*sizeLimit) < 1 {
sizeLimit = volumeSizeLimit
}
} Not being very familiar with the k8s codebase, do I understand this right and we only look at a potential volume size limit when it's actually less than half of the node's memory? If that's the case, this is a good addition, but still doesn't solve the problem I'm facing. What I'm after is being able to allocate more than 50%, but have a granular way of saying that a pod can consume x amount of memory for its processing, and y amount for a memory filesystem. More than often, y will be much larger than x. Or is there simply no way to alter the default behavior of allocating 50% of a node's memory? Thanks for the PR.. very clean! |
@mitar see linked pr and kep. |
Yes, I have seen the PR, but PR is not yet merged. |
What this PR does / why we need it:
Add a size option when mount tmpfs
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #63126
Special notes for your reviewer:
i run a pod like below
and it have a sizelimit now
/ # df -h | grep mysql
tmpfs 350.0M 350.0M 0 100% /data/mysql
Release note: