Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size limit to tmpfs #63641

Open
wants to merge 1 commit into
base: master
from

Conversation

@lovejoy
Copy link
Contributor

lovejoy commented May 10, 2018

What this PR does / why we need it:
Add a size option when mount tmpfs
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63126

Special notes for your reviewer:
i run a pod like below

apiVersion: v1
kind: Pod
metadata:
  name: busybox-1
  namespace: default
spec:
  containers:
  - command:
    - sleep
    - "360000"
    image: busybox
    imagePullPolicy: IfNotPresent
    name: busybox
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
      - name: foo
        mountPath: /data/mysql
    resources:
      limits:
        memory: 1229Mi
      requests:
        cpu: 500m
        memory: 1Gi
  volumes:
   - name: foo
     emptyDir:
       sizeLimit: "350Mi"
       medium: "Memory"
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30

and it have a sizelimit now
/ # df -h | grep mysql
tmpfs 350.0M 350.0M 0 100% /data/mysql
Release note:

Fixed emptyDir to use the sizeLimit when creating the temporary volumes. Before this fix the tmpfs volume size was set to half of the available RAM (default linux kernel behavior)
@lovejoy

This comment has been minimized.

Copy link
Contributor Author

lovejoy commented May 10, 2018

/assign @thockin
cc @jingxu97

@wgliang

This comment has been minimized.

Copy link
Member

wgliang commented May 10, 2018

/ok-to-test

@lovejoy

This comment has been minimized.

Copy link
Contributor Author

lovejoy commented May 10, 2018

/retest

pkg/volume/empty_dir/empty_dir.go Outdated Show resolved Hide resolved
pkg/volume/empty_dir/empty_dir.go Outdated Show resolved Hide resolved
@msau42

This comment has been minimized.

Copy link
Member

msau42 commented May 10, 2018

/assign @jingxu97

@lovejoy lovejoy force-pushed the lovejoy:addSizeLimitToTmpfs branch 3 times, most recently from 852c9b4 to 8816196 May 11, 2018
@lovejoy

This comment has been minimized.

Copy link
Contributor Author

lovejoy commented May 11, 2018

/retest

1 similar comment
@lovejoy

This comment has been minimized.

Copy link
Contributor Author

lovejoy commented May 11, 2018

/retest

@dims

This comment has been minimized.

Copy link
Member

dims commented May 11, 2018

/lgtm

@dims

This comment has been minimized.

Copy link
Member

dims commented May 11, 2018

Please add a release note

@lovejoy

This comment has been minimized.

Copy link
Contributor Author

lovejoy commented May 11, 2018

@dims added the release note and may be you have some better describe for the release note?

@dims

This comment has been minimized.

Copy link
Member

dims commented May 11, 2018

How about this?

Fixed emptyDir to use the sizeLimit when creating the temporary volumes. Before this fix the tmpfs volume size was set to half of the available RAM (default linux kernel behavior)

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented May 11, 2018

We need to carefully consider and document the behavior implications of this. The emptyDir sizeLimit for non-tmpfs volumes is a soft limit, ie pods get evicted vs IO error in the app. But this change makes it a hard limit for tmpfs. So users are going to see different behavior of sizeLimit depending on the emptydir type, and that could be confusing.

I think this same difference exists when handing limits for cpu/memory, so it could just be a matter of good documentation.

@lovejoy

This comment has been minimized.

Copy link
Contributor Author

lovejoy commented May 14, 2018

@msau42 ok i will update the document about tmpdir in the website

@dims

This comment has been minimized.

Copy link
Member

dims commented Jan 16, 2019

/priority important-soon

@lovejoy lovejoy force-pushed the lovejoy:addSizeLimitToTmpfs branch from 0681306 to 9afb75f Jan 18, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 18, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lovejoy
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: msau42

If they are not already assigned, you can assign the PR to them by writing /assign @msau42 in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jan 26, 2019

This is back in my queue. We know that tmpfs charging is not exactly what we want here, but is it workable? Rather, is this any worse than no limit? I don't think it is.

Ergo, I am inclined to approve, but I'd like a node from @vishh and @msau42 that this is unlikely to explode on us

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jan 26, 2019

Actually, I am confused. Docs for EmptyDirVolumeSource say:

    // Total amount of local storage required for this EmptyDir volume.
    // The size limit is also applicable for memory medium.
    // The maximum usage on memory medium EmptyDir would be the minimum value between
    // the SizeLimit specified here and the sum of memory limits of all containers in a pod.                                                                                             
    // The default is nil which means that the limit is undefined.
    // More info: http://kubernetes.io/docs/user-guide/volumes#emptydir
    // +optional
    SizeLimit *resource.Quantity

This was written by @jingxu97 in commit 85f030c but that commit does not seem to have any impl for it..

So what is going on???

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 26, 2019

@thockin, this PR #45686 is the code to implement the sizeLimit for emptyDir. In the design, the sizeLimit parameter set for emptyDir volume could not be used for creating a volume with the size. Instead what it does it that eviction manager keeps monitoring the disk space used by pod emptyDir volume and it will evict pods when the usage exceeds the limit.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Feb 4, 2019

We preferred soft limits (based on background eviction loops) instead of hard limits (setting the limit on tmpfs to be that emptydir size) to allow for kubelet to detect and expose situations where the pod is actually running out of space on it's empty dir volume. This lets users (or automation systems like vertical pod autoscaling) bump up limits on-demand. If we instead set the limit to sizeLimit, the app would receive a ENOSPC error (which is probably expected from the developer POV), which is often not exposed outside of apps clearly or could be attributed to running out of other resources on the node like pids for example.
In the case of tmpfs, the memory limit of the pod acts as a safety net.
If hard limits for tmpfs is needed, I'd prefer that being a non-default option controllable via k'let configuration. I can see it being necessary in scenarios where there is a high risk of DDOS, but I'd prefer keeping the default user experience simpler.

@mikkeloscar

This comment has been minimized.

Copy link
Contributor

mikkeloscar commented Feb 4, 2019

I can see it being necessary in scenarios where there is a high risk of DDOS, but I'd prefer keeping the default user experience simpler.

@vishh I don't follow your argument for how this is simpler without hard limits? I think it depends on what problem you are trying to solve. Let me share a use case that this change would help me solve.

We have an Elasticsearch cluster where we want to run some of the "nodes" with tmpfs storage for fast read performance. In an environment like AWS EC2 we would simply allocate x% of the instance memory as storage space and the other y% of the instance memory for the Elasticsearch application.
If we want to do this with Kubernetes we can currently only use 50% of the nodes memory for tmpfs because this is how much is allocated when you define an emptyDir of medium Memory and we have a pod that is less than 50% we can not easily prevent the space used for "storage" to overlap with that of the application.

Additionally we can not just allocate a tmpfs manually and mount it e.g. as a hostPath because of how disk caches are allocated to the total memory of the cgroup of a pod/container as far as I understand. That is, if we allocate say 60% of a nodes memory to a tmpfs and the rest to a pod/container then the container could get OOM Killed when writing to the tmpfs even though the process itself didn't use 40% of memory. At least this is what we have observed.
In this scenario we could of cause run the pod/container without memory limits, but this only really works (and not in practice) if there is nothing else running on that same node, otherwise we might just kill some other random application when starting to use more memory than requested.

Do you have an idea for how to solve this problem, either with this change, or some other way?

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Feb 4, 2019

@mcluseau

This comment has been minimized.

Copy link
Contributor

mcluseau commented Mar 15, 2019

isn't a solution to this in a new parameter hardSizeLimit?

@sftim

This comment has been minimized.

Copy link
Contributor

sftim commented May 25, 2019

In terms of tmpfs emptyDir volumes and memory cgroups, it seems that it's handy to have an option to charge the tmpfs memory to the Pod (or to a cgroup inside the Pod) rather than the container.

Eg: An emptyDir volume in a Pod with 2 containers. Containers A and B have a 512 MiB memory limit.

Container A has a fairly small process that writes 499 MiB of app data to the tmpfs and doesn't OOM whilst doing so. Container B has a process twice the size of container A's that rewrites the file with updated app data. Container B OOMs because writing those pages took ownership of them, and the process plus the app data totals more than 512MiB.

This feels like it'd surprise the developer writing code to run in those containers, who'd expect to be able to set something like:

      requests:
        memory: "4Mi"
      limits:
        memory: "16Mi"

for both containers (and if the Volume were a different type, that'd work fine).

@dims

This comment has been minimized.

Copy link
Member

dims commented Jul 8, 2019

/uncc

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Jul 8, 2019

this had fallen into the back of my queue as well.

if a container writes to an emptyDir, and the container restarts, the charge is transferred to the bounding pod cgroup. this makes me wonder if its better to default the size of the tmpfs to node allocatable. basically, why treat tmpfs differently? if this was the default behavior, would you still desire the ability to specify a more granular limit?

@perdelt perdelt referenced this pull request Sep 17, 2019
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Oct 6, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@mitar

This comment has been minimized.

Copy link
Contributor

mitar commented Oct 6, 2019

/remove-lifecycle stale

@dims

This comment has been minimized.

Copy link
Member

dims commented Oct 25, 2019

@lovejoy can you please answer the last question from @derekwaynecarr (and any other feedback as well)

@xordspar0

This comment has been minimized.

Copy link

xordspar0 commented Oct 28, 2019

this makes me wonder if its better to default the size of the tmpfs to node allocatable.

@derekwaynecarr Could you explain more about what you mean by this? Is "node allocatable" the space total space on a node available to all pods? Currently the tmpfs size is half of the node's total memory, correct?
Can you explain what this would do to improve the situation?

@bashimao

This comment has been minimized.

Copy link

bashimao commented Nov 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.