New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet evictions - whats remaining? #31362

Closed
vishh opened this Issue Aug 24, 2016 · 35 comments

Comments

Projects
None yet
@vishh
Copy link
Member

vishh commented Aug 24, 2016

Kubelet now supports evicting of pods & images to free up memory and disk space whenever it notices memory or disk pressure. This feature is amazing since it helps keep the nodes running. But there are still a few glitches with this feature that makes it not ready for production. This issue attempts to expose those glitches and discuss possible solutions for those glitches.

  • Kubelet pod evictions are not checkpointed. Kubelet restarts might forget evicted pods. need to checkpoint evicted phase to the API server before performing actual eviction
  • Kubelet eviction triggers are slow. It takes at-least 10 seconds for kubelet to detect resource usage changes. In the case of disk usage, it can be as long as 70 seconds. This makes low thresholds for kubelet evictions not work in production. For example memory.available<100Mi, doesn't prevent system OOMs ~40% of the time. This issue is more serious of disk.
    Proposed Solutions
    • Add an on-demand stats interface to cAdvisor that collects & provides the most recent stats on-demand - google/cadvisor#1247 is related
    • Setup memcg thresholds that trigger events as soon as the eviction thresholds are met.
  • Kubelet evicts more pods that needed whenever it notices memory or disk pressure.
    • Memory - the 10 second update could result in multiple evictions. Kubelet evicts volumes asynchronously after evicting pods. This results in memory held my in-memory volumes from not being released and additional pods get deleted. Possible solutions here include evictions volumes before marking a pod as "evicted" and running eviction control loops more often (1s?) whenever there is pressure.
    • Disk - The issue mentioned above from memory applies to disk too. In addition to that, the disk storage layout has not been spec'ed yet - #30799
      Disk Inodes are not tracked per-pod. This could result in eviction multiple pods until the outlier is evicted.
      Logs are not preserved for evicted pods. This might be critical for certain applications.

cc @kubernetes/sig-node @derekwaynecarr @sjenning @ronnielai

@vishh vishh added this to the v1.5 milestone Aug 24, 2016

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Aug 24, 2016

few glitches with this feature that do not make it ready for production

I disagree with the above statement. I would strongly recommend user's to enable this feature in production, but I also agree setting an eviction threshold at memory.available<100Mi is skating on thin-ice with the current support. I know per our conversation, @vishh feels similarly, but I want to clarify the statement for others that come and read this issue later. Many operators of production workloads target 70% utilization, and run pods with requests and limits both specified within a well-defined level of over-commit. Those users can derive significant value from the existing support and we should encourage our users to enable the function to reduce their risk of OOM, if not completely eliminate it yet.

That said, I agree we should focus on the following areas:

Improve the latency in observing when eviction thresholds are met

For disk, we will need to continue to poll cAdvisor, and since image garbage collection is unified with disk management, we should not promote user's running so close to total out of disk scenarios.

For memory, we should integrate with memcg notification API in addition to polling cAdvisor. For each soft/hard threshold for memory, we should register notifications to get a call-back. This will give a much more precise mechanism to enforce grace-period for soft evictions, and vastly reduce the latency in observing memory pressure. Since observation of memory pressure is now disconnected from knowing how much memory each pod/container is consuming in this scenario, we need to improve cAdvisor stat collection so we make the right eviction decision in response. There is reduced value to know there is a problem, and then solve it by selecting an incorrect pod.

On-demand stats when ranking pods for eviction

cAdvisor should support on-demand stat collection, and the eviction manager should request on-demand usage stats prior to making a pod eviction decision. If we are unable to get on-demand stats complete in 1.5 time-frame, we should at minimum expose a global timestamp to prevent excessive pod deletion due to eviction manager looking at the same usage stat values in consecutive sync passes.

Pod evictions should block on volume clean-up

If disk is scarce, pod evictions should block on per pod volume clean-up rather than happen asynchronously. For memory, we may not want to block since under pressure.

Inode reclamation

Right now, we do not have per-container usage stats for inodes, which means knowing which pod to reclaim in the face of inode exhaustion is difficult. In practice though, inode exhaustion would also trigger image and container gc. Right now, we have no visibility into the number of inodes we reclaim in response to those actions. Rather than prioritizing per container stats for inode usage, we would prefer to prioritize on-demand stat collection in cAdvisor so we do not err on the side of an unnecessary pod eviction after image/container gc.

@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Aug 25, 2016

Doing some research on how to get better (i.e. low latency) notification of memory pressure. There is a memcg attribute memory.pressure_level which will give the following notifications when the cgroup is under pressure:

  • low: system is reclaiming "inactive" memory for new allocations
  • medium: system is reclaiming "active" memory which might include swap activity
  • critical: the amount of reclaimable memory is very low and the cgroup is thrashing. OOM is imminent.

The medium-level notification on the root cgroup might be a great indicator of when pods should start being evicted. It seems that is the point at which the overall performance is beginning to suffer due to reclamation of active cache memory.

I'll have to see how easily this is triggered though. Also, it isn't clear if the notification is delivered again if we resolve the situation, then in the future, it happens again.

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Aug 25, 2016

I suspect that by the time kubelet reacts to medium pressure, there could
be one or more system OOM kills. If we were to run a dedicated OOM daemon,
then I'd be less concerned about using these notifications.
For now, I prefer using memory thresholds instead. WDYT?

On Thu, Aug 25, 2016 at 10:59 AM, Seth Jennings notifications@github.com
wrote:

Doing some research on how to get better (i.e. low latency) notification
of memory pressure. There is a memcg attribute memory.pressure_level
which will give the following notifications when the cgroup is under
pressure:

  • low: system is reclaiming "inactive" memory for new allocations
  • medium: system is reclaiming "active" memory which might include
    swap activity
  • critical: the amount of reclaimable memory is very low and the
    cgroup is thrashing. OOM is imminent.

The medium-level notification on the root cgroup might be a great
indicator of when pods should start being evicted. It seems that is the
point at which the overall performance is beginning to suffer due to
reclamation of active cache memory. I'll have to see how easily this is
triggered though.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKKeS8Ch2ZjJkxDGi_uq87uNGrHyOks5qjdgZgaJpZM4JsVQX
.

@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Aug 25, 2016

Yes, I've been doing some tests and, especially for systems with no swap (or memory.memsw.limit_in_bytes equal to memory.limit_in_bytes), there doesn't seem to be a notification level that is useful 😦 low is too easily triggered (happens on any cache reclamation) and medium and critical happen too late to be of any use. If you have swap, then medium is somewhat useful, but that can't be assumed.

memory.pressure_level is a dead end.

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Aug 25, 2016

We recommend disabling swap since it results in unpredictable application
performance in a cluster scenario.

On Thu, Aug 25, 2016 at 11:35 AM, Seth Jennings notifications@github.com
wrote:

Yes, I've been doing some tests and, especially for systems with no swap
(or memory.memsw.limit_in_bytes equal to memory.limit_in_bytes), there
doesn't seem to be a notification level that is useful 😦 low is too
easily triggered (happens on any cache reclamation) and medium and
critical happen too late to be of any use. If you have swap, then medium
is somewhat useful, but that can't be assumed.

memory.pressure_level is a dead end.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKG78v3zvSGXjkvMC0UMS8uY-bmzTks5qjeB-gaJpZM4JsVQX
.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented Oct 5, 2016

@derekwaynecarr @vishh in regards to imagefs garbage collection, if i want to set a lower threshold of say '<50%', would the node status condition be updated as 'under disk pressure', thereby rejecting new pods ? I think disk cleanup configured at lower threshold shouldn't lead to scheduler disregarding the node as schedulable or kubelet admission controlling the pods.
what i'm saying is people would want the kubelet to kick in garbage collection way before than any pod eviction thresholds. Is that possible now?

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Oct 5, 2016

@ravilr IIUC, you intend to proactively cleanup images to keep the disk usage low. This is not currently supported. FYI, kubelet will attempt to cleanup images before attempting to evict pods as of now.
I'm curious to know why you want to cleanup images proactively? Image downloads cause high pod startup tail latency and the scheduler is designed to understand image locality.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented Oct 5, 2016

my thinking is around the scenario that i would like to avoid, where most of the nodes in the cluster reached the disk based eviction threshold at the same time and become unschedulable till the usage falls below the threshold. I understand that existing pods wouldn't be evicted before running the cleanup, but new pods wouldn't be admitted for the eviction pressure transition period.
Also, certain types of disk performance may degrade at fuller capacity, so one would want to avoid filling up near-to-full, if possible.
Also in continous deployment world, where every application change results in a new image getting deployed, it could be that lot of the images are of older versions and not enough incentive to keep them lying around on the disk.

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Oct 5, 2016

If images are the main issue, by setting a lower hard threshold (~70%) and a lower eviction transition period (~30s), the nodes can be made to evict images often and not end up staying unschedulable for too long. This will result in lowering effective utilization of disk to 70% though.

One possibility is to extend soft evictions to start performing image and logs garbage collection as soon as the threshold is met and evict pods only after the soft eviction graceperiod is met. We can also provide an option to disable pod evictions via soft threshold by setting the soft eviction grace period to -1 or some such special value. @derekwaynecarr WDYT?

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Oct 5, 2016

@ravilr - I am curious what storage driver you are using?

On Wednesday, October 5, 2016, Vish Kannan notifications@github.com wrote:

If images are the main issue, by setting a lower hard threshold (~70%) and
a lower eviction transition period (~30s), the nodes can be made to evict
images often and not end up staying unschedulable for too long. This will
result in lowering effective utilization of disk to 70% though.

One possibility is to extend soft evictions to start performing image and
logs garbage collection as soon as the threshold is met and evict pods only
after the soft eviction graceperiod is met. We can also provide an option
to disable pod evictions via soft threshold by setting the soft eviction
grace period to -1 or some such special value. @derekwaynecarr
https://github.com/derekwaynecarr WDYT?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbHEFa1YY8NjaGvFdQxo_sgOV0ZLbks5qw_CTgaJpZM4JsVQX
.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented Oct 6, 2016

@derekwaynecarr - using devicemapper driver.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Oct 6, 2016

@ravilr - do you use the autoextend features? If so with what percent?
Also how do you configure dm.min_free_space?

@vishh - I am not adverse doing node level reclaims the moment an soft
threshold is passed, but I need to reason through some edge cases.

On Wednesday, October 5, 2016, ravilr notifications@github.com wrote:

@derekwaynecarr https://github.com/derekwaynecarr - using devicemapper
driver.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbKWEQpmLSPfJMrixxVC2ipjOom9Pks5qxETPgaJpZM4JsVQX
.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented Oct 6, 2016

@derekwaynecarr - autoextend is turned off, with thinpool DATA_SIZE set to 90%FREE. dm.min_free_space is also not set. i guess the default is 10%. Do you suggest configuring differently and why? Thanks.

@dims

This comment has been minimized.

Copy link
Member

dims commented Nov 16, 2016

This needs to be triaged as a release-blocker or not for 1.5 @vishh @derekwaynecarr

@vishh vishh modified the milestones: v1.6, v1.5 Nov 16, 2016

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Nov 20, 2016

Eviction for overloaded CPU would probably be good

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Nov 30, 2016

@davidopp Why do you think so? CPU is compressible right?

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Nov 30, 2016

To spread out the load if best effort pods are getting starved. I guess rescheduler could do it instead...

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Dec 1, 2016

@davidopp - we had discussed doing eviction on cpu load, but i also think the rescheduler could take on that responsibility as well... we can discuss more in resource management group.

@vishh vishh assigned dashpole and unassigned rkouj and dashpole Feb 28, 2017

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Feb 28, 2017

I have a few issues regarding evicting extra pods:
#41347
#40474

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Mar 10, 2017

@dashpole can you work on a list of work items that are pending based on this issue?

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 10, 2017

Current state of kubelet evictions:

  • Memory Evictions: With fairly well-behaved memory-consuming pods (using 12Mi memory every 10 sec), there is no evidence of extra pod evictions. Once other, more pressing issues have been dealt with, it may be worth testing with more aggressive memory consuming pods. Enabling memcg thresholds should allow the kubelet to handle these more aggressive pods without extra evictions.
  • Disk Evictions: Stats collection is slow, and stats used for evictions are frequently stale. This is the main blocker to reducing extra evictions.

My working items for 1.7 with links to issues:

Other items mentioned here that require more discussion, and an owner:

  • CPU based evictions #42910
  • Eviction Checkpointing
  • Blocking evictions on volume cleanup
@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 15, 2017

Another working item for 1.7:
#43166

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Apr 3, 2017

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented May 23, 2017

Update for 1.7 items:
On Demand Stats: Punted to next release
Reenable memcg on GCI: On track to finish by 1.7
Optimize container disk stats collection: More complicated than we initially thought.
Metrics on age of status used for evictions: Complete, metrics are exposed, and logged during eviction testing.
Metrics for high-latenct stats collection: Punted to next release
Fix or Mitigate Pod Eviction Race Condition: Complete
Eviction waits for pod cleanup: #43166 Complete
Garbage Collection of dead containers under disk pressure: #45896 Complete
Set default thresholds for disk eviction: Complete

@jayunit100

This comment has been minimized.

Copy link
Member

jayunit100 commented May 23, 2017

(Deleted earlier comment; wrong thread :))

@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented Jun 9, 2017

We made a lot improvements on kubelet eviction for 1.7 release, and reenable memcg notification on GCI is tracked by a separate issue.

I move this umbrella issue to 1.8 for more improvement.

@dchen1107 dchen1107 modified the milestones: v1.7, v1.8 Jun 9, 2017

k8s-merge-robot added a commit that referenced this issue Jun 13, 2017

Merge pull request #46441 from dashpole/eviction_time
Automatic merge from submit-queue

Shorten eviction tests, and increase test suite timeout

After #43590, the eviction manager is less aggressive when evicting pods.  Because of that, many runs in the flaky suite time out.
To shorten the inode eviction test, I have lowered the eviction threshold.
To shorten the allocatable eviction test, I now set KubeReserved = NodeMemoryCapacity - 200Mb, so that any pod using 200Mb will be evicted.  This shortens this test from 40 minutes, to 10 minutes.
While this should be enough to not hit the flaky suite timeout anymore, it is better to keep lower individual test timeouts than a lower suite timeout, since hitting the suite timeout means that even successful test runs are not reported.

/assign @Random-Liu @mtaufen 

issue: #31362
@k8s-merge-robot

This comment has been minimized.

Copy link
Contributor

k8s-merge-robot commented Sep 9, 2017

[MILESTONENOTIFIER] Milestone Removed

@dashpole @vishh

Important:
This issue was missing labels required for the v1.8 milestone for more than 7 days:

kind: Must specify exactly one of [kind/bug, kind/cleanup, kind/feature].

Removing it from the milestone.

Additional instructions available here The commands available for adding these labels are documented here

@k8s-merge-robot k8s-merge-robot removed this from the v1.8 milestone Sep 9, 2017

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 5, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 5, 2018

There are a couple items I still would like to tackle related to evictions in the 1.10 - 1.11 timeframe:
Enable On-Demand metrics: #56112
Fix Implementation of Memory Cgroup notifications: #51745, kubernetes/community#1451
Cleanup eviction signals: #53902
Prevent evictions after successfully performing garbage collection: #56573
Monitor allocatable memory usage more directly: #55638
Use Memory Cgroup notifications for allocatable evictions: #57901

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Apr 5, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented May 5, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented May 8, 2018

The remaining items, to be finished in 1.11 are:
Use Memory Cgroup notifications for allocatable evictions: #57901
Better kubelet eviction events: #63415
Eviction tests ensure Status.Reason is Evicted so we can catch OOMs: #57849

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented May 17, 2018

Everything I had planned to work on for this has been completed.
To address the initial points, we have long had per-container inode metrics and inode eviction, we have on-demand node-level and allocatable-level metrics, a decent integration with memory cgroup notifications, and do not perform additional evictions until all artifacts of the previous evicted pod, including volumes and containers, have been cleaned up.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment