kubelet: eviction: add memcg threshold notifier to improve eviction responsiveness #32577

sjenning · 2016-09-13T15:04:44Z

This PR adds the ability for the eviction code to get immediate notification from the kernel when the available memory in the root cgroup falls below a user defined threshold, controlled by setting the memory.available siginal with the --eviction-hard flag.

This PR by itself, doesn't change anything as the frequency at which new stats can be obtained is currently controlled by the cadvisor housekeeping interval. That being the case, the call to synchronize() by the notification loop will very likely get stale stats and not act any more quickly than it does now.

However, whenever cadvisor does get on-demand stat gathering ability, this will improve eviction responsiveness by getting async notification of the root cgroup memory state rather than relying on polling cadvisor.

@vishh @derekwaynecarr @kubernetes/rh-cluster-infra

This change is

derekwaynecarr · 2016-09-13T15:26:32Z

pkg/kubelet/eviction/eviction_manager.go

+
+	for _, threshold := range thresholds {
+		// only enable memcg threshold notification if a hard memory eviction limit is set
+		if threshold.Signal != SignalMemoryAvailable || threshold.GracePeriod != time.Duration(0) {


why special case grace period?

Are we not able to define two thresholds - one for the soft and one for that hard eviction config? By getting notified on a soft transition, we can get more accurate grace period calculation.

this makes the code more complex in that we are tracking two notifiers now, but it can be done

is the complexity that we need to have multiple notifiers? i think for a first pass, just having it for eviction-hard is fine, but in the long run, i would want both. ideally, we could have the memcg notification that we crossed the threshold, and set a timer to wake up at the associated grace period and run sync at that point.

vishh · 2016-09-13T18:36:16Z

I'd like to review this, but will probably have to wait till next week. Apologies!

sjenning · 2016-09-13T19:18:02Z

Updated with typo fixes, log prefix, godoc, pulled threshold setting out of constructor and into a SetThreshold() on the interface, added stopCh to Start() (even though i used wait.NeverStop, seemed to align with the convention).

Still outstanding: do we want to have a threshold for soft limit too?

derekwaynecarr

I think I want this factored in a way that lets us verify that no-matter what the ThresholdNotifier, the module actually responds to a notification being triggered. right now, nothing is actually testable in unit testing / mock with the current structure.

derekwaynecarr · 2016-09-19T14:54:18Z

pkg/kubelet/eviction/eviction_manager.go

@@ -62,6 +63,8 @@ type managerImpl struct {
 	resourceToRankFunc map[api.ResourceName]rankFunc
 	// resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
 	resourceToNodeReclaimFuncs map[api.ResourceName]nodeReclaimFuncs
+	// mcgThresholdNotifier provides immediate notification when the root cgroup crosses a memory threshold


nit: its a thresholdNotifier now.

derekwaynecarr · 2016-09-19T14:55:59Z

pkg/kubelet/eviction/eviction_manager.go

@@ -136,6 +139,58 @@ func (m *managerImpl) IsUnderDiskPressure() bool {
 	return hasNodeCondition(m.nodeConditions, api.NodeDiskPressure)
 }

+func (m *managerImpl) createThresholdNotification(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, observations signalObservations) error {


is it cleaner if this returns a ThresholdNotifier rather than change internal state?

derekwaynecarr · 2016-09-19T14:58:28Z

pkg/kubelet/eviction/eviction_manager.go

+
+		glog.Infof("eviction manager: registered memory notification for %s on %s at %s", cgpath, attribute, quantity)
+		m.thresholdNotifier = thresholdNotifier
+		go m.thresholdNotifier.Start(wait.NeverStop)


if we returned a ThresholdNotifier, it seems the calling method could then handle spawning this go func...

not sure if its a cleaner factoring yet, but it may us test this better.

for example, the current structure of this code doesn't lets us have a mock threshold notifier that triggers a notification AND let's us know that synchronize was actually called in response.

it would be good if the code was structured to support us knowing that we responded to a notification by actually invoking a sync.

sjenning · 2016-09-19T21:55:19Z

@derekwaynecarr fixed nits and refactored createThresholdNotification() just to see what it looks like. still thinking about how to refactor the notifier code for better testing.

k8s-bot · 2016-09-19T22:36:28Z

GCE e2e build/test passed for commit 93fe23b.

ingvagabund · 2016-09-20T10:58:41Z

pkg/kubelet/eviction/eviction_manager.go

@@ -32,7 +33,7 @@ import (
 	"k8s.io/kubernetes/pkg/util/wait"
 )

-// managerImpl implements NodeStabilityManager
+// managerImpl implements Manager


use var _ Manager = &managerImpl{} instead the comment?

yes, the type assertion is already below the struct def. i was just fixing an inaccuracy in the in comment.

First line that got hidden, no argue about that :)

ingvagabund · 2016-09-20T11:10:12Z

pkg/kubelet/eviction/threshold_notifier.go

+	}
+	eventfd := int(efd)
+	if eventfd < 0 {
+		return nil, fmt.Errorf("eventfd call failed")


missing syscall.Close(watchfd) and syscall.Close(controlfd) ?

sjenning · 2016-09-20T14:34:59Z

@ingvagabund fixed up. thanks for the review!

derekwaynecarr · 2016-09-30T15:48:42Z

@vishh -- is this something you want to review ?

vishh · 2016-09-30T21:32:20Z

Yes. Apologies for the delay. I will shepherd this PR through soon.

On Fri, Sep 30, 2016 at 8:49 AM, Derek Carr notifications@github.com
wrote:

@vishh https://github.com/vishh -- is this something you want to review
?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32577 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKJxKPGfHk_Gfzp9YAfRMLNGxOiqqks5qvS-MgaJpZM4J7x_r
.

sjenning · 2016-10-03T22:14:23Z

@derekwaynecarr @vishh I reworked this PR today. The notifier interface is simpler now with just a Start() function. I put in notifiers for both soft and hard limits. Overall, I think it made it simpler, more functional, and possibly better for testing. I can add the testing once the approach is approved.

sjenning · 2016-10-04T01:59:50Z

@k8s-bot gci gce e2e test this

sjenning · 2016-10-04T02:00:29Z

@k8s-bot node e2e test this issue #31408

mikedanese · 2016-11-22T23:24:35Z

@derekwaynecarr @sjenning the bazel build is not PR blocking, so please ignore it. I'm looking into the failure regardless.

mikedanese · 2016-11-22T23:54:59Z

You found a bug in gazel (or maybe an unimplemented feature)! mikedanese/gazel#3 is under review and will fix the bazel build failure. Again the bazel presubmit is not currently blocking. As you can see the Submit Queue check is green even though Bazel Build is failing.

k8s-github-robot · 2016-11-23T00:50:50Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

derekwaynecarr · 2016-11-23T01:09:35Z

@mikedanese - thanks!

k8s-github-robot · 2016-11-23T02:21:03Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-ci-robot · 2016-11-23T02:26:48Z

Jenkins Bazel Build failed for commit 2583116. Full PR test history.

The magic incantation to run this job again is @k8s-bot bazel test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-11-23T03:05:54Z

Automatic merge from submit-queue

rmmh · 2016-11-23T05:37:52Z

@dchen1107 @mikedanese @sjenning this broke the cross-build, but I don't understand why. Please revert or fix.

mikedanese · 2016-11-23T06:07:37Z

Its likely due to the CGO.

derekwaynecarr · 2016-11-23T15:58:49Z

@rmmh -- I am putting together a fix for cross-build now.

saad-ali · 2016-11-30T03:12:59Z

Removing cherry-pick candidate label this is already in 1.5

saad-ali · 2016-12-07T02:15:14Z

@sjenning @derekwaynecarr @vishh This PR is currently on trial for causing issue #37853. @dchen1107, @mtaufen, and @thockin are working to verify. If convicted, this PR will be reverted from master and the 1.5 release branch.

Could you help me understand what the impact of reverting this PR and shipping 1.5.0 without it will be?

derekwaynecarr · 2016-12-07T02:26:24Z

Can we gate its enablement behind an additional flag? It's unclear if this impacted all kernels and it's not clear to me if it only had an impact when you experience memory pressure. Will speak to Seth more in morning. The impact is Kubelet is less responsive to memory pressure and node is more likely to OOM assuming kernel itself was stable with the notification.

derekwaynecarr · 2016-12-07T02:37:59Z

For example, a --kernel-memcg-notify flag would be a simple way to gate the behavior on impacted distros.

vishh · 2016-12-07T02:38:39Z

This PR will help avoid production issues that occur often due to Node running into Memory Pressure. This improves the reliability of the node. So before reverting, it would help if we can further understand the impact of this PR and identify any potential fixes for the base image.

…

On Wed, Dec 7, 2016 at 7:56 AM, Derek Carr ***@***.***> wrote: Can we gate its enablement behind an additional flag? It's unclear if this impacted all kernels and it's not clear to me if it only had an impact when you experience memory pressure. Will speak to Seth more in morning. The impact is Kubelet is less responsive to memory pressure and node is more likely to OOM assuming kernel itself was stable with the notification. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#32577 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKLl2IXpcSeQTe6Zujnqmf4U_oUYrks5rFhlngaJpZM4J7x_r> .

saad-ali · 2016-12-07T03:02:37Z

This PR will help avoid production issues that occur often due to Node running into Memory Pressure. This improves the reliability of the node. So before reverting, it would help if we can further understand the impact of this PR and identify any potential fixes for the base image.

Could you please help enumerate those.

vishh · 2016-12-07T03:11:35Z

Kubernetes nodes are typically memory overcommitted by default unless Node Allocatable is configured. This results in nodes invoking the kernel OOM killer which ends up killing system services like docker daemon, kubelet, kube-proxy at times, in addition to arbitrary user processes. We introduced a user space eviction logic in kubelet to avoid invoking kernel OOM killer as much as possible by evicting user pods to free up memory.
Prior to this PR, that user space eviction logic ran in a housekeeping loop and missed memory spikes at times, and failed to perform its intended functionality effectively.
With this PR, that user space memory eviction logic will be triggered by the kernel whenever there is memory pressure and evict pods immediately.
@saad-ali I can enumerate a list of customers hitting this issue offline.

derekwaynecarr · 2016-12-07T03:14:13Z

@sjenning @vishh @saad-ali - i have started to put a patch to gate its enablement behind an additional flag for kernels where the notification could prove problematic.

dchen1107 · 2016-12-07T03:21:58Z

I am ok to have a flag to gate the feature.

vishh · 2016-12-07T03:28:45Z

Thanks @derekwaynecarr

derekwaynecarr · 2016-12-07T03:33:54Z

opened a WIP pr here for a flag: #38258

hope to finish this evening.

googlebot added the cla: yes label Sep 13, 2016

derekwaynecarr assigned derekwaynecarr and vishh Sep 13, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 13, 2016

derekwaynecarr reviewed Sep 13, 2016
View reviewed changes

derekwaynecarr added this to the v1.5 milestone Sep 19, 2016

derekwaynecarr added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Sep 19, 2016

derekwaynecarr requested changes Sep 19, 2016

View reviewed changes

ingvagabund reviewed Sep 20, 2016

View reviewed changes

sjenning force-pushed the memcg-notification-wip branch from b146072 to ee11b88 Compare October 3, 2016 22:07

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2016

sjenning force-pushed the memcg-notification-wip branch from ee11b88 to 1ab92c1 Compare October 4, 2016 13:50

k8s-github-robot merged commit f8d8831 into kubernetes:master Nov 23, 2016

rmmh mentioned this pull request Nov 23, 2016

kubernetes-cross-build: broken test run #37340

Closed

derekwaynecarr mentioned this pull request Nov 23, 2016

Fix cross-build for memcg notification #37384

Merged

smarterclayton mentioned this pull request Nov 23, 2016

Unable to compile on Mac #37400

Closed

saad-ali removed the cherrypick-candidate label Nov 30, 2016

dchen1107 mentioned this pull request Dec 6, 2016

root cause kernel soft lockups #37853

Closed

dashpole mentioned this pull request Mar 7, 2017

Test Memcg Notifications on GCI/COS #42676

Closed

5 tasks

sjenning deleted the memcg-notification-wip branch August 16, 2017 02:17

sjenning mentioned this pull request May 12, 2020

Allow override of oom_score_adj values #90973

Closed

kubelet: eviction: add memcg threshold notifier to improve eviction responsiveness #32577

kubelet: eviction: add memcg threshold notifier to improve eviction responsiveness #32577

Conversation

sjenning commented Sep 13, 2016 • edited by k8s-oncall

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishh commented Sep 13, 2016

sjenning commented Sep 13, 2016

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjenning commented Sep 19, 2016

k8s-bot commented Sep 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjenning commented Sep 20, 2016

derekwaynecarr commented Sep 30, 2016

vishh commented Sep 30, 2016

sjenning commented Oct 3, 2016

sjenning commented Oct 4, 2016

sjenning commented Oct 4, 2016

mikedanese commented Nov 22, 2016

mikedanese commented Nov 22, 2016 • edited

k8s-github-robot commented Nov 23, 2016

derekwaynecarr commented Nov 23, 2016

k8s-github-robot commented Nov 23, 2016

k8s-ci-robot commented Nov 23, 2016

k8s-github-robot commented Nov 23, 2016

rmmh commented Nov 23, 2016 • edited

mikedanese commented Nov 23, 2016

derekwaynecarr commented Nov 23, 2016

saad-ali commented Nov 30, 2016

saad-ali commented Dec 7, 2016

derekwaynecarr commented Dec 7, 2016

derekwaynecarr commented Dec 7, 2016

vishh commented Dec 7, 2016 via email

saad-ali commented Dec 7, 2016

vishh commented Dec 7, 2016

derekwaynecarr commented Dec 7, 2016

dchen1107 commented Dec 7, 2016

vishh commented Dec 7, 2016

derekwaynecarr commented Dec 7, 2016

sjenning commented Sep 13, 2016 •

edited by k8s-oncall

mikedanese commented Nov 22, 2016 •

edited

rmmh commented Nov 23, 2016 •

edited