Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: eviction: add memcg threshold notifier to improve eviction responsiveness #32577

Merged
merged 1 commit into from Nov 23, 2016

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Sep 13, 2016

This PR adds the ability for the eviction code to get immediate notification from the kernel when the available memory in the root cgroup falls below a user defined threshold, controlled by setting the memory.available siginal with the --eviction-hard flag.

This PR by itself, doesn't change anything as the frequency at which new stats can be obtained is currently controlled by the cadvisor housekeeping interval. That being the case, the call to synchronize() by the notification loop will very likely get stale stats and not act any more quickly than it does now.

However, whenever cadvisor does get on-demand stat gathering ability, this will improve eviction responsiveness by getting async notification of the root cgroup memory state rather than relying on polling cadvisor.

@vishh @derekwaynecarr @kubernetes/rh-cluster-infra


This change is Reviewable

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 13, 2016

for _, threshold := range thresholds {
// only enable memcg threshold notification if a hard memory eviction limit is set
if threshold.Signal != SignalMemoryAvailable || threshold.GracePeriod != time.Duration(0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why special case grace period?

Are we not able to define two thresholds - one for the soft and one for that hard eviction config? By getting notified on a soft transition, we can get more accurate grace period calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes the code more complex in that we are tracking two notifiers now, but it can be done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the complexity that we need to have multiple notifiers? i think for a first pass, just having it for eviction-hard is fine, but in the long run, i would want both. ideally, we could have the memcg notification that we crossed the threshold, and set a timer to wake up at the associated grace period and run sync at that point.

@vishh
Copy link
Contributor

vishh commented Sep 13, 2016

I'd like to review this, but will probably have to wait till next week. Apologies!

@sjenning
Copy link
Contributor Author

Updated with typo fixes, log prefix, godoc, pulled threshold setting out of constructor and into a SetThreshold() on the interface, added stopCh to Start() (even though i used wait.NeverStop, seemed to align with the convention).

Still outstanding: do we want to have a threshold for soft limit too?

@derekwaynecarr derekwaynecarr added this to the v1.5 milestone Sep 19, 2016
@derekwaynecarr derekwaynecarr added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Sep 19, 2016
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want this factored in a way that lets us verify that no-matter what the ThresholdNotifier, the module actually responds to a notification being triggered. right now, nothing is actually testable in unit testing / mock with the current structure.

@@ -62,6 +63,8 @@ type managerImpl struct {
resourceToRankFunc map[api.ResourceName]rankFunc
// resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
resourceToNodeReclaimFuncs map[api.ResourceName]nodeReclaimFuncs
// mcgThresholdNotifier provides immediate notification when the root cgroup crosses a memory threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: its a thresholdNotifier now.

@@ -136,6 +139,58 @@ func (m *managerImpl) IsUnderDiskPressure() bool {
return hasNodeCondition(m.nodeConditions, api.NodeDiskPressure)
}

func (m *managerImpl) createThresholdNotification(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, observations signalObservations) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it cleaner if this returns a ThresholdNotifier rather than change internal state?


glog.Infof("eviction manager: registered memory notification for %s on %s at %s", cgpath, attribute, quantity)
m.thresholdNotifier = thresholdNotifier
go m.thresholdNotifier.Start(wait.NeverStop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we returned a ThresholdNotifier, it seems the calling method could then handle spawning this go func...

not sure if its a cleaner factoring yet, but it may us test this better.

for example, the current structure of this code doesn't lets us have a mock threshold notifier that triggers a notification AND let's us know that synchronize was actually called in response.

it would be good if the code was structured to support us knowing that we responded to a notification by actually invoking a sync.

@sjenning
Copy link
Contributor Author

@derekwaynecarr fixed nits and refactored createThresholdNotification() just to see what it looks like. still thinking about how to refactor the notifier code for better testing.

@k8s-bot
Copy link

k8s-bot commented Sep 19, 2016

GCE e2e build/test passed for commit 93fe23b.

@@ -32,7 +33,7 @@ import (
"k8s.io/kubernetes/pkg/util/wait"
)

// managerImpl implements NodeStabilityManager
// managerImpl implements Manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use var _ Manager = &managerImpl{} instead the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the type assertion is already below the struct def. i was just fixing an inaccuracy in the in comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First line that got hidden, no argue about that :)

}
eventfd := int(efd)
if eventfd < 0 {
return nil, fmt.Errorf("eventfd call failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing syscall.Close(watchfd) and syscall.Close(controlfd) ?

@sjenning
Copy link
Contributor Author

@ingvagabund fixed up. thanks for the review!

@derekwaynecarr
Copy link
Member

@vishh -- is this something you want to review ?

@vishh
Copy link
Contributor

vishh commented Sep 30, 2016

Yes. Apologies for the delay. I will shepherd this PR through soon.

On Fri, Sep 30, 2016 at 8:49 AM, Derek Carr notifications@github.com
wrote:

@vishh https://github.com/vishh -- is this something you want to review
?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32577 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKJxKPGfHk_Gfzp9YAfRMLNGxOiqqks5qvS-MgaJpZM4J7x_r
.

@sjenning
Copy link
Contributor Author

sjenning commented Oct 3, 2016

@derekwaynecarr @vishh I reworked this PR today. The notifier interface is simpler now with just a Start() function. I put in notifiers for both soft and hard limits. Overall, I think it made it simpler, more functional, and possibly better for testing. I can add the testing once the approach is approved.

@sjenning
Copy link
Contributor Author

sjenning commented Oct 4, 2016

@k8s-bot gci gce e2e test this

@sjenning
Copy link
Contributor Author

sjenning commented Oct 4, 2016

@k8s-bot node e2e test this issue #31408

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2016
@mikedanese
Copy link
Member

@derekwaynecarr @sjenning the bazel build is not PR blocking, so please ignore it. I'm looking into the failure regardless.

@mikedanese
Copy link
Member

mikedanese commented Nov 22, 2016

You found a bug in gazel (or maybe an unimplemented feature)! mikedanese/gazel#3 is under review and will fix the bazel build failure. Again the bazel presubmit is not currently blocking. As you can see the Submit Queue check is green even though Bazel Build is failing.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@derekwaynecarr
Copy link
Member

@mikedanese - thanks!

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Contributor

Jenkins Bazel Build failed for commit 2583116. Full PR test history.

The magic incantation to run this job again is @k8s-bot bazel test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit f8d8831 into kubernetes:master Nov 23, 2016
@rmmh
Copy link
Contributor

rmmh commented Nov 23, 2016

@dchen1107 @mikedanese @sjenning this broke the cross-build, but I don't understand why. Please revert or fix.

@mikedanese
Copy link
Member

Its likely due to the CGO.

@derekwaynecarr
Copy link
Member

@rmmh -- I am putting together a fix for cross-build now.

@saad-ali
Copy link
Member

Removing cherry-pick candidate label this is already in 1.5

@saad-ali
Copy link
Member

saad-ali commented Dec 7, 2016

@sjenning @derekwaynecarr @vishh This PR is currently on trial for causing issue #37853. @dchen1107, @mtaufen, and @thockin are working to verify. If convicted, this PR will be reverted from master and the 1.5 release branch.

Could you help me understand what the impact of reverting this PR and shipping 1.5.0 without it will be?

@derekwaynecarr
Copy link
Member

Can we gate its enablement behind an additional flag? It's unclear if this impacted all kernels and it's not clear to me if it only had an impact when you experience memory pressure. Will speak to Seth more in morning. The impact is Kubelet is less responsive to memory pressure and node is more likely to OOM assuming kernel itself was stable with the notification.

@derekwaynecarr
Copy link
Member

For example, a --kernel-memcg-notify flag would be a simple way to gate the behavior on impacted distros.

@vishh
Copy link
Contributor

vishh commented Dec 7, 2016 via email

@saad-ali
Copy link
Member

saad-ali commented Dec 7, 2016

This PR will help avoid production issues that occur often due to Node running into Memory Pressure. This improves the reliability of the node. So before reverting, it would help if we can further understand the impact of this PR and identify any potential fixes for the base image.

Could you please help enumerate those.

@vishh
Copy link
Contributor

vishh commented Dec 7, 2016

Kubernetes nodes are typically memory overcommitted by default unless Node Allocatable is configured. This results in nodes invoking the kernel OOM killer which ends up killing system services like docker daemon, kubelet, kube-proxy at times, in addition to arbitrary user processes. We introduced a user space eviction logic in kubelet to avoid invoking kernel OOM killer as much as possible by evicting user pods to free up memory.
Prior to this PR, that user space eviction logic ran in a housekeeping loop and missed memory spikes at times, and failed to perform its intended functionality effectively.
With this PR, that user space memory eviction logic will be triggered by the kernel whenever there is memory pressure and evict pods immediately.
@saad-ali I can enumerate a list of customers hitting this issue offline.

@derekwaynecarr
Copy link
Member

@sjenning @vishh @saad-ali - i have started to put a patch to gate its enablement behind an additional flag for kernels where the notification could prove problematic.

@dchen1107
Copy link
Member

I am ok to have a flag to gate the feature.

@vishh
Copy link
Contributor

vishh commented Dec 7, 2016

Thanks @derekwaynecarr

@derekwaynecarr
Copy link
Member

opened a WIP pr here for a flag: #38258

hope to finish this evening.

@sjenning sjenning deleted the memcg-notification-wip branch August 16, 2017 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-blocker release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet