New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: eviction: add memcg threshold notifier to improve eviction responsiveness #32577
kubelet: eviction: add memcg threshold notifier to improve eviction responsiveness #32577
Conversation
|
||
for _, threshold := range thresholds { | ||
// only enable memcg threshold notification if a hard memory eviction limit is set | ||
if threshold.Signal != SignalMemoryAvailable || threshold.GracePeriod != time.Duration(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why special case grace period?
Are we not able to define two thresholds - one for the soft and one for that hard eviction config? By getting notified on a soft transition, we can get more accurate grace period calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes the code more complex in that we are tracking two notifiers now, but it can be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the complexity that we need to have multiple notifiers? i think for a first pass, just having it for eviction-hard is fine, but in the long run, i would want both. ideally, we could have the memcg notification that we crossed the threshold, and set a timer to wake up at the associated grace period and run sync at that point.
I'd like to review this, but will probably have to wait till next week. Apologies! |
Updated with typo fixes, log prefix, godoc, pulled threshold setting out of constructor and into a SetThreshold() on the interface, added stopCh to Start() (even though i used wait.NeverStop, seemed to align with the convention). Still outstanding: do we want to have a threshold for soft limit too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I want this factored in a way that lets us verify that no-matter what the ThresholdNotifier
, the module actually responds to a notification being triggered. right now, nothing is actually testable in unit testing / mock with the current structure.
@@ -62,6 +63,8 @@ type managerImpl struct { | |||
resourceToRankFunc map[api.ResourceName]rankFunc | |||
// resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource. | |||
resourceToNodeReclaimFuncs map[api.ResourceName]nodeReclaimFuncs | |||
// mcgThresholdNotifier provides immediate notification when the root cgroup crosses a memory threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: its a thresholdNotifier now.
@@ -136,6 +139,58 @@ func (m *managerImpl) IsUnderDiskPressure() bool { | |||
return hasNodeCondition(m.nodeConditions, api.NodeDiskPressure) | |||
} | |||
|
|||
func (m *managerImpl) createThresholdNotification(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, observations signalObservations) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it cleaner if this returns a ThresholdNotifier
rather than change internal state?
|
||
glog.Infof("eviction manager: registered memory notification for %s on %s at %s", cgpath, attribute, quantity) | ||
m.thresholdNotifier = thresholdNotifier | ||
go m.thresholdNotifier.Start(wait.NeverStop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we returned a ThresholdNotifier
, it seems the calling method could then handle spawning this go func...
not sure if its a cleaner factoring yet, but it may us test this better.
for example, the current structure of this code doesn't lets us have a mock threshold notifier that triggers a notification AND let's us know that synchronize was actually called in response.
it would be good if the code was structured to support us knowing that we responded to a notification by actually invoking a sync.
@derekwaynecarr fixed nits and refactored createThresholdNotification() just to see what it looks like. still thinking about how to refactor the notifier code for better testing. |
GCE e2e build/test passed for commit 93fe23b. |
@@ -32,7 +33,7 @@ import ( | |||
"k8s.io/kubernetes/pkg/util/wait" | |||
) | |||
|
|||
// managerImpl implements NodeStabilityManager | |||
// managerImpl implements Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use var _ Manager = &managerImpl{}
instead the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the type assertion is already below the struct def. i was just fixing an inaccuracy in the in comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First line that got hidden, no argue about that :)
} | ||
eventfd := int(efd) | ||
if eventfd < 0 { | ||
return nil, fmt.Errorf("eventfd call failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing syscall.Close(watchfd)
and syscall.Close(controlfd)
?
@ingvagabund fixed up. thanks for the review! |
@vishh -- is this something you want to review ? |
Yes. Apologies for the delay. I will shepherd this PR through soon. On Fri, Sep 30, 2016 at 8:49 AM, Derek Carr notifications@github.com
|
b146072
to
ee11b88
Compare
@derekwaynecarr @vishh I reworked this PR today. The notifier interface is simpler now with just a Start() function. I put in notifiers for both soft and hard limits. Overall, I think it made it simpler, more functional, and possibly better for testing. I can add the testing once the approach is approved. |
@k8s-bot gci gce e2e test this |
ee11b88
to
1ab92c1
Compare
@derekwaynecarr @sjenning the bazel build is not PR blocking, so please ignore it. I'm looking into the failure regardless. |
You found a bug in gazel (or maybe an unimplemented feature)! mikedanese/gazel#3 is under review and will fix the bazel build failure. Again the bazel presubmit is not currently blocking. As you can see the Submit Queue check is green even though Bazel Build is failing. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
@mikedanese - thanks! |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Jenkins Bazel Build failed for commit 2583116. Full PR test history. The magic incantation to run this job again is |
Automatic merge from submit-queue |
@dchen1107 @mikedanese @sjenning this broke the cross-build, but I don't understand why. Please revert or fix. |
Its likely due to the CGO. |
@rmmh -- I am putting together a fix for cross-build now. |
Removing cherry-pick candidate label this is already in 1.5 |
@sjenning @derekwaynecarr @vishh This PR is currently on trial for causing issue #37853. @dchen1107, @mtaufen, and @thockin are working to verify. If convicted, this PR will be reverted from master and the 1.5 release branch. Could you help me understand what the impact of reverting this PR and shipping 1.5.0 without it will be? |
Can we gate its enablement behind an additional flag? It's unclear if this impacted all kernels and it's not clear to me if it only had an impact when you experience memory pressure. Will speak to Seth more in morning. The impact is Kubelet is less responsive to memory pressure and node is more likely to OOM assuming kernel itself was stable with the notification. |
For example, a --kernel-memcg-notify flag would be a simple way to gate the behavior on impacted distros. |
This PR will help avoid production issues that occur often due to Node
running into Memory Pressure. This improves the reliability of the node. So
before reverting, it would help if we can further understand the impact of
this PR and identify any potential fixes for the base image.
…On Wed, Dec 7, 2016 at 7:56 AM, Derek Carr ***@***.***> wrote:
Can we gate its enablement behind an additional flag? It's unclear if this
impacted all kernels and it's not clear to me if it only had an impact when
you experience memory pressure. Will speak to Seth more in morning. The
impact is Kubelet is less responsive to memory pressure and node is more
likely to OOM assuming kernel itself was stable with the notification.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32577 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKLl2IXpcSeQTe6Zujnqmf4U_oUYrks5rFhlngaJpZM4J7x_r>
.
|
Could you please help enumerate those. |
Kubernetes nodes are typically memory overcommitted by default unless Node Allocatable is configured. This results in nodes invoking the kernel OOM killer which ends up killing system services like docker daemon, kubelet, kube-proxy at times, in addition to arbitrary user processes. We introduced a user space eviction logic in kubelet to avoid invoking kernel OOM killer as much as possible by evicting user pods to free up memory. |
I am ok to have a flag to gate the feature. |
Thanks @derekwaynecarr |
opened a WIP pr here for a flag: #38258 hope to finish this evening. |
This PR adds the ability for the eviction code to get immediate notification from the kernel when the available memory in the root cgroup falls below a user defined threshold, controlled by setting the
memory.available
siginal with the--eviction-hard
flag.This PR by itself, doesn't change anything as the frequency at which new stats can be obtained is currently controlled by the cadvisor housekeeping interval. That being the case, the call to
synchronize()
by the notification loop will very likely get stale stats and not act any more quickly than it does now.However, whenever cadvisor does get on-demand stat gathering ability, this will improve eviction responsiveness by getting async notification of the root cgroup memory state rather than relying on polling cadvisor.
@vishh @derekwaynecarr @kubernetes/rh-cluster-infra
This change is