Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare pod evictor for the descheduling framework plugin #846

Merged
merged 3 commits into from
Jul 9, 2022

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Jun 9, 2022

  • Pass the strategy name into evictor through context to reduce the list of parameters to miminum
  • Drop node parameter as it can be retrieved from the pod object
  • EvictPod: stop returning error and have higher processes decide when is the right time to exit a strategy loop

Pre-requisite for #837

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2022
@ingvagabund ingvagabund changed the title WIP: Pass the strategy name into evictor through context WIP: Prepare pod evictor for the descheduling framework plugin Jun 9, 2022
@ingvagabund
Copy link
Contributor Author

CC @knelasevero

@ingvagabund
Copy link
Contributor Author

/retest

klog.ErrorS(err, "Error evicting pod", "pod", klog.KObj(pod))
break
}
podEvictor.EvictPod(ctx, pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that functionality is being changed here? In other words, before it would break on a single "failed" eviction but now it loops through all pods.

Should it be preserved?

Suggested change
podEvictor.EvictPod(ctx, pod)
if !podEvictor.EvictPod(ctx, pod) {
break
}

Copy link
Contributor Author

@ingvagabund ingvagabund Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past there was only a single error issued (exceeding limit of number of pods allowed to be evicted per node). So it made sense to break and move to another node. However, few months back we added a limit for the number of pods evicted per namespace. So when the namespace limit gets exceeded we can still continue evicting.

Though, I plan to introduce a check for the node limit exceeded. I will incorporate it in this PR once #847 is merged. I am still performing changes to figure out the smallest amount of refactoring so I can move the evictor filter bits into a plugin. Once done we can start moving the strategies into plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just introduced NodeLimitExceeded method for performing the check.

@ingvagabund
Copy link
Contributor Author

#847 needs to be merged first.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 13, 2022
@ingvagabund
Copy link
Contributor Author

Rebasing on top of #847

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 16, 2022
@ingvagabund ingvagabund changed the title WIP: Prepare pod evictor for the descheduling framework plugin Prepare pod evictor for the descheduling framework plugin Jun 16, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2022
@ingvagabund
Copy link
Contributor Author

@damemi @a7i ptal

reason = ctx.Value("evictionReason").(string)
}

if pod.Spec.NodeName == "" {
Copy link
Contributor

@a7i a7i Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodLifeTime strategy allows Pods in Pending state. I assume there could be a state where the Pod has not been scheduled yet or cannot but needs to be evicted for a retry?

/b2418ef481298c6caf185a5f88dd0bb6ddc1cdbf/pkg/descheduler/strategies/pod_lifetime.go#L42

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will alter the code to take this case into account. Thanks for noticing!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought... after scanning the code, it seems that it only lists Pods on Nodes. We should probably discuss changing that but not part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might remove condition in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/pod/pods.go#L129-L131 to have pending pods included in getPodsAssignedToNode function. Considering an empty nodename as a special case.

The method uses the node object to only get the node name.
The node name can be retrieved from the pod object.

Some strategies might try to evict a pod in Pending state which
does not have the .spec.nodeName field set. Thus, skipping
the test for the node limit.
When an error is returned a strategy either stops completely or starts
processing another node. Given the error can be a transient error or
only one of the limits can get exceeded it is fair to just skip a
pod that failed eviction and proceed to the next instead.

In order to optimize the processing and stop earlier, it is more
practical to implement a check which will say when a limit was
exceeded.
// EvictPod evicts a pod while exercising eviction limits.
// Returns true when the pod is evicted on the server side.
// Eviction reason can be set through the ctx's evictionReason:STRING pair
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to know why the return signature was changed to not return the error.

Previously, we had the flexibility to return an error for namespace limit reached, node limit reached, and the potential for some other limit in the future. But now it's up to each strategy to handle limit exceeded logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because each plugin should now be responsible for deciding if it should continue/abort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because each plugin should now be responsible for deciding if it should continue/abort?

There are several limits now. Not all of them require to stop processing the current node. E.g. when a namespace limit is reached, a plugin can continue processing another pod on the same node. Thus yes, a plugin is allowed to decide if the right course of action is to stop processing a node and continue with another one. Or, just skipping a pod and taking another on the same node

// EvictPod evicts a pod while exercising eviction limits.
// Returns true when the pod is evicted on the server side.
// Eviction reason can be set through the ctx's evictionReason:STRING pair
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we continue to leave the v1.Node parameter? See #859 for use-case

Copy link
Contributor Author

@ingvagabund ingvagabund Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied in #859 (comment)

Copy link
Member

@JaneLiuL JaneLiuL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -74,6 +74,7 @@ func RemovePodsViolatingNodeAffinity(ctx context.Context, client clientset.Inter

switch nodeAffinity {
case "requiredDuringSchedulingIgnoredDuringExecution":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm actually, but i just wonder why this name requiredDuringSchedulingIgnoredDuringExecution so long...

Copy link
Contributor Author

@ingvagabund ingvagabund Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chosen by the designers: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity. There's probably a historical reason for it.

Copy link
Member

@JaneLiuL JaneLiuL Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it. look good to me now~

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2022
@ingvagabund
Copy link
Contributor Author

@a7i @damemi would you please review the PR one more time?

@a7i
Copy link
Contributor

a7i commented Jul 7, 2022

/ok-to-test
/lgtm

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jul 7, 2022
@a7i
Copy link
Contributor

a7i commented Jul 8, 2022

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a7i

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2022
pkg/descheduler/strategies/duplicates.go Show resolved Hide resolved
@@ -302,7 +302,7 @@ func RunDeschedulerStrategies(ctx context.Context, rs *options.DeschedulerServer
continue
}
evictorFilter := evictions.NewEvictorFilter(nodes, getPodsAssignedToNode, evictLocalStoragePods, evictSystemCriticalPods, ignorePvcPods, evictBarePods, evictions.WithNodeFit(nodeFit), evictions.WithPriorityThreshold(thresholdPriority))
f(ctx, rs.Client, strategy, nodes, podEvictor, evictorFilter, getPodsAssignedToNode)
f(context.WithValue(ctx, "strategyName", string(name)), rs.Client, strategy, nodes, podEvictor, evictorFilter, getPodsAssignedToNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to be the final approach or just a work in progress step? (passing values through the context)

Copy link
Contributor Author

@ingvagabund ingvagabund Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing some values through the context is the final approach. At least the strategyName. Which will get changed into pluginName. The context key can be then read and used on other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for doing that rather than adding a new parameter to the functions that use it? Looks like this is just being used to pass the strategy name and reason to EvictPod, is that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better approach would be to define an "options" struct to pass as an optional param to EvictPod. that struct can have strategy and reasons fields to start with, and if we decide to add more options in the future then we don't need to change EvictPod's signature.

what do you think about something like that? As this is, I don't think it's an appropriate use of context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, I don't need to block on this right now since this PR has been open for a while (and that's my bad for not getting around to reviewing until now). If my idea sounds good I'll take it as a follow up refactor and we can unhold this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might as well move the strategy name from EvictPod. It's used mostly for metrics. Then two lines for logging which can be logged without the strategy name (the same for the reason). Instead, printing both the strategy name and the reason (if there's any) outside of EvictPod through a wrapper which will get introduced after the framework primitives are implemented.

My reasoning is to use EvictPod only for the actual eviction. The method does not need to know anything about the strategy/plugin/reason/etc. in order to evict a pod. We can have higher invokers to log the additional information.

If my idea sounds good I'll take it as a follow up refactor and we can unhold this PR

+1 for refactoring the code more in the follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of passing additional info/options to EvictPod makes sense. It's already a wrapper for a private evictPod function (not sure why though), but as the public interface it seems like a good spot to expose a handle for customization/logging/metrics.

@damemi
Copy link
Contributor

damemi commented Jul 8, 2022

/hold
just had 2 questions before this auto-merges

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 8, 2022
@ingvagabund
Copy link
Contributor Author

Unholding based on #846 (comment)
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2022
@k8s-ci-robot k8s-ci-robot merged commit 6e69a10 into kubernetes-sigs:master Jul 9, 2022
@ingvagabund ingvagabund deleted the evictor-interface branch July 9, 2022 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants