New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Allow Node strategies to run with informers #488
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damemi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as an operator is used to deploy both instances of descheduler (in both modes) and assuming the same config file gets used (each mode will just filter out what's not relevant), this change should be seemingly transparent for a user. Without an operator, one will need to maintain both a CronJob and a Deployment.
In any case there will be two instances of a descheduler running which might interfere with each other. With strategies no longer running in sequence, we need to revisit the code for potential races. Also, improve some error message where an eviction failed (e.g. due to non-existing pod). We might also need to add a mechanism which will make sure both mode instances are ran over separate sets of pods (e.g. label selector filtering) to minimize interference.
) | ||
|
||
// StrategyFunction defines the function signature for each strategy's main function | ||
type StrategyFunction func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep this type private until we discuss how to refactor the way strategies are initialized.
sharedInformerFactory informers.SharedInformerFactory | ||
stopChannel chan struct{} | ||
|
||
f StrategyFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/f/func to avoid one letter variable (at least two letters to make searching for it easy please :))
c.nodes = nodes | ||
c.queue.Add(workQueueKey) | ||
}, | ||
UpdateFunc: func(old, new interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put down a comment saying DeleteFunc is not needed since pods are automatically evicted
(or similar)
@ingvagabund this does not require 2 descheduler instances. The informed strategies are spun off into separate, non-blocking goroutines while the main wait loop handles the iterative strategies. This is done with 1 descheduler, run as a deployment with |
I see. So you suggest to drop the cron job and move back to the original way of deploying the descheduler (or keep cron but provide deployment as well as the main manifest). I recall it was more practical to use cron job than to have a descheduler instance stopped on Also, Kubelet reports node status every 10 seconds by default. There's gonna be a lot of "empty" iterations so it might make sense to add a check to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems every informed strategies has its own StrategyController
, then we may run several strategies at the same time, this may cause some data synchronization problems. I think we can keep only one StrategyController
and work queue. When we got an event, EventHandler
should decide which strategy(s) should be called by the event's type and add the strategy's name to work queue, worker will process them one by one.
Yeah, this is my thinking. There are advantages and disadvantages to either way of running the scheduler, so no reason to not provide them both.
That is a good idea. This is why I did not make the
This is interesting, and sort of goes along with what @ingvagabund suggested above. I thought it made sense (at least from a code organization standpoint) to have individual strategies responsible for their own This brings up another data synchronization point, what if an event comes in right around the same time as a Maybe this could be solved with some kind of mutex? That way if the lock is held by either the I think this design (of running informed and default strategies in the same descheduler instance) is critical for both usability and further development of reactive descheduling. So I would like to make sure that this can run smoothly. |
return c | ||
} | ||
|
||
func nodeEventHandler(c *StrategyController) cache.ResourceEventHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this event handler be moved to types.go
or somewhere else? It looks like a common event handler and will be used by other strategies. If it's a custom event handler, a name like nodeAffinityNodeEventHandler
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a common handler, but only between the node strategies (taints+affinity). I actually considered putting these strategies into their own subpackage (like pkg/descheduler/strategies/node/
) but that seems a bit overcomplicated.
func nodeEventHandler(c *StrategyController) cache.ResourceEventHandler { | ||
return cache.ResourceEventHandlerFuncs{ | ||
AddFunc: func(obj interface{}) { | ||
nodes, err := nodeutil.ReadyNodes(c.ctx, c.client, c.sharedInformerFactory.Core().V1().Nodes(), c.nodeSelector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If strategies run serially, the time between listing nodes and running strategy may be long, if another event comes during this period, we'll list nodes again, acturally we only need to do this once. So like what was disscussed in #469 , listing and filtering nodes could be done in each strategies, we can also custom nodeSelector
for each strategies by doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if another event comes during this period, we'll list nodes again, acturally we only need to do this once
I'm not sure, because the point of reacting to Node events as they happen is to get a real-time updated view of the cluster to operate on. So each event should trigger a re-list so the strategy doesn't have old data.
I do agree though with refactoring this a bit like was previously mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the "reactive" strategies listening to node informers need to take into account strategies like LowNodeUtilization. LowNodeUtilization
needs to take into account entire cluster (or its reasonable subset). So if the strategies are ran in incorrect order, e.g. LowNodeUtilization
followed by running PodLifeTime
removing many old pods, running LowNodeUtilization
before might be actually nullified and make the overall resource consumption worst. So far, the order of strategies was kinda hardcoded (depending on the map iterator). We might compute some static impact/score of each strategy to overall resource utilization and run the strategies in some "practical" order starting with strategies changing utilization the most to strategies changing it the least (just a though, it might not be possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that LowNodeUtilization
will be able to easily convert to be reactive. That is the point of the design I proposed here, that we can begin to convert some of the strategies while keeping the hardcoded order of others.
I think @lixiang233's idea of a single strategy controller with a master registry of event handlers would solve any ordering issues. That with a mutex ensures we aren't running 2 strategies at the same time.
For example, with an interval of 60 minutes we could have:
0min - interval run of strategies
12 min - strategycontroller runs nodeaffinity
60 min - interval run
119 min - strategycontroller run (mutex lock)
121 min - interval (blocked waiting for mutex) runs
181 min - interval run
And since each periodic interval does a node re-list, those are working with an up-to-date list each time. Any results from NodeAffinity/NodeTaints won't interfere. So really, we are just triggering a descheduling run at dynamic intervals along with the hardcoded/cron intervals
/cc |
@damemi: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@JaneLiuL: GitHub didn't allow me to request PR reviews from the following users: JaneLiuL. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@damemi very good PR~~ Very happy to see that descheduler support Default and Informer mode. |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@damemi: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature