Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow Node strategies to run with informers #488

Closed
wants to merge 7 commits into from

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Jan 25, 2021

/kind feature

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2021
Copy link
Contributor

@ingvagabund ingvagabund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as an operator is used to deploy both instances of descheduler (in both modes) and assuming the same config file gets used (each mode will just filter out what's not relevant), this change should be seemingly transparent for a user. Without an operator, one will need to maintain both a CronJob and a Deployment.

In any case there will be two instances of a descheduler running which might interfere with each other. With strategies no longer running in sequence, we need to revisit the code for potential races. Also, improve some error message where an eviction failed (e.g. due to non-existing pod). We might also need to add a mechanism which will make sure both mode instances are ran over separate sets of pods (e.g. label selector filtering) to minimize interference.

)

// StrategyFunction defines the function signature for each strategy's main function
type StrategyFunction func(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this type private until we discuss how to refactor the way strategies are initialized.

sharedInformerFactory informers.SharedInformerFactory
stopChannel chan struct{}

f StrategyFunction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/f/func to avoid one letter variable (at least two letters to make searching for it easy please :))

c.nodes = nodes
c.queue.Add(workQueueKey)
},
UpdateFunc: func(old, new interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put down a comment saying DeleteFunc is not needed since pods are automatically evicted (or similar)

@damemi
Copy link
Contributor Author

damemi commented Jan 26, 2021

Without an operator, one will need to maintain both a CronJob and a Deployment. In any case there will be two instances of a descheduler running which might interfere with each other.

@ingvagabund this does not require 2 descheduler instances. The informed strategies are spun off into separate, non-blocking goroutines while the main wait loop handles the iterative strategies. This is done with 1 descheduler, run as a deployment with deschedulingInterval.

@ingvagabund
Copy link
Contributor

This is done with 1 descheduler, run as a deployment with deschedulingInterval.

I see. So you suggest to drop the cron job and move back to the original way of deploying the descheduler (or keep cron but provide deployment as well as the main manifest). I recall it was more practical to use cron job than to have a descheduler instance stopped on time.Sleep for deschedulingInterval. Deployment makes more sense now.

Also, Kubelet reports node status every 10 seconds by default. There's gonna be a lot of "empty" iterations so it might make sense to add a check to UpdateFunc for each strategy to compare what actually changed so a strategy does not have to be run every 10 seconds going through all pods and nodes just because a node changed e.g. its labels.

Copy link
Contributor

@lixiang233 lixiang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems every informed strategies has its own StrategyController, then we may run several strategies at the same time, this may cause some data synchronization problems. I think we can keep only one StrategyController and work queue. When we got an event, EventHandler should decide which strategy(s) should be called by the event's type and add the strategy's name to work queue, worker will process them one by one.

@damemi
Copy link
Contributor Author

damemi commented Jan 27, 2021

keep cron but provide deployment as well as the main manifest

Yeah, this is my thinking. There are advantages and disadvantages to either way of running the scheduler, so no reason to not provide them both.

it might make sense to add a check to UpdateFunc for each strategy to compare what actually changed so a strategy does not have to be run every 10 seconds

That is a good idea. This is why I did not make the eventHandler a function of StrategyController, with the idea that each informed strategy may want to implement its own logic for filtering out noop events (or making modifications prior to operating on the event)

It seems every informed strategies has its own StrategyController, then we may run several strategies at the same time, this may cause some data synchronization problems. I think we can keep only one StrategyController and work queue. When we got an event, EventHandler should decide which strategy(s) should be called by the event's type and add the strategy's name to work queue, worker will process them one by one.

This is interesting, and sort of goes along with what @ingvagabund suggested above.

I thought it made sense (at least from a code organization standpoint) to have individual strategies responsible for their own EventHandlers, but I see your point about data synchronization. A central event handler "registry" which then processes the informed strategies serially would be safer. I would probably like to add some way to pass the event itself to each strategy, where the strategy's code can decide whether to run or ignore it.

This brings up another data synchronization point, what if an event comes in right around the same time as a deschedulingInterval? We shouldn't run the periodic strategies at the same time as the informed strategies.

Maybe this could be solved with some kind of mutex? That way if the lock is held by either the StrategyController or the wait loop (whichever is triggered first) we don't hit multithreading data sync issues.

I think this design (of running informed and default strategies in the same descheduler instance) is critical for both usability and further development of reactive descheduling. So I would like to make sure that this can run smoothly.

return c
}

func nodeEventHandler(c *StrategyController) cache.ResourceEventHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this event handler be moved to types.go or somewhere else? It looks like a common event handler and will be used by other strategies. If it's a custom event handler, a name like nodeAffinityNodeEventHandler is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a common handler, but only between the node strategies (taints+affinity). I actually considered putting these strategies into their own subpackage (like pkg/descheduler/strategies/node/) but that seems a bit overcomplicated.

func nodeEventHandler(c *StrategyController) cache.ResourceEventHandler {
return cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
nodes, err := nodeutil.ReadyNodes(c.ctx, c.client, c.sharedInformerFactory.Core().V1().Nodes(), c.nodeSelector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If strategies run serially, the time between listing nodes and running strategy may be long, if another event comes during this period, we'll list nodes again, acturally we only need to do this once. So like what was disscussed in #469 , listing and filtering nodes could be done in each strategies, we can also custom nodeSelector for each strategies by doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if another event comes during this period, we'll list nodes again, acturally we only need to do this once

I'm not sure, because the point of reacting to Node events as they happen is to get a real-time updated view of the cluster to operate on. So each event should trigger a re-list so the strategy doesn't have old data.

I do agree though with refactoring this a bit like was previously mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the "reactive" strategies listening to node informers need to take into account strategies like LowNodeUtilization. LowNodeUtilization needs to take into account entire cluster (or its reasonable subset). So if the strategies are ran in incorrect order, e.g. LowNodeUtilization followed by running PodLifeTime removing many old pods, running LowNodeUtilization before might be actually nullified and make the overall resource consumption worst. So far, the order of strategies was kinda hardcoded (depending on the map iterator). We might compute some static impact/score of each strategy to overall resource utilization and run the strategies in some "practical" order starting with strategies changing utilization the most to strategies changing it the least (just a though, it might not be possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that LowNodeUtilization will be able to easily convert to be reactive. That is the point of the design I proposed here, that we can begin to convert some of the strategies while keeping the hardcoded order of others.

I think @lixiang233's idea of a single strategy controller with a master registry of event handlers would solve any ordering issues. That with a mutex ensures we aren't running 2 strategies at the same time.

For example, with an interval of 60 minutes we could have:

0min - interval run of strategies
12 min - strategycontroller runs nodeaffinity
60 min - interval run
119 min - strategycontroller run (mutex lock)
121 min - interval (blocked waiting for mutex) runs
181 min - interval run

And since each periodic interval does a node re-list, those are working with an up-to-date list each time. Any results from NodeAffinity/NodeTaints won't interfere. So really, we are just triggering a descheduling run at dynamic intervals along with the hardcoded/cron intervals

@seanmalloy
Copy link
Member

/cc

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2021
@k8s-ci-robot
Copy link
Contributor

@damemi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@denkensk
Copy link
Member

denkensk commented Jun 1, 2021

/cc

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2022
@k8s-ci-robot
Copy link
Contributor

@JaneLiuL: GitHub didn't allow me to request PR reviews from the following users: JaneLiuL.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JaneLiuL
Copy link
Member

@damemi very good PR~~ Very happy to see that descheduler support Default and Informer mode.
I can help to separate severl PR if you not mind :)

@Dentrax
Copy link
Contributor

Dentrax commented Jan 20, 2022

Hey @damemi, According to issue #696, we can get to this PR if you show us what kind of things needs to be done. 🙏 We can be whether work on this PR or create a new carry PR by cherry-picking your commits. Can you please enlighten us to see our next step?

cc @developer-guy @eminaktas @yasintahaerol @necatican @f9n

@k8s-ci-robot k8s-ci-robot removed the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2022
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jan 20, 2022
@k8s-ci-robot
Copy link
Contributor

@damemi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-descheduler-test-e2e-k8s-master-1-21 f01c473 link /test pull-descheduler-test-e2e-k8s-master-1-21
pull-descheduler-helm-test f01c473 link /test pull-descheduler-helm-test
pull-descheduler-test-e2e-k8s-master-1-22 f01c473 link /test pull-descheduler-test-e2e-k8s-master-1-22
pull-descheduler-test-e2e-k8s-1-21-1-21 f01c473 link /test pull-descheduler-test-e2e-k8s-1-21-1-21
pull-descheduler-unit-test-master-master f01c473 link true /test pull-descheduler-unit-test-master-master
pull-descheduler-test-e2e-k8s-master-1-23 f01c473 link true /test pull-descheduler-test-e2e-k8s-master-1-23

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 5, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants