Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reschedule bindings when unjoining a target cluster #1049

Closed
wants to merge 3 commits into from

Conversation

dddddai
Copy link
Member

@dddddai dddddai commented Nov 30, 2021

What type of PR is this?
/kind feature

What this PR does / why we need it:
Please refer to #829 (comment)

Which issue(s) this PR fixes:
Part of #829

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

karmada-scheduler: reschedule bindings when unjoining a target cluster.

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 30, 2021
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign rainbowmango after the PR has been reviewed.
You can assign the PR to them by writing /assign @rainbowmango in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 30, 2021
@mrlihanbo
Copy link

we need to do code refactoring here which will failed for Duplicated policy:

	// TODO: should schedule as much as possible?
	deltaLen := len(spec.Clusters) - len(reservedClusters)
	if len(candidateClusters) < deltaLen {
		// for ReplicaSchedulingTypeDivided, we will try to migrate replicas to the other health clusters
		if placement.ReplicaScheduling == nil || placement.ReplicaScheduling.ReplicaSchedulingType == policyv1alpha1.ReplicaSchedulingTypeDuplicated {
			klog.Warningf("ignore reschedule binding as insufficient available cluster")
			return ScheduleResult{}, nil
		}
	}

@dddddai
Copy link
Member Author

dddddai commented Nov 30, 2021

we need to do code refactoring here which will failed for Duplicated policy:

What do you mean by "failed for Duplicated policy"?
If number of available clusters is less than deltaLen, it would NOT perform reschedule, and I don't know if it's expected

For example, there is a propagation policy that propagates a workload to all clusters, should it perform reschedule when unjoining a cluster? I guess the answer is YES, so is this the difference between Reschedule and FailoverSchedule?

@mrlihanbo
Copy link

we need to do code refactoring here which will failed for Duplicated policy:

What do you mean by "failed for Duplicated policy"? If number of available clusters is less than deltaLen, it would NOT perform reschedule, and I don't know if it's expected

For example, there is a propagation policy that propagates a workload to all clusters, should it perform reschedule when unjoining a cluster? I guess the answer is YES, so is this the difference between Reschedule and FailoverSchedule?

我直接使用中文吧。FailoverSchedule最早是设计成和spread constraint一起用的。所以它的逻辑是:

  1. 原调度结果中有几个集群故障了,从健康集群中选出相同数目的集群补充。对于divided的场景,其实还会触发一次scale schedule。而对于Duplicated场景,不配合spread constrain使用的话,会由于选不到新集群导致保留原有结果。
  2. 所以这里产生的一个问题是,我们需要把unjoin的集群从调度结果中移除,一但candidateClusters小于故障集群数,未移除。

原场景:
propagation policy里指定[A, B, C]三个集群,spread constrain设为2,所以调度结果是[A, B]. 如果A故障了,会选出C补上,新结果为[B, C]

问题场景:
propagation policy里指定[A, B, C]三个集群,不使用spread constrain,调度结果是[A, B, C]。 如果A故障了,选不出新集群,保留[A, B, C]的调度结果。而我们期望的是把unjoin的集群从调度结果里移除。

@dddddai
Copy link
Member Author

dddddai commented Dec 1, 2021

如果A故障了,选不出新集群,保留[A, B, C]的调度结果。而我们期望的是把unjoin的集群从调度结果里移除。

明白
所以对于Reschedule,我们不考虑“从健康集群中选出相同数目的集群补充”,对吗?这也是Reschedule和FailoverSchedule的唯一区别?

@mrlihanbo
Copy link

如果A故障了,选不出新集群,保留[A, B, C]的调度结果。而我们期望的是把unjoin的集群从调度结果里移除。

明白 所以对于Reschedule,我们不考虑“从健康集群中选出相同数目的集群补充”,对吗?这也是Reschedule和FailoverSchedule的唯一区别?

有点复杂,需要结合是否配置了spread constraint来判断。这里之前写的也不完善,简单的选出相同数目的集群,其实只能满足SpreadByFieldCluster这种类型的spread constraint。

  1. 如果未使用spread constraint,Reschedule应该是选出所有符合条件的集群,并把unjoin的集群从调度结果中移除。
  2. 若使用spread constraint,重调度的结果仍需满足spread constraint约束。

@Garrybest
Copy link
Member

删掉不健康集群的调度结果,保留正常集群的调度结果,然后代码直接走normal scheduling逻辑,让调度器二次调度,这样可以么?#1051中我已经把scale scheduling和normal scheduling合并了,我在想failover的逻辑是否可以进行合并?

@Garrybest
Copy link
Member

When all types share the same scheduling logic, spread constraint will be concerned automatically. Now Failover has some defects, e.g., it does not take idle resource into consideration when doing rescheduling. I thought we may merge all types into one scheduling process together.

@dddddai
Copy link
Member Author

dddddai commented Dec 1, 2021

When all types share the same scheduling logic, spread constraint will be concerned automatically. Now Failover has some defects, e.g., it does not take idle resource into consideration when doing rescheduling. I thought we may merge all types into one scheduling process together.

Agree, I think this is the best way to solve the problem

@mrlihanbo
Copy link

删掉不健康集群的调度结果,保留正常集群的调度结果,然后代码直接走normal scheduling逻辑,让调度器二次调度,这样可以么?#1051中我已经把scale scheduling和normal scheduling合并了,我在想failover的逻辑是否可以进行合并?

好主意,我去看下合并的pr。failover确实需要大优化。

@RainbowMango
Copy link
Member

Please cc me after you made an agreement. :)

@dddddai dddddai marked this pull request as draft December 2, 2021 02:40
@karmada-bot karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2021
@dddddai dddddai force-pushed the reschedule branch 2 times, most recently from a13375d to c5ece75 Compare December 2, 2021 14:31
@dddddai dddddai marked this pull request as ready for review December 2, 2021 15:00
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2021
@dddddai
Copy link
Member Author

dddddai commented Dec 3, 2021

Hi @Garrybest, I commited a new patch to merge failover schedule with normal schedule

@Garrybest
Copy link
Member

Thanks @dddddai, I will check it later until #1051 get merged.

@Garrybest
Copy link
Member

/assign

@dddddai dddddai force-pushed the reschedule branch 2 times, most recently from 660220b to e2f7f16 Compare December 4, 2021 07:59
@karmada-bot karmada-bot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 4, 2021
@dddddai dddddai force-pushed the reschedule branch 2 times, most recently from d65ab94 to 7abe58d Compare December 8, 2021 01:57
@dddddai dddddai marked this pull request as draft December 8, 2021 02:00
@karmada-bot karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2021
@dddddai dddddai force-pushed the reschedule branch 2 times, most recently from ff11011 to c83dff6 Compare December 8, 2021 02:44
@dddddai dddddai force-pushed the reschedule branch 2 times, most recently from 12be21d to 2590917 Compare December 8, 2021 03:36
@dddddai dddddai marked this pull request as ready for review December 8, 2021 04:40
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2021
@huone1
Copy link
Contributor

huone1 commented Jan 17, 2022

@dddddai Let us discuss the PR in the weekly meeting tomorrow and the scenario “unjoining a target cluster” is a common user‘s ’behaviors

@huone1
Copy link
Contributor

huone1 commented Jan 17, 2022

there is nothing to requeue the rb about the cluster deleted in function deleteCluster ; I think it should add the requeue logic

func (s *Scheduler) deleteCluster(obj interface{}) {
var cluster *clusterv1alpha1.Cluster
switch t := obj.(type) {
case *clusterv1alpha1.Cluster:
cluster = t
case cache.DeletedFinalStateUnknown:
var ok bool
cluster, ok = t.Obj.(*clusterv1alpha1.Cluster)
if !ok {
klog.Errorf("cannot convert to clusterv1alpha1.Cluster: %v", t.Obj)
return
}
default:
klog.Errorf("cannot convert to clusterv1alpha1.Cluster: %v", t)
return
}
klog.V(3).Infof("Delete event for cluster %s", cluster.Name)
if s.enableSchedulerEstimator {
s.schedulerEstimatorWorker.Add(cluster.Name)
}
}

@@ -33,24 +33,29 @@ func (s *Snapshot) GetClusters() []*framework.ClusterInfo {
func (s *Snapshot) GetReadyClusters() []*framework.ClusterInfo {
var readyClusterInfoList []*framework.ClusterInfo
for _, c := range s.clusterInfoList {
if util.IsClusterReady(&c.Cluster().Status) {
if util.IsClusterReady(&c.Cluster().Status) && c.Cluster().DeletionTimestamp.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two judgment conditions can be combined into one : util.IsClusterReady(c.Cluster());
c.Cluster().DeletionTimestamp.IsZero() is also a special NoReady

Copy link
Member Author

@dddddai dddddai Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought so, but util.IsClusterReady is also used in other places and I'm afraid updating the func could impact them

// available clusters are ready clusters NOT in `binding.spec.clusters`
availableClusters := readyClusters.Difference(bindClusters)
// candidate clusters are available clusters that fit the placement
candidateClusters, err := g.findClustersThatFit(ctx, g.scheduleFramework, placement, &spec.Resource, clusterInfoSnapshot, availableClusters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some problems that just filter based the availableClusters ; a scenario is as follow:

  1. a deployment is appled in cluster A, B, C.
  2. cluster A add a taint which the deployment doesn't tolerate
  3. the replicas scale up from 5 to 10
  4. new pod should not be scheduled to cluster A

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it sounds reasonable, but we can not remove cluster A from binding.spec.clusters directly cos it will break the original workloads in cluster A

It's kind of tricky and IMHO it's something about assignReplicas, we may implement this in a follow-up

@@ -45,28 +45,52 @@ func NewGenericScheduler(
}
}

func (g *genericScheduler) Schedule(ctx context.Context, placement *policyv1alpha1.Placement, spec *workv1alpha2.ResourceBindingSpec) (result ScheduleResult, err error) {
// evict indicates whether to remove reserved clusters that don't fit the placement
func (g *genericScheduler) Schedule(ctx context.Context, placement *policyv1alpha1.Placement, spec *workv1alpha2.ResourceBindingSpec, evict bool) (result ScheduleResult, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function description should introduce the Behavior first, then important parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update it once I'm back from trip

// reserve unready clusters if failover is disabled
reservedClusters = reservedClusters.Union(unreadyBindClusters)
} else if len(candidateClusters) < unreadyBindClusters.Len() {
// *DON'T evict unready clusters in `binding.spec.clusters` if there are insufficient candidate clusters*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the num of bindClusters isn't have to be the cluster num of the next scheduling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the previous behavior:

// TODO: should schedule as much as possible?
deltaLen := len(spec.Clusters) - len(reservedClusters)
if len(candidateClusters) < deltaLen {
// for ReplicaSchedulingTypeDivided, we will try to migrate replicas to the other health clusters
if placement.ReplicaScheduling == nil || placement.ReplicaScheduling.ReplicaSchedulingType == policyv1alpha1.ReplicaSchedulingTypeDuplicated {
klog.Warningf("ignore reschedule binding as insufficient available cluster")
return ScheduleResult{}, nil
}
}

Are we gonna change the old behavior? That would be a breaking change

@dddddai
Copy link
Member Author

dddddai commented Jan 17, 2022

@dddddai Let us discuss the PR in the weekly meeting tomorrow and the scenario “unjoining a target cluster” is a common user‘s ’behaviors

@huone1 sorry I would be travelling in the next three days, how about discussing it in the slack channel :)

@dddddai
Copy link
Member Author

dddddai commented Jan 17, 2022

there is nothing to requeue the rb about the cluster deleted in function deleteCluster ; I think it should add the requeue logic

There's no need to requeue rbs on delete since they are already requeued when the DeletionTimeStamp is set

// Check if cluster is unjoined
if !newCluster.DeletionTimestamp.IsZero() {
// Trigger reschedule when cluster is unjoined
s.enqueueAffectedBinding(newCluster.Name)
s.enqueueAffectedClusterBinding(newCluster.Name)

@huone1
Copy link
Contributor

huone1 commented Jan 17, 2022

@dddddai Let us discuss the PR in the weekly meeting tomorrow and the scenario “unjoining a target cluster” is a common user‘s ’behaviors

@huone1 sorry I would be travelling in the next three days, how about discussing it in the slack channel :)

OK,have fun!

Signed-off-by: dddddai <dddwq@foxmail.com>
Signed-off-by: dddddai <dddwq@foxmail.com>
Signed-off-by: dddddai <dddwq@foxmail.com>
@dddddai dddddai closed this Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants