Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a corner case that re-schedule be skipped in case of the cluster becomes not fit. #2912

Merged
merged 1 commit into from
Dec 13, 2022

Conversation

jwcesign
Copy link
Member

@jwcesign jwcesign commented Dec 7, 2022

Signed-off-by: jwcesign jiangwei115@huawei.com

What type of PR is this?
/kind bug

What this PR does / why we need it:

  • Fix resource hangover in member cluster when propagating missmatch

Which issue(s) this PR fixes:
Fixes #2906

Special notes for your reviewer:
none

Does this PR introduce a user-facing change?:

`karmada-scheduler`: Fixed a corner case that re-schedule be skipped in case of the cluster becomes not fit.

@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 7, 2022
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 7, 2022
@jwcesign jwcesign changed the title Fix resource hangover in member cluster when propagating matching changes Fix resource hangover in member cluster when propagating missmatch Dec 7, 2022
@jwcesign
Copy link
Member Author

jwcesign commented Dec 7, 2022

cc @XiShanYongYe-Chang

@XiShanYongYe-Chang
Copy link
Member

Thanks~
/assign

@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2022

Codecov Report

Merging #2912 (24bc5a0) into master (69bd4ec) will increase coverage by 0.65%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #2912      +/-   ##
==========================================
+ Coverage   37.87%   38.52%   +0.65%     
==========================================
  Files         201      205       +4     
  Lines       18494    18815     +321     
==========================================
+ Hits         7004     7249     +245     
- Misses      11081    11137      +56     
- Partials      409      429      +20     
Flag Coverage Δ
unittests 38.52% <0.00%> (+0.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/scheduler/core/generic_scheduler.go 0.00% <0.00%> (ø)
pkg/scheduler/event_handler.go 27.27% <0.00%> (-0.70%) ⬇️
...ceinterpreter/configurableinterpreter/luavm/lua.go 57.14% <0.00%> (-0.12%) ⬇️
pkg/util/rbac.go 100.00% <0.00%> (ø)
pkg/search/proxy/store/util.go 93.36% <0.00%> (ø)
pkg/karmadactl/interpret/execute.go 63.63% <0.00%> (ø)
pkg/karmadactl/interpret/interpret.go 56.00% <0.00%> (ø)
...einterpreter/configurableinterpreter/luavm/kube.go 67.39% <0.00%> (ø)
pkg/karmadactl/interpret/check.go 91.17% <0.00%> (ø)
pkg/util/names/names.go 96.55% <0.00%> (+0.44%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@RainbowMango
Copy link
Member

/retitle fix a corner case that re-schedule be skipped in case of the cluster becomes not fit.

@karmada-bot karmada-bot changed the title Fix resource hangover in member cluster when propagating missmatch fix a corner case that re-schedule be skipped in case of the cluster becomes not fit. Dec 7, 2022
@jwcesign
Copy link
Member Author

jwcesign commented Dec 7, 2022

cc @chaunceyjiang , Please take a look, I've heard that this question has other context.

pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
test/e2e/propagationpolicy_test.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
test/e2e/rescheduling_test.go Outdated Show resolved Hide resolved
test/e2e/framework/cluster.go Show resolved Hide resolved
test/e2e/rescheduling_test.go Show resolved Hide resolved
@RainbowMango RainbowMango added this to the v1.5 milestone Dec 9, 2022
@chaunceyjiang
Copy link
Member

Oh, I missed this pr.

cc @chaunceyjiang , Please take a look, I've heard that this question has other context.

yes ,refer to #2261

@jwcesign jwcesign force-pushed the fix-pp-rb branch 2 times, most recently from 452595a to 24bc5a0 Compare December 9, 2022 06:56
Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some typo.
Other parts LTGM

test/e2e/framework/cluster.go Outdated Show resolved Hide resolved
test/e2e/framework/cluster.go Outdated Show resolved Hide resolved
test/e2e/rescheduling_test.go Outdated Show resolved Hide resolved
@jwcesign jwcesign force-pushed the fix-pp-rb branch 3 times, most recently from 95beefb to ba18160 Compare December 12, 2022 07:10
@XiShanYongYe-Chang
Copy link
Member

/lgtm
/assign @Garrybest

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 12, 2022
pkg/scheduler/event_handler.go Show resolved Hide resolved
@@ -476,13 +478,14 @@ func (s *Scheduler) scheduleResourceBinding(resourceBinding *workv1alpha2.Resour
}

scheduleResult, err := s.Algorithm.Schedule(context.TODO(), &placement, &resourceBinding.Spec, &core.ScheduleAlgorithmOption{EnableEmptyWorkloadPropagation: s.enableEmptyWorkloadPropagation})
if err != nil {
if err != nil && reflect.TypeOf(err) != reflect.TypeOf(&framework.FitError{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid we have a situation here. When a cluster is not ready for a while, the cluster will be added a NoSchedule taint. And before, a workload is running on this cluster and happens to scale up(may be triggered by users or hpa-controller), however, no cluster is fit due to the taint. I guess the previous replicas will be evicted here because now fit error is skipped and result will be patched to RB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to

if bindingSpec.TargetContains(cluster.Name) {
, the previous replicas will not be evicted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I didn't realize taint plugin is updated here. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the FitError will be dedicated used in the situation that no cluster fit, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, then the fitError information will be shown in the event and RB's status, the same behavior as k8s.

@Garrybest
Copy link
Member

/lgtm

@jwcesign
Copy link
Member Author

cc @RainbowMango for final checking

@RainbowMango
Copy link
Member

@XiShanYongYe-Chang Have you confirmed the E2E part?

@XiShanYongYe-Chang
Copy link
Member

@XiShanYongYe-Chang Have you confirmed the E2E part?

Yes, I've confirmed it.

@@ -476,13 +478,14 @@ func (s *Scheduler) scheduleResourceBinding(resourceBinding *workv1alpha2.Resour
}

scheduleResult, err := s.Algorithm.Schedule(context.TODO(), &placement, &resourceBinding.Spec, &core.ScheduleAlgorithmOption{EnableEmptyWorkloadPropagation: s.enableEmptyWorkloadPropagation})
if err != nil {
if err != nil && reflect.TypeOf(err) != reflect.TypeOf(&framework.FitError{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil && reflect.TypeOf(err) != reflect.TypeOf(&framework.FitError{}) {
var noClusterFit *framework.FitError
// in case of no cluster fit, can not return but continue to patch(cleanup) the result.
if err != nil && !errors.As(err, &noClusterFit) {

Does this make it a little clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@karmada-bot karmada-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2022
…nges

Signed-off-by: jwcesign <jiangwei115@huawei.com>
@RainbowMango
Copy link
Member

Have you figured out the root cause of the failing tests? I didn't see any change from the force-push.

@jwcesign
Copy link
Member Author

Have you figured out the root cause of the failing tests? I didn't see any change from the force push.

it's my mistake, should be if err != nil && !errors.As(err, &noClusterFit) { instead of if err != nil && errors.As(err, &noClusterFit) {.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Please cherry-pick this patch to release branches.
And we are going to tag a new release after that.

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2022
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2022
@karmada-bot karmada-bot merged commit f029394 into karmada-io:master Dec 13, 2022
karmada-bot added a commit that referenced this pull request Dec 14, 2022
…-upstream-release-1.4

Automated cherry pick of #2912: fix a corner case that re-schedule be skipped in case of the cluster becomes not fit
karmada-bot added a commit that referenced this pull request Dec 15, 2022
…-upstream-release-1.3

Automated cherry pick of #2912: fix a corner case that re-schedule be skipped in case of the cluster becomes not fit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should a label mismatch in PropagationPolicy placement delete resources from member clusters?
7 participants