Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add NoExecute taint manager #1945

Merged
merged 2 commits into from
Jul 18, 2022
Merged

Conversation

Garrybest
Copy link
Member

@Garrybest Garrybest commented Jun 1, 2022

Signed-off-by: Garrybest garrybest@foxmail.com

What type of PR is this?
/kind feature

What this PR does / why we need it:
Add taint manager for NoExecute manager.

Which issue(s) this PR fixes:
Fixes part of #1762

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Introduced `taint manager` to cluster controller to evict workloads from clusters with `no-execute` taint. 

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 1, 2022
@karmada-bot karmada-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 1, 2022
@RainbowMango
Copy link
Member

/assign
But I'm afraid don't have enough time to look at it.

@Garrybest
Copy link
Member Author

Thanks @RainbowMango. The taint manager in k/k is really complicated. I have tried my best to keep our taint manager concise. Let me invite more reviewers to check.

/cc @lynnsong @XiShanYongYe-Chang @huone1

Copy link
Member

@dddddai dddddai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really enjoyable to look at code like this.

Thanks @Garrybest and sorry for let this sit.

pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
pkg/util/binding.go Outdated Show resolved Hide resolved
@Garrybest Garrybest changed the title add NoExecute taint manager [WIP] add NoExecute taint manager Jun 13, 2022
@karmada-bot karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 13, 2022
@Garrybest Garrybest changed the title [WIP] add NoExecute taint manager add NoExecute taint manager Jun 13, 2022
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 13, 2022
@Garrybest
Copy link
Member Author

The e2e error does not make any sense. Maybe we need a retest.

@RainbowMango
Copy link
Member

Right.

Echo here for @XiShanYongYe-Chang to check.
https://github.com/karmada-io/karmada/runs/6858581736?check_suite_focus=true

[namespace auto-provision] namespace auto-provision testing joining new cluster
namespace should be propagated to new created clusters[needCreateCluster]
/home/runner/work/karmada/karmada/test/e2e/namespace_test.go:131
------------------------------
• [FAILED] [105.817 seconds]
[namespace auto-provision] namespace auto-provision testing
/home/runner/work/karmada/karmada/test/e2e/namespace_test.go:24
joining new cluster [BeforeEach][needCreateCluster]
/home/runner/work/karmada/karmada/test/e2e/namespace_test.go:84
    namespace should be propagated to new created clusters
/home/runner/work/karmada/karmada/test/e2e/namespace_test.go:131
Begin Captured StdOut/StdErr Output >>
    E0613 09:56:15.873001   42505 unjoin.go:277] Failed to delete cluster object. cluster name: member-e2e-blx, error: Delete "https://172.18.0.5:5443/apis/cluster.karmada.io/v1alpha1/clusters/member-e2e-blx": dial tcp 172.18.0.5:5443: connect: connection refused
    E0613 09:56:15.873109   42505 unjoin.go:166] Failed to delete cluster object. cluster name: member-e2e-blx, error: Delete "https://172.18.0.5:5443/apis/cluster.karmada.io/v1alpha1/clusters/member-e2e-blx": dial tcp 172.18.0.5:5443: connect: connection refused

@XiShanYongYe-Chang
Copy link
Member

XiShanYongYe-Chang commented Jun 13, 2022

Logs are lost. Access to karmada-apiserver is refused based on existing error information:

Delete "https://172.18.0.5:5443/apis/cluster.karmada.io/v1alpha1/clusters/member-e2e-blx": dial tcp 172.18.0.5:5443: connect: connection refused
Get \"[https://172.18.0.5:5443/api/v1/namespaces/karmada-cluster\](https://172.18.0.5:5443/api/v1/namespaces/karmada-cluster/)": dial tcp 172.18.0.5:5443: connect: connection refused
Delete "https://172.18.0.5:5443/api/v1/namespaces/karmadatest-zzk": dial tcp 172.18.0.5:5443: connect: connection refused

@Garrybest
Copy link
Member Author

No big deal, maybe just some occasional errors.

@dddddai
Copy link
Member

dddddai commented Jun 17, 2022

Looks pretty good!

@Garrybest
Copy link
Member Author

Could we move forward?

@RainbowMango
Copy link
Member

I just talked to @XiShanYongYe-Chang about this, he will work on the issue I mentioned above(#1945 (comment)).

I don't want the replica/cluster to be removed directly from binding.spec.clusters that will lead to the replicas being removed immediately.

So, can we just hold for a while? I guess we can include this in the next release(v1.3).

@Garrybest
Copy link
Member Author

OK.

ClusterTaintEvictionRetryFrequency: 10 * time.Second,
ConcurrentReconciles: 3,
}
if err := taintManager.SetupWithManager(mgr); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NoExecuteTaintManager is essentially a controller, do you think we should reserve the capacity of disabling it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NoExecuteTaintManager is a part of cluster controller.

Kubernetes takes TaintManager as a part of NodeLifecycleController. It provides an option --enable-taint-manager in NodeLifecycleController to enable taint manager.

So what about adding an option --enable-taint-manager as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. the --enable-taint-manager defaults to true. But, I think the feature is part of Failover, I suggest checking both the feature gate and the flags.
By the way, the feature gate will be removed eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I think the feature is part of Failover.

Actually, I think this feature is independent. We implement NoExecute taint manager, so another effect of taint is introduced into karmada. Users may taint the cluster by themselves manually. Meanwhile, Failover does depend on this feature, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the feature gate should be added in #1781. Let me make a change here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this feature is independent. We implement NoExecute taint manager, so another effect of taint is introduced into karmada. Users may taint the cluster by themselves manually. Meanwhile, Failover does depend on this feature, right?

After the eviction, the scheduler would select another cluster to fill the slot(if possible), which is exactly the scenario of Failover. For the case users taint the cluster, I think it's a special scenario that users initiate the failover by themselves.

By the way, how do think about the failover feature? What are the use cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the feature gate should be added in #1781. Let me make a change here.

I'm ok with it. The feature gate means the feature is just introduced, maybe not mature enough to default enable, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, how do think about the failover feature? What are the use cases?

  1. Cluster has something wrong, ready condition becomes not ready or unknown.
  2. Cluster controller tries to add NoExecute taint after 5 minutes, enhance cluster lifecycle management: add taints for the clusters which are unhealthy for a period of time #1781.
  3. Taint manager removes scheduling result of failed cluster in rb.spec.clsuters, add NoExecute taint manager #1945. It's a little bit like what descheduler does.
  4. Scheduler tries to scale up these evicted replicas.

I'm ok with it. The feature gate means the feature is just introduced, maybe not mature enough to default enable, right?

Not exactly, we just try to move Failover from scheduler to controller-manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, we just try to move Failover from scheduler to controller-manager.

I see the plan at #1762. I think removing failover from the scheduler means relieving some responsibilities from the scheduler and letting the scheduler focus on select cluster for workload according to PropgationPolicy. Are we on the same page?

Anyway, let's have a talk at next week's meeting, are you available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. No problem.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me.

Even though I hope this issue could be solved before this PR, but at the meantime, I don't want this to be delied so much, so I suggest moving this forward now since we have a lot of things that addressed by #1945 to do.

Does that makes sense to you? @Garrybest

pkg/controllers/cluster/taint_manager.go Outdated Show resolved Hide resolved
ClusterTaintEvictionRetryFrequency: 10 * time.Second,
ConcurrentReconciles: 3,
}
if err := taintManager.SetupWithManager(mgr); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. the --enable-taint-manager defaults to true. But, I think the feature is part of Failover, I suggest checking both the feature gate and the flags.
By the way, the feature gate will be removed eventually.

@Garrybest
Copy link
Member Author

Thanks, I will add a new option and let us move forward.

@RainbowMango
Copy link
Member

Please rebase your branch as we are fixing the fake testing recently.

Signed-off-by: Garrybest <garrybest@foxmail.com>
@Garrybest Garrybest force-pushed the pr_taint branch 2 times, most recently from cdceb6f to c2b3c85 Compare July 15, 2022 03:28
Signed-off-by: Garrybest <garrybest@foxmail.com>
Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/hold
for
no release notes?

@karmada-bot karmada-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Jul 18, 2022
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2022
@RainbowMango
Copy link
Member

/hold cancel
PS: I updated the release notes to the PR description.

@karmada-bot karmada-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2022
@karmada-bot karmada-bot merged commit d626fcc into karmada-io:master Jul 18, 2022
@Garrybest Garrybest deleted the pr_taint branch July 18, 2022 09:17
@Garrybest
Copy link
Member Author

OK.

Garrybest pushed a commit to Garrybest/karmada that referenced this pull request Aug 25, 2022
add NoExecute Taints
add NoExecute Taints
karmada-io#1945

Signed-off-by: Garrybest <garrybest@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants