-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add NoExecute taint manager #1945
Conversation
/assign |
Thanks @RainbowMango. The taint manager in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really enjoyable to look at code like this.
Thanks @Garrybest and sorry for let this sit.
The e2e error does not make any sense. Maybe we need a retest. |
Right. Echo here for @XiShanYongYe-Chang to check.
|
Logs are lost. Access to
|
No big deal, maybe just some occasional errors. |
Looks pretty good! |
Could we move forward? |
I just talked to @XiShanYongYe-Chang about this, he will work on the issue I mentioned above(#1945 (comment)). I don't want the So, can we just hold for a while? I guess we can include this in the next release(v1.3). |
OK. |
ClusterTaintEvictionRetryFrequency: 10 * time.Second, | ||
ConcurrentReconciles: 3, | ||
} | ||
if err := taintManager.SetupWithManager(mgr); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NoExecuteTaintManager is essentially a controller, do you think we should reserve the capacity of disabling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NoExecuteTaintManager is a part of cluster controller.
Kubernetes takes TaintManager
as a part of NodeLifecycleController. It provides an option --enable-taint-manager
in NodeLifecycleController
to enable taint manager.
So what about adding an option --enable-taint-manager
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. the --enable-taint-manager defaults to true
. But, I think the feature is part of Failover
, I suggest checking both the feature gate and the flags.
By the way, the feature gate will be removed eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I think the feature is part of Failover.
Actually, I think this feature is independent. We implement NoExecute taint manager, so another effect of taint is introduced into karmada. Users may taint the cluster by themselves manually. Meanwhile, Failover
does depend on this feature, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the feature gate should be added in #1781. Let me make a change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this feature is independent. We implement NoExecute taint manager, so another effect of taint is introduced into karmada. Users may taint the cluster by themselves manually. Meanwhile, Failover does depend on this feature, right?
After the eviction, the scheduler would select another cluster to fill the slot(if possible), which is exactly the scenario of Failover. For the case users taint the cluster, I think it's a special scenario that users initiate the failover
by themselves.
By the way, how do think about the failover
feature? What are the use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the feature gate should be added in #1781. Let me make a change here.
I'm ok with it. The feature gate means the feature is just introduced, maybe not mature enough to default enable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, how do think about the failover feature? What are the use cases?
- Cluster has something wrong, ready condition becomes not ready or unknown.
- Cluster controller tries to add NoExecute taint after 5 minutes, enhance cluster lifecycle management: add taints for the clusters which are unhealthy for a period of time #1781.
- Taint manager removes scheduling result of failed cluster in
rb.spec.clsuters
, add NoExecute taint manager #1945. It's a little bit like whatdescheduler
does. - Scheduler tries to scale up these evicted replicas.
I'm ok with it. The feature gate means the feature is just introduced, maybe not mature enough to default enable, right?
Not exactly, we just try to move Failover
from scheduler to controller-manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly, we just try to move Failover from scheduler to controller-manager.
I see the plan at #1762. I think removing failover
from the scheduler means relieving some responsibilities from the scheduler and letting the scheduler focus on select
cluster for workload according to PropgationPolicy. Are we on the same page?
Anyway, let's have a talk at next week's meeting, are you available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. No problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me.
Even though I hope this issue could be solved before this PR, but at the meantime, I don't want this to be delied so much, so I suggest moving this forward now since we have a lot of things that addressed by #1945 to do.
Does that makes sense to you? @Garrybest
ClusterTaintEvictionRetryFrequency: 10 * time.Second, | ||
ConcurrentReconciles: 3, | ||
} | ||
if err := taintManager.SetupWithManager(mgr); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. the --enable-taint-manager defaults to true
. But, I think the feature is part of Failover
, I suggest checking both the feature gate and the flags.
By the way, the feature gate will be removed eventually.
Thanks, I will add a new option and let us move forward. |
Please rebase your branch as we are fixing the fake testing recently. |
Signed-off-by: Garrybest <garrybest@foxmail.com>
cdceb6f
to
c2b3c85
Compare
Signed-off-by: Garrybest <garrybest@foxmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/hold
for
no release notes?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
OK. |
add NoExecute Taints add NoExecute Taints karmada-io#1945 Signed-off-by: Garrybest <garrybest@foxmail.com>
Signed-off-by: Garrybest garrybest@foxmail.com
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add taint manager for NoExecute manager.
Which issue(s) this PR fixes:
Fixes part of #1762
Special notes for your reviewer:
Does this PR introduce a user-facing change?: