-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Node Fencing Specification #2763
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi @beekhof. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
CLA signed |
/cc |
/cc @orainxiong |
@NickrenREN: GitHub didn't allow me to request PR reviews from the following users: orainxiong. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
As suggested by the bot... /assign @jdumars |
I agree, fencing is a fundamental and useful building block. Do we cover pod fencing here, or only node fencing? Both seem useful. Superficially this proposal looks sane (although I've not reviewed it in detail). Specifically, ability to request a node/pod be fenced, and confirmation that the node/pod has been successfully fenced. A useful addition would be some intermediate state where the node/pod has not actually been fenced yet, but is guaranteed not come alive again (for example, the machine is currently down, might be rebooted, but promises not to do anything until it's checked that it has not been requested to fence while down). Admittedly that's hard to do in practise, but might be possible. My 2c. |
|
||
Three additional NodeConditions and taints: | ||
|
||
* a `node.alpha.kubernetes.io/fencingtriaged` taint corresponding to NodeCondition `FencingTriaged` being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we do not want to put the alpha/beta in the name any more. It causes too much user pain for too little value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
Admins may also set NodeCondition `FencingRequired` to `True` to manually | ||
trigger fencing. | ||
|
||
Once fencing has been initiated, the NodeCondition `FencingComplete` should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"should" is a bad word.
Who will set this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to be handled by the fencing implementation.
I will replace s/should/must/
|
||
## Drawbacks | ||
|
||
* In deleting the Node object, we are also removing the history of events that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this doc does not talk about delete ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this paragraph is missing some much needed context.
The rest of the document covers how implementations should identify nodes in need of fencing and indicate progress towards the node being made safe.
However once the node is safe, something must use that information to convey to the rest of the system that affected workloads are now safe to run elsewhere.
One approach would be having the fencing logic delete Node objects.
The other extreme is for every controller to monitor for the new Node Conditions and react appropriately.
I do not yet have a good sense of which part of the spectrum would be preferred.
|
||
Three additional NodeConditions and taints: | ||
|
||
* a `node.alpha.kubernetes.io/fencingtriaged` taint corresponding to NodeCondition `FencingTriaged` being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this indicate? What does "triaged" mean in this context? If a node has been triaged for fencing, this taint will prevent new work from landing here, even if fencing is not required -- why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is that this would be an intermediate state that indicates the node is in some bad state, but policy dictates that it should not be fenced (at least not yet).
Whether this state should inhibit new work is a good question. I can see arguments for both ways.
|
||
## Proposal | ||
|
||
Three additional NodeConditions and taints: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to talk about the flavor of taint - what is the effect?
It might be good to introduce a real-world example and use that to motivate the design points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
@beekhof: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We do not consider pod fencing in the absence of a node level failure. Fencing at both levels is useful but the focus here is on node fencing to recover from node level failures. Though node fencing does imply pod fencing for anything unfortunate enough to be located there.
Fencing is the act of putting the machine into a safe state. Without having been fenced you can't really know that it has actually been down in order to respect this kind of please-be-good-after-reboot logic. It could be disconnected from the network, or kubelet could be hung - both would look identical to "currently down" from the master's perspective. What we see in traditional HA solutions, is: Our implementation is already planning for a), but perhaps the switch for that mechanism deserves to be part of this specification rather than an implementation detail. |
REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@derekwaynecarr Should we try to merge this KEP prior to the Nov 30 deadline? If so, what would you like to see added/changed? |
KEPs have moved to k/enhancements. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@justaugustus: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Submitting early to solicit feedback