Proposal: Node fence design #1416

bronhaim · 2017-11-19T10:11:02Z

Propose fencing mechanism in k8s cluster that allows to manage nodes in cluster with advanced fence actions

Signed-off-by: Yaniv Bronhaim <ybronhei@redhat.com>

k8s-ci-robot · 2017-11-19T10:11:08Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

rootfs · 2017-11-19T15:09:01Z

related to #124

rootfs · 2017-11-19T15:09:42Z

/ok-to-test

rootfs · 2017-11-19T15:10:12Z

@kubernetes/sig-node-proposals

rootfs · 2017-11-19T15:10:35Z

/sig node

k8s-ci-robot · 2017-12-05T13:56:28Z

@bronhaim: GitHub didn't allow me to request PR reviews from the following users: nirs, aglitke.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @nirs @aglitke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CaoShuFeng · 2017-12-19T07:44:19Z

contributors/design-proposals/node/node-fence.md

+### Implementation
+#### Fence Controller
+- The fence controller is a stateless pod instance running in cluster that supports fence.
+- The controller will identify unresponsive node by check their apiserver object, once node becomes “not ready” a fence treatment will be triggered by posting crd for fence to initiate fence flow by Fence Executor.


What will happen if controller or executor's node become unresponsive?

looks a leader election process

rootfs · 2017-12-19T16:28:36Z

cc @nirs @aglitke

current behavior in code base uses jobs to execute fence script. in this update I change the implementation section to fit current flow.

k8s-ci-robot · 2018-01-09T09:36:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bronhaim
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

contributors/design-proposals/node/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

beekhof · 2018-02-27T02:45:56Z

contributors/design-proposals/node/node-fence.md

+- How to trigger fence devices/apis - general template parameters (e.g. cluster UPS address and credentials) and overrides values per node for specific fields (e.g. ports related to node)
+- How to run a fence flow for each node (each node can be attached to several fence devices\apis and the fence flow can defer)
+
+The following is stored in `ConfigMap` objects:


Is everyone on-board with config maps? I'd have thought we'd be creating a lot of noise for that particular set of objects, particularly for the bare metal case. By creating a CRD for this information we isolate it from regular usage and it allows us to optimise the format for this use-case.

I agree.. However, new crd means new object in cluster with version and specific services. for key-value fields I saw that using the builtin structure configmap seems to be good enough for our purposes. how would you optimize the format with crd?

beekhof · 2018-02-27T02:47:39Z

contributors/design-proposals/node/node-fence.md

+
+Some ways to prevent fencing storms:
+- Skip fencing if select % of hosts in cluster is non-responsive (will help for issue #2 above)
+- Skip fencing if detected the host cannot connect to storage.


The act of checking and/or making sure its not currently connected to storage is itself a fencing operation, so the above doesn't seem completely accurate.

beekhof · 2018-02-27T03:06:01Z

contributors/design-proposals/node/node-fence.md

+Fencing storm is a situation in which fencing for a few nodes in the cluster is triggered at the same time, due to an environmental issue.
+
+Examples:
+1. Switch failure - a switch used to connect a few nodes to the environment is failing, and there is no redundancy in the network environment. In such a case, the nodes will be reported as unresponsive, while they are still alive and kicking, and perhaps providing service through other networks they are attached to. If we fence those nodes, a few services will be offline until restarted, while it might not have been necessary.


Sort of, there are downsides either way (deciding whether to fence or not). Have a read of http://blog.clusterlabs.org/blog/2018/two-node-problems

The problem is that you have no way to know if it is a switch failure.
So either:

you assume it's not a network failure, initiate fencing and potentially kill a healthy Pod (does it really count as healthy if no-one can reach it though) so it can be recovered, or

you assume it is a network failure, don't initiate fencing and risk that there are multiple copies of some Pods

fejta-bot · 2018-05-28T13:07:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fabiand · 2018-05-29T13:19:41Z

/remove-lifecycle stale

fejta-bot · 2018-08-27T13:49:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

mwperina · 2018-08-28T12:10:37Z

/remove-lifecycle stale

thockin · 2018-09-10T19:18:04Z

What is the status on this one?

beekhof · 2018-09-12T01:36:11Z

@thockin I'm in the process of trying to resurrect it.

After additional consultation we're moving to a Machine API based approach.

fejta-bot · 2019-01-20T07:31:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-02-19T07:49:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

rob-nicholson · 2019-03-01T08:59:26Z

@beekhof I'm just wondering about your last comment
#1416 (comment) Does that mean that you are not now trying to resurrect this proposal? Is there someplace I could look for discussion/elaboration of the 'machine API based approach' to solving this problem?

beekhof · 2019-03-04T02:13:35Z

@rob-nicholson I resurrected it in a different form as a KEP - #2763

Rather than preaching a a specific implementation, it tried to describe the interaction points that solutions should respect.

In the end it didn't get a lot of attention and was later closed when KEPs moved to a new repo.
If people are interested, I would be happy to resurrect it once again in the new, new location :)

We also have an implementation based on this work in progress at https://github.com/openshift/machine-api-operator/tree/master/pkg/controller/machinehealthcheck which we are actively working on.

fejta-bot · 2019-04-03T02:34:19Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-04-03T02:34:27Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabiand · 2019-04-03T08:13:46Z

I know that some people are working on this, thus reopening it /reopen

k8s-ci-robot · 2019-04-03T08:13:52Z

@fabiand: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

I know that some people are working on this, thus reopening it

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bronhaim and others added 12 commits October 31, 2017 15:19

Copied content

9f65f42

add agents list

ff39c0c

Signed-off-by: Yaniv Bronhaim <ybronhei@redhat.com>

updated content for initial community review

5d18044

Update draft-20171031-node-fence.md

f8e2abe

Update draft-20171031-node-fence.md

27528a2

Update draft-20171031-node-fence.md

5a94ebe

Update draft-20171031-node-fence.md

1393dbe

Update draft-20171031-node-fence.md

4e83e98

Update draft-20171031-node-fence.md

152941f

Update draft-20171031-node-fence.md

fa7b493

Update draft-20171031-node-fence.md

e5c618c

Update draft-20171031-node-fence.md

8974e7c

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 19, 2017

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 19, 2017

k8s-github-robot assigned dchen1107 and derekwaynecarr Nov 19, 2017

k8s-github-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 19, 2017

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 19, 2017

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 19, 2017

k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Nov 19, 2017

remove draft prefix

f698f38

bronhaim force-pushed the node-fence branch from 8974e7c to ff39c0c Compare November 20, 2017 08:51

k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 20, 2017

CaoShuFeng reviewed Dec 19, 2017

View reviewed changes

Updating Controller implementation section

06e367c

current behavior in code base uses jobs to execute fence script. in this update I change the implementation section to fit current flow.

beekhof reviewed Feb 27, 2018

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2018

beekhof mentioned this pull request Aug 31, 2018

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion kubernetes/kubernetes#65392

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2019

k8s-ci-robot closed this Apr 3, 2019

NickrenREN mentioned this pull request Apr 22, 2020

Rotating cloud instances with PVCs in a StatefulSet kubernetes-sigs/sig-storage-local-static-provisioner#181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Node fence design #1416

Proposal: Node fence design #1416

bronhaim commented Nov 19, 2017 •

edited

k8s-ci-robot commented Nov 19, 2017

rootfs commented Nov 19, 2017

rootfs commented Nov 19, 2017

rootfs commented Nov 19, 2017

rootfs commented Nov 19, 2017

k8s-ci-robot commented Dec 5, 2017

CaoShuFeng Dec 19, 2017

rootfs Dec 19, 2017

rootfs commented Dec 19, 2017

k8s-ci-robot commented Jan 9, 2018

beekhof Feb 27, 2018

bronhaim Feb 27, 2018

beekhof Feb 27, 2018

beekhof Feb 27, 2018

fejta-bot commented May 28, 2018

fabiand commented May 29, 2018

fejta-bot commented Aug 27, 2018

mwperina commented Aug 28, 2018

thockin commented Sep 10, 2018

beekhof commented Sep 12, 2018

fejta-bot commented Jan 20, 2019

fejta-bot commented Feb 19, 2019

rob-nicholson commented Mar 1, 2019

beekhof commented Mar 4, 2019

fejta-bot commented Apr 3, 2019

k8s-ci-robot commented Apr 3, 2019

fabiand commented Apr 3, 2019 via email

k8s-ci-robot commented Apr 3, 2019

Proposal: Node fence design #1416

Proposal: Node fence design #1416

Conversation

bronhaim commented Nov 19, 2017 • edited

k8s-ci-robot commented Nov 19, 2017

rootfs commented Nov 19, 2017

rootfs commented Nov 19, 2017

rootfs commented Nov 19, 2017

rootfs commented Nov 19, 2017

k8s-ci-robot commented Dec 5, 2017

CaoShuFeng Dec 19, 2017

Choose a reason for hiding this comment

rootfs Dec 19, 2017

Choose a reason for hiding this comment

rootfs commented Dec 19, 2017

k8s-ci-robot commented Jan 9, 2018

beekhof Feb 27, 2018

Choose a reason for hiding this comment

bronhaim Feb 27, 2018

Choose a reason for hiding this comment

beekhof Feb 27, 2018

Choose a reason for hiding this comment

beekhof Feb 27, 2018

Choose a reason for hiding this comment

fejta-bot commented May 28, 2018

fabiand commented May 29, 2018

fejta-bot commented Aug 27, 2018

mwperina commented Aug 28, 2018

thockin commented Sep 10, 2018

beekhof commented Sep 12, 2018

fejta-bot commented Jan 20, 2019

fejta-bot commented Feb 19, 2019

rob-nicholson commented Mar 1, 2019

beekhof commented Mar 4, 2019

fejta-bot commented Apr 3, 2019

k8s-ci-robot commented Apr 3, 2019

fabiand commented Apr 3, 2019 via email

k8s-ci-robot commented Apr 3, 2019

bronhaim commented Nov 19, 2017 •

edited