Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Node fence design #1416

Closed
wants to merge 17 commits into from
Closed

Conversation

bronhaim
Copy link

@bronhaim bronhaim commented Nov 19, 2017

Propose fencing mechanism in k8s cluster that allows to manage nodes in cluster with advanced fence actions

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 19, 2017
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 19, 2017
@k8s-github-robot k8s-github-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 19, 2017
@rootfs
Copy link
Contributor

rootfs commented Nov 19, 2017

related to #124

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 19, 2017
@rootfs
Copy link
Contributor

rootfs commented Nov 19, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 19, 2017
@rootfs
Copy link
Contributor

rootfs commented Nov 19, 2017

@kubernetes/sig-node-proposals

@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Nov 19, 2017
@rootfs
Copy link
Contributor

rootfs commented Nov 19, 2017

/sig node

@k8s-ci-robot k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 20, 2017
@k8s-ci-robot
Copy link
Contributor

@bronhaim: GitHub didn't allow me to request PR reviews from the following users: nirs, aglitke.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @nirs @aglitke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

### Implementation
#### Fence Controller
- The fence controller is a stateless pod instance running in cluster that supports fence.
- The controller will identify unresponsive node by check their apiserver object, once node becomes “not ready” a fence treatment will be triggered by posting crd for fence to initiate fence flow by Fence Executor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if controller or executor's node become unresponsive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a leader election process

@rootfs
Copy link
Contributor

rootfs commented Dec 19, 2017

cc @nirs @aglitke

current behavior in code base uses jobs to execute fence script. in this update I change the implementation section to fit current flow.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bronhaim
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

- How to trigger fence devices/apis - general template parameters (e.g. cluster UPS address and credentials) and overrides values per node for specific fields (e.g. ports related to node)
- How to run a fence flow for each node (each node can be attached to several fence devices\apis and the fence flow can defer)

The following is stored in `ConfigMap` objects:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is everyone on-board with config maps? I'd have thought we'd be creating a lot of noise for that particular set of objects, particularly for the bare metal case. By creating a CRD for this information we isolate it from regular usage and it allows us to optimise the format for this use-case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.. However, new crd means new object in cluster with version and specific services. for key-value fields I saw that using the builtin structure configmap seems to be good enough for our purposes. how would you optimize the format with crd?


Some ways to prevent fencing storms:
- Skip fencing if select % of hosts in cluster is non-responsive (will help for issue #2 above)
- Skip fencing if detected the host cannot connect to storage.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The act of checking and/or making sure its not currently connected to storage is itself a fencing operation, so the above doesn't seem completely accurate.

Fencing storm is a situation in which fencing for a few nodes in the cluster is triggered at the same time, due to an environmental issue.

Examples:
1. Switch failure - a switch used to connect a few nodes to the environment is failing, and there is no redundancy in the network environment. In such a case, the nodes will be reported as unresponsive, while they are still alive and kicking, and perhaps providing service through other networks they are attached to. If we fence those nodes, a few services will be offline until restarted, while it might not have been necessary.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, there are downsides either way (deciding whether to fence or not). Have a read of http://blog.clusterlabs.org/blog/2018/two-node-problems

The problem is that you have no way to know if it is a switch failure.
So either:

  • you assume it's not a network failure, initiate fencing and potentially kill a healthy Pod (does it really count as healthy if no-one can reach it though) so it can be recovered, or
  • you assume it is a network failure, don't initiate fencing and risk that there are multiple copies of some Pods

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018
@fabiand
Copy link

fabiand commented May 29, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2018
@mwperina
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2018
@thockin
Copy link
Member

thockin commented Sep 10, 2018

What is the status on this one?

@beekhof
Copy link

beekhof commented Sep 12, 2018

@thockin I'm in the process of trying to resurrect it.

After additional consultation we're moving to a Machine API based approach.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2019
@rob-nicholson
Copy link

@beekhof I'm just wondering about your last comment
#1416 (comment) Does that mean that you are not now trying to resurrect this proposal? Is there someplace I could look for discussion/elaboration of the 'machine API based approach' to solving this problem?

@beekhof
Copy link

beekhof commented Mar 4, 2019

@rob-nicholson I resurrected it in a different form as a KEP - #2763

Rather than preaching a a specific implementation, it tried to describe the interaction points that solutions should respect.

In the end it didn't get a lot of attention and was later closed when KEPs moved to a new repo.
If people are interested, I would be happy to resurrect it once again in the new, new location :)

We also have an implementation based on this work in progress at https://github.com/openshift/machine-api-operator/tree/master/pkg/controller/machinehealthcheck which we are actively working on.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabiand
Copy link

fabiand commented Apr 3, 2019 via email

@k8s-ci-robot
Copy link
Contributor

@fabiand: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

I know that some people are working on this, thus reopening it

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet