Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iscsi PV does not recover under a node down. #63475

Closed
dElogics opened this issue May 7, 2018 · 14 comments
Closed

iscsi PV does not recover under a node down. #63475

dElogics opened this issue May 7, 2018 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@dElogics
Copy link

dElogics commented May 7, 2018

/kind bug
What happened:
Suppose a pod is bound to a PV and the node running that pod goes down. Under default configuration, the PV must be force-detached from the node which got down within 6 minutes (I think depending on --attach-detach-reconcile-sync-period) and successfully relocate to another node; however in the other node, the pod is stuck at ContainerCreating state unless the

  1. Node is deleted
  2. Controller is restarted.

Looking at the bug reports #57497, #50004, #50200, this bug appears to be fixed in many PV drivers, but not in the iscsi one.

How to reproduce it (as minimally and precisely as possible):
Make a node unavailable by taking it's IP out, or pkill kubelet, or crash kernel, poweroff hardware/VM etc...

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:38:10Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Bare metal, 512GB, 4 proc nodes.
  • OS (e.g. from /etc/os-release):
    PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
    NAME="Debian GNU/Linux"
    VERSION_ID="9"
    VERSION="9 (stretch)"
    ID=debian
    HOME_URL="https://www.debian.org/"
    SUPPORT_URL="https://www.debian.org/support"
    BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
    Linux aws-prod132 4.9.0-3-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.30-2+deb9u5 (2017-09-19) x86_64 GNU/Linux
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels May 7, 2018
@dims
Copy link
Member

dims commented May 7, 2018

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2018
@nikhita
Copy link
Member

nikhita commented Aug 10, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2018
@adampl
Copy link

adampl commented Aug 11, 2018

@dElogics AFAIK this fundamental issue has been fixed only in several cloud providers, as if bare-metal clusters were a second-class citizen...

@zoobab
Copy link

zoobab commented Oct 2, 2018

@adampl can you mention for which providers it has been fixed? I am chasing a simple bug where in Openstack the volumes are not being freed from a node which has been shutdown...

@adampl
Copy link

adampl commented Oct 2, 2018

I don't remember now, but probably AWS and/or GCE. Now I'm not even sure if it's actually fixed, or it's just a difference in behavior. Some providers actually remove the node from the cluster when it's shut down, which AFAIK somehow helps K8S to force-detach the volume. But generally, several solutions for this problem appeared over time as pull requests (like #67977), but eventually they have been held/canceled in favor of something better (like #65392).

@aizuddin85
Copy link

I had similar problem with iSCSI when the node was crashed due to kernel panic.

@jaywryan
Copy link

jaywryan commented Dec 6, 2018

@aizuddin85 any work arounds?

@aizuddin85
Copy link

aizuddin85 commented Dec 6, 2018

@aizuddin85 any work arounds?

There is a timer within the provisioner.go that will release the lock eventually after 6 mins if I recall correctly. Somehow I forgot to capture which line it was from the codebase.
The reason behind this is due to nature of the block device and avoid any unintended data corruption.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2019
@kmova
Copy link

kmova commented Mar 15, 2019

cc: @humblec

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 14, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

10 participants