iscsi PV does not recover under a node down. #63475

dElogics · 2018-05-07T05:43:16Z

/kind bug
What happened:
Suppose a pod is bound to a PV and the node running that pod goes down. Under default configuration, the PV must be force-detached from the node which got down within 6 minutes (I think depending on --attach-detach-reconcile-sync-period) and successfully relocate to another node; however in the other node, the pod is stuck at ContainerCreating state unless the

Node is deleted
Controller is restarted.

Looking at the bug reports #57497, #50004, #50200, this bug appears to be fixed in many PV drivers, but not in the iscsi one.

How to reproduce it (as minimally and precisely as possible):
Make a node unavailable by taking it's IP out, or pkill kubelet, or crash kernel, poweroff hardware/VM etc...

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:38:10Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
Bare metal, 512GB, 4 proc nodes.
OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Kernel (e.g. uname -a):
Linux aws-prod132 4.9.0-3-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.30-2+deb9u5 (2017-09-19) x86_64 GNU/Linux
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

dims · 2018-05-07T12:38:03Z

/sig storage

fejta-bot · 2018-08-05T13:11:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

nikhita · 2018-08-10T14:25:12Z

/remove-lifecycle stale

adampl · 2018-08-11T11:46:22Z

@dElogics AFAIK this fundamental issue has been fixed only in several cloud providers, as if bare-metal clusters were a second-class citizen...

zoobab · 2018-10-02T14:59:04Z

@adampl can you mention for which providers it has been fixed? I am chasing a simple bug where in Openstack the volumes are not being freed from a node which has been shutdown...

adampl · 2018-10-02T17:51:44Z

I don't remember now, but probably AWS and/or GCE. Now I'm not even sure if it's actually fixed, or it's just a difference in behavior. Some providers actually remove the node from the cluster when it's shut down, which AFAIK somehow helps K8S to force-detach the volume. But generally, several solutions for this problem appeared over time as pull requests (like #67977), but eventually they have been held/canceled in favor of something better (like #65392).

aizuddin85 · 2018-10-10T12:26:32Z

I had similar problem with iSCSI when the node was crashed due to kernel panic.

jaywryan · 2018-12-06T01:56:36Z

@aizuddin85 any work arounds?

aizuddin85 · 2018-12-06T09:16:11Z

@aizuddin85 any work arounds?

There is a timer within the provisioner.go that will release the lock eventually after 6 mins if I recall correctly. Somehow I forgot to capture which line it was from the codebase.
The reason behind this is due to nature of the block device and avoid any unintended data corruption.

fejta-bot · 2019-03-06T09:43:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

kmova · 2019-03-15T18:36:27Z

cc: @humblec

fejta-bot · 2019-04-14T18:54:45Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-05-14T19:40:07Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-05-14T19:40:15Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels May 7, 2018

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2019

kmova mentioned this issue Mar 27, 2019

OpenEBS(catalog) meet mount failed after the node is shutdown and removed rancher/rancher#16791

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 14, 2019

k8s-ci-robot closed this as completed May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iscsi PV does not recover under a node down. #63475

iscsi PV does not recover under a node down. #63475

dElogics commented May 7, 2018 •

edited

dims commented May 7, 2018

fejta-bot commented Aug 5, 2018

nikhita commented Aug 10, 2018

adampl commented Aug 11, 2018

zoobab commented Oct 2, 2018

adampl commented Oct 2, 2018

aizuddin85 commented Oct 10, 2018

jaywryan commented Dec 6, 2018

aizuddin85 commented Dec 6, 2018 •

edited

fejta-bot commented Mar 6, 2019

kmova commented Mar 15, 2019

fejta-bot commented Apr 14, 2019

fejta-bot commented May 14, 2019

k8s-ci-robot commented May 14, 2019

iscsi PV does not recover under a node down. #63475

iscsi PV does not recover under a node down. #63475

Comments

dElogics commented May 7, 2018 • edited

dims commented May 7, 2018

fejta-bot commented Aug 5, 2018

nikhita commented Aug 10, 2018

adampl commented Aug 11, 2018

zoobab commented Oct 2, 2018

adampl commented Oct 2, 2018

aizuddin85 commented Oct 10, 2018

jaywryan commented Dec 6, 2018

aizuddin85 commented Dec 6, 2018 • edited

fejta-bot commented Mar 6, 2019

kmova commented Mar 15, 2019

fejta-bot commented Apr 14, 2019

fejta-bot commented May 14, 2019

k8s-ci-robot commented May 14, 2019

dElogics commented May 7, 2018 •

edited

aizuddin85 commented Dec 6, 2018 •

edited