New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods with volumes stuck in ContainerCreating after cluster node is deleted from OpenStack #50200
Comments
@kars7e
Note: Method 1 will trigger an email to the group. You can find the group list here and label list here. |
/sig storage |
@kars7e: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It seems attachDetachController does not update actualStateOfWorld when node is deleted. I will check it. |
From the log @kars7e provided, I think the following happened.
One thing I am not sure is that syncState should check the volume status and update the actual state periodically. @kars7e did you set the flag disable-attach-detach-reconcile-sync? Opened an issue for openstack bug in DetachVolume |
@jingxu97 Oops, it is need to check "available" in DetachDisk. |
Thanks @jingxu97 for looking into it. I actually had the same theory, so I added following change to DetachDisk:
And tried with that. I don't see errors about detaching the volume anymore, but somehow the asw does not update the mounts, and I'm seeing thousands of I haven't set flag
|
If it's of any help, here is the log (with v=4) for controller manager: Note: I removed all lines with
|
@jingxu97 Can we update node status before ensuring volumes that should be attached are attached and ensuring volumes that should be detached are detached? |
@kars7e Can you test it by the following codes if it is convenience for you, if not, I will test it(I am sorry for that my cluster is down and I rebuild it day after tomorrow.): + // Update Node Status |
@FengyunPan Thanks! I will try that tomorrow |
I don't think that will work. UpdateNodeStatuses() is not updating the actual state of the reconciler. It is updating the node status for communicating with kubelet so that kubelet volume manager whether the volume is already attached or not. Strange thing is that if disable-attach-detach-reconcile-sync is not set, the reconciler syncState should check whether volume is still attached to node or not periodically and update the actual state. I wondered whether this function is also has bug https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L409. Even though the volume status is already available, is it possible that volume.AttachedServerId still has the server id. |
Thank for your comment, that's right.
I think that's impossible, if volume'status is available, volume's attachments will be []. |
I've tried the It was working for us quite all right in 1.6, this problem surfaced after upgrading to 1.7. My guess is because this commit 06baeb3 |
So this time I see following lines in log:
Then it fails to attach, because it tries to attach to the instance which is being deleted.
But eventually the target node is updated, and it successfully gets attached:
Now, what about other volumes that are failing? I looked for VolumesAreAttached, and I see plenty of:
|
Also this shows why this time one of the volumes was reattached correctly. The |
@kars7e I have see the log(E0806 05:36:32.343037 21 operation_generator.go:161] VolumesAreAttached failed for checking on node "k8s-node-2-cca67ed1-eda7-4988-848c-3222706c2b45" with: Failed to find object) which means VerifyVolumesAreAttachedPerNode checks a no exist node. This is a bug. |
@kars7e Hi, I new a PR for this bug, can you help me to test it. |
If node does not exist, node's volumes will be detached automatically and become available. So mark them detached. Fix: kubernetes#50200
@FengyunPan Right it is a bug. I checked a few volume plugins, currently GCE PD and AWS have the correct behavior. If node no longer exist by checking the cloud provider, we can safely mark the volume as detached. But the rest of volumes do not check the error such as cinder, vsphere, photon. Need to fix them also. Opened an issue to track this #50266 |
@FengyunPan thanks for the patch, I tested and it worked! Can you post a PR with it? CC @jingxu97 |
Automatic merge from submit-queue (batch tested with PRs 38947, 50239, 51115, 51094, 51116) Mark the volumes as detached when node does not exist If node does not exist, node's volumes will be detached automatically and become available. So mark them detached and do not return err. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # #50200 **Release note**: ```release-note NONE ```
If node doesn't exist, OpenStack Nova will assume the volumes are not attached to it. So mark the volumes as detached and return false without error. Fix: kubernetes#50200
Automatic merge from submit-queue (batch tested with PRs 38947, 50239, 51115, 51094, 51116) Mark the volumes as detached when node does not exist If node does not exist, node's volumes will be detached automatically and become available. So mark them detached and do not return err. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # kubernetes#50200 **Release note**: ```release-note NONE ```
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
One of the cluster worker nodes was deleted from OpenStack.
Pods running on this node have been rescheduled on different nodes but got stuck in ContainerCreating. It's stuck for 20+ minutes until action like restarting controller manager is taken (it can't reconcile without manual intervention). See at the end for actions that can fix it.
What you expected to happen:
Pod should be started correctly on a different node, with volumes attached to it.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
The underlying issue is that Cinder volumes are getting dettached when instance is deleted, but k8s is not registering this fact, and is throwing:
Multi-Attach error for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c" Volume is already exclusively attached to one node and can't be attached to another
It seems that manager attempts the detach, but is not able to handle the fact that volume is in available state (k8s-node-2 is the node that has been deleted):
{"log":"E0805 19:36:33.196285 6 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\\\"\" failed. No retries permitted until 2017-08-05 19:37:37.196173574 +0000 UTC (durationBeforeRetry 1m4s). Error: DetachVolume.Detach failed for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") on node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" : can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196396348Z"}
After inspecting the pod, following events are registered (note, this pod is eventually started because controller manager has been rebooted):
Also an excerpt from controller manager log (with --v=4):
Note: Following operations will resolve this issue:
Environment:
kubectl version
):OpenStack Mitaka
Ubuntu 16.04.2 LTS (Xenial Xerus)
uname -a
):Linux k8s-master-1-31217f04-941c-48f2-b36e-8a97a3bf7515 4.4.0-62-generic add travis integration #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
kargo (kubespray)
The text was updated successfully, but these errors were encountered: