-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployments with GCE PD fail with "...is already being used by..." #48968
Comments
This report mentioned Deployments along with GCE PDs. This can get tricky because in some cases it can result in multiple pods (scheduled to different nodes) referencing the same (read-write once) volume which will cause the second pod to not start. To prevent this from happening, the general recommendations for using Deployments with GCE PDs is:
However, the reporter mentioned they used the "Recreate" strategy, which means that there must be a bug here. To help us debug, if you run into this issue, please:
Let's figure this out! CC @kubernetes/sig-storage-bugs |
The initial report mentions that the node was not healthy, so the pod got rescheduled. Does that trigger the update strategy (recreate or rollingupdate)? |
We've run into this a lot too, and we weren't using deployments. We were simply spawning pods that had a PD attached. After the pods were killed, the disks don't seem to be detached from the underling VM automatically, sometimes for hours? When the pods come back up on a different machine, we often get this exact same error error (and it sometimes clears up in a few minutes, often not). We wrote a script that just watches for this issue (grepping describe in a loop, ugh) and runs the appropriate gcloud detach command, and that made the problem 'go away' for us. We were on GKE. |
We also see this issue happening. The most annoying thing about GCE is that it does "silent" migration of VMs, after which PVs are not released. So, when Kubernetes wants to start pod on that node, it fails with resource already in use type error. cc: @sadlil |
Hello, I'm original reddit poster. Couple more details. Deployment in question (sans sensitive info):
The PVC:
Last time I observed behavior in question, it was induced by faulty job producing high number of faulty pods, which generated disk pressure. Redis was scheduled to different node and the rest is history. |
@yuvipanda Thank you for reporting the issue. Is that possible you could share us some controller's logs during the issue happened? Thanks! |
@yuvipanda can you elaborate on "pod gets killed" ? Previously for terminated pods but not deleted from api server, we never used to perform volume detach but this changed in #45191 . |
@jingxu97 we were running GKE, and I couldn't find a way to get logs of controllers unfortunately. @gnufied we spawn one pod per user, and then the user can execute arbitrary code inside the pod (via a Jupyter Notebook). We have a script that would watch for users that are inactive, and perform a delete (via the k8s API) of their pods. We found that for a long (and intermittent?) time after that, the volumes would still be attached to the node in which the user's pod was running. So when the user's pod was re-created (maybe they became active again), we need to attach the same volume back to this pod, and this would fail with this message. |
@jingxu97 reached out to me, and I provided a repro case (https://gist.github.com/yuvipanda/0b6aa32192c35b960e91698e1c14690c) |
@wawastein thanks for reporting the error. You mentioned job producing high number of faulty pods. Does each pod uses a different PVC so it creates a new volume? Thanks! |
@jingxu97 hi, no, job pods do not use or create any volumes. Problem was in misconfiguration, containers couldn't connect to DB and failed time and time again. |
I can't reproduce this on 1.7 btw. Can on 1.6 |
@yuvipanda I tried on both 1.6.4 and 1.7 cluster with your repro steps, but could not get the errors.. |
Here we go again.
|
@wawastein could you please share more information about this issue, in what condition this error occurred? If possible, could you also share the project, zone and cluster name information so that we could check the master log? |
@jingxu97 so the timeline was as follows:
Project: superlocal-149713 |
@wawastein I checked the master log and found the following related information
From the log we can see, around 13:52:53 the pvc is first attached to node gke-staging-cluster-2-pool-1-bb497dff-9mms. Later around 14:00:38, reconciler tries to attached this pvc to a different node and failed since it is still attached to the old node. The problem is here that the old pod is not deleted so the volume is still attached and the new pod is trying to attach then failed. From the deployment doc https://kubernetes.io/docs/concepts/workloads/controllers/deployment/, I haven't looked at the deployment controller in detail, but if the new pod failed to start because of volume problem and controller does not kill the old pod, it will struck. I also notice that in your master controller log, there are lots of errors like the following, not sure whether they are relevant or not.
I will check with workload team to see whether there is a problem during deployment update. |
@jingxu97 the last log messages confused me too, I though it might be because most of the time those pods are idle and don't use resources. HPA ones are related to deleted deployments. On the bound PVC however, it's at least counter intuitive. If new pod is scheduled to new node, maybe there ought to be be a check for any detachable volumes of the old pod first. |
@wawastein the scheduler and volume manager are completely separated. When scheduler schedule the pod it will not know whether the volume used by the pod is still attached to some node or not. But normally, old pod should be deleted at some point, so that volume will be detached from the old node, even though it might happen a little bit after the new pod is created |
@wawastein Could you check the comment here to see whether following the steps could solve your problem? #48968 (comment) |
I've had this same problem pop up when attempting to schedule single-replica deployments after VM migrations of my nodes on GCP. Eventually the scheduler gives up trying to reschedule the pod and I have to delete it manually to resolve the issue. I'm going to try recreating the deployments with strategy: Replace and see if that helps, but I'll report back in this issue if it happens again and I can provide a timestamp, node, project etc. UPDATE: It looks like, even when using the Anecdotally though, after changing the strategy to replace, the disk was released much faster than previously and the pod quickly recovered. I'll need a greater sample size before I can say that with any authority though. |
I'm noticing this during lots of deploys. Maybe there's a timeout between switching disks between machines, or some other sort of sync loop? |
@jingxu97 I'll try it on the new deployment. But I can't really force reschedule for test, I'll just sit there and hope it doesn't happen. |
/assign looking into recovery from different failure scenarios with recommended settings to make sure recreated pods are schedulable. |
I am facing similar issue for statefulsets. I have attached the PV through PVC in GKE. But when the replicas are more than 1, the rest pod will go in waiting: the crashloopbackoff state. Any progress on this issue? |
Hi @jakirpatel, crashloopbackoff generally means that the the volume is successfully mounted, but your container is crashing. You'll need to check the logs of your container to see why it's crashing. |
Can you post the events you see with 'kubectl describe pod '? |
I have looked into different failure scenarios with our recommended settings and have the following conclusions: Note: This may not be an exhaustive list of failure scenarios. Please contact me if you are experiencing issues with a different scenario. Note: All scenarios tested with "replicas:1" and deployment with a PVC referencing gce-pd (only supports single node attach).
|
@davidz627 This seems to be inline with what I've experienced as well. The multi-attach error happened when I was updating the StatefulSet and caused pods to be reshuffled on nodes. |
@davidz627 can you also try this experiment with statefulsets? Theoretically, you should not see multi-attach errors with StatefulSets because the pod must be completely deleted before a new pod with the same name can be recreated. @mofirouz if you could paste your full pod events showing the CrashLoopBackOff, that would be helpful in triaging. |
Also tested with stateful sets with 3 replicas. Used volumeClaimTemplates with gce-pd. Here are the results
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten Hi, we experienced the same problem with kubernetes 1.9.6: Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Cloud provider or hardware configuration: GCE Some more details: We use a statefulset for our POD, configured with RollingUpdate strategy. Pod was restarted from node1 to node2 and stuck in ContainerCreating state due to
controller-manager logs:
After checking for volumesAttached and volumesInUse for both nodes we saw:
By draining node 2 and having the POD restart on node 1 (happened by accident since we have more than 2 nodes in our cluster) the problem was fixed Due to the requirement of a manual fix there is some downtime and this is a critical POD for our cluster |
@plkokanov to confirm, you saw this problem when you issued a rolling update on your statefulset? |
@plkokanov would you mind to share us the full controller-manager log to us so we can help triage? You see VolumeInUse in node 2 is normal because controller try to mount gcp-dynam-pvc-id on node 2. The strange thing is why volume is still attached to node 1. |
@msau42 we saw it after the pod was restarted. The reason for the restart was most likely an unsuccessful liveliness probe. I'll make sure to post the exact reason as soon as I manage to reproduce it. |
This error is still occuring to me, running GKE 1.9.6-1. Running a pod with a gce-pd PVC. The odd thing is that this error pops up immediately on creation of the pod and its associated PVC, after deleting the old pod and PVC. My guess is that the old one has not been deleted in GCP, even though it's gone from Kubernetes and from GCE console. This is a really big problem and is making it impossible to run any stateful apps in my cluster. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@pv93, sorry to reply late. Could you please let us know more details about your issue? You mentioned -you deleted old pod and PVC. Did you use "--forcedelete" and "--grace-period=0" option? Otherwise, the new pod or PVC (if using the same name as the old one) could not be created without the old one being cleaned up and deleted. For PVC, unless you change the Reclaim policy, by default, if PVC is deleted, the volume should be also deleted. Your new PVC should create a new volume. |
@jingxu97 it seems like this issue has stopped popping up as much with Kubernetes 1.10. Once a pod is rescheduled, GKE is much faster with removing the PVC from the old node and attaching it to the new one. So this isn't a problem for me anymore. Thanks anyway. |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
From user report on reddit:
Other reports here:
What you expected to happen:
Volume should attach to new node without issue.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: