-
Notifications
You must be signed in to change notification settings - Fork 40.4k
Mounting/unmounting volume error on GKE #29903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm experiencing the same problems on multiple clusters, all running Here's what I get in
|
I solved it temporarily by deleting the affected node. |
Just a quick update: I temporarily resolved the problem by downgrading my cluster's nodes to |
Maybe it was just resolved because the node got recreated like in my case? |
@mwuertinger yes, I assume that it might have been the case. |
@mesmerizingr do you happen to have some background information about this issue so that we can reference it here? |
This seems to be related: #29617 |
@mwuertinger take a look on this thread: #28750 Also, one may find that the problem was acknowledged in the following release notes: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md/#v133 The most reassuring comment I've got was: #28750 (comment) So, if you aren't vendor-locked on GKE and have complete control on your cluster including its master I should upgrade it to |
@mwuertinger @mesmerizingr, thank you for reporting your issue. Yes, we Jing On Thu, Aug 4, 2016 at 7:46 AM, Roman notifications@github.com wrote:
|
Thanks for confirming, @jingxu97 |
@mwuertinger, if you are on GKE, downgrading to 1.3.2 seemed to have solved the issue on at least one of my clusters ... however, I didn't really follow up anymore (by restarting pods with PDs and secret volumes) as I was just happy that the cluster was healthy again. |
@mheese thank for sharing this information. I'll just wait for |
@mesmerizingr: false alarm: it was still happening today .... only the downgrade to |
@mheese |
@mesmerizingr @mheese could you please try out 1.3.4 and let us know whether there still issues or not? Thanks! |
We upgraded our GKE cluster to 1.3.4 (both master and nodes) and we are still seeing volume mounting issues:
I should say that this deployment manifest was working fine before upgrade. Here is the kubelet logs on the gke-c2-default-pool-40cc47de-tun8 node: If needed the cluster name is c2 on zone europe-west1-d and REDACTED as project id. |
@rvrignaud There are a number of issues that can cause |
Hi @saad-ali, Let me know if you need anything more. |
Hi @rvrignaud I am checking the kubelet log and found a few things, but still trying to figure out the root cause. Could you please share the deployment spec with me and also what is the version you used when the deployment still working? Also you mentioned by deleting a pod could eventually make it run. Could you please give me more details about what operations you performed? Thanks! |
Hi @jingxu97 , Here is the deployment spec:
It was running correctly with master and nodes 1.3.2. |
@rvrignaud Thanks for the information. If you still have the pod which has problem, could you try to login to the node, and check the path under /dev/disk/by-id and let me know what files you can see? Also after you delete the pod, you mean you can successfully create another deployment without problem (volume can be attached and mounted)? Thanks! |
Hi @jingxu97 , The pod is still in same state (kept it in that state for debugging).
What I meant is for the other pod that showed the same symptom on the same cluster, when I deleted it, it was rescheduled and mounted successfully its volume. |
Hi @rvrignaud Thanks for the information. Strange thing is according to the log you gave earlier today, the device path "google-schall1-int2-es" should not exist. That is why mounting the volume failed to start because it needs to check the device path exists or not first. To confirm this, could you please send me the log around the time you Jing |
@rvrignaud |
@rvrignaud Thanks for working with us to debug this. Turns out you hit #29358 again, and that bug was not completeley fixed in v1.3.4. I reopened that bug. We'll work on a fix and get it into v1.3.6. In the meantime, I restarted your |
@rvrignaud Thanks a lot for providing us detailed information to debug. We will keep you posted when the fix is in. Please try it on the new version and let us know whether the issue still happen or not. @mesmerizingr @mheese We are working on a fix based on the information @rvrignaud provided. The fix should be in v1.3.6. Right now I am not sure the problems you experienced are the same as @rvrignaud. Please let us know if you still have issue after the upgrade. Thanks! |
Hi @saad-ali , Thanks to had a look to the cluster. After a moment (a few dozen of minutes ?) pods where still not running, I did try to delete the pods (can't remember if they where still in containerCreating state or not) and they did scheduled and mount their volume in a "normal" amount of time. |
We hit this bug again during 1.3.7 GKE migration. 4 pods on 11 using PersistentDisk failed to be scheduled. This is really painful for us.
project-id : clustree-cloud / zone : europe-west1-d / cluster : c2 Do not hesitate if we can send you more informations. |
Checked the log provided by @rvrignaud, and it hits the PR #32807, which should be checked in release 1.4.1. @saad-ali, could we also consider backport this fix to 1.3 release? |
I am experiencing this very irritating problem again. just upgraded a GKE from 1.3.7 to 1.4. i have a working cluster mongodb, and all are stuck in
i am stuck with 1 pod, how am i going to return the working cluster?
|
it seems to be trying to mount a wrong pvc name?
|
@Hokutosei, 5d120abb-66e9-11e6-942f-42010af00048 is the pod uid, not a pvc name. Could you please share your kebelet log and your GKE cluster's project id, zone information? Right now, there is one bug fix #32807 has not yet checked in the release which might cause your problem. I can double check and also try to push the fix out asap. Sorry for the inconvenience. |
@jingxu97 would love to share it in private instead? I'd get a permission first. Hopefully you can fix this, I still have a pending upgrades to our gke production environment. Thanks for the response |
@jeanepaul Sure. you can reach me through slack @jinxu or email On Tue, Oct 4, 2016 at 4:46 PM, jeanepaul notifications@github.com wrote:
|
Hello, thanks for all your hard work kubernetes team! I'm having very similar problems ( I see that some of the commits/PRs referenced here are merged, but can't tell which minor version they will be released with. Is there a version/release with the fix for this I can watch out for? |
Hi @aldenks, thank you for your information. As now, quite some PR are already merged in 1.4.1, but still a couple of important ones are still pending such as #33616. It should be in very soon. Since your are using GKE, you can also email me jinxu@google.com or find me at slack @jinxu about your cluster so that I can take a look. Thanks! Jing |
Hi - I have just upgraded to 1.4.1 and have this exact issue. |
Also @Stono, if you want to make sure, I can help take a look at your logs to double check the root cause of your issue. You can share your cluster information (project name, cluster name, zone) to my email jinxu@google.com or slack on @jinxu if it is ok. Thanks! |
We had the same thing agains yesterday, going from 1.4.3 -> 1.4.4, time out on mounts, specifically PV claims on dynamically provision gcePersistentDisks Deleting the deployment and recreating it fixed the issue - however this is a pretty consistent failure. |
It seems as though the fix (#34859) is still under review. We are also effected by this and it is causing downtime in production daily. |
#34859 is already merged to the master. It should be in 1.4.6 branch. Please kindly let me know if you continue experience any issues related to mounts. Thank you! |
I'm using v1.4.6 on GKE and I'm seeing a similar issue. I have a PetSet that creates a few pods and PVs, but if a node dies and the pod gets rescheduled the new pod is unable to mount its PV. I can see in the compute engine console that the PV is attached to the new node, it's listed under /dev/disk/by-id, but yet VerifyControllerAttachedVolume never seems to recognise it:
|
@simonhorlick slight side question, PetSet on GKE? As it's still alpha are you using the "Alpha Clusters" which are not upgradable and last 30days? Or do you have some other ninja way of enabling PetSet |
@Stono Correct, I was testing on an alpha cluster. |
@simonhorlick cool - I don't know if you're aware but 1.5.1 is available for 'new' clusters now https://groups.google.com/forum/#!topic/gke-release-notes/kl5KAr-1oeA |
I am also having similar issues on Google Container Engine.
I have a 3 node cluster in Google Container Engine. 2 of 3 nodes are preemptible. This error happens every 24 hours or so when the nodes get removed and created again. But if I manually delete one of the preemptible node to test the failover, I have so far not seen these errors. Here is my setup: gce-volumes.yaml
mysql-deployment.yaml
wordpress-deployment.yaml
|
@munjalpatel sorry for the late response. Are you still have the error every 24 hours? What version of kubernetes in your cluster? Thanks! |
@jingxu97 Yes, I still have these errors. I have to manually delete the nodes everyday to fix the issue. All three nodes are currently on: 1.4.8. |
@munjalpatel Since we have quite some fixes since 1.4.8. Could you please try the latest version and let us know whether you still have the problem? If you need, you could also share with me your zone, project and cluster name so that I can take a look at the controller log. |
i am still having this issue on GKE 1.21. drain the node manually can fix this problem |
Greetings all. I'm running my cluster on the GKE and after upgrading to
1.3.3
I've been having errors of the following format:Evidently, that's not something specific to
mongodb-data
volume. I get these same errors for other volumes too (majority of them are secrets). I'm anticipating the possibility of upgrading to1.3.4
release on GKE as I've read that it possibly resolves the problem but it's not available yet.But for now I'd like to know if there're some workarounds/hacks for the problem that could allow me to continue deploying my services?
Also, huge thanks to the whole Kubernetes community. You're awesome 🖖🏻
The text was updated successfully, but these errors were encountered: