Mounting/unmounting volume error on GKE #29903

Open
qteqoo opened this Issue Aug 2, 2016 · 56 comments

Projects

None yet
@qteqoo
qteqoo commented Aug 2, 2016 edited

Greetings all. I'm running my cluster on the GKE and after upgrading to 1.3.3 I've been having errors of the following format:

Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message

8m 4m 20 {default-scheduler } Warning FailedScheduling no nodes available to schedule pods
3m 3m 1 {default-scheduler } Normal Scheduled Successfully assigned mongodb-321322295-mnmo7 to gke-test-cluster-default-pool-f56cf9ee-f2kb
1m 1m 1 {kubelet gke-test-cluster-default-pool-f56cf9ee-f2kb} Warning FailedMount Unable to mount volumes for pod "mongodb-321322295-mnmo7_dev(20c985df-5882-11e6-ac8b-42010af00014)": timeout expired waiting for volumes to attach/mount for pod "mongodb-321322295-mnmo7"/"dev". list of unattached/unmounted volumes=[mongodb-data]
1m 1m 1 {kubelet gke-test-cluster-default-pool-f56cf9ee-f2kb} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-321322295-mnmo7"/"dev". list of unattached/unmounted volumes=[mongodb-data]

Evidently, that's not something specific to mongodb-data volume. I get these same errors for other volumes too (majority of them are secrets). I'm anticipating the possibility of upgrading to 1.3.4 release on GKE as I've read that it possibly resolves the problem but it's not available yet.

But for now I'd like to know if there're some workarounds/hacks for the problem that could allow me to continue deploying my services?

Also, huge thanks to the whole Kubernetes community. You're awesome ๐Ÿ––๐Ÿป

@mwuertinger

I'm experiencing the same problems on multiple clusters, all running 1.3.3 on the nodes. I'm seeing this with secrets as well as with persistent disks.

Here's what I get in /var/log/kubelet.log in a recent incident (Removed some unrelated logs from the output):

E0804 10:37:23.625498    4237 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/jenkins-co" failed. No retries permitted until 2016-08-04 10:37:24.125485506 +0000 UTC (durationBeforeRetry 500ms). error: MountVolume.WaitForAttach failed for volume "kubernetes.io/gce-pd/jenkins-co" (spec.Name: "jenkins-data") pod "03bdc3f3-5a2e-11e6-91e2-42010af0000b" (UID: "03bdc3f3-5a2e-11e6-91e2-42010af0000b") with: Could not find attached GCE PD "jenkins-co". Timeout waiting for mount paths to be created.
I0804 10:37:24.181044    4237 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/jenkins-co" (spec.Name: "jenkins-data") pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b")
I0804 10:37:24.184064    4237 operation_executor.go:959] Controller successfully attached volume "kubernetes.io/gce-pd/jenkins-co" (spec.Name: "jenkins-data") pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b") devicePath: "/dev/disk/by-id/google-jenkins-co"
I0804 10:37:24.281385    4237 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/gce-pd/jenkins-co" (spec.Name: "jenkins-data") to pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b"). 
I0804 10:37:24.281468    4237 operation_executor.go:629] Entering MountVolume.WaitForAttach for volume "kubernetes.io/gce-pd/jenkins-co" (spec.Name: "jenkins-data") pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b") DevicePath: "/dev/disk/by-id/google-jenkins-co"
E0804 10:37:54.107339    4237 kubelet.go:1930] Unable to mount volumes for pod "jenkins-8r2qa_jenkins-co(38981380-5a2f-11e6-91e2-42010af0000b)": timeout expired waiting for volumes to attach/mount for pod "jenkins-8r2qa"/"jenkins-co". list of unattached/unmounted volumes=[jenkins-data]; skipping pod
E0804 10:37:54.107392    4237 pod_workers.go:183] Error syncing pod 38981380-5a2f-11e6-91e2-42010af0000b, skipping: timeout expired waiting for volumes to attach/mount for pod "jenkins-8r2qa"/"jenkins-co". list of unattached/unmounted volumes=[jenkins-data]
I0804 10:38:09.277105    4237 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/ssl" (spec.Name: "ssl") to pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b"). Volume is already mounted to pod, but remount was requested.
I0804 10:38:09.277236    4237 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-h26dn" (spec.Name: "default-token-h26dn") to pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b"). Volume is already mounted to pod, but remount was requested.
I0804 10:38:09.282795    4237 operation_executor.go:720] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/default-token-h26dn" (spec.Name: "default-token-h26dn") pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b").
I0804 10:38:09.283012    4237 operation_executor.go:720] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/ssl" (spec.Name: "ssl") pod "38981380-5a2f-11e6-91e2-42010af0000b" (UID: "38981380-5a2f-11e6-91e2-42010af0000b").
@mwuertinger

I solved it temporarily by deleting the affected node.

@qteqoo
qteqoo commented Aug 4, 2016 edited

Just a quick update: I temporarily resolved the problem by downgrading my cluster's nodes to 1.2.6 and haven't experienced the problem yet albeit I've found in similar topics that it doesn't resolve the problem. But so far it works.

@mwuertinger

Maybe it was just resolved because the node got recreated like in my case?

@qteqoo
qteqoo commented Aug 4, 2016

@mwuertinger yes, I assume that it might have been the case.

@mwuertinger

@mesmerizingr do you happen to have some background information about this issue so that we can reference it here?

@mwuertinger

This seems to be related: #29617

@qteqoo
qteqoo commented Aug 4, 2016 edited

@mwuertinger take a look on this thread: #28750

Also, one may find that the problem was acknowledged in the following release notes: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md/#v133

The most reassuring comment I've got was: #28750 (comment)

So, if you aren't vendor-locked on GKE and have complete control on your cluster including its master I should upgrade it to 1.3.4.

@jingxu97
Contributor
jingxu97 commented Aug 5, 2016

@mwuertinger @mesmerizingr, thank you for reporting your issue. Yes, we
recently fixed a few issues for mounting/umounting volumes in v1.3.3. The
fix is already in v1.3.4. Please upgrade to v1.3.4 and kindly let us know
if you still have problems.

Jing

On Thu, Aug 4, 2016 at 7:46 AM, Roman notifications@github.com wrote:

@mwuertinger https://github.com/mwuertinger take a look on this thread:
#28750 #28750

โ€”
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#29903 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxSOTF3WlyjFGuOv-zrLj-wNuMwEeks5qcfs1gaJpZM4JaVMN
.

  • Jing
@mwuertinger

Thanks for confirming, @jingxu97
We're on Google Container Engine so we have to wait until they approve v1.3.4 there.

@mheese
mheese commented Aug 9, 2016

@mwuertinger, if you are on GKE, downgrading to 1.3.2 seemed to have solved the issue on at least one of my clusters ... however, I didn't really follow up anymore (by restarting pods with PDs and secret volumes) as I was just happy that the cluster was healthy again.

@qteqoo
qteqoo commented Aug 9, 2016 edited

@mheese thank for sharing this information. I'll just wait for 1.3.4 upgrade. Downgrading cluster's nodes to 1.2.6 resolved the problem.

@mheese
mheese commented Aug 9, 2016

@mesmerizingr: false alarm: it was still happening today .... only the downgrade to 1.2.6 really fixes it ... thx for the hint ๐Ÿ‘

@qteqoo
qteqoo commented Aug 13, 2016

@mheese 1.3.4 upgrade has been rolled out. Let your deployments be as stable as possible! ๐Ÿ––๐Ÿป

@jingxu97
Contributor

@mesmerizingr @mheese could you please try out 1.3.4 and let us know whether there still issues or not? Thanks!

@rvrignaud
rvrignaud commented Aug 16, 2016 edited

Hi @jingxu97 and @saad-ali,

We upgraded our GKE cluster to 1.3.4 (both master and nodes) and we are still seeing volume mounting issues:

  21h       1m      589 {kubelet gke-c2-default-pool-40cc47de-tun8}         Warning     FailedMount Unable to mount volumes for pod "prometheus-2727983241-wjnth_kube-inf(7ab3e75f-62d3-11e6-95e0-42010af00088)": timeout expired waiting for volumes to attach/mount for pod "prometheus-2727983241-wjnth"/"kube-inf". list of unattached/unmounted volumes=[data]
  21h       1m      589 {kubelet gke-c2-default-pool-40cc47de-tun8}         Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "prometheus-2727983241-wjnth"/"kube-inf". list of unattached/unmounted volumes=[data]

I should say that this deployment manifest was working fine before upgrade.
It seems that deleting a pod could eventually make it run again.

Here is the kubelet logs on the gke-c2-default-pool-40cc47de-tun8 node:
EDIT : logs removed

If needed the cluster name is c2 on zone europe-west1-d and REDACTED as project id.

@saad-ali
Member

@rvrignaud There are a number of issues that can cause timeout expired waiting while mounting. Send me your kubelet logs and master info (cluster name/project ID/zone) on Slack and the deployment files and I'll take a look.

@rvrignaud

Hi @saad-ali,
Everything is already provided above.

Let me know if you need anything more.

@jingxu97
Contributor
jingxu97 commented Aug 16, 2016 edited

Hi @rvrignaud I am checking the kubelet log and found a few things, but still trying to figure out the root cause. Could you please share the deployment spec with me and also what is the version you used when the deployment still working?

Also you mentioned by deleting a pod could eventually make it run. Could you please give me more details about what operations you performed? Thanks!

@rvrignaud

Hi @jingxu97 ,

Here is the deployment spec:

{
   "kind":"Deployment",
   "apiVersion":"extensions/v1beta1",
   "metadata":{
      "name":"prometheus",
      "namespace": "kube-inf"
   },
   "spec":{
      "replicas":1,
      "template":{
         "metadata":{
            "labels":{
               "name":"prometheus"
            }
         },
         "spec":{
           "nodeSelector": {
             "failure-domain.beta.kubernetes.io/zone": "europe-west1-d"
           },
            "containers":[
               {
                  "name":"prometheus",
                  "image":"REDACTED",
                  "resources":{
                    "limits": {
                      "memory": "5120Mi",
                      "cpu": "300m"
                    }
                  },
                  "livenessProbe": {
                    "httpGet": {
                      "path": "/status",
                      "port": 9090
                    },
                    "initialDelaySeconds": 900
                  },
                  "readinessProbe": {
                    "httpGet": {
                      "path": "/status",
                      "port": 9090
                    }
                  },
                  "env": [
                    {
                      "name": "CLUSTER_NAME",
                      "value": "c2"
                    }
                  ],
                  "volumeMounts": [
                    {
                      "mountPath": "/opt/prometheus/var",
                      "name": "data"
                    },
                    {
                      "mountPath": "/opt/prometheus/etc/rules/",
                      "name": "alerts"
                    }
                  ]
               },
               {
                  "name":"blackbox-exporter",
                  "image":"prom/blackbox-exporter:v0.2.0",
                  "imagePullPolicy": "Always",
                  "resources":{
                    "limits": {
                      "memory": "50Mi",
                      "cpu": "10m"
                    }
                  },
                  "command": [
                    "/bin/blackbox_exporter",
                    "-config.file=/etc/blackbox_exporter/config.yml"
                  ],
                  "volumeMounts": [
                    {
                      "mountPath": "/etc/blackbox_exporter",
                      "name": "blackbox"
                    }
                  ]
              }
            ],
            "volumes": [
              {
                "name": "data",
                "gcePersistentDisk": {
                  "pdName": "c2-kube-inf-prometheus-data",
                  "fsType": "ext4"
                }
              },
              {
                "name": "alerts",
                "configMap": {
                  "name": "alerts"
                }
              },
              {
                "name": "blackbox",
                "configMap": {
                  "name": "blackbox"
                }
              }
           ]
         }
      }
   }
}

It was running correctly with master and nodes 1.3.2.
I had another pod which show same problem. A simple kubectl delete pods -l name=XXX solved it.

@jingxu97
Contributor

@rvrignaud Thanks for the information. If you still have the pod which has problem, could you try to login to the node, and check the path under /dev/disk/by-id and let me know what files you can see?

Also after you delete the pod, you mean you can successfully create another deployment without problem (volume can be attached and mounted)? Thanks!

@rvrignaud

Hi @jingxu97 ,

The pod is still in same state (kept it in that state for debugging).

โžœ  deployments git:(master) c2kinfkubectl get pods -l name=prometheus -o wide
NAME                          READY     STATUS              RESTARTS   AGE       IP        NODE
prometheus-2727983241-wjnth   0/2       ContainerCreating   0          1d        <none>    gke-c2-default-pool-40cc47de-tun8
โžœ  deployments git:(master) ssh gke-c2-default-pool-40cc47de-tun8
rvrignaud@gke-c2-default-pool-40cc47de-tun8:~$ sudo -i
root@gke-c2-default-pool-40cc47de-tun8:~# cd /dev/disk/by-id/
root@gke-c2-default-pool-40cc47de-tun8:/dev/disk/by-id# ls -alh
total 0
drwxr-xr-x 2 root root 160 Aug 16 09:28 .
drwxr-xr-x 5 root root 100 Aug 15 09:11 ..
lrwxrwxrwx 1 root root   9 Aug 15 09:11 google-persistent-disk-0 -> ../../sda
lrwxrwxrwx 1 root root  10 Aug 15 09:11 google-persistent-disk-0-part1 -> ../../sda1
lrwxrwxrwx 1 root root   9 Aug 16 08:11 google-schall1-int2-es -> ../../sde
lrwxrwxrwx 1 root root   9 Aug 15 09:11 scsi-0Google_PersistentDisk_persistent-disk-0 -> ../../sda
lrwxrwxrwx 1 root root  10 Aug 15 09:11 scsi-0Google_PersistentDisk_persistent-disk-0-part1 -> ../../sda1
lrwxrwxrwx 1 root root   9 Aug 16 08:11 scsi-0Google_PersistentDisk_schall1-int2-es -> ../../sde

What I meant is for the other pod that showed the same symptom on the same cluster, when I deleted it, it was rescheduled and mounted successfully its volume.

@jingxu97
Contributor
jingxu97 commented Aug 16, 2016 edited

Hi @rvrignaud Thanks for the information. Strange thing is according to the log you gave earlier today, the device path "google-schall1-int2-es" should not exist. That is why mounting the volume failed to start because it needs to check the device path exists or not first.

To confirm this, could you please send me the log around the time you ls the /dev/disk path Aug 16 08:11. Thank you!

Jing

@jingxu97
Contributor

@rvrignaud
How many nodes in your cluster? Could you please send me the files under /dev/disk/by-id/ for all other cluster nodes? I am wondering whether it is possible that the volume attached to the wrong node. Thanks a lot!

@saad-ali
Member

@rvrignaud Thanks for working with us to debug this. Turns out you hit #29358 again, and that bug was not completeley fixed in v1.3.4. I reopened that bug. We'll work on a fix and get it into v1.3.6. In the meantime, I restarted your kube-controller-manager binary on your master which should unblock you.

@jingxu97
Contributor

@rvrignaud Thanks a lot for providing us detailed information to debug. We will keep you posted when the fix is in. Please try it on the new version and let us know whether the issue still happen or not.

@mesmerizingr @mheese We are working on a fix based on the information @rvrignaud provided. The fix should be in v1.3.6. Right now I am not sure the problems you experienced are the same as @rvrignaud. Please let us know if you still have issue after the upgrade. Thanks!

@rvrignaud

Hi @saad-ali and @jingxu97 ,

Thanks you for the support. I'll keep an eye on 1.3.6 release.
Sorry if I hijacked this thread if the issue is not the same.

@jingxu97
Contributor

Hi @rvrignaud , actually there are still a few things not very clear to me
and I want to investigate. Could you please share the full kubelet log on
node gke-c2-default-pool-40cc47de-tun8 after it was created. (if log is
rolled over, could you also share the zip files?) Thanks a lot!

On Tue, Aug 16, 2016 at 11:25 PM, Romain Vrignaud notifications@github.com
wrote:

Hi @saad-ali https://github.com/saad-ali and @jingxu97
https://github.com/jingxu97 ,

Thanks you for the support. I'll keep an eye on 1.3.6 release.
Sorry if I hijacked this thread if the issue is not the same.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29903 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxbH8dxoi1t1A_-EIcb7TjxOc7V_Oks5qgqlLgaJpZM4JaVMN
.

  • Jing
@jingxu97
Contributor

After checking the log, we found another related issue is if node no longer exists, node status updater fails to update all the rest of the nodes and cause kubelet receiving incorrect information from node status and acting upon it. This is fixed by part of PR #30737

@rvrignaud

Hi @saad-ali and @jingxu97,

We just upgraded both master and nodes to 1.3.6 today. All of our pods with a PD attached failed to start. All of them ...

For the same pod as #29903 (comment)

We got:

  19m       1m          9       {kubelet gke-c2-default-pool-40cc47de-tun8}             Warning     FailedMount     Unable to mount volumes for pod "prometheus-3753817817-dz1ff_kube-inf(d5953f2e-6f7b-11e6-ac06-42010af000f0)": timeout expired waiting for volumes to attach/mount for pod "prometheus-3753817817-dz1ff"/"kube-inf". list of unattached/unmounted volumes=[data]
  19m       1m          9       {kubelet gke-c2-default-pool-40cc47de-tun8}             Warning     FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "prometheus-3753817817-dz1ff"/"kube-inf". list of unattached/unmounted volumes=[data]

This is still on the same GKE cluster : c2 on zone europe-west1-d. I'll send you project-id and kubelet.log by slack.
Do not hesitate to ask for more details if needed

@saad-ali
Member
saad-ali commented Sep 7, 2016

@rvrignaud Thanks for the details. I took a look at your cluster. All your pods appear to be running normally at the moment and their volumes are attached/mounted properly. The master logs show no indications that you ran into #29358 again. So that issue appears to be resolved in v1.3.6.

That said, I'd still like to understand the issue you experienced immediately after this upgrade.

The timeout expired waiting for volumes to attach/mount is a generic umbrella error: in normal scenarios it will be posted every 2 minutes as a warning until the volumes are successfully attached and mounted (which could take 5-10 minutes in some cases). If the volume is unable to attached or mounted indefinitely then the pod will eventually go into failed state.

I took a look at the prometheus-3753817817-dz1ff pod. It was referencing volume c2-kube-inf-prometheus-data. The master kube-controller-manager appears to have restarted at 08:13:53.321620 and when it came up it tried to attach the volume. The volume was already attached so it proceeded accordingly:

I0903 08:13:53.449084       6 reconciler.go:170] Started AttachVolume for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" to node "gke-c2-default-pool-40cc47de-tun8"
I0903 08:13:53.917183       6 attacher.go:87] Attach operation is successful. PD "c2-kube-inf-prometheus-data" is already attached to node "gke-c2-default-pool-40cc47de-tun8".
I0903 08:13:53.917210       6 operation_executor.go:484] AttachVolume.Attach succeeded for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") from node "gke-c2-default-pool-40cc47de-tun8".

However, the kubelet failed to mount the volume, at first because updating the node object was slow:

E0903 08:35:54.553685    3751 nestedpendingoperations.go:254] Operation for "\"kubernetes.io/gce-pd/c2-kube-inf-prometheus-data\"" failed. No retries permitted until 2016-09-03 08:35:55.05366788 +0000 UTC (durationBeforeRetry 500ms). Error: Volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") pod "d5953f2e-6f7b-11e6-ac06-42010af000f0" (UID: "d5953f2e-6f7b-11e6-ac06-42010af000f0") has not yet been added to the list of VolumesInUse in the node's volume status.

But ultimately because the volume was not reported as attached according to the node status:

I0903 08:36:26.430040    3751 reconciler.go:220] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") pod "d5953f2e-6f7b-11e6-ac06-42010af000f0" (UID: "d5953f2e-6f7b-11e6-ac06-42010af000f0")
E0903 08:36:26.433097    3751 nestedpendingoperations.go:254] Operation for "\"kubernetes.io/gce-pd/c2-kube-inf-prometheus-data\"" failed. No retries permitted until 2016-09-03 08:36:58.43306514 +0000 UTC (durationBeforeRetry 32s). Error: Volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") pod "d5953f2e-6f7b-11e6-ac06-42010af000f0" (UID: "d5953f2e-6f7b-11e6-ac06-42010af000f0") is not yet attached according to node **status.**

Need to understand why ^this happened.

About an hour later, I presume because the pod was deleted and recreated, the volume was detached and reattached, and everything seems to be ok since:

I0903 09:44:32.090598       6 reconciler.go:115] Started DetachVolume for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" from node "gke-c2-default-pool-40cc47de-tun8"
I0903 09:44:32.093015       6 operation_executor.go:619] Verified volume is safe to detach for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") from node "gke-c2-default-pool-40cc47de-tun8".
I0903 09:44:32.523905       6 attacher.go:226] Detach operation is successful. PD "c2-kube-inf-prometheus-data" was not attached to node "gke-c2-default-pool-40cc47de-tun8".
I0903 09:44:32.523963       6 operation_executor.go:565] DetachVolume.Detach succeeded for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") from node "gke-c2-default-pool-40cc47de-tun8".
I0903 09:44:33.193664       6 reconciler.go:170] Started AttachVolume for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" to node "gke-c2-default-pool-40cc47de-tun8"
I0903 09:44:41.052550       6 operation_executor.go:484] AttachVolume.Attach succeeded for volume "kubernetes.io/gce-pd/c2-kube-inf-prometheus-data" (spec.Name: "data") from node "gke-c2-default-pool-40cc47de-tun8".
@rvrignaud

Hi @saad-ali ,

Thanks to had a look to the cluster.
What happened is that I upgraded the master (maybe that's why kube-controler-manager restarted ?) then the nodes to 1.3.6.

After a moment (a few dozen of minutes ?) pods where still not running, I did try to delete the pods (can't remember if they where still in containerCreating state or not) and they did scheduled and mount their volume in a "normal" amount of time.

@rvrignaud
rvrignaud commented Sep 20, 2016 edited

Hi @saad-ali and @jingxu97 ,

Did you have any chance to have a look of what happened ? Did you reproduce the bug ?

Regards,

@rvrignaud

@saad-ali and @jingxu97,

We hit this bug again during 1.3.7 GKE migration. 4 pods on 11 using PersistentDisk failed to be scheduled. This is really painful for us.

kubectl --namespace=infrastructure describe pods clair-2479896191-26ry6
Name:       clair-2479896191-26ry6
Namespace:      infrastructure
Node:       gke-c2-default-pool-9f928830-ospw/10.10.32.4
Start Time:     Sat, 03 Sep 2016 11:45:10 +0200
Labels:     name=clair
            pod-template-hash=2479896191
Status:     Pending
IP:
Controllers:    ReplicaSet/clair-2479896191
Containers:
  clair:
    Container ID:
    Image:          quay.io/coreos/clair:v1.2.3
    Image ID:
    Port:
    Args:
      -config
      /config/config.yaml
    Limits:
      cpu:      100m
      memory:   1500Mi
    Requests:
      cpu:              20m
      memory:           1500Mi
    State:              Waiting
      Reason:           ContainerCreating
    Last State:         Terminated
      Reason:           Error
      Exit Code:            1
      Started:          Sat, 03 Sep 2016 11:45:57 +0200
      Finished:         Sat, 03 Sep 2016 11:45:58 +0200
    Ready:              False
    Restart Count:          1
    Liveness:           tcp-socket :6061 delay=30s timeout=1s period=10s #success=1 #failure=3
    Readiness:          tcp-socket :6061 delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment Variables:      <none>
  postgres:
    Container ID:
    Image:          postgres:9.5
    Image ID:
    Port:
    Limits:
      cpu:      200m
      memory:   300Mi
    Requests:
      cpu:          200m
      memory:       300Mi
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:      0
    Environment Variables:
      POSTGRES_PASSWORD:        xxx
      POSTGRES_USER:        clair
      PGDATA:           /var/lib/postgresql/data/pgdata
Conditions:
  Type      Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  inf-clair-db:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     inf-clair-db
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
  config:
    Type:       ConfigMap (a volume populated by a ConfigMap)
    Name:       clair
  default-token-mi2id:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-mi2id
QoS Tier:       Burstable
Events:
  FirstSeen     LastSeen        Count   From                        SubobjectPath   Type        Reason      Message
  ---------     --------        -----   ----                        -------------   --------        ------      -------
  18m       1m          9       {kubelet gke-c2-default-pool-9f928830-ospw}             Warning     FailedMount     Unable to mount volumes for pod "clair-2479896191-26ry6_infrastructure(f56ba462-71ba-11e6-95ab-42010af000f0)": timeout expired waiting for volumes to attach/mount for pod "clair-2479896191-26ry6"/"infrastructure". list of unattached/unmounted volumes=[inf-clair-db]
  18m       1m          9       {kubelet gke-c2-default-pool-9f928830-ospw}             Warning     FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "clair-2479896191-26ry6"/"infrastructure". list of unattached/unmounted volumes=[inf-clair-db]

project-id : clustree-cloud / zone : europe-west1-d / cluster : c2
Sending kubelet.log over slack

Do not hesitate if we can send you more informations.

@jingxu97
Contributor

Checked the log provided by @rvrignaud, and it hits the PR #32807, which should be checked in release 1.4.1.

@saad-ali, could we also consider backport this fix to 1.3 release?

@Hokutosei
Contributor
Hokutosei commented Oct 4, 2016 edited

I am experiencing this very irritating problem again. just upgraded a GKE from 1.3.7 to 1.4. i have a working cluster mongodb, and all are stuck in ContainerCreating I tried deleting some pods, and it wont spin up new mongodb pods.

Events:
  FirstSeen LastSeen    Count   From                                SubobjectPath       Type        Reason      Message
  --------- --------    -----   ----                                -------------       --------    ------      -------
  52m       52m     1   {kubelet gke-beeedev-cluster-default-pool-86c239e7-4rf2}    spec.containers{mongo}  Normal      Killing     Killing container with docker id d28a167aa8ec: Need to kill pod.
  52m       52m     1   {kubelet gke-beeedev-cluster-default-pool-86c239e7-4rf2}                Warning     FailedSync  Error syncing pod, skipping: failed to "KillContainer" for "mongo" with KillContainerError: "Cannot connect to the Docker daemon. Is the docker daemon running on this host?"

  45m   1m  21  {kubelet gke-beeedev-cluster-default-pool-86c239e7-4rf2}        Warning FailedMount Unable to mount volumes for pod "a-mongo-0_default(5d120abb-66e9-11e6-942f-42010af00048)": timeout expired waiting for volumes to attach/mount for pod "a-mongo-0"/"default". list of unattached/unmounted volumes=[data]
  45m   1m  21  {kubelet gke-beeedev-cluster-default-pool-86c239e7-4rf2}        Warning FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "a-mongo-0"/"default". list of unattached/unmounted volumes=[data]

i am stuck with 1 pod, how am i going to return the working cluster?

NAME                          READY     STATUS              RESTARTS   AGE
a-mongo-0                     0/1       ContainerCreating   4          45d
@Hokutosei
Contributor

it seems to be trying to mount a wrong pvc name? 5d120abb-66e9-11e6-942f-42010af00048. there is no pvc like that name if that's the case

NAME             STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
data-a-mongo-0   Bound     pvc-41685d71-662a-11e6-8641-42010af00048   10Gi       RWO           46d
data-a-mongo-1   Bound     pvc-416a60a7-662a-11e6-8641-42010af00048   10Gi       RWO           46d
data-a-mongo-2   Bound     pvc-416c51bc-662a-11e6-8641-42010af00048   10Gi       RWO           46d
data-a-redis-0   Bound     pvc-a2784a4d-68f5-11e6-942f-42010af00048   15Gi       RWO           42
@jingxu97
Contributor
jingxu97 commented Oct 4, 2016

@Hokutosei, 5d120abb-66e9-11e6-942f-42010af00048 is the pod uid, not a pvc name. Could you please share your kebelet log and your GKE cluster's project id, zone information?

Right now, there is one bug fix #32807 has not yet checked in the release which might cause your problem. I can double check and also try to push the fix out asap. Sorry for the inconvenience.

@Hokutosei
Contributor

@jingxu97 would love to share it in private instead? I'd get a permission first. Hopefully you can fix this, I still have a pending upgrades to our gke production environment. Thanks for the response

@jingxu97
Contributor
jingxu97 commented Oct 5, 2016

@jeanepaul Sure. you can reach me through slack @jinxu or email
jinxu@google.com.

On Tue, Oct 4, 2016 at 4:46 PM, jeanepaul notifications@github.com wrote:

@jingxu97 https://github.com/jingxu97 would love to share it in private
instead? I'd get a permission first. Hopefully you can fix this, I still
have a pending upgrades to our gke production environment. Thanks for the
response

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29903 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxVe-12jU46-fZe_rRyf8ropP1pSZks5qwuVjgaJpZM4JaVMN
.

  • Jing
@aldenks
aldenks commented Oct 12, 2016

Hello, thanks for all your hard work kubernetes team! I'm having very similar problems (Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod...). This happens fairly often because we're using preemptible instances with GKE. I can usually "fix" it by manually deleting the pod and letting the deployment recreate it.

I see that some of the commits/PRs referenced here are merged, but can't tell which minor version they will be released with. Is there a version/release with the fix for this I can watch out for?

@jingxu97
Contributor
jingxu97 commented Oct 12, 2016 edited

Hi @aldenks, thank you for your information. As now, quite some PR are already merged in 1.4.1, but still a couple of important ones are still pending such as #33616. It should be in very soon.
For your case, could you please share more details about the issue so that I can double check whether our fixes cover your issue. It would be very helpful if you could let us know in what cases you notice the problem, or share some logs (kubelet and kube-controller-manager) with us.

Since your are using GKE, you can also email me jinxu@google.com or find me at slack @jinxu about your cluster so that I can take a look. Thanks!

Jing

@Stono
Stono commented Oct 16, 2016

Hi - I have just upgraded to 1.4.1 and have this exact issue.
Please advise quickly as currently our services are down.

@jingxu97
Contributor

Hi @stono, sorry for the inconvenience. I think delete the pods and then recreate them should solve the problem. Please let me know if it is not working or you have other issues for doing this way.

The fix for this problem is currently under review #34859.

@jingxu97
Contributor

Also @stono, if you want to make sure, I can help take a look at your logs to double check the root cause of your issue. You can share your cluster information (project name, cluster name, zone) to my email jinxu@google.com or slack on @jinxu if it is ok. Thanks!

@Stono
Stono commented Oct 24, 2016

We had the same thing agains yesterday, going from 1.4.3 -> 1.4.4, time out on mounts, specifically PV claims on dynamically provision gcePersistentDisks

Deleting the deployment and recreating it fixed the issue - however this is a pretty consistent failure.

@aldenks
aldenks commented Oct 25, 2016

It seems as though the fix (#34859) is still under review. We are also effected by this and it is causing downtime in production daily.

@jingxu97
Contributor
jingxu97 commented Nov 1, 2016

#34859 is already merged to the master. It should be in 1.4.6 branch. Please kindly let me know if you continue experience any issues related to mounts. Thank you!

@simonhorlick

I'm using v1.4.6 on GKE and I'm seeing a similar issue. I have a PetSet that creates a few pods and PVs, but if a node dies and the pod gets rescheduled the new pod is unable to mount its PV. I can see in the compute engine console that the PV is attached to the new node, it's listed under /dev/disk/by-id, but yet VerifyControllerAttachedVolume never seems to recognise it:

"I1124 18:35:33.206128    1274 reconciler.go:229] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/gke-default-e5f91d80-d-pvc-c222811d-b0bc-11e6-ba25-42010af000f2" (spec.Name: "pvc-c222811d-b0bc-11e6-ba25-42010af000f2") pod "9553076c-b16d-11e6-ba25-42010af000f2" (UID: "9553076c-b16d-11e6-ba25-42010af000f2")"
"E1124 18:35:33.211726    1274 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/gce-pd/gke-default-e5f91d80-d-pvc-c222811d-b0bc-11e6-ba25-42010af000f2\"" failed. No retries permitted until 2016-11-24 18:37:33.211683109 +0000 UTC (durationBeforeRetry 2m0s). Error: Volume "kubernetes.io/gce-pd/gke-default-e5f91d80-d-pvc-c222811d-b0bc-11e6-ba25-42010af000f2" (spec.Name: "pvc-c222811d-b0bc-11e6-ba25-42010af000f2") pod "9553076c-b16d-11e6-ba25-42010af000f2" (UID: "9553076c-b16d-11e6-ba25-42010af000f2") is not yet attached according to node status."
@Stono
Stono commented Dec 19, 2016

@simonhorlick slight side question, PetSet on GKE? As it's still alpha are you using the "Alpha Clusters" which are not upgradable and last 30days? Or do you have some other ninja way of enabling PetSet

@simonhorlick

@Stono Correct, I was testing on an alpha cluster.

@Stono
Stono commented Dec 19, 2016

@simonhorlick cool - I don't know if you're aware but 1.5.1 is available for 'new' clusters now

https://groups.google.com/forum/#!topic/gke-release-notes/kl5KAr-1oeA

@munjalpatel

I am also having similar issues on Google Container Engine.
Here are the errors in the "Replica Set" view:

Unable to mount volumes for pod "wordpress-mysql-2379610080-if843_munpat-wordpress(1b30f418-dd1c-11e6-8301-42010a8a0163)": timeout expired waiting for volumes to attach/mount for pod "wordpress-mysql-2379610080-if843"/"munpat-wordpress". list of unattached/unmounted volumes=[mysql-persistent-storage]

Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "wordpress-mysql-2379610080-if843"/"munpat-wordpress". list of unattached/unmounted volumes=[mysql-persistent-storage]

I have a 3 node cluster in Google Container Engine. 2 of 3 nodes are preemptible. This error happens every 24 hours or so when the nodes get removed and created again. But if I manually delete one of the preemptible node to test the failover, I have so far not seen these errors.

Here is my setup:

gce-volumes.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  namespace: munpat-wordpress
  name: wordpress-pv-1
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  gcePersistentDisk:
    pdName: munpat-wordpress-1
    fsType: ext4
---
apiVersion: v1
kind: PersistentVolume
metadata:
  namespace: munpat-wordpress
  name: wordpress-pv-2
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  gcePersistentDisk:
    pdName: munpat-wordpress-2
    fsType: ext4

mysql-deployment.yaml

apiVersion: v1
kind: Service
metadata:
  namespace: munpat-wordpress
  name: wordpress-mysql
  labels:
    app: wordpress
spec:
  ports:
    - port: 3306
  selector:
    app: wordpress
    tier: mysql
  clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: munpat-wordpress
  name: mysql-pv-claim
  labels:
    app: wordpress
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: munpat-wordpress
  name: wordpress-mysql
  labels:
    app: wordpress
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: wordpress
        tier: mysql
    spec:
      containers:
      - image: mysql:5.6
        name: mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-pass
              key: password.txt
        ports:
        - containerPort: 3306
          name: mysql
        volumeMounts:
        - name: mysql-persistent-storage
          mountPath: /var/lib/mysql
      volumes:
      - name: mysql-persistent-storage
        persistentVolumeClaim:
          claimName: mysql-pv-claim

wordpress-deployment.yaml

apiVersion: v1
kind: Service
metadata:
  namespace: munpat-wordpress
  name: wordpress
  labels:
    app: wordpress
spec:
  ports:
    - port: 80
  selector:
    app: wordpress
    tier: frontend
  type: LoadBalancer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: munpat-wordpress
  name: wp-pv-claim
  labels:
    app: wordpress
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: munpat-wordpress
  name: wordpress
  labels:
    app: wordpress
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: wordpress
        tier: frontend
    spec:
      containers:
      - image: wordpress
        name: wordpress
        env:
        - name: WORDPRESS_DB_HOST
          value: wordpress-mysql
        - name: WORDPRESS_DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-pass
              key: password.txt
        ports:
        - containerPort: 80
          name: wordpress
        volumeMounts:
        - name: wordpress-persistent-storage
          mountPath: /var/www/html
      volumes:
      - name: wordpress-persistent-storage
        persistentVolumeClaim:
          claimName: wp-pv-claim
@jingxu97
Contributor

@munjalpatel sorry for the late response. Are you still have the error every 24 hours? What version of kubernetes in your cluster? Thanks!

@munjalpatel

@jingxu97 Yes, I still have these errors. I have to manually delete the nodes everyday to fix the issue. All three nodes are currently on: 1.4.8.

@jingxu97
Contributor

@munjalpatel Since we have quite some fixes since 1.4.8. Could you please try the latest version and let us know whether you still have the problem? If you need, you could also share with me your zone, project and cluster name so that I can take a look at the controller log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment