Pet Set stuck in ContainerCreating #28709

Closed
chrislovecnm opened this Issue Jul 8, 2016 · 26 comments

Projects

None yet
@chrislovecnm
Member

A previously healthy pod was running. It got unhealthy for some reason, and I did a silly thing. Deleted the pod. Now it is stuck:

Volumes:
  cassandra-analytics:
    Type:   PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cassandra-analytics-cassandra-analytics-112
    ReadOnly:   false
  default-token-mu6y5:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-mu6y5
QoS Tier:   Guaranteed
Events:
  FirstSeen LastSeen    Count   From                    SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                    -------------   --------    ------      -------
  14d       22s     9100    {kubelet kubernetes-minion-group-0jg9}          Warning     FailedMount Unable to mount volumes for pod "cassandra-analytics-112_default(13171b3b-3a50-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]
  14d       22s     9100    {kubelet kubernetes-minion-group-0jg9}          Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]

Can we document the process of getting this unstuck? I am guessing this is an error, because the PV has data in it. Here is the PV.

$ kubectl describe pv pvc-a5f45f53-38f0-11e6-b7d0-42010a800002
Name:       pvc-a5f45f53-38f0-11e6-b7d0-42010a800002
Labels:     failure-domain.beta.kubernetes.io/region=us-central1
        failure-domain.beta.kubernetes.io/zone=us-central1-c
Status:     Bound
Claim:      default/cassandra-analytics-cassandra-analytics-112
Reclaim Policy: Delete
Access Modes:   RWO
Capacity:   380Gi
Message:
Source:
    Type:   GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName: kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002
    FSType:
    Partition:  0
    ReadOnly:   false
No events.

FYI I am on 1.3 beta without the last minute volume fix stuff. I can try to recreate with the new volume code, but I don't think it will make a difference.

@Random-Liu
Member
@bprashanth
Contributor

Hmm that even fired 9100 times, you're probably hitting one of the races we fixed in the last phases of 1.3. Can you confirm through the gce ui that the pd kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002 is attached to the node kubernetes-minion-group-0jg9?

@chrislovecnm
Member
chrislovecnm commented Jul 8, 2016 edited

It is attached according to the GCE UI.

kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002 SSD persistent disk 380 GB us-central1-c kubernetes-minion-group-0jg9

@chrislovecnm
Member

Nudge, how can we debug this??

@chrislovecnm
Member

@bprashanth ping ... want to kill this cluster. Do we need to debug at all?

@iazarny
iazarny commented Jul 20, 2016

The same behavior when ConfigMap is mounted as volume to read some config. K8s creating service-account-token, which appear in list of unattached/unmounted volumes after restarting the pods

@bprashanth
Contributor

I think these are 2 different bugs, the former is a bug in the pv provisioner that we fixed in later stages of 1.3 (I believe Chris was running with a beta build). @iazarny can you give me a petset yaml to repro?

@ShengjieLuo

Hi, I meet similar problems after I had deleted a pod and restarted it again.
See kubectl information here

NAME                                     READY        STATUS              RESTARTS                       AGE            IP           NODE
965948e462c21505447838bd4b4d8f89-38ab9   0/1          ContainerCreating   0                              47m            <none>       127.0.0.1
etcd-bmnds                               1/1          Running             0                              8d             172.17.0.8   127.0.0.1
fdc1f548472ead805c457dd82ebdf7c6-hbwry   0/1          ContainerCreating   0                              47m            <none>       127.0.0.1
pachd-j1tcj                              1/1          Running             10                             8d             172.17.0.7   127.0.0.1
pachd-jxl0q                              1/1          Running             10                             8d             172.17.0.4   127.0.0.1
rethink-jxb3q                            1/1          Running             0                              8d             172.17.0.3   127.0.0.1

See debug information here.

Name:           fdc1f548472ead805c457dd82ebdf7c6-hbwry
Namespace:      default
Node:           127.0.0.1/127.0.0.1
Start Time:     Thu, 21 Jul 2016 10:32:03 +0800
Labels:         app=fdc1f548472ead805c457dd82ebdf7c6
                suite=pachyderm
Status:         Pending
IP:
Controllers:    Job/fdc1f548472ead805c457dd82ebdf7c6
Containers:
  user:
    Container ID:
    Image:              fruit_stand
    Image ID:
    Port:
    Command:
      /job-shim
      fdc1f548472ead805c457dd82ebdf7c6
    State:              Waiting
      Reason:           ContainerCreating
    Ready:              False
    Restart Count:      0
    Environment Variables:
      PACH_OUTPUT_COMMIT_ID:    603dd21753b048d3b390b6fa06d0d5b0
      PACH_DATA_COMMIT_ID:      6095a2a6faf6485fb7873f5dfd4558a8
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  default-token-bm3z3:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-bm3z3
QoS Tier:       BestEffort
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  40m           40m             1       {default-scheduler }                    Normal          Scheduled       Successfully assigned fdc1f548472ead805c457dd82ebdf7c6-hbwry to 127.0.0.1
  38m           39s             18      {kubelet 127.0.0.1}                     Warning         FailedMount     Unable to mount volumes for pod "fdc1f548472ead805c457dd82ebdf7c6-hbwry_default(4f6bd8f2-4eeb-11e6-b885-001e67548707)": timeout expired waiting for volumes to attach/mount for pod "fdc1f548472ead805c457dd82ebdf7c6-hbwry"/"default". list of unattached/unmounted volumes=[default-token-bm3z3]
  38m           39s             18      {kubelet 127.0.0.1}                     Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "fdc1f548472ead805c457dd82ebdf7c6-hbwry"/"default". list of unattached/unmounted volumes=[default-token-bm3z3]

How can it be solved? Thanks.

@ShengjieLuo ShengjieLuo referenced this issue in pachyderm/pachyderm Jul 21, 2016
Closed

Stuck in "Pulling State" after delete-all #674

@saad-ali
Member

@ShengjieLuo You are likely hitting #28750. PR #28939 should help (patch will be in k8s v1.3.4).

@Hokutosei
Contributor

I am experiencing this problem too,...

Events:
  FirstSeen LastSeen    Count   From                                SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                                -------------   --------    ------      -------
  13m       13m     1   {default-scheduler }                                Normal      Scheduled   Successfully assigned a-mongo-1 to gke-beeedev-cluster-default-pool-86c239e7-4rf2
  11m       38s     6   {kubelet gke-beeedev-cluster-default-pool-86c239e7-4rf2}            Warning     FailedMount Unable to mount volumes for pod "a-mongo-1_default(eca29be0-629c-11e6-8641-42010af00048)": timeout expired waiting for volumes to attach/mount for pod "a-mongo-1"/"default". list of unattached/unmounted volumes=[data]
  11m       38s     6   {kubelet gke-beeedev-cluster-default-pool-86c239e7-4rf2}            Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "a-mongo-1"/"default". list of unattached/unmounted volumes=[data]

weekend past, and tried deleting the pod today, yet still same...

kubectl get pods
NAME                           READY     STATUS              RESTARTS   AGE
a-mongo-0                      1/1       Running             0          2d
a-mongo-1                      0/1       ContainerCreating   0          14m
a-mongo-2                      1/1       Running             0          2d
@chrislovecnm
Member

@bprashanth how do we debug this more?

@Hokutosei what version are you using, and what cloud are you on?

@Hokutosei
Contributor

@chrislovecnm thanks for the reply, I am all using GKE (google container engine), and both master and nodes are using 1.3.3 I am waiting for the update for 1.3.4 to fix this, hopefully. I am on asia-east1-a (master+nodes) but upgrades for 1.3.4 aren't available yet...

@solsson
solsson commented Aug 25, 2016 edited

I'm seeing the same error (or at least the same error message) in GKE after upgrade to 1.3.5. Created the PetSet in 1.3.4 using https://github.com/Yolean/kubernetes-mysql-cluster/blob/master/50mariadb.yml. Pods restarted fine in 1.3.4.

Is there any way to get more details on the error timeout expired waiting for volumes to attach/mount for pod?

  22m       21s     11  {kubelet gke-yo2-default-pool-fb17f416-r2sn}            Warning     FailedMount Unable to mount volumes for pod "mariadb-0_mysql(2879728f-6a91-11e6-bd82-42010a840025)": timeout expired waiting for volumes to attach/mount for pod "mariadb-0"/"mysql". list of unattached/unmounted volumes=[datadir]
$ kubectl get pvc
NAME                STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
datadir-mariadb-0   Bound     pvc-8b46765d-6619-11e6-8405-42010a840025   1Gi        RWO           5d
datadir-mariadb-1   Bound     pvc-b025c0b5-661a-11e6-8405-42010a840025   1Gi        RWO           5d

What happens if the PetSet resource is replaced, for exampel to change env or arguments? How does K8s match the volumes?

Edit: Works now, after petset+pod deletion and new kubectl create -f. This was probably a case of pod termination and startup timing during cluster upgrade.

@ceefour
ceefour commented Aug 28, 2016 edited

Also got this issue with both v1.3.4 and v1.3.5, I originally reported in https://googlecloudplatform.uservoice.com/forums/302595-compute-engine/suggestions/15838738-timeout-expired-waiting-for-volumes-to-attach-moun . Happens using both ReplicationController and still happened after I switched to Deployment (of ReplicaSet).

Pod has been stuck in this error state for 4 days, having 2631 repeated errors in the process. I will be forced to recreate the pod. :(

>kubectl describe po mongo-rc0-iosry 
Name: mongo-rc0-iosry 
Namespace: default 
Node: gke-fatih-small-pool-59881027-h46g/- 
Start Time: Tue, 23 Aug 2016 19:59:08 +0700 
Labels: instance=fatih0 
mongo-rs-name=bippo 
name=mongo-node0 
Status: Pending 
IP: 
Controllers: ReplicationController/mongo-rc0 
Containers: 
mongo-node0: 
Container ID: 
Image: mongo 
Image ID: 
Port: 27017/TCP 
Command: 
/bin/sh 
-c 
Args: 
cp /etc/mongo-keyfile/mongo.keyfile /etc/mongo.keyfile && chmod 600 /etc/mongo.keyfile && chown mongodb:mongodb /etc/mongo.keyfile && mongod --replSet bippo --keyFile /etc/mongo.keyfile 
Requests: 
cpu: 100m 
State: Waiting 
Reason: ContainerCreating 
Ready: False 
Restart Count: 0 
Environment Variables: <none> 
Conditions: 
Type Status 
Initialized True 
Ready False 
PodScheduled True 
Volumes: 
mongo-keyfile: 
Type: Secret (a volume populated by a Secret) 
SecretName: mongo-keyfile 
mongo-persistent-storage0: 
Type: GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine) 
PDName: mongodb-disk0 
FSType: 
Partition: 0 
ReadOnly: false 
default-token-i0lox: 
Type: Secret (a volume populated by a Secret) 
SecretName: default-token-i0lox 
QoS Tier: Burstable 
Events: 
FirstSeen LastSeen Count From SubobjectPath Type 
Reason Message 
--------- -------- ----- ---- ------------- -------- 
------ ------- 
4d 1m 2631 {kubelet gke-fatih-small-pool-59881027-h46g} Warning 
FailedMount Unable to mount volumes for pod "mongo-rc0-iosry_default(60ec9427-6931-11e6-b129-42010af0011e)": timeout expired waiting for volumes to attach/mount for pod "mongo-rc0-iosry"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage0] 
4d 1m 2631 {kubelet gke-fatih-small-pool-59881027-h46g} Warning 
FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongo-rc0-iosry"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage0]

(unfortunately UserVoice didn't preserve text indentation. I don't know who at Google suggested UserVoice over StackOverflow to report issues, but I can tell it's a terrible idea)

I have:

  1. upgraded to v1.3.5
  2. changed from g1-small pool to n1-standard-1 pool
  3. switched from ReplicationController to Deployment (ReplicaSet)
  4. recreated the pod countless number of times
  5. deleted the original PD and recreated

... problem persists (tell me what I haven't done?). Here's the current state:

>kubectl describe po mongo1-1619551385-ie291
Name:           mongo1-1619551385-ie291
Namespace:      default
Node:           gke-fatih-standard-fb894cbb-sc1j/x
Start Time:     Sun, 28 Aug 2016 13:46:56 +0700
Labels:         instance=fatih1
                mongo-rs-name=bippo
                pod-template-hash=1619551385
Status:         Pending
IP:
Controllers:    ReplicaSet/mongo1-1619551385
Containers:
  mongo:
    Container ID:
    Image:              mongo
    Image ID:
    Port:               27017/TCP
    Command:
      /bin/sh
      -c
    Args:
      cp /etc/mongo-keyfile/mongo.keyfile /etc/mongo.keyfile && chmod 600 /etc/mongo.keyfile && chown mongodb:mongodb /etc/mongo.keyfile && mongod --replSet bippo --keyFile /etc/mongo.keyfile
    Requests:
      cpu:                      100m
      memory:                   2Gi
    State:                      Waiting
      Reason:                   ContainerCreating
    Ready:                      False
    Restart Count:              0
    Environment Variables:      <none>
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  mongo-keyfile:
    Type:       Secret (a volume populated by a Secret)
    SecretName: mongo-keyfile
  mongo-persistent-storage1:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     mongodb-disk1
    FSType:
    Partition:  0
    ReadOnly:   false
  default-token-i0lox:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-i0lox
QoS Tier:       Burstable
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath   Type            Reason
Message
  ---------     --------        -----   ----                    -------------   --------        ------
-------
  26m           26m             2       {default-scheduler }                    Warning         FailedScheduling
pod (mongo1-1619551385-ie291) failed to fit in any node
fit failure on node (gke-fatih-standard-fb894cbb-d7ue): Insufficient Memory
fit failure on node (gke-fatih-standard-fb894cbb-sc1j): NoDiskConflict

  26m   26m     4       {default-scheduler }            Warning FailedScheduling        pod (mongo1-1619551385-ie291) failed to fit in any node
fit failure on node (gke-fatih-standard-fb894cbb-d7ue): Insufficient Memory
fit failure on node (gke-fatih-standard-fb894cbb-sc1j): Insufficient Memory

  25m   25m     1       {default-scheduler }                                    Normal  Scheduled   Successfully assigned mongo1-1619551385-ie291 to gke-fatih-standard-fb894cbb-sc1j
  23m   1m      11      {kubelet gke-fatih-standard-fb894cbb-sc1j}              Warning FailedMount Unable to mount volumes for pod "mongo1-1619551385-ie291_default(23bfe726-6ceb-11e6-98ac-42010af00085)": timeout expired waiting for volumes to attach/mount for pod "mongo1-1619551385-ie291"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage1]
  23m   1m      11      {kubelet gke-fatih-standard-fb894cbb-sc1j}              Warning FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongo1-1619551385-ie291"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage1]

(Please ignore Insufficient Memory, it's an intentional simplistic constraint to prevent two mongodb's in the same node)

Nobody is using the "disk" (I'd argue it's a drive since it's SSD):

>gcloud compute disks describe mongodb-disk1
creationTimestamp: '2016-08-27T23:45:56.010-07:00'
id: '5239705523257816651'
kind: compute#disk
name: mongodb-disk1
selfLink: https://www.googleapis.com/compute/v1/projects/bippo-141108/zones/us-east1-d/disks/mongodb-disk1
sizeGb: '100'
status: READY
type: https://www.googleapis.com/compute/v1/projects/bippo-141108/zones/us-east1-d/diskTypes/pd-ssd
zone: https://www.googleapis.com/compute/v1/projects/bippo-141108/zones/us-east1-d
@ceefour
ceefour commented Aug 28, 2016

Tried recreating Deployment again, just for kicks, and still happening:

>kubectl describe po mongo1-1619551385-x0u9x
Name:           mongo1-1619551385-x0u9x
Namespace:      default
Node:           gke-fatih-standard-fb894cbb-sc1j/x
Start Time:     Sun, 28 Aug 2016 14:22:49 +0700
Labels:         instance=fatih1
                mongo-rs-name=bippo
                pod-template-hash=1619551385
Status:         Pending
IP:
Controllers:    ReplicaSet/mongo1-1619551385
Containers:
  mongo:
    Container ID:
    Image:              mongo
    Image ID:
    Port:               27017/TCP
    Command:
      /bin/sh
      -c
    Args:
      cp /etc/mongo-keyfile/mongo.keyfile /etc/mongo.keyfile && chmod 600 /etc/mongo.keyfile && chown mongodb:mongodb /etc/mongo.keyfile && mongod --replSet bippo --keyFile /etc/mongo.keyfile
    Requests:
      cpu:                      100m
      memory:                   2Gi
    State:                      Waiting
      Reason:                   ContainerCreating
    Ready:                      False
    Restart Count:              0
    Environment Variables:      <none>
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  mongo-keyfile:
    Type:       Secret (a volume populated by a Secret)
    SecretName: mongo-keyfile
  mongo-persistent-storage1:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     mongodb-disk1
    FSType:
    Partition:  0
    ReadOnly:   false
  default-token-i0lox:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-i0lox
QoS Tier:       Burstable
Events:
  FirstSeen     LastSeen        Count   From                                            SubobjectPath   Type
Reason          Message
  ---------     --------        -----   ----                                            -------------   --------
------          -------
  2m            2m              1       {default-scheduler }                                        Normal
Scheduled       Successfully assigned mongo1-1619551385-x0u9x to gke-fatih-standard-fb894cbb-sc1j
  22s           22s             1       {kubelet gke-fatih-standard-fb894cbb-sc1j}                  Warning
FailedMount     Unable to mount volumes for pod "mongo1-1619551385-x0u9x_default(39860491-6cf0-11e6-98ac-42010af00085)": timeout expired waiting for volumes to attach/mount for pod "mongo1-1619551385-x0u9x"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage1]
  22s           22s             1       {kubelet gke-fatih-standard-fb894cbb-sc1j}                  Warning
FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongo1-1619551385-x0u9x"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage1]

I have a gut feeling though this has something to do with missing fsType: ext4 ?

@ceefour
ceefour commented Aug 28, 2016 edited

@chrislovecnm @Hokutosei I think the issue title needs to be renamed to mention that it's originally Pet Set, but potentially not limited to Pet Set, but also ReplicationController and ReplicaSet. Also the issue title needs to signify that is about FailedMount of a GCE Persistent Disk (whether direct volume or via PVC), not about ContainerCreating state (which can mean many things).

Unfortunately it's not because of missing fsType, adding fsType: ext4 has no effect.

>kubectl describe po mongo1-3639502366-ck47a
Name:           mongo1-3639502366-ck47a
Namespace:      default
Node:           gke-fatih-standard-fb894cbb-sc1j/x
Start Time:     Sun, 28 Aug 2016 14:27:01 +0700
Labels:         instance=fatih1
                mongo-rs-name=bippo
                pod-template-hash=3639502366
Status:         Pending
IP:
Controllers:    ReplicaSet/mongo1-3639502366
Containers:
  mongo:
    Container ID:
    Image:              mongo
    Image ID:
    Port:               27017/TCP
    Command:
      /bin/sh
      -c
    Args:
      cp /etc/mongo-keyfile/mongo.keyfile /etc/mongo.keyfile && chmod 600 /etc/mongo.keyfile && chown mongodb:mongodb /etc/mongo.keyfile && mongod --replSet bippo --keyFile /etc/mongo.keyfile
    Requests:
      cpu:                      100m
      memory:                   2Gi
    State:                      Waiting
      Reason:                   ContainerCreating
    Ready:                      False
    Restart Count:              0
    Environment Variables:      <none>
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  mongo-keyfile:
    Type:       Secret (a volume populated by a Secret)
    SecretName: mongo-keyfile
  mongo-persistent-storage1:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     mongodb-disk1
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
  default-token-i0lox:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-i0lox
QoS Tier:       Burstable
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath   Type            Reason
Message
  ---------     --------        -----   ----                    -------------   --------        ------
-------
  6m            6m              1       {default-scheduler }                    Warning         FailedScheduling
pod (mongo1-3639502366-ck47a) failed to fit in any node
fit failure on node (gke-fatih-standard-fb894cbb-sc1j): Insufficient Memory
fit failure on node (gke-fatih-standard-fb894cbb-d7ue): Insufficient Memory

  6m    6m      1       {default-scheduler }            Warning FailedScheduling        pod (mongo1-3639502366-ck47a) failed to fit in any node
fit failure on node (gke-fatih-standard-fb894cbb-sc1j): NoDiskConflict
fit failure on node (gke-fatih-standard-fb894cbb-d7ue): Insufficient Memory

  6m    6m      1       {default-scheduler }                                    Normal  Scheduled   Successfully assigned mongo1-3639502366-ck47a to gke-fatih-standard-fb894cbb-sc1j
  4m    1m      2       {kubelet gke-fatih-standard-fb894cbb-sc1j}              Warning FailedMount Unable to mount volumes for pod "mongo1-3639502366-ck47a_default(cf178aa2-6cf0-11e6-98ac-42010af00085)": timeout expired waiting for volumes to attach/mount for pod "mongo1-3639502366-ck47a"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage1]
  4m    1m      2       {kubelet gke-fatih-standard-fb894cbb-sc1j}              Warning FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongo1-3639502366-ck47a"/"default". list of unattached/unmounted volumes=[mongo-persistent-storage1]

Latest /var/log/kubelet.log on the node:

I0828 07:29:49.862778    3419 server.go:959] GET /healthz: (38.033µs) 200 [[curl/7.26.0] 127.0.0.1:45674]
I0828 07:29:59.876687    3419 server.go:959] GET /healthz: (33.818µs) 200 [[curl/7.26.0] 127.0.0.1:45684]
I0828 07:30:03.971510    3419 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/349883c9-6c66-11e6-98ac-42010af00
085-default-token-sb5ix" (spec.Name: "default-token-sb5ix") to pod "349883c9-6c66-11e6-98ac-42010af00085" (UID: "349883c9-6c66-11e6-98ac-42010af0
0085"). Volume is already mounted to pod, but remount was requested.
I0828 07:30:03.974620    3419 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/349883c9-6c66-11e6-98ac-420
10af00085-default-token-sb5ix" (spec.Name: "default-token-sb5ix") pod "349883c9-6c66-11e6-98ac-42010af00085" (UID: "349883c9-6c66-11e6-98ac-42010
af00085").
I0828 07:30:04.511484    3419 container_manager_linux.go:614] Found 44 PIDs in root, 44 of them are not to be moved
I0828 07:30:05.025852    3419 server.go:959] GET /stats/summary/: (9.852816ms) 200 [[Go-http-client/1.1] 10.60.4.5:33076]
I0828 07:30:09.895201    3419 server.go:959] GET /healthz: (45.607µs) 200 [[curl/7.26.0] 127.0.0.1:45734]
I0828 07:30:11.461372    3419 server.go:959] GET /healthz: (44.393µs) 200 [[Go-http-client/1.1] 127.0.0.1:45738]
I0828 07:30:19.913709    3419 server.go:959] GET /healthz: (53.076µs) 200 [[curl/7.26.0] 127.0.0.1:45745]
I0828 07:30:29.927719    3419 server.go:959] GET /healthz: (25.364µs) 200 [[curl/7.26.0] 127.0.0.1:45753]
I0828 07:30:39.943744    3419 server.go:959] GET /healthz: (33.425µs) 200 [[curl/7.26.0] 127.0.0.1:45768]
I0828 07:30:47.987922    3419 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/f6bb05a4-6ce1-11e6-98ac-42010af00
085-mongo-keyfile" (spec.Name: "mongo-keyfile") to pod "f6bb05a4-6ce1-11e6-98ac-42010af00085" (UID: "f6bb05a4-6ce1-11e6-98ac-42010af00085"). Volu
me is already mounted to pod, but remount was requested.
I0828 07:30:47.988007    3419 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/f6bb05a4-6ce1-11e6-98ac-42010af00
085-default-tok
en-i0lox" (spec.Name: "default-token-i0lox") to pod "f6bb05a4-6ce1-11e6-98ac-42010af00085" (UID: "f6bb05a4-6ce1-11e6-98ac-42010af00085"). Volume is already mounted to pod, but remount was requested.
I0828 07:30:47.991753    3419 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f6bb05a4-6ce1-11e6-98ac-42010af00085-default-token-i0lox" (spec.Name: "default-token-i0lox") pod "f6bb05a4-6ce1-11e6-98ac-42010af00085" (UID: "f6bb05a4-6ce1-11e6-98ac-42010af00085").
I0828 07:30:47.992095    3419 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f6bb05a4-6ce1-11e6-98ac-42010af00085-mongo-keyfile" (spec.Name: "mongo-keyfile") pod "f6bb05a4-6ce1-11e6-98ac-42010af00085" (UID: "f6bb05a4-6ce1-11e6-98ac-42010af00085").
I0828 07:30:49.691312    3419 reconciler.go:180] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/mongodb-disk1" (spec.Name: "mongo-persistent-storage1") pod "cf178aa2-6cf0-11e6-98ac-42010af00085" (UID: "cf178aa2-6cf0-11e6-98ac-42010af00085")
E0828 07:30:49.694187    3419 nestedpendingoperations.go:233] Operation for "\"kubernetes.io/gce-pd/mongodb-disk1\"" failed. No retries permitted until 2016-08-28 07:32:49.694170006 +0000 UTC (durationBeforeRetry 2m0s). Error: Volume "kubernetes.io/gce-pd/mongodb-disk1" (spec.Name: "mongo-persistent-storage1") pod "cf178aa2-6cf0-11e6-98ac-42010af00085" (UID: "cf178aa2-6cf0-11e6-98ac-42010af00085") is not yet attached according to node status.
I0828 07:30:49.957393    3419 server.go:959] GET /healthz: (36.682µs) 200 [[curl/7.26.0] 127.0.0.1:45778]
@saad-ali
Member

@ceefour Thanks for providing logs. Unable to mount volumes for pod... timeout expired waiting for volumes to attach/mount... list of unattached/unmounted volumes... is a catch-all for all types of volume failure errors. In your case it looks like the volume failed to attach (...is not yet attached according to node status). You may be running into issue #29358. A fix for that issue is in v1.3.6 (which will be on GKE shortly). In the meantime, feel free to email me the details of your GKE cluster (project name/cluster name/zone) and I can confirm/help you get unblocked.

@Hokutosei @solsson Check your /var/log/kubelet.log file for errors from nestedpendingoperations. You may also be running in to the same issue as @ceefour, in which case if you email me the details of your GKE cluster (project name/cluster name/zone) I can confirm/help you get unblocked.

@saad-ali
Member

@ceefour got in touch with me offline and I was able to confirm that he ran into #29358 (which is fixed in v1.3.6). I was able to unblock him using the workarounds mentioned in that issue.

@geoah
geoah commented Sep 1, 2016

I'm running on the same issue on AWS (1.3.4 CoreOS using kube-aws).

Events report...

FailedMount   {kubelet ip-10-0-2-7.eu-west-1.compute.internal}   Unable to mount volumes for pod "cassandra-2_default(e0c1d1dd-701a-11e6-b003-02ae0e2810f7)": timeout expired waiting for volumes to attach/mount for pod "cassandra-2"/"default". list of unattached/unmounted volumes=[cassandra-data]

Worker journal...

24996 nestedpendingoperations.go:233] Operation for "\"kubernetes.io/aws-ebs/aws://eu-west-1c/vol-843d7f35\"" failed. No retries permitted until 2016-09-01 09:12:45.513677556 +0000 UTC (durationBeforeRetry 2s). Error: MountVolume.WaitForAttach failed for volume "kubernetes.io/aws-ebs/aws://eu-west-1c/vol-843d7f35" (spec.Name: "pvc-b1170328-6943-11e6-b003-02ae0e2810f7") pod "e0c1d1dd-701a-11e6-b003-02ae0e2810f7" (UID: "e0c1d1dd-701a-11e6-b003-02ae0e2810f7") with: Could not find attached AWS Volume "aws://eu-west-1c/vol-843d7f35". Timeout waiting for mount paths to be created.
24996 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/aws-ebs/aws://eu-west-1c/vol-843d7f35" (spec.Name: "pvc-b1170328-6943-11e6-b003-02ae0e2810f7") to pod "e0c1d1dd-701a-11e6-b003-02ae0e2810f7" (UID: "e0c1d1dd-701a-11e6-b003-02ae0e2810f7").
24996 operation_executor.go:649] Entering MountVolume.WaitForAttach for volume "kubernetes.io/aws-ebs/aws://eu-west-1c/vol-843d7f35" (spec.Name: "pvc-b1170328-6943-11e6-b003-02ae0e2810f7") pod "e0c1d1dd-701a-11e6-b003-02ae0e2810f7" (UID: "e0c1d1dd-701a-11e6-b003-02ae0e2810f7") DevicePath: "/dev/xvdbd"
...
24996 kubelet.go:1932] Unable to mount volumes for pod "cassandra-2_default(e0c1d1dd-701a-11e6-b003-02ae0e2810f7)": timeout expired waiting for volumes to attach/mount for pod "cassandra-2"/"default". list of unattached/unmounted volumes=[cassandra-data]; skipping pod
24996 pod_workers.go:183] Error syncing pod e0c1d1dd-701a-11e6-b003-02ae0e2810f7, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-2"/"default". list of unattached/unmounted volumes=[cassandra-data]

Petset volumeClaimTemplates...

  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

AWS shows the volume properly mounted on the instance.

Any ideas/hints on what might be going wrong? :)
Thanks.

@saad-ali
Member
saad-ali commented Sep 1, 2016

AWS shows the volume properly mounted on the instance.

@geoah Can you confirm that the device MountVolume.WaitForAttach is waiting for, /dev/xvdbd, exists on the node. I wonder if you might be encountering #29324.

@chrislovecnm
Member

Has anyone recreated this? Probably going to close.

@emiliauskas-fuzzy
emiliauskas-fuzzy commented Oct 27, 2016 edited

managed to bump into the same issue while going through this tutorial: https://cloud.google.com/endpoints/docs/quickstart-container-engine, but don't see any logs for /var/logs/kubelet.log on the cluster VM.
Thought managed to get some info and it seems related to the mounting again:

{kubelet gke-rec-service-cluster-default-pool-28938bf0-c9vz} Warning FailedMount Unable to mount volumes for pod "esp-product-recommendations-3552363475-216am_default(bba10941-9c65-11e6-becb-42010af00224)": timeout expired waiting for volumes to attach/mount for pod "esp-product-recommendations-3552363475-216am"/"default". list of unattached/unmounted volumes=[nginx-ssl]
18h 1m 501 {kubelet gke-rec-service-cluster-default-pool-28938bf0-c9vz} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "esp-product-recommendations-3552363475-216am"/"default". list of unattached/unmounted volumes=[nginx-ssl]
18h 1m 563 {kubelet gke-rec-service-cluster-default-pool-28938bf0-c9vz} Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/secret/bba10941-9c65-11e6-becb-42010af00224-nginx-ssl" (spec.Name: "nginx-ssl") pod "bba10941-9c65-11e6-becb-42010af00224" (UID: "bba10941-9c65-11e6-becb-42010af00224") with: secrets "nginx-ssl" not found

@emiliauskas-fuzzy

ok please ignore the last post finally understood that I was using the wrong config and didn't knew that you need to create those secrets before hand

@chrislovecnm
Member

Closing this

@whereisaaron

This 'stuck attaching' is a common issue using k8s with AWS and GCE (at least). On AWS it appears to be an AWS problem, triggered by the fast unmounting and remounting of EBS volumes. You see an error in the AWS console for the problem volume. One suggestion (based on AWS FAQ) is it is caused by mounting a volume using the same device name as one you just detached a very short time ago.
I comment on this general problem here:
#36450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment