Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployments with GCE PD fail with "...is already being used by..." #48968

Closed
saad-ali opened this issue Jul 14, 2017 · 75 comments
Closed

Deployments with GCE PD fail with "...is already being used by..." #48968

saad-ali opened this issue Jul 14, 2017 · 75 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@saad-ali
Copy link
Member

saad-ali commented Jul 14, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

From user report on reddit:

I have a deployment with persistent volume claim in Google cloud. One pod is using this volume. Deployment is of "recreate" type. But each time node is feeling under the weather and reschedules this pod to another one, it fails to start with:

googleapi: Error 400: The disk resource 'projects/...-pvc-...' is already being used by 'projects/.../instances/..node-bla-bla'     

I've stumbled across some issues on github, but don't see definitive solution. Due to the nature of the problem, I cannot reliably recreate it manually, artificial overload needs to be created.

What I considered doing:
1. Create some sort of gluster/ceph/whateverfs cluster and using it as PV. Con: additional point of failure, needs setup/maintenance of its own.
2. Create separate node pool with 1 node in it and schedule deployment strictly to that pool. Con: doesn't scale neither up nor down, at this point no need in one whole node just for that deployment, but if it grows then problem starts all over.

I've upgraded cluster and nodes to 1.6.7, but don't know if it will matter. Any help appreciated.

Other reports here:

What you expected to happen:
Volume should attach to new node without issue.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 14, 2017
@saad-ali saad-ali added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 14, 2017
@saad-ali
Copy link
Member Author

This report mentioned Deployments along with GCE PDs. This can get tricky because in some cases it can result in multiple pods (scheduled to different nodes) referencing the same (read-write once) volume which will cause the second pod to not start.

To prevent this from happening, the general recommendations for using Deployments with GCE PDs is:

  • Set the deployment replicas count to 1 -- Because GCE PDs can only support Read-write attachment to a single node at a time, and if you have more than 1 replica, pods may be scheduled to different nodes.
  • When doing rolling updates either:
    1. Use the "Recreate" strategy, which ensures that old pods are killed before new pods are created (there was a bug Recreate deployments dont wait for pod termination #27362 where this doesn't work correctly in some cases that apperently was fixed a long time ago)
    2. Use the "RollingUpdate" strategy with MaxSurge=0 and MaxUnavailable=1.
      • If a strategy is not specified for a deployment, the default is RollingUpdate. Rolling update strategy has two parameters maxUnavailable and maxSurge; when not specified they default to 1 and 1 respectively. This means that during a rolling update it requires at least one pod from the old deployment to remain and permits an extra new pod (beyond the requested number of replicas) to be created. When this happens, if the new pod lands on a different node, since the old pod has the disk attached as read-write the new pod will fail to start.

However, the reporter mentioned they used the "Recreate" strategy, which means that there must be a bug here.

To help us debug, if you run into this issue, please:

  1. Verify that you are adhering guidance provided above.
  2. Grab and share the following with me (either post here or email directly if you don't want to share publicly):
    * Your kube-controller-manager logs from your master (if you're on GKE contact customer support, reference this issue, and ask them to grab the logs for you).
    * Your deployment YAML
    * A description of what commands your ran and when.

Let's figure this out!

CC @kubernetes/sig-storage-bugs

@msau42
Copy link
Member

msau42 commented Jul 15, 2017

The initial report mentions that the node was not healthy, so the pod got rescheduled. Does that trigger the update strategy (recreate or rollingupdate)?

@yuvipanda
Copy link
Contributor

We've run into this a lot too, and we weren't using deployments. We were simply spawning pods that had a PD attached. After the pods were killed, the disks don't seem to be detached from the underling VM automatically, sometimes for hours? When the pods come back up on a different machine, we often get this exact same error error (and it sometimes clears up in a few minutes, often not).

We wrote a script that just watches for this issue (grepping describe in a loop, ugh) and runs the appropriate gcloud detach command, and that made the problem 'go away' for us.

We were on GKE.

@tamalsaha
Copy link
Member

We also see this issue happening. The most annoying thing about GCE is that it does "silent" migration of VMs, after which PVs are not released. So, when Kubernetes wants to start pod on that node, it fails with resource already in use type error.

cc: @sadlil

@wawastein
Copy link

Hello, I'm original reddit poster. Couple more details.

Deployment in question (sans sensitive info):

kind: Deployment
metadata:
  name: ...-redis
  labels:
    name: ...-redis
spec:
  selector:
    matchLabels:
      name: ...-redis
  template:
    metadata:
      labels:
        name: ...-redis
        app: ...
      namespace: ...
    spec:
      nodeSelector:
          cloud.google.com/gke-nodepool: pool-1
      volumes:
      - name: redis-data
        persistentVolumeClaim:
          claimName: ...-redis
      containers:
      - name: redis
        image: "gcr.io/.../...-redis:latest"
        ports: 
        - name: redis
          containerPort: 6379
        volumeMounts:
        - mountPath: /data
          name: redis-data
  strategy: 
    type: Recreate

The PVC:

apiVersion: v1
metadata:
  name: ...-redis
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

Last time I observed behavior in question, it was induced by faulty job producing high number of faulty pods, which generated disk pressure. Redis was scheduled to different node and the rest is history.
Please let me know if I can be of any additional help.

@jingxu97
Copy link
Contributor

@yuvipanda Thank you for reporting the issue. Is that possible you could share us some controller's logs during the issue happened? Thanks!

@gnufied
Copy link
Member

gnufied commented Jul 17, 2017

@yuvipanda can you elaborate on "pod gets killed" ? Previously for terminated pods but not deleted from api server, we never used to perform volume detach but this changed in #45191 .

@yuvipanda
Copy link
Contributor

@jingxu97 we were running GKE, and I couldn't find a way to get logs of controllers unfortunately.

@gnufied we spawn one pod per user, and then the user can execute arbitrary code inside the pod (via a Jupyter Notebook). We have a script that would watch for users that are inactive, and perform a delete (via the k8s API) of their pods. We found that for a long (and intermittent?) time after that, the volumes would still be attached to the node in which the user's pod was running. So when the user's pod was re-created (maybe they became active again), we need to attach the same volume back to this pod, and this would fail with this message.

@yuvipanda
Copy link
Contributor

@jingxu97 reached out to me, and I provided a repro case (https://gist.github.com/yuvipanda/0b6aa32192c35b960e91698e1c14690c)

@jingxu97
Copy link
Contributor

@wawastein thanks for reporting the error. You mentioned job producing high number of faulty pods. Does each pod uses a different PVC so it creates a new volume? Thanks!

@wawastein
Copy link

@jingxu97 hi, no, job pods do not use or create any volumes. Problem was in misconfiguration, containers couldn't connect to DB and failed time and time again.

@yuvipanda
Copy link
Contributor

I can't reproduce this on 1.7 btw. Can on 1.6

@jingxu97
Copy link
Contributor

@yuvipanda I tried on both 1.6.4 and 1.7 cluster with your repro steps, but could not get the errors..

@wawastein
Copy link

wawastein commented Aug 1, 2017

Here we go again.
Cluster 1.7.2
This time nothing extraordinary, just a regular deployment.

Multi-Attach error for volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" Volume is already exclusively attached to one node and can't be attached to another
Unable to mount volumes for pod "imgs-640158079-lq5rg_imgs(cc617d27-76c1-11e7-b22a-42010a84002d)": timeout expired waiting for volumes to attach/mount for pod "imgs"/"imgs-640158079-lq5rg". list of unattached/unmounted volumes=[nginx-cache]
Error syncing pod

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 1, 2017

@wawastein could you please share more information about this issue, in what condition this error occurred? If possible, could you also share the project, zone and cluster name information so that we could check the master log?

@wawastein
Copy link

@jingxu97 so the timeline was as follows:

  1. I created 1 Gi PVC for using with nginx container.
  2. Deployed the changed config, deploy was successful
  3. Pushed new nginx image to registry and deployed once again
  4. Got the error above.

Project: superlocal-149713
Zone: europe-west1-b
Cluster name: staging-cluster-2

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 2, 2017

@wawastein I checked the master log and found the following related information

kube-controller-manager.log-20170802-1501632000:I0801 13:41:43.693416       5 gce_util.go:144] Successfully created GCE PD volume gke-staging-cluster-2--pvc-25580310-76bf-11e7-b22a-42010a84002d
kube-controller-manager.log-20170802-1501632000:I0801 13:41:44.494282       5 pv_controller.go:1409] volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" provisioned for claim "imgs/imgs-nginx-cache"
kube-controller-manager.log-20170802-1501632000:I0801 13:41:44.500007       5 pv_controller.go:718] volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" entered phase "Bound"
kube-controller-manager.log-20170802-1501632000:I0801 13:41:44.500046       5 pv_controller.go:853] volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" bound to claim "imgs/imgs-nginx-cache"
kube-controller-manager.log-20170802-1501632000:I0801 13:52:45.861876       5 reconciler.go:272] attacherDetacher.AttachVolume started for volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" (UniqueName: "kubernetes.io/gce-pd/gke-staging-cluster-2--pvc-25580310-76bf-11e7-b22a-42010a84002d") from node "gke-staging-cluster-2-pool-1-bb497dff-9mms" 
kube-controller-manager.log-20170802-1501632000:I0801 13:52:53.731639       5 operation_generator.go:271] AttachVolume.Attach succeeded for volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" (UniqueName: "kubernetes.io/gce-pd/gke-staging-cluster-2--pvc-25580310-76bf-11e7-b22a-42010a84002d") from node "gke-staging-cluster-2-pool-1-bb497dff-9mms" 
kube-controller-manager.log-20170802-1501632000:W0801 14:00:38.804960       5 reconciler.go:262] Multi-Attach error for volume "pvc-25580310-76bf-11e7-b22a-42010a84002d" (UniqueName: "kubernetes.io/gce-pd/gke-staging-cluster-2--pvc-25580310-76bf-11e7-b22a-42010a84002d") from node "gke-staging-cluster-2-pool-1-bb497dff-rqsz" Volume is already exclusively attached to one node and can't be attached to another

From the log we can see, around 13:52:53 the pvc is first attached to node gke-staging-cluster-2-pool-1-bb497dff-9mms. Later around 14:00:38, reconciler tries to attached this pvc to a different node and failed since it is still attached to the old node. The problem is here that the old pod is not deleted so the volume is still attached and the new pod is trying to attach then failed.

From the deployment doc https://kubernetes.io/docs/concepts/workloads/controllers/deployment/,
"For example, if you look at the above Deployment closely, you will see that it first created a new Pod, then deleted some old Pods and created new ones. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that number of available Pods is at least 2 and the number of total Pods is at most 4."

I haven't looked at the deployment controller in detail, but if the new pod failed to start because of volume problem and controller does not kill the old pod, it will struck.

I also notice that in your master controller log, there are lots of errors like the following, not sure whether they are relevant or not.

E0801 13:52:19.659620       5 horizontal.go:206] failed to query scale subresource for Deployment/default/dj: deployments/scale.extensions "dj" not found
E0801 13:52:19.668547       5 horizontal.go:206] failed to query scale subresource for Deployment/default/puma: deployments/scale.extensions "puma" not found
E0801 13:52:26.056806       5 horizontal.go:206] failed to compute desired number of replicas based on listed metrics for Deployment/deals-production/cloudsqlproxy: failed to get cpu utilization: missing request for cpu on container cloudsqlproxy in pod deals-production/cloudsqlproxy-795806673-0g5ch
E0801 13:52:27.566351       5 horizontal.go:206] failed to compute desired number of replicas based on listed metrics for Deployment/staging/rpush: failed to get cpu utilization: missing request for cpu on container rpush in pod staging/rpush-196916759-l2p29
E0801 13:52:29.141446       5 horizontal.go:206] failed to compute desired number of replicas based on listed metrics for Deployment/imgs/imgs: failed to get cpu utilization: missing request for cpu on container app in pod imgs/imgs-3690798063-ct914
I0801 13:52:45.765722       5 replica_set.go:455] Too few "imgs"/"imgs-557785410" replicas, need 1, creating 1

I will check with workload team to see whether there is a problem during deployment update.

@wawastein
Copy link

@jingxu97 the last log messages confused me too, I though it might be because most of the time those pods are idle and don't use resources. HPA ones are related to deleted deployments.

On the bound PVC however, it's at least counter intuitive. If new pod is scheduled to new node, maybe there ought to be be a check for any detachable volumes of the old pod first.

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 2, 2017

@wawastein the scheduler and volume manager are completely separated. When scheduler schedule the pod it will not know whether the volume used by the pod is still attached to some node or not. But normally, old pod should be deleted at some point, so that volume will be detached from the old node, even though it might happen a little bit after the new pod is created

@jingxu97 jingxu97 self-assigned this Aug 2, 2017
@jingxu97
Copy link
Contributor

jingxu97 commented Aug 2, 2017

@wawastein Could you check the comment here to see whether following the steps could solve your problem? #48968 (comment)

@dfcowell
Copy link

dfcowell commented Aug 3, 2017

I've had this same problem pop up when attempting to schedule single-replica deployments after VM migrations of my nodes on GCP.

Eventually the scheduler gives up trying to reschedule the pod and I have to delete it manually to resolve the issue.

I'm going to try recreating the deployments with strategy: Replace and see if that helps, but I'll report back in this issue if it happens again and I can provide a timestamp, node, project etc.

UPDATE: It looks like, even when using the Recreate strategy on the deployment, the cluster schedules the new pod before removing the old one:

image

Anecdotally though, after changing the strategy to replace, the disk was released much faster than previously and the pod quickly recovered. I'll need a greater sample size before I can say that with any authority though.

@andrewhowdencom
Copy link

I'm noticing this during lots of deploys. Maybe there's a timeout between switching disks between machines, or some other sort of sync loop?

@wawastein
Copy link

@jingxu97 I'll try it on the new deployment. But I can't really force reschedule for test, I'll just sit there and hope it doesn't happen.

@davidz627
Copy link
Contributor

davidz627 commented Dec 11, 2017

/assign looking into recovery from different failure scenarios with recommended settings to make sure recreated pods are schedulable.

@jakirpatel
Copy link

I am facing similar issue for statefulsets. I have attached the PV through PVC in GKE. But when the replicas are more than 1, the rest pod will go in waiting: the crashloopbackoff state.

Any progress on this issue?

@msau42
Copy link
Member

msau42 commented Dec 15, 2017

Hi @jakirpatel, crashloopbackoff generally means that the the volume is successfully mounted, but your container is crashing. You'll need to check the logs of your container to see why it's crashing.

@msau42
Copy link
Member

msau42 commented Dec 15, 2017

Can you post the events you see with 'kubectl describe pod '?

@davidz627
Copy link
Contributor

davidz627 commented Dec 21, 2017

I have looked into different failure scenarios with our recommended settings and have the following conclusions:

Note: This may not be an exhaustive list of failure scenarios. Please contact me if you are experiencing issues with a different scenario.

Note: All scenarios tested with "replicas:1" and deployment with a PVC referencing gce-pd (only supports single node attach).

No deployment strategy Deployment strategy: Recreate Deployment strategy: Rolling update (maxSurge: 0, maxUnavailable: 1)
Deleting pod manually Successful new pod Successful new pod Successful new pod
Updating deployment Expected Error: multi-attach Successful new pod Successful new pod
Tainting node to evict pods Successful new pod Successful new pod Successful new pod
Node is killed Successful new pod Successful new pod Successful new pod
Stressing node to evict pods #57531 #57531 #57531

@mofirouz
Copy link

@davidz627 This seems to be inline with what I've experienced as well. The multi-attach error happened when I was updating the StatefulSet and caused pods to be reshuffled on nodes.

@msau42
Copy link
Member

msau42 commented Dec 22, 2017

@davidz627 can you also try this experiment with statefulsets? Theoretically, you should not see multi-attach errors with StatefulSets because the pod must be completely deleted before a new pod with the same name can be recreated.

@mofirouz if you could paste your full pod events showing the CrashLoopBackOff, that would be helpful in triaging.

@davidz627
Copy link
Contributor

Also tested with stateful sets with 3 replicas. Used volumeClaimTemplates with gce-pd. Here are the results

Stateful Set
Deleting pod manually Successful new pods
Updating stateful set Successful new pods
Tainting node to evict pods Successful new pods
Node is killed Successful new pods
Stressing node to evict pods #57531

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 21, 2018
@plkokanov
Copy link

/remove-lifecycle rotten

Hi, we experienced the same problem with kubernetes 1.9.6:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-27T00:13:02Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}

Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: GCE

Some more details: We use a statefulset for our POD, configured with RollingUpdate strategy.

Pod was restarted from node1 to node2 and stuck in ContainerCreating state due to

AttachVolume.Attach failed for volume “pvc-id” : googleapi: Error 400: The disk resource ‘projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id’ is already being used by ‘projects/blabla/zones/blablazone/instances/node1’

controller-manager logs:

I0515 18:04:22.207470       1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-id" (UniqueName: "kubernetes.io/gce-pd/gcp-dynam-pvc-id) from node "node2"

E0515 18:04:26.778812       1 gce_op.go:88] GCE operation failed: googleapi: Error 400: The disk resource 'projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id' is already being used by 'projects/blabla/zones/blablazone/instances/node1'

E0515 18:04:26.778879       1 attacher.go:92] Error attaching PD "gcp-dynam-pvc-id" to node "node2": googleapi: Error 400: The disk resource 'projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id' is already being used by 'projects/blabla/zones/blablazone/instances/node1'

E0515 18:04:26.778969       1 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/gcp-dynam-pvc-id\"" failed. No retries permitted until 2018-05-15 18:06:28.778939119 +0000 UTC m=+19053.925176613 (durationBeforeRetry 2m2s). Error: "AttachVolume.Attach failed for volume \"pvc-id\" (UniqueName: \"kubernetes.io/gce-pd/gcp-dynam-pvc-id") from node \"node2\" : googleapi: Error 400: The disk resource 'projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id' is already being used by 'projects/blabla/zones/blablazone/instances/node1'"

After checking for volumesAttached and volumesInUse for both nodes we saw:

  • on node 1 there were no volumesAttached and no volumesInUse
  • on node 2 there were no volumesAttached, however there was one volumesInUse:
    volumesInUse:
        gcp-dynam-pvc-id
    

By draining node 2 and having the POD restart on node 1 (happened by accident since we have more than 2 nodes in our cluster) the problem was fixed

Due to the requirement of a manual fix there is some downtime and this is a critical POD for our cluster

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 16, 2018
@msau42
Copy link
Member

msau42 commented May 16, 2018

@plkokanov to confirm, you saw this problem when you issued a rolling update on your statefulset?

@jingxu97
Copy link
Contributor

@plkokanov would you mind to share us the full controller-manager log to us so we can help triage? You see VolumeInUse in node 2 is normal because controller try to mount gcp-dynam-pvc-id on node 2. The strange thing is why volume is still attached to node 1.

@plkokanov
Copy link

@msau42 we saw it after the pod was restarted. The reason for the restart was most likely an unsuccessful liveliness probe. I'll make sure to post the exact reason as soon as I manage to reproduce it.
@jingxu97 can't get the controller-manager logs from when the issue happened. I'll see if I can do anything about that.

@pv93
Copy link

pv93 commented Jun 1, 2018

This error is still occuring to me, running GKE 1.9.6-1. Running a pod with a gce-pd PVC. The odd thing is that this error pops up immediately on creation of the pod and its associated PVC, after deleting the old pod and PVC. My guess is that the old one has not been deleted in GCP, even though it's gone from Kubernetes and from GCE console. This is a really big problem and is making it impossible to run any stateful apps in my cluster.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 29, 2018
@jingxu97
Copy link
Contributor

@pv93, sorry to reply late. Could you please let us know more details about your issue? You mentioned -you deleted old pod and PVC. Did you use "--forcedelete" and "--grace-period=0" option? Otherwise, the new pod or PVC (if using the same name as the old one) could not be created without the old one being cleaned up and deleted. For PVC, unless you change the Reclaim policy, by default, if PVC is deleted, the volume should be also deleted. Your new PVC should create a new volume.

@pv93
Copy link

pv93 commented Oct 18, 2018

@jingxu97 it seems like this issue has stopped popping up as much with Kubernetes 1.10. Once a pod is rescheduled, GKE is much faster with removing the PVC from the old node and attaching it to the new one. So this isn't a problem for me anymore. Thanks anyway.

@saad-ali saad-ali closed this as completed Nov 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests