Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pv is stuck Terminating due to race condition when csi-attacher removes finalizer and csi-provisioner tries to remove another finalizer #1217

Open
andyzhangx opened this issue May 22, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@andyzhangx
Copy link
Member

andyzhangx commented May 22, 2024

What happened:
pv is stuck Terminating due to race condition when csi-attacher removes finalizer and csi-provisioner tries to remove another finalizer

  • symptom
    pv is stuck Terminating when pvc is deleted with HonorPVReclaimPolicy feature gate enabled.

  • process
    csi-attacher would remove the finalizer(e.g. external-attacher/disk-csi-azure-com) when pv is detached, and later on csi-provisioner the would try to remove the external-provisioner.volume.kubernetes.io/finalizer finalizer when pvc is deleted and since pv object is in the cache of provisioner , the finalizer deletion always fail until maximum 6 retries exceeds, and finally leaves the pv in Terminating state forever (the underlying storage is deleted before finalizer remove fails).

csi-attacher-disk	E0510 10:18:09.499513       1 csi_handler.go:701] Failed to remove finalizer from PV "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": PersistentVolume "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"kubernetes.io/pv-protection"}

csi-attacher-disk	I0510 10:18:09.510077       1 csi_handler.go:706] Removed finalizer from PV "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0"

csi-provisioner-disk	I0510 10:18:09.466810       1 controller.go:1517] delete "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": volume deleted

csi-azuredisk-controller	I0510 10:18:09.466386       1 azure_managedDiskController.go:325] azureDisk - deleted a managed disk: /subscriptions/xxx/resourceGroups/icto-1019_npi-lcm-cn-lcm-npi-cluster-01-nodes/providers/Microsoft.Compute/disks/pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0

csi-provisioner-disk	I0510 10:18:09.489676       1 controller.go:1554] delete "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": failed to remove finalizer for persistentvolume: Operation cannot be fulfilled on persistentvolumes "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": the object has been modified; please apply your changes to the latest version and try again

csi-provisioner-disk	W0510 10:18:09.489714       1 controller.go:989] Retrying syncing volume "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0", failure 6

csi-provisioner-disk	E0510 10:18:09.489752       1 controller.go:1007] error syncing volume "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": Operation cannot be fulfilled on persistentvolumes "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": the object has been modified; please apply your changes to the latest version and try again
  • workaround
    remove all finalizers from the pv and then delete pv manually
kubectl patch pv NAME -p '{"metadata":{"finalizers":null}}'

/kind bug
cc @jsafrane

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

  • Driver version: v4.0.0
  • Kubernetes version (use kubectl version): 1.27
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 22, 2024
@Jainbrt
Copy link

Jainbrt commented Jun 20, 2024

Team, do we have any debug points on this issue, as we are also hitting this in our test wih 5.0.1 provisioner image? @carlory @jsafrane

@carlory
Copy link
Member

carlory commented Jun 20, 2024

I appreciate your reminder. I didn't know this issue. Is it easy to reproduce? It may be a blocker to promote this feature to beta. Unfortunately, we have promoted it to beta and enabled it by default. cc @xing-yang @deepakkinni

I will investigate this issue next week. Can you provide reproducible steps or a simple test case?

@Jainbrt
Copy link

Jainbrt commented Jun 20, 2024

@carlory not sure if this issue is also related #1235 but that has more details.

@Jainbrt
Copy link

Jainbrt commented Jun 20, 2024

If we have pre-provisioned PV bound to PVC and gets deleted. PV first move to failed state ( during PVC delete) and then on stuck in Terminating( during PV delete)

Below are the provisioner logs:

I0620 10:50:11.805561       1 controller.go:1132] "shouldDelete" PV="pv-fbeqn"
I0620 10:50:11.805575       1 controller.go:1548] "Started" PV="pv-fbeqn"
I0620 10:50:11.805665       1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-rgupwoxrg\"}"
I0620 10:50:12.894422       1 leaderelection.go:281] successfully renewed lease ibm-spectrum-scale-csi-driver/spectrumscale-csi-ibm-com
I0620 10:50:21.453285       1 connection.go:270] "GRPC response" response="{}" err=null
I0620 10:50:21.453335       1 controller.go:1563] "Volume deleted" PV="pv-fbeqn"
I0620 10:50:21.528417       1 controller.go:1607] "PersistentVolume deleted succeeded" PV="pv-fbeqn"
I0620 10:50:21.528709       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pv-fbeqn    2d464b99-70d9-4001-b4bd-058d4a51dcb4 19599924 0 2024-06-20 10:47:32 +0000 UTC 2024-06-20 10:50:21 +0000 UTC 0xc0009efa80 map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [external-provisioner.volume.kubernetes.io/finalizer] [{OpenAPI-Generator Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:author\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {csi-provisioner Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/bound-by-controller\":{}}},\"f:spec\":{\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}}}} } {kube-controller-manager Update v1 2024-06-20 10:50:11 +0000 UTC FieldsV1 {\"f:status\":{\"f:message\":{},\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-rgupwoxrg,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1718880431,Name:pvc-kqdmmbm,UID:f3821261-6e7a-4c91-b3e5-59871d7dfc2a,APIVersion:v1,ResourceVersion:19599399,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-wdijv,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Failed,Message:error getting deleter volume plugin for volume \"pv-fbeqn\": no deletable volume plugin matched,Reason:,LastPhaseTransitionTime:2024-06-20 10:50:11 +0000 UTC,},}"

@deepakkinni
Copy link
Member

@andyzhangx, there have been changes to use Patch; hopefully, you shouldn't see this error 5.0.1. Can you try it again?

@Jainbrt
Copy link

Jainbrt commented Jun 21, 2024

@deepakkinni we are using patch verb but still seeing above error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants