[cinder-csi-plugin] Use Nova's os-volume_attachments API to poll attach/detach completion #1645

lyarwood · 2021-09-13T12:18:06Z

/kind feature

What happened:

The current volume attach and detach code within the cinder-csi-plugin uses Nova's os-volume_attachments API [1] to initially attach [2] and later detach [3] a volume from an instance. As noted in [1] both of these APIs are async and need to be polled (with GETs against /servers/{server_id}/os-volume_attachments) after calling to determine when an operation has completed fully.

Instead in both situations cinder-csi appears to poll Cinder's volume API [4] checking the attachments listed there instead. As discussed upstream in OpenStack [5][6] there are some race conditions within Cinder at present that could allow cinder-csi to request that a volume be deleted before the original request to detach it had actually been completed. Cinder then incorrectly accepting the request to delete the volume and borking the original request to detach late in the flow.

Ultimately cinder-csi just needs to poll Nova's os-volume_attachments API to determine when an operation has completed.

[1] https://docs.openstack.org/api-ref/compute/#servers-with-volume-attachments-servers-os-volume-attachments
[2]

cloud-provider-openstack/pkg/csi/cinder/openstack/openstack_volumes.go

Lines 164 to 198 in 61f2129

    
           func (os *OpenStack) AttachVolume(instanceID, volumeID string) (string, error) { 
        
           	computeServiceClient := os.compute 
        
           	volume, err := os.GetVolume(volumeID) 
        
           	if err != nil { 
        
           		return "", err 
        
           	} 
        
           	for _, att := range volume.Attachments { 
        
           		if instanceID == att.ServerID { 
        
           			klog.V(4).Infof("Disk %s is already attached to instance %s", volumeID, instanceID) 
        
           			return volume.ID, nil 
        
           		} 
        
           	} 
        
           	if volume.Multiattach { 
        
           		// For multiattach volumes, supported compute api version is 2.60 
        
           		// Init a local thread safe copy of the compute ServiceClient 
        
           		computeServiceClient, err = openstack.NewComputeV2(os.compute.ProviderClient, os.epOpts) 
        
           		if err != nil { 
        
           			return "", err 
        
           		} 
        
           		computeServiceClient.Microversion = "2.60" 
        
           	} 
        
           	_, err = volumeattach.Create(computeServiceClient, instanceID, &volumeattach.CreateOpts{ 
        
           		VolumeID: volume.ID, 
        
           	}).Extract() 
        
           	if err != nil { 
        
           		return "", fmt.Errorf("failed to attach %s volume to %s compute: %v", volumeID, instanceID, err) 
        
           	} 
        
           	return volume.ID, nil 
        
           }

[3]

cloud-provider-openstack/pkg/csi/cinder/openstack/openstack_volumes.go

Lines 259 to 287 in 61f2129

    
           func (os *OpenStack) DetachVolume(instanceID, volumeID string) error { 
        
           	volume, err := os.GetVolume(volumeID) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if volume.Status == VolumeAvailableStatus { 
        
           		klog.V(2).Infof("volume: %s has been detached from compute: %s ", volume.ID, instanceID) 
        
           		return nil 
        
           	} 
        
           	if volume.Status != VolumeInUseStatus { 
        
           		return fmt.Errorf("can not detach volume %s, its status is %s", volume.Name, volume.Status) 
        
           	} 
        
           	// Incase volume is of type multiattach, it could be attached to more than one instance 
        
           	for _, att := range volume.Attachments { 
        
           		if att.ServerID == instanceID { 
        
           			err = volumeattach.Delete(os.compute, instanceID, volume.ID).ExtractErr() 
        
           			if err != nil { 
        
           				return fmt.Errorf("failed to detach volume %s from compute %s : %v", volume.ID, instanceID, err) 
        
           			} 
        
           			klog.V(2).Infof("Successfully detached volume: %s from compute: %s", volume.ID, instanceID) 
        
           			return nil 
        
           		} 
        
           	} 
        
           	// Disk has no attachments or not attached to provided compute 
        
           	return nil 
        
           }

[4]

cloud-provider-openstack/pkg/csi/cinder/openstack/openstack_volumes.go

Lines 370 to 381 in 61f2129

    
           func (os *OpenStack) diskIsAttached(instanceID, volumeID string) (bool, error) { 
        
           	volume, err := os.GetVolume(volumeID) 
        
           	if err != nil { 
        
           		return false, err 
        
           	} 
        
           	for _, att := range volume.Attachments { 
        
           		if att.ServerID == instanceID { 
        
           			return true, nil 
        
           		} 
        
           	} 
        
           	return false, nil 
        
           }

[5] https://review.opendev.org/c/openstack/cinder/+/801913
[6] https://bugs.launchpad.net/cinder/+bug/1937084

What you expected to happen:

Ultimately cinder-csi just needs to poll Nova's os-volume_attachments API to determine when an operation has completed.

How to reproduce it:

I've marked this as a feature but honestly it could just be a bug so feel free to change.

Cause volumes to be attached or detached to an instance and note that Nova's os-volume_attachments API isn't used to poll the state of either operation.

Anything else we need to know?:

Environment:

openstack-cloud-controller-manager(or other related binary) version: N/A
OpenStack version: N/A (Nova's os-volume_attachments API has been around for all of v2)
Others:

ramineni · 2021-09-14T05:58:54Z

@lyarwood Thanks for reporting the issue and detail explanation.
Makes sense to me to poll nova attachments API instead . Will take a look . Thanks

ramineni · 2021-09-14T05:59:01Z

/assign

* Update cinder from branch 'master' to 5b1578f472c534e0036de3a1e896fc7f35e7f50f - Merge "Fix: Race between attachment and volume deletion" - Fix: Race between attachment and volume deletion There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614

There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614

k8s-triage-robot · 2021-12-13T06:45:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ramineni · 2021-12-13T07:51:36Z

/remove-lifecycle stale

There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614 (cherry picked from commit 2ec2222)

ramineni · 2022-01-31T12:41:25Z

@lyarwood is this issue fixed from openstack side in latest releases?

lyarwood · 2022-01-31T12:47:55Z

@lyarwood is this issue fixed from openstack side in latest releases?

The Cinder change referenced above is just a bugfix resolving a race that users could hit if they incorrectly used Cinder's API to determine when a volume was attached or detached. There are likely others so this project should still move over to Nova's API to determine when an attach or detach has completed.

There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614 (cherry picked from commit 2ec2222) Conflicts: cinder/tests/unit/attachments/test_attachments_manager.py cinder/volume/manager.py (cherry picked from commit ed0be0c)

There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614 (cherry picked from commit 2ec2222) Conflicts: cinder/tests/unit/attachments/test_attachments_manager.py cinder/volume/manager.py (cherry picked from commit ed0be0c) (cherry picked from commit 7210c91) Conflicts: cinder/db/sqlalchemy/api.py

…rnetes#1645) As set out in the issue this is the correct OpenStack API to use when polling volume attachment and detachment from an instance.

…rnetes#1645) As set out in the issue this is the correct OpenStack API to use when polling volume attachment and detachment from an instance. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

k8s-triage-robot · 2022-05-01T12:53:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-31T13:24:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-06-30T14:16:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-06-30T14:16:20Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…rnetes#1645) As set out in the issue this is the correct OpenStack API to use when polling volume attachment and detachment from an instance. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

chrigl · 2022-07-05T12:41:10Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2022-07-05T12:41:27Z

@chrigl: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…rnetes#1645) As set out in the issue this is the correct OpenStack API to use when polling volume attachment and detachment from an instance. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>

k8s-triage-robot · 2022-10-03T12:42:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-11-02T12:53:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614 (cherry picked from commit 2ec2222) Conflicts: cinder/tests/unit/attachments/test_attachments_manager.py cinder/volume/manager.py (cherry picked from commit ed0be0c) (cherry picked from commit 7210c91) Conflicts: cinder/db/sqlalchemy/api.py

seanschneeweiss · 2022-11-25T09:20:19Z

In our environment we also observed many problems with volumes that were stuck in reserved, detaching and inconsistencies between the Nova attachments and Cinder attachments - e.g. Cinder showing available, while volume is still attached in the Nova database.

Canonical identified a bug in keystonemiddleware related to service tokens.

I found this issue when nova calls cinder with an expired X-Auth-Token but it is configured to also send a X-Service-Token. The traffic goes like this:

nova-compute -> cinder: post with X-Auth-Token and X-Service-Token
cinder -> keystone: validate X-Auth-Token
keystone -> cinder: returns 404
cinder -> nova-compute: returns 401
nova-compute -> cinder: retry post with new X-Service-Token
cinder -> keystone: validate X-Service-Token
keystone -> cinder: returns 200 showing that the token is valid
cinder -> nova-compute: returns 401

As I understand Cinder should return 200 in the last message as the token is valid.

My test client is a long running service that uses the same token to communicate to nova until it receives a 401 and then generates a new one. Sometimes the token is invalidated in the middle of a transaction and nova returns 200 to the client but cinder returns 401 to nova.

The bug fix is currently being reviewed:
https://review.opendev.org/c/openstack/keystonemiddleware/+/860481

As far as my understanding goes, the other bugs mentioned in this issue have already been fixed too.

_{Sean Schneeweiss sean.schneeweiss@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, Provider Information}

k8s-triage-robot · 2022-12-25T09:57:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2022-12-25T09:57:27Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

There are cases where requests to delete an attachment made by Nova can race other third-party requests to delete the overall volume. This has been observed when running cinder-csi, where it first requests that Nova detaches a volume before itself requesting that the overall volume is deleted once it becomes `available`. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this bug are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The last attachment is removed from the DB and the status of the volume is changed in the DB to "available" [Cinder-CSI] - Checks that there are no attachments in the volume and asks Cinder to delete it (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It doesn't have attachments and is available, so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. Hardening is also being done on the Nova side [2] to accept that the volume attachment may be gone. This patch fixes the issue mentioned above, but there is a request on Cinder-CSI [1] to use Nova as the source of truth regarding its attachments that, when implemented, would also fix the issue. [1]: kubernetes/cloud-provider-openstack#1645 [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova Closes-Bug: #1937084 Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614 (cherry picked from commit 2ec2222) Conflicts: cinder/tests/unit/attachments/test_attachments_manager.py cinder/volume/manager.py (cherry picked from commit ed0be0c) (cherry picked from commit 7210c91) Conflicts: cinder/db/sqlalchemy/api.py

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 13, 2021

lyarwood changed the title ~~[cinder-csi-plugin] Use Nova's os-volume_attachments API to attach/detach completion~~ [cinder-csi-plugin] Use Nova's os-volume_attachments API to poll attach/detach completion Sep 13, 2021

k8s-ci-robot assigned ramineni Sep 14, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2021

lyarwood mentioned this issue Apr 13, 2022

cinder-csi-plugin: Use os-volume_attachments API to poll detach #1828

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 31, 2022

k8s-ci-robot closed this as completed Jun 30, 2022

k8s-ci-robot reopened this Jul 5, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 5, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 2, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cinder-csi-plugin] Use Nova's os-volume_attachments API to poll attach/detach completion #1645

[cinder-csi-plugin] Use Nova's os-volume_attachments API to poll attach/detach completion #1645

lyarwood commented Sep 13, 2021 •

edited

Loading

ramineni commented Sep 14, 2021

ramineni commented Sep 14, 2021

k8s-triage-robot commented Dec 13, 2021

ramineni commented Dec 13, 2021

ramineni commented Jan 31, 2022 •

edited

Loading

lyarwood commented Jan 31, 2022

k8s-triage-robot commented May 1, 2022

k8s-triage-robot commented May 31, 2022

k8s-triage-robot commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

chrigl commented Jul 5, 2022

k8s-ci-robot commented Jul 5, 2022

k8s-triage-robot commented Oct 3, 2022

k8s-triage-robot commented Nov 2, 2022

seanschneeweiss commented Nov 25, 2022

k8s-triage-robot commented Dec 25, 2022

k8s-ci-robot commented Dec 25, 2022

[cinder-csi-plugin] Use Nova's os-volume_attachments API to poll attach/detach completion #1645

[cinder-csi-plugin] Use Nova's os-volume_attachments API to poll attach/detach completion #1645

Comments

lyarwood commented Sep 13, 2021 • edited Loading

ramineni commented Sep 14, 2021

ramineni commented Sep 14, 2021

k8s-triage-robot commented Dec 13, 2021

ramineni commented Dec 13, 2021

ramineni commented Jan 31, 2022 • edited Loading

lyarwood commented Jan 31, 2022

k8s-triage-robot commented May 1, 2022

k8s-triage-robot commented May 31, 2022

k8s-triage-robot commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

chrigl commented Jul 5, 2022

k8s-ci-robot commented Jul 5, 2022

k8s-triage-robot commented Oct 3, 2022

k8s-triage-robot commented Nov 2, 2022

seanschneeweiss commented Nov 25, 2022

k8s-triage-robot commented Dec 25, 2022

k8s-ci-robot commented Dec 25, 2022

lyarwood commented Sep 13, 2021 •

edited

Loading

ramineni commented Jan 31, 2022 •

edited

Loading