Change destroy operation to use foreground cascading delete #2379

lblackstone · 2023-05-09T20:29:58Z

Proposed changes

By default, Kubernetes uses "background cascading deletion" (BCD) to clean up resources. This works in most cases with the eventual consistency model as resources are garbage collected. However, there are cases where BCD can lead to stuck resources due to race conditions between dependents. A concrete example is an application Deployment that includes a volume mount managed by a Container Storage Interface (CSI) driver. The underlying Pods managed by this Deployment depend on the CSI driver to unmount the volume on teardown, and this process can take some time. Thus, if a Namespace containing both the CSI driver and the application Deployment is deleted, it is possible for the CSI driver to be removed before it has finished tearing down the application Pods, leaving them stuck in a "Terminating" state.

A reliable way to avoid this race condition is by using "foreground cascading deletion" (FCD) instead. FCD blocks deletion of the parent resource until any children have been deleted. In the previous example, the application Deployment resource would not be deleted until all of the underlying Pods had unmounted the CSI volume and finished terminating. Once the application Deployment is gone, then Pulumi can safely clean up the CSI driver as well.

One downside of this approach is that resource deletion can take longer to resolve since Kubernetes is explicitly waiting on the delete operation to complete. However, this increases reliability of the delete operation by making it less prone to race conditions, so the tradeoff seems worth it.

This PR doesn't include additional testing because this scenario is already well covered by existing tests. Every test includes a destroy operation, which exercises the new behavior. This change was confirmed to fix the customer issue, and manual testing was also performed.

Related issues (optional)

Fix https://github.com/pulumi/customer-support/issues/931

By default, Kubernetes uses "background cascading deletion" (BCD) to clean up resources. This works in most cases with the eventual consistency model as resources are garbage collected. However, there are cases where BCD can lead to stuck resources due to race conditions between dependents. A concrete example is an application Deployment that includes a volume mount managed by a Container Storage Interface (CSI) driver. The underlying Pods managed by this Deployment depend on the CSI driver to unmount the volume on teardown, and this process can take some time. Thus, if a Namespace containing both the CSI driver and the application Deployment is deleted, it is possible for the CSI driver to be removed before it has finished tearing down the application Pods, leaving them stuck in a "Terminating" state. A reliable way to avoid this race condition is by using "foreground cascading deletion" (FCD) instead. FCD blocks deletion of the parent resource until any children have been deleted. In the previous example, the application Deployment resource would not be deleted until all of the underlying Pods had unmounted the CSI volume and finished terminating. Once the application Deployment is gone, then Pulumi can safely clean up the CSI driver as well. One downside of this approach is that resource deletion can take longer to resolve since Kubernetes is explicitly waiting on the delete operation to complete. However, this increases reliability of the delete operation by making it less prone to race conditions, so the tradeoff seems worth it.

github-actions · 2023-05-09T20:36:38Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

kpitzen

LGTM! Appears to do what it says on the box :)

github-actions · 2023-05-09T20:59:20Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

rquitales

I think this is a reasonable change. Given that we'd like to somewhat manage the lifecycle and ensure resources are deleted correctly, switching over to foreground cascading delete seems like a reasonable trade-off vs speed.

mikhailshilkov · 2023-05-16T14:09:52Z

@lblackstone Loved the detailed PR description with motivation and explanation of the change.

I'd love to see it extended with the testing approach: why you didn't think we needed extra tests, whether you tested it manually, what kind of risks it could have brought in, etc.

lblackstone · 2023-05-16T17:25:13Z

@lblackstone Loved the detailed PR description with motivation and explanation of the change.

I'd love to see it extended with the testing approach: why you didn't think we needed extra tests, whether you tested it manually, what kind of risks it could have brought in, etc.

I added some additional detail to the PR description.

lblackstone requested a review from a team May 9, 2023 20:29

kpitzen approved these changes May 9, 2023

View reviewed changes

rquitales approved these changes May 9, 2023

View reviewed changes

lblackstone merged commit 3ef35dc into master May 9, 2023
17 checks passed

lblackstone deleted the lblackstone/foreground-delete branch May 9, 2023 23:03

EronWright mentioned this pull request Oct 2, 2023

Support for background cascading deletion #2529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change destroy operation to use foreground cascading delete #2379

Change destroy operation to use foreground cascading delete #2379

lblackstone commented May 9, 2023 •

edited

Loading

github-actions bot commented May 9, 2023

kpitzen left a comment

github-actions bot commented May 9, 2023

rquitales left a comment

mikhailshilkov commented May 16, 2023

lblackstone commented May 16, 2023

Change destroy operation to use foreground cascading delete #2379

Change destroy operation to use foreground cascading delete #2379

Conversation

lblackstone commented May 9, 2023 • edited Loading

Proposed changes

Related issues (optional)

github-actions bot commented May 9, 2023

Does the PR have any schema changes?

kpitzen left a comment

Choose a reason for hiding this comment

github-actions bot commented May 9, 2023

Does the PR have any schema changes?

rquitales left a comment

Choose a reason for hiding this comment

mikhailshilkov commented May 16, 2023

lblackstone commented May 16, 2023

lblackstone commented May 9, 2023 •

edited

Loading