Deleting `pendingDeletes` at beginning of deployment leads to stuck states #2948

lukehoban · 2019-07-19T00:28:44Z

Today, we process any pendingDeletes at the beginning of a deployment. This is not "correct".

Two examples:

First, a program with a VPC and an Instance. A change causes the VPC to be replaced, and the Instance fails to create. This leads to a newly created VPC, and a pending delete VPC. On the next update, we try to flush the pending deletes, meaning trying to delete the old VPC. This fails, because the Instance is still running in the old VPC. It is only "correct" to delete the old VPC at the end of the deployment after all other repercussions of the replacement have been made.

Second, a Kubernetes Provider and a Kubernetes Resource. A change causes the Kubernetes Provider to be replaced, but the Kubernetes Resource fails to create. This leads to a newly created Provider in the checkpoint, and a pending delete Provider in the checkpoint. On the next update, we successfully delete the pending delete Provider from the checkpoint. However, now all of the references in the checkpoint have provider references to a provider which does not exist. When we try to process the recreate of the Kubernetes Resource, it fails with a message like:

resource urn:pulumi:ds-dog-k8s-dev::sg-deploy-k8s-helper::kubernetes:core/v1:Secret::langserver-auth refers to unknown provider urn:pulumi:ds-dog-k8s-dev::sg-deploy-k8s-helper::pulumi:providers:kubernetes::dogfood-full-k8s::3a90eb1d-d8d5-4272-ae29-300c34caaab9

To be correct, I believe we will need to postpone pending deletes to the end of the deployment.

The text was updated successfully, but these errors were encountered:

lukehoban · 2019-07-19T17:30:01Z

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

pgavlin · 2019-08-16T20:12:22Z

We've decided that this is too risky a change to take at this point in Q3. We will fix this ASAP post-1.0.

mdcuk34 · 2020-11-16T17:19:03Z

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

What's the recommended solution to get out of this weird state? I'm having similar issues as described in the Slack message however can't see the responses due to the 10,000 message limit. Error log below:


     Type                      Name                                    Status                  Info
     pulumi:pulumi:Stack       xxx-xxx-xxx-service-dev  **failed**              1 error
 -   ├─ aws:ec2:SecurityGroup  xxx-xxx-dev                        **deleting failed**     1 error
 -   └─ aws:lb:TargetGroup     xxx-xxx-dev                          **deleting failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (xxx-xxx-xxx-service-dev):
    error: update failed

  aws:lb:TargetGroup (xxx-xxx):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::aws:lb:ApplicationLoadBalancer$awsx:lb:ApplicationTargetGroup$aws:lb/targetGroup:TargetGroup::xxx-targetdev: 1 error occurred:
    	* Error deleting Target Group: ResourceInUse: Target group 'arn:aws:elasticloadbalancing:eu-west-1:675965213304:targetgroup/xxx-targetdev-74e5679/2fa26820b86b102b' is currently in use by a listener or a rule
    	status code: 400, request id: 72e44b5c-f97a-4f80-9f04-0bece5688359

  aws:ec2:SecurityGroup (xxx-cluster-dev):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::awsx:x:ecs:Cluster$awsx:x:ec2:SecurityGroup$aws:ec2/securityGroup:SecurityGroup::xxx-cluster-dev: 1 error occurred:
    	* Error deleting security group: DependencyViolation: resource sg-07d619669ce3f4793 has a dependent object
    	status code: 400, request id: 47918f5f-1a1a-44be-9772-32a6e73167aa

blampe · 2022-08-10T17:22:29Z

This would be a great quality of life improvement! I've run into both of the problems Luke mentioned in the description.

lukehoban · 2022-08-25T22:50:56Z

Another member of the internal team hit this today.

Their first updated did the create side of a replacement of a LaunchConfiguration.

++  aws:ec2:LaunchConfiguration ecsClusterInstanceLaunchConfiguration create-replacementd

Then the update failed for a reasonable reason.

The next update they did failed almost immediately with:

ecsClusterInstanceLaunchConfiguration (aws:ec2:LaunchConfiguration)
completing deletion from previous update
 
error: deleting urn:pulumi:kimberley::pulumi-service::aws:ec2/launchConfiguration:LaunchConfiguration::ecsClusterInstanceLaunchConfiguration: 1 error occurred:
	* error deleting Autoscaling Launch Configuration (ecsClusterInstanceLaunchConfiguration-13f7e0f): ResourceInUse: Cannot delete launch configuration ecsClusterInstanceLaunchConfiguration-13f7e0f because it is attached to AutoScalingGroup autoScalingGroupStack-4a63cb8-Instances-L4JB1QE2ZJ6J
	status code: 400, request id: 88a8a416-5fd0-48a9-9d5d-52358c77e2df

jonasgroendahl · 2022-08-29T12:16:00Z

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

What's the recommended solution to get out of this weird state? I'm having similar issues as described in the Slack message however can't see the responses due to the 10,000 message limit. Error log below:
     Type                      Name                                    Status                  Info
     pulumi:pulumi:Stack       xxx-xxx-xxx-service-dev  **failed**              1 error
 -   ├─ aws:ec2:SecurityGroup  xxx-xxx-dev                        **deleting failed**     1 error
 -   └─ aws:lb:TargetGroup     xxx-xxx-dev                          **deleting failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (xxx-xxx-xxx-service-dev):
    error: update failed

  aws:lb:TargetGroup (xxx-xxx):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::aws:lb:ApplicationLoadBalancer$awsx:lb:ApplicationTargetGroup$aws:lb/targetGroup:TargetGroup::xxx-targetdev: 1 error occurred:
    	* Error deleting Target Group: ResourceInUse: Target group 'arn:aws:elasticloadbalancing:eu-west-1:675965213304:targetgroup/xxx-targetdev-74e5679/2fa26820b86b102b' is currently in use by a listener or a rule
    	status code: 400, request id: 72e44b5c-f97a-4f80-9f04-0bece5688359

  aws:ec2:SecurityGroup (xxx-cluster-dev):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::awsx:x:ecs:Cluster$awsx:x:ec2:SecurityGroup$aws:ec2/securityGroup:SecurityGroup::xxx-cluster-dev: 1 error occurred:
    	* Error deleting security group: DependencyViolation: resource sg-07d619669ce3f4793 has a dependent object
    	status code: 400, request id: 47918f5f-1a1a-44be-9772-32a6e73167aa

got exactly this error too,

merely changed some vpc stuff and then it decided it was time to delete the target group and now can't get rid of "completing deletion from previous update..." state

parryian · 2022-09-07T10:33:58Z

any suggested workarounds for this issue?

ralvarez-globant · 2022-09-09T22:53:36Z

Same here... tried to manually remove the resource from the stack to no avail.
Now my stack has two identical resources...

Do you want to perform this update? yes
Updating (CLIENT/ENV)

View Live: https://app.pulumi.com/CLIENT/STACK/ENV/updates/NN

 Type                       Name                                  Status                  Info
 pulumi:pulumi:Stack        RESOURCE_NAME  **failed**              1 error

└─ gcp:projects:IAMMember BINDING_NAME deleting failed 1 error

Diagnostics:
gcp:projects:IAMMember (BINDING_NAME):
error: unable to find required configuration setting: GCP Project
Set the GCP Project by using:
pulumi config set gcp:project <project>

Resources:

Duration: 2s

ralvarez-globant · 2022-09-09T23:14:30Z

Ok, found a workaround. It's not pretty but does the job:

Export your current state (and back it up)
Backup your stack (just in case)
Modify your stack.json (vim or whatever editor you choose). Just make sure to remove the source conflict ( remove your conflicting resources and manually clean up your infrastructure)
Make sure you are configuring the proper stack
Import modified stack
Refresh and keep on wherever you left

pulumi stack export -s STACK > stack.json
cp stack.json stack.json.origin
vi stack.json
pulumi stack select STACK
pulumi stack import < stack.json
pulumi up

This removes all the handling of pending deletes from the start of deployments. Instead we allow resources to just be deleted as they usually would at the end of the deployment. There's a big comment in TestPendingDeleteOrder that explains the order of operations in a succesful run and how that order differs if we try and do pending deletes up-front. Fixes #2948

solomonshorser · 2022-10-14T19:11:41Z

@ralvarez-globant You say:

3. Modify your stack.json (vim or whatever editor you choose). Just make sure to remove the source conflict ( remove your conflicting resources and manually clean up your infrastructure)

Did you just remove the problematic resource itself? I would imagine that you need to remove any other resources the reference it as a dependency, as well.

11009: Fix update plans with dependent replacements r=Frassle a=Frassle  # Description  We weren't correctly handling the case where a resource was marked for deletion due to one of it's dependencies being deleted. We would add an entry to it's "Ops" list, but then overwrite that "Ops" list when we came to generate the recreation step. Fixes #10924 ## Checklist  - [x] I have added tests that prove my fix is effective or that my feature works  - [x] I have run `make changelog` and committed the `changelog/pending/<file>` documenting my change  - [ ] Yes, there are changes in this PR that warrants bumping the Pulumi Service API version  11027: Do not execute pending deletes at the start of deployment r=Frassle a=Frassle  # Description  This removes all the handling of pending deletes from the start of deployments. Instead we allow resources to just be deleted as they usually would at the end of the deployment. There's a big comment in TestPendingDeleteOrder that explains the order of operations in a succesful run and how that order differs if we try and do pending deletes up-front. Fixes #2948 ## Checklist  - [x] I have added tests that prove my fix is effective or that my feature works  - [ ] I have run `make changelog` and committed the `changelog/pending/<file>` documenting my change  - [ ] Yes, there are changes in this PR that warrants bumping the Pulumi Service API version  Co-authored-by: Fraser Waters <fraser@pulumi.com>

lukehoban assigned pgavlin Jul 19, 2019

lukehoban added feature/q3 kind/bug Some behavior is incorrect or out of spec labels Jul 19, 2019

lukehoban added this to the 0.26 milestone Jul 19, 2019

lukehoban added the p1 Bugs severe enough to be the next item assigned to an engineer label Jul 19, 2019

lukehoban mentioned this issue Aug 8, 2019

Preview shows pending deletion and then unshows it #2326

Closed

pgavlin removed the feature/q3 label Aug 16, 2019

pgavlin modified the milestones: 0.26, 0.27 Aug 16, 2019

lukehoban modified the milestones: 0.27, 0.28 Sep 17, 2019

pgavlin modified the milestones: 0.28, 0.29 Oct 21, 2019

pgavlin modified the milestones: 0.29, 0.30 Nov 5, 2019

pgavlin modified the milestones: 0.30, 0.31 Dec 3, 2019

lukehoban modified the milestones: 0.31, 0.32 Feb 1, 2020

pgavlin removed the p1 Bugs severe enough to be the next item assigned to an engineer label Feb 12, 2020

pgavlin modified the milestones: 0.32, 0.33 Feb 12, 2020

leezen removed this from the 0.33 milestone Mar 9, 2020

joeduffy added the pj2 label Jul 10, 2021

leezen mentioned this issue Dec 11, 2021

Targeted Runs will complete out-of-scope Deletion Actions from previous runs #8568

Closed

mikhailshilkov unassigned pgavlin Apr 1, 2022

mikhailshilkov added the area/core label Apr 1, 2022

lukehoban mentioned this issue Jun 27, 2022

"completing deletion from previous update" should be run last #9979

Closed

mikhailshilkov removed the pj2 label Jul 13, 2022

mikhailshilkov added area/engine Pulumi engine and removed area/core labels Jul 28, 2022

lukehoban mentioned this issue Aug 4, 2022

Snapshot integrity failure when redeploying a Kubernetes cluster #4996

Open

Frassle self-assigned this Oct 14, 2022

Frassle added this to the 0.79 milestone Oct 14, 2022

Frassle mentioned this issue Oct 14, 2022

Do not execute pending deletes at the start of deployment #11027

Merged

3 tasks

mikhailshilkov modified the milestones: 0.79, 0.80 Oct 25, 2022

bors bot closed this as completed in a3128e5 Nov 1, 2022

pulumi-bot added the resolution/fixed This issue was fixed label Nov 1, 2022

ericrudder changed the title ~~Deleting pendingDeletes at beginning of deployment leads to stuck states~~ Deleting pendingDeletes at beginning of deployment leads to stuck states Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting `pendingDeletes` at beginning of deployment leads to stuck states #2948

Deleting `pendingDeletes` at beginning of deployment leads to stuck states #2948

lukehoban commented Jul 19, 2019

lukehoban commented Jul 19, 2019

pgavlin commented Aug 16, 2019

mdcuk34 commented Nov 16, 2020

blampe commented Aug 10, 2022

lukehoban commented Aug 25, 2022

jonasgroendahl commented Aug 29, 2022

parryian commented Sep 7, 2022

ralvarez-globant commented Sep 9, 2022

ralvarez-globant commented Sep 9, 2022 •

edited

solomonshorser commented Oct 14, 2022

Deleting pendingDeletes at beginning of deployment leads to stuck states #2948

Deleting pendingDeletes at beginning of deployment leads to stuck states #2948

Comments

lukehoban commented Jul 19, 2019

lukehoban commented Jul 19, 2019

pgavlin commented Aug 16, 2019

mdcuk34 commented Nov 16, 2020

blampe commented Aug 10, 2022

lukehoban commented Aug 25, 2022

jonasgroendahl commented Aug 29, 2022

parryian commented Sep 7, 2022

ralvarez-globant commented Sep 9, 2022

ralvarez-globant commented Sep 9, 2022 • edited

solomonshorser commented Oct 14, 2022

Deleting `pendingDeletes` at beginning of deployment leads to stuck states #2948

Deleting `pendingDeletes` at beginning of deployment leads to stuck states #2948

ralvarez-globant commented Sep 9, 2022 •

edited