Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting pendingDeletes at beginning of deployment leads to stuck states #2948

Closed
lukehoban opened this issue Jul 19, 2019 · 10 comments · Fixed by #11027
Closed

Deleting pendingDeletes at beginning of deployment leads to stuck states #2948

lukehoban opened this issue Jul 19, 2019 · 10 comments · Fixed by #11027
Assignees
Labels
area/engine Pulumi engine kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@lukehoban
Copy link
Member

Today, we process any pendingDeletes at the beginning of a deployment. This is not "correct".

Two examples:

First, a program with a VPC and an Instance. A change causes the VPC to be replaced, and the Instance fails to create. This leads to a newly created VPC, and a pending delete VPC. On the next update, we try to flush the pending deletes, meaning trying to delete the old VPC. This fails, because the Instance is still running in the old VPC. It is only "correct" to delete the old VPC at the end of the deployment after all other repercussions of the replacement have been made.

Second, a Kubernetes Provider and a Kubernetes Resource. A change causes the Kubernetes Provider to be replaced, but the Kubernetes Resource fails to create. This leads to a newly created Provider in the checkpoint, and a pending delete Provider in the checkpoint. On the next update, we successfully delete the pending delete Provider from the checkpoint. However, now all of the references in the checkpoint have provider references to a provider which does not exist. When we try to process the recreate of the Kubernetes Resource, it fails with a message like:

resource urn:pulumi:ds-dog-k8s-dev::sg-deploy-k8s-helper::kubernetes:core/v1:Secret::langserver-auth refers to unknown provider urn:pulumi:ds-dog-k8s-dev::sg-deploy-k8s-helper::pulumi:providers:kubernetes::dogfood-full-k8s::3a90eb1d-d8d5-4272-ae29-300c34caaab9

To be correct, I believe we will need to postpone pending deletes to the end of the deployment.

@lukehoban lukehoban added feature/q3 kind/bug Some behavior is incorrect or out of spec labels Jul 19, 2019
@lukehoban lukehoban added this to the 0.26 milestone Jul 19, 2019
@lukehoban lukehoban added the p1 Bugs severe enough to be the next item assigned to an engineer label Jul 19, 2019
@lukehoban
Copy link
Member Author

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

@pgavlin
Copy link
Member

pgavlin commented Aug 16, 2019

We've decided that this is too risky a change to take at this point in Q3. We will fix this ASAP post-1.0.

@pgavlin pgavlin modified the milestones: 0.26, 0.27 Aug 16, 2019
@lukehoban lukehoban modified the milestones: 0.27, 0.28 Sep 17, 2019
@pgavlin pgavlin modified the milestones: 0.28, 0.29 Oct 21, 2019
@pgavlin pgavlin modified the milestones: 0.29, 0.30 Nov 5, 2019
@pgavlin pgavlin modified the milestones: 0.30, 0.31 Dec 3, 2019
@lukehoban lukehoban modified the milestones: 0.31, 0.32 Feb 1, 2020
@pgavlin pgavlin removed the p1 Bugs severe enough to be the next item assigned to an engineer label Feb 12, 2020
@pgavlin pgavlin modified the milestones: 0.32, 0.33 Feb 12, 2020
@leezen leezen removed this from the 0.33 milestone Mar 9, 2020
@mdcuk34
Copy link

mdcuk34 commented Nov 16, 2020

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

What's the recommended solution to get out of this weird state? I'm having similar issues as described in the Slack message however can't see the responses due to the 10,000 message limit. Error log below:


     Type                      Name                                    Status                  Info
     pulumi:pulumi:Stack       xxx-xxx-xxx-service-dev  **failed**              1 error
 -   ├─ aws:ec2:SecurityGroup  xxx-xxx-dev                        **deleting failed**     1 error
 -   └─ aws:lb:TargetGroup     xxx-xxx-dev                          **deleting failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (xxx-xxx-xxx-service-dev):
    error: update failed

  aws:lb:TargetGroup (xxx-xxx):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::aws:lb:ApplicationLoadBalancer$awsx:lb:ApplicationTargetGroup$aws:lb/targetGroup:TargetGroup::xxx-targetdev: 1 error occurred:
    	* Error deleting Target Group: ResourceInUse: Target group 'arn:aws:elasticloadbalancing:eu-west-1:675965213304:targetgroup/xxx-targetdev-74e5679/2fa26820b86b102b' is currently in use by a listener or a rule
    	status code: 400, request id: 72e44b5c-f97a-4f80-9f04-0bece5688359

  aws:ec2:SecurityGroup (xxx-cluster-dev):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::awsx:x:ecs:Cluster$awsx:x:ec2:SecurityGroup$aws:ec2/securityGroup:SecurityGroup::xxx-cluster-dev: 1 error occurred:
    	* Error deleting security group: DependencyViolation: resource sg-07d619669ce3f4793 has a dependent object
    	status code: 400, request id: 47918f5f-1a1a-44be-9772-32a6e73167aa

@blampe
Copy link
Contributor

blampe commented Aug 10, 2022

This would be a great quality of life improvement! I've run into both of the problems Luke mentioned in the description.

@lukehoban
Copy link
Member Author

Another member of the internal team hit this today.

Their first updated did the create side of a replacement of a LaunchConfiguration.

++  aws:ec2:LaunchConfiguration ecsClusterInstanceLaunchConfiguration create-replacementd

Then the update failed for a reasonable reason.

The next update they did failed almost immediately with:

ecsClusterInstanceLaunchConfiguration (aws:ec2:LaunchConfiguration)
completing deletion from previous update
 
error: deleting urn:pulumi:kimberley::pulumi-service::aws:ec2/launchConfiguration:LaunchConfiguration::ecsClusterInstanceLaunchConfiguration: 1 error occurred:
	* error deleting Autoscaling Launch Configuration (ecsClusterInstanceLaunchConfiguration-13f7e0f): ResourceInUse: Cannot delete launch configuration ecsClusterInstanceLaunchConfiguration-13f7e0f because it is attached to AutoScalingGroup autoScalingGroupStack-4a63cb8-Instances-L4JB1QE2ZJ6J
	status code: 400, request id: 88a8a416-5fd0-48a9-9d5d-52358c77e2df

@jonasgroendahl
Copy link

Another example that is likely related - from https://pulumi-community.slack.com/archives/C84L4E3N1/p1563557008032100:

Just trying to get the initial cluster set up, and made some silly mistakes (set subnets to public, not private). But trying to make changes to the cluster config is crazy. It tries to replace the cluster, but then gets stuck since it cant delete the resources for the now deleted cluster.

Deleting everything now fails with dial tcp: lookup xxx.gr7.us-east-1.eks.amazonaws.com: no such host

What's the recommended solution to get out of this weird state? I'm having similar issues as described in the Slack message however can't see the responses due to the 10,000 message limit. Error log below:


     Type                      Name                                    Status                  Info
     pulumi:pulumi:Stack       xxx-xxx-xxx-service-dev  **failed**              1 error
 -   ├─ aws:ec2:SecurityGroup  xxx-xxx-dev                        **deleting failed**     1 error
 -   └─ aws:lb:TargetGroup     xxx-xxx-dev                          **deleting failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (xxx-xxx-xxx-service-dev):
    error: update failed

  aws:lb:TargetGroup (xxx-xxx):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::aws:lb:ApplicationLoadBalancer$awsx:lb:ApplicationTargetGroup$aws:lb/targetGroup:TargetGroup::xxx-targetdev: 1 error occurred:
    	* Error deleting Target Group: ResourceInUse: Target group 'arn:aws:elasticloadbalancing:eu-west-1:675965213304:targetgroup/xxx-targetdev-74e5679/2fa26820b86b102b' is currently in use by a listener or a rule
    	status code: 400, request id: 72e44b5c-f97a-4f80-9f04-0bece5688359

  aws:ec2:SecurityGroup (xxx-cluster-dev):
    error: deleting urn:pulumi:dev::xxx-xxx-xxx-service::awsx:x:ecs:Cluster$awsx:x:ec2:SecurityGroup$aws:ec2/securityGroup:SecurityGroup::xxx-cluster-dev: 1 error occurred:
    	* Error deleting security group: DependencyViolation: resource sg-07d619669ce3f4793 has a dependent object
    	status code: 400, request id: 47918f5f-1a1a-44be-9772-32a6e73167aa

got exactly this error too,

merely changed some vpc stuff and then it decided it was time to delete the target group and now can't get rid of "completing deletion from previous update..." state

@parryian
Copy link

parryian commented Sep 7, 2022

any suggested workarounds for this issue?

@ralvarez-globant
Copy link

Same here... tried to manually remove the resource from the stack to no avail.
Now my stack has two identical resources...

Do you want to perform this update? yes
Updating (CLIENT/ENV)

View Live: https://app.pulumi.com/CLIENT/STACK/ENV/updates/NN

 Type                       Name                                  Status                  Info
 pulumi:pulumi:Stack        RESOURCE_NAME  **failed**              1 error
  • └─ gcp:projects:IAMMember BINDING_NAME deleting failed 1 error

Diagnostics:
gcp:projects:IAMMember (BINDING_NAME):
error: unable to find required configuration setting: GCP Project
Set the GCP Project by using:
pulumi config set gcp:project <project>

Resources:

Duration: 2s

@ralvarez-globant
Copy link

ralvarez-globant commented Sep 9, 2022

Ok, found a workaround. It's not pretty but does the job:

  1. Export your current state (and back it up)
  2. Backup your stack (just in case)
  3. Modify your stack.json (vim or whatever editor you choose). Just make sure to remove the source conflict ( remove your conflicting resources and manually clean up your infrastructure)
  4. Make sure you are configuring the proper stack
  5. Import modified stack
  6. Refresh and keep on wherever you left
pulumi stack export -s STACK > stack.json
cp stack.json stack.json.origin
vi stack.json
pulumi stack select STACK
pulumi stack import < stack.json
pulumi up

@Frassle Frassle self-assigned this Oct 14, 2022
@Frassle Frassle added this to the 0.79 milestone Oct 14, 2022
Frassle added a commit that referenced this issue Oct 14, 2022
This removes all the handling of pending deletes from the start of
deployments. Instead we allow resources to just be deleted as they
usually would at the end of the deployment.

There's a big comment in TestPendingDeleteOrder that explains the order
of operations in a succesful run and how that order differs if we try
and do pending deletes up-front.

Fixes #2948
Frassle added a commit that referenced this issue Oct 14, 2022
This removes all the handling of pending deletes from the start of
deployments. Instead we allow resources to just be deleted as they
usually would at the end of the deployment.

There's a big comment in TestPendingDeleteOrder that explains the order
of operations in a succesful run and how that order differs if we try
and do pending deletes up-front.

Fixes #2948
@solomonshorser
Copy link

@ralvarez-globant You say:

3. Modify your stack.json (vim or whatever editor you choose). Just make sure to remove the source conflict ( remove your conflicting resources and manually clean up your infrastructure)

Did you just remove the problematic resource itself? I would imagine that you need to remove any other resources the reference it as a dependency, as well.

@mikhailshilkov mikhailshilkov modified the milestones: 0.79, 0.80 Oct 25, 2022
bors bot added a commit that referenced this issue Nov 1, 2022
11009: Fix update plans with dependent replacements r=Frassle a=Frassle

<!--- 
Thanks so much for your contribution! If this is your first time contributing, please ensure that you have read the [CONTRIBUTING](https://github.com/pulumi/pulumi/blob/master/CONTRIBUTING.md) documentation.
-->

# Description

<!--- Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. -->

We weren't correctly handling the case where a resource was marked for deletion due to one of it's dependencies being deleted. We would add an entry to it's "Ops" list, but then overwrite that "Ops" list when we came to generate the recreation step.

Fixes #10924

## Checklist

<!--- Please provide details if the checkbox below is to be left unchecked. -->
- [x] I have added tests that prove my fix is effective or that my feature works
<!--- 
User-facing changes require a CHANGELOG entry.
-->
- [x] I have run `make changelog` and committed the `changelog/pending/<file>` documenting my change
<!--
If the change(s) in this PR is a modification of an existing call to the Pulumi Service,
then the service should honor older versions of the CLI where this change would not exist.
You must then bump the API version in /pkg/backend/httpstate/client/api.go, as well as add
it to the service.
-->
- [ ] Yes, there are changes in this PR that warrants bumping the Pulumi Service API version
  <!-- `@Pulumi` employees: If yes, you must submit corresponding changes in the service repo. -->


11027: Do not execute pending deletes at the start of deployment r=Frassle a=Frassle

<!--- 
Thanks so much for your contribution! If this is your first time contributing, please ensure that you have read the [CONTRIBUTING](https://github.com/pulumi/pulumi/blob/master/CONTRIBUTING.md) documentation.
-->

# Description

<!--- Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. -->

This removes all the handling of pending deletes from the start of deployments. Instead we allow resources to just be deleted as they usually would at the end of the deployment.

There's a big comment in TestPendingDeleteOrder that explains the order of operations in a succesful run and how that order differs if we try and do pending deletes up-front.

Fixes #2948

## Checklist

<!--- Please provide details if the checkbox below is to be left unchecked. -->
- [x] I have added tests that prove my fix is effective or that my feature works
<!--- 
User-facing changes require a CHANGELOG entry.
-->
- [ ] I have run `make changelog` and committed the `changelog/pending/<file>` documenting my change
<!--
If the change(s) in this PR is a modification of an existing call to the Pulumi Service,
then the service should honor older versions of the CLI where this change would not exist.
You must then bump the API version in /pkg/backend/httpstate/client/api.go, as well as add
it to the service.
-->
- [ ] Yes, there are changes in this PR that warrants bumping the Pulumi Service API version
  <!-- `@Pulumi` employees: If yes, you must submit corresponding changes in the service repo. -->


Co-authored-by: Fraser Waters <fraser@pulumi.com>
@bors bors bot closed this as completed in a3128e5 Nov 1, 2022
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Nov 1, 2022
@ericrudder ericrudder changed the title Deleting pendingDeletes at beginning of deployment leads to stuck states Deleting pendingDeletes at beginning of deployment leads to stuck states Oct 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine Pulumi engine kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.