Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destroy: return aws resources that could not be deleted #3772

Closed

Conversation

staebler
Copy link
Contributor

@staebler staebler commented Jun 18, 2020

The RunWithContext method has been modified to return a slice of ARNs that could not be destroyed. Only the ARNs of the first-level resources will be returned. For example, when deleting a VPC, the uninstaller will first delete other resources that use the VPC. Any of those resources that are blocked from being deleted but are not tagged for the cluster will not show up in the list of blocked resources. However, the first-level VPC will show up as blocked in this case.

This will be used by Hive to expose to the user the resources that cannot be deleted so that the user can take action on those resources.

https://issues.redhat.com/browse/CO-973

@staebler
Copy link
Contributor Author

/hold

This builds on #3765.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2020
@staebler staebler force-pushed the aws_destroy_return_blockers branch from 7d5fcb4 to 9465ece Compare June 18, 2020 12:40
@staebler
Copy link
Contributor Author

As suggested in #3765 (comment), the destroyer will now collect--but not attempt to delete--the un-deleted tagged resources when the context expires while trying to delete the EC2 instances.

@staebler
Copy link
Contributor Author

/hold cancel

#3765 has merged.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2020
@sdodson
Copy link
Member

sdodson commented Jul 16, 2020

/assign @patrickdillon @jstuever
PTAL

@gregsheremeta
Copy link
Contributor

Hi, we really need someone to take a look at this. @patrickdillon @jstuever do you have some cycles to review?

@sdodson @abhinavdahiya can you assign other reviewers if @patrickdillon and @jstuever can't get to it?

@jstuever
Copy link
Contributor

jstuever commented Aug 7, 2020

/cc @abhinavdahiya
I invested some time earlier in this PR to try to understand the changes; there is a lot going on. I believe it may change behaviour in terms of what is printed as "Deleted". I think someone who is more familiar with the aws destroy should review it.

@jstuever
Copy link
Contributor

I'll take another look at this after code freeze; sorry for the delay.

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of the second level of tracking that is being added here for remaining,
i.e need to delete the instance (that we identified from tag) but that usually requires us to also clean up instance iam profile so we also add that profile resource to the tracker.

I think the best way to provide most information is to just tracker the tagged resources for users to describe remaining, our discovery is best effort and a lot of the times there are other resources that block deletion of the tagged resources that only AWS knows or provides in the error message.

So telling the user VPC deletion is left is all the information that is required for user to go ahead and delete the VPC in case we cannot.

pkg/destroy/aws/aws.go Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
@staebler
Copy link
Contributor Author

staebler commented Oct 8, 2020

I am not in favor of the second level of tracking that is being added here for remaining,
i.e need to delete the instance (that we identified from tag) but that usually requires us to also clean up instance iam profile so we also add that profile resource to the tracker.

Thanks for the review, @abhinavdahiya. I can live with only returning the tagged resources that could not be deleted rather than also including the next-level resources. I will rework this PR with that in mind.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jstuever after the PR has been reviewed.
You can assign the PR to them by writing /assign @jstuever in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@staebler
Copy link
Contributor Author

I am not in favor of the second level of tracking that is being added here for remaining,
i.e need to delete the instance (that we identified from tag) but that usually requires us to also clean up instance iam profile so we also add that profile resource to the tracker.

Thanks for the review, @abhinavdahiya. I can live with only returning the tagged resources that could not be deleted rather than also including the next-level resources. I will rework this PR with that in mind.

@abhinavdahiya I've re-written the changes. The logic should be much easier to follow now.

nextTagClients = append(nextTagClients, tagClient)
} else {
o.Logger.Debugf("no deletions from %s, removing client", tagClientNames[tagClient])
instancesToDelete, err := o.findEC2Instances(ctx, ec2Client, deleted)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to add a comment about why EC2 instances are special and need to be in the terminated state first before attempting to delete other resources. Can someone shed light on this for me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit message should provide context on why we stop instances before anything else. cf69c1e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks.

tagClientStack := append([]*resourcegroupstaggingapi.ResourceGroupsTaggingAPI(nil), tagClients...)

// Terminate EC2 instances.
ec2Client := ec2.New(awsSession)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a region argument otherwise this might not function correctly..?

aws.NewConfig().WithRegion(region) something like that..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code in master does not specify the region. The awsSession should already be configured with the region.

deleted, err := terminateEC2InstancesByTags(ctx, ec2.New(awsSession), iamClient, o.Filters, o.Logger)

@abhinavdahiya
Copy link
Contributor

The change looks good in at high level, the destroy isn't completing currently see e2e-aws exceeding the 4 hour timeout with no deprovision logs.

So i think there is something missing here. maybe https://github.com/openshift/installer/pull/3772/files#discussion_r504111660 ??

The `RunWithContext` method has been modified to return a slice of ARNs
that could not be destroyed. Only the ARNs of the first-level resources
will be returned. For example, when deleting a VPC, the uninstaller will
first delete other resources that use the VPC. Any of those resources
that are blocked from being deleted but are not tagged for the cluster
will not show up in the list of blocked resources. However, the first-level
VPC will show up as blocked in this case.

This will be used by Hive to expose to the user the resources that
cannot be deleted so that the user can take action on those resources.

https://issues.redhat.com/browse/CO-973
@openshift-ci-robot
Copy link
Contributor

@staebler: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-crc 5db7fb7 link /test e2e-crc
ci/prow/e2e-ovirt 5db7fb7 link /test e2e-ovirt
ci/prow/e2e-libvirt 5db7fb7 link /test e2e-libvirt
ci/prow/e2e-aws 5db7fb7 link /test e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@staebler
Copy link
Contributor Author

The change looks good in at high level, the destroy isn't completing currently see e2e-aws exceeding the 4 hour timeout with no deprovision logs.

So i think there is something missing here. maybe https://github.com/openshift/installer/pull/3772/files#discussion_r504111660 ??

The tests are timing out running e2e-aws-gather-core-dump, before any attempts to deprovision.

@staebler
Copy link
Contributor Author

The change looks good in at high level, the destroy isn't completing currently see e2e-aws exceeding the 4 hour timeout with no deprovision logs.
So i think there is something missing here. maybe https://github.com/openshift/installer/pull/3772/files#discussion_r504111660 ??

The tests are timing out running e2e-aws-gather-core-dump, before any attempts to deprovision.

I think that it is a problem with this PR in particular and not the code in general. This may be because this PR is quite old and is maybe not getting some changes that have been made to the tests in the past couple months. I copied the branch and opened a new PR with the same code (#4270), and the e2e-aws test passes. I am going to close this PR in favor of the new one.

/close

@openshift-ci-robot
Copy link
Contributor

@staebler: Closed this PR.

In response to this:

The change looks good in at high level, the destroy isn't completing currently see e2e-aws exceeding the 4 hour timeout with no deprovision logs.
So i think there is something missing here. maybe https://github.com/openshift/installer/pull/3772/files#discussion_r504111660 ??

The tests are timing out running e2e-aws-gather-core-dump, before any attempts to deprovision.

I think that it is a problem with this PR in particular and not the code in general. This may be because this PR is quite old and is maybe not getting some changes that have been made to the tests in the past couple months. I copied the branch and opened a new PR with the same code (#4270), and the e2e-aws test passes. I am going to close this PR in favor of the new one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants