Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destroy: provide way to stop aws uninstall using context #3765

Merged

Conversation

staebler
Copy link
Contributor

@staebler staebler commented Jun 17, 2020

The RunWithContext method was added to the AWS uninstaller. This provides a way for the user to supply a context that can be used to stop the uninstall.

This will be used by Hive to backoff uninstall attempts.

https://issues.redhat.com/browse/CO-973

@staebler
Copy link
Contributor Author

/cc @csrwng

pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
Copy link
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, just a nit about the named return values... in most cases it seems that just the types would be enough to understand what is being returned.

pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved
profile := fmt.Sprintf("%s-%s-profile", o.ClusterID, profileType)
response, err := iamClient.GetInstanceProfileWithContext(ctx, &iam.GetInstanceProfileInput{InstanceProfileName: &profile})
if err != nil {
if err.(awserr.Error).Code() == iam.ErrCodeNoSuchEntityException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a get preflight? Looks like we catch and ignore this from the delete itself in deleteIAMInstanceProfileByName since the function landed in #1268.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preflight is so that I can get the ARN to record whether it was deleted successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update deleteIAMInstanceProfileByName to return the deleted ARN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. But that function does not know the ARN either, so it would have to look it up. The other caller of the function already knows the ARN, so it would be a wasted request for that caller.

Copy link
Contributor Author

@staebler staebler Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I can deduce the ARN without looking up the instance profile. Let me change it to do that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not that straight-forward. The account ID is not known in that function.

}

// merge merges the record of destroyed resources from changes into r. If changes and r have results for the same
// resource, the results from changes have precedence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This precedence doesn't make sense to me. Wouldn't we want "if either value was true, there merged value is true"? Because if Alice deleted A and merges with Bob who failed to delete A, A is still gone, regardless of Bob's failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, thinking of it another way, if Alice failed to delete A even though Bob previously succeeded in deleting A, then A is still there regardless of whether Bob originally thought it had been deleted successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...regardless of whether Bob originally thought it had been deleted successfully.

That is very bad. Why would we be ok with Bob thinking he deleted something when it had not been deleted? I don't want to rely on Alice to come along and double-check him.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I don't care too much either way, so I will change it.

// the ARNs of those resources.
func (r destroyedResources) extractRemaining() []string {
if r == nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels strange to me to have a nil guard here but not in merge or record. Is there a reason behind the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nil guard in merge or record would mean that the operation did not do anything. The functions cannot maps to merge or record into that will be accessible to the caller.

This works the same way as the operations on a general map. You can get from a nil map, but you cannot set into a nil map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel better about it, I can get rid of the nil guard. The function will return an empty slice instead of a nil slice, in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nil guard in merge or record would mean that the operation did not do anything.

Makes sense. Where are we initializing these maps today? Do we need a:

func newDestroyedResources() destroyedResources {
  return map[string]bool{}
}

or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are initialized with destroyedResources{}. I don't think we need to introduce a function to do that.

for k, v := range r {
if !v {
notDestroyed = append(notDestroyed, k)
delete(r, k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the delete? They're still ARNs we have not deleted, so seems like they should stay in r.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delete is for the case where there was a failure deleting a resource in one iteration of the loop but the resource is no longer present in subsequent iterations of the loop. For example, let's say the destroyer failed to delete resource A in iteration 1. Later in that same iteration, deleting a different resource had the ripple effect of deleting resource A. In iteration 2, there is no attempt to delete resource A since it is not found when looking for tagged resources. When the context expires, the RunWithContext function will respond that resource A could not be deleted, even though it no longer exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delete is for the case where there was a failure deleting a resource in one iteration of the loop but the resource is no longer present in subsequent iterations of the loop.

Hmm... I think we may want to iterate over undeleted resources and confirm that they are gone, instead of hoping that we'll discover them again if they aren't gone. Of course, if we kill and restart the destroyer, we'd loose that state. But within a given run, I'm concerned about conflating "we don't check it again" with "it is deleted". Don't hold on this account though; I'll mull it over some more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retains the same behavior that we have had before. If a resource is not found on a subsequent iteration, then the destroyer makes no further attempt to delete the resource.

if err != nil {
return err
return deleted.extractRemaining(), err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a very limited view of remaining ARNs, since it does not include all the other resource types we check in the poll loop below. How thorough do callers expect our remaining-ARN list to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I am not sure what else we can do here because this will happen when the context expires. It may be better to collect all of the ARNs from tags upfront before attempting to delete anything. That still would not cover sub-resources, but at least it will give a more complete picture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still would not cover sub-resources...

Personally, I'm fine treating sub-resources of owned resources as also owned, so a caller who received a set of top-level resources could recurse through sub-resources and remove the whole set. @abhinavdahiya has pushed back on that interpretation, although I'm not clear on his counter-argument. In the context of this PR, that means that I'm fine with returning a list of whatever the tag-search logic turns up, but that you might want to run it by him before coding anything up.

@wking
Copy link
Member

wking commented Jun 17, 2020

Returning remaining ARNs seems tricky. If we want to land a good chunk of this change while we sort out the remaining-ARN logic, maybe spin out the Context addition to a separate PR? Seems like that would be an easier review. But if folks are happy with both in this PR, that's fine with me too.

@abhinavdahiya
Copy link
Contributor

The PR title and the changes in this PR do not match. Either separate the context cancel and report remaining out into separate PRs or update the PR + desc to include/highlight info on the remaining reporting changes.

@staebler
Copy link
Contributor Author

I will extract the changes for returning the resources that could not be deleted into a separate PR.

The `RunWithContext` method was added to the AWS uninstaller. This provides
a way for the user to supply a context that can be used to stop the
uninstall.

This will be used by Hive to backoff uninstall attempts.

https://issues.redhat.com/browse/CO-973
@staebler
Copy link
Contributor Author

This PR has been updated to only include the option of providing a context and not the returning of resources that could not be deleted.

@abhinavdahiya
Copy link
Contributor

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 18, 2020

@staebler: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovirt dea5719 link /test e2e-ovirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member

wking commented Jun 19, 2020

Could not resolve host: github.com

/retest

@openshift-merge-robot openshift-merge-robot merged commit 9eb7698 into openshift:master Jun 19, 2020
staebler added a commit to staebler/installer that referenced this pull request Jun 24, 2020
The Run function in the Destroyer interface was modified to take a context as
a parameter. This provides a way for the user stop the uninstall after a
period of time by providing a context with a deadline.

A "--timeout" flag was added to the `openshift-install destroy cluster` command.
This allows the user to time out the destroy after a specified number of seconds.

The baremetal, libvirt, openstack, and ovirt providers do not provide a means
by which most requests made to the provider can be stopped prematurely. In these
cases, the context is checked prior to making the requests as a best effort.
But the uninstall may continue for a period of time after the context is done.

The RunWithContext function introduced in openshift#3765
for AWS has been obsoleted since the Run function now accepts a context.

This will be used by Hive to backoff uninstall attempts.

https://issues.redhat.com/browse/CO-974
staebler added a commit to staebler/installer that referenced this pull request Jun 24, 2020
The Run function in the Destroyer interface was modified to take a context as
a parameter. This provides a way for the user stop the uninstall after a
period of time by providing a context with a deadline.

A "--timeout" flag was added to the `openshift-install destroy cluster` command.
This allows the user to time out the destroy after a specified number of seconds.

The baremetal, libvirt, openstack, and ovirt providers do not provide a means
by which most requests made to the provider can be stopped prematurely. In these
cases, the context is checked prior to making the requests as a best effort.
But the uninstall may continue for a period of time after the context is done.

The RunWithContext function introduced in openshift#3765
for AWS has been obsoleted since the Run function now accepts a context.

This will be used by Hive to backoff uninstall attempts.

https://issues.redhat.com/browse/CO-974
staebler added a commit to staebler/installer that referenced this pull request Jun 24, 2020
The Run function in the Destroyer interface was modified to take a context as
a parameter. This provides a way for the user stop the uninstall after a
period of time by providing a context with a deadline.

A "--timeout" flag was added to the `openshift-install destroy cluster` command.
This allows the user to time out the destroy after a specified number of seconds.

The baremetal, libvirt, openstack, and ovirt providers do not provide a means
by which most requests made to the provider can be stopped prematurely. In these
cases, the context is checked prior to making the requests as a best effort.
But the uninstall may continue for a period of time after the context is done.

The RunWithContext function introduced in openshift#3765
for AWS has been obsoleted since the Run function now accepts a context.

This will be used by Hive to backoff uninstall attempts.

https://issues.redhat.com/browse/CO-974
staebler added a commit to staebler/installer that referenced this pull request Jun 24, 2020
The Run function in the Destroyer interface was modified to take a context as
a parameter. This provides a way for the user stop the uninstall after a
period of time by providing a context with a deadline.

A "--timeout" flag was added to the `openshift-install destroy cluster` command.
This allows the user to time out the destroy after a specified number of seconds.

The baremetal, libvirt, openstack, and ovirt providers do not provide a means
by which most requests made to the provider can be stopped prematurely. In these
cases, the context is checked prior to making the requests as a best effort.
But the uninstall may continue for a period of time after the context is done.

The RunWithContext function introduced in openshift#3765
for AWS has been obsoleted since the Run function now accepts a context.

This will be used by Hive to backoff uninstall attempts.

https://issues.redhat.com/browse/CO-974
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants