New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
destroy: provide way to stop aws uninstall using context #3765
destroy: provide way to stop aws uninstall using context #3765
Conversation
/cc @csrwng |
6a1f732
to
705391f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, just a nit about the named return values... in most cases it seems that just the types would be enough to understand what is being returned.
pkg/destroy/aws/aws.go
Outdated
profile := fmt.Sprintf("%s-%s-profile", o.ClusterID, profileType) | ||
response, err := iamClient.GetInstanceProfileWithContext(ctx, &iam.GetInstanceProfileInput{InstanceProfileName: &profile}) | ||
if err != nil { | ||
if err.(awserr.Error).Code() == iam.ErrCodeNoSuchEntityException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a get preflight? Looks like we catch and ignore this from the delete itself in deleteIAMInstanceProfileByName
since the function landed in #1268.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preflight is so that I can get the ARN to record whether it was deleted successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update deleteIAMInstanceProfileByName
to return the deleted ARN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. But that function does not know the ARN either, so it would have to look it up. The other caller of the function already knows the ARN, so it would be a wasted request for that caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I can deduce the ARN without looking up the instance profile. Let me change it to do that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is not that straight-forward. The account ID is not known in that function.
pkg/destroy/aws/aws.go
Outdated
} | ||
|
||
// merge merges the record of destroyed resources from changes into r. If changes and r have results for the same | ||
// resource, the results from changes have precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This precedence doesn't make sense to me. Wouldn't we want "if either value was true
, there merged value is true
"? Because if Alice deleted A and merges with Bob who failed to delete A, A is still gone, regardless of Bob's failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, thinking of it another way, if Alice failed to delete A even though Bob previously succeeded in deleting A, then A is still there regardless of whether Bob originally thought it had been deleted successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...regardless of whether Bob originally thought it had been deleted successfully.
That is very bad. Why would we be ok with Bob thinking he deleted something when it had not been deleted? I don't want to rely on Alice to come along and double-check him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I don't care too much either way, so I will change it.
pkg/destroy/aws/aws.go
Outdated
// the ARNs of those resources. | ||
func (r destroyedResources) extractRemaining() []string { | ||
if r == nil { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels strange to me to have a nil
guard here but not in merge
or record
. Is there a reason behind the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nil guard in merge
or record
would mean that the operation did not do anything. The functions cannot maps to merge or record into that will be accessible to the caller.
This works the same way as the operations on a general map. You can get from a nil map, but you cannot set into a nil map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel better about it, I can get rid of the nil guard. The function will return an empty slice instead of a nil slice, in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nil guard in
merge
orrecord
would mean that the operation did not do anything.
Makes sense. Where are we initializing these maps today? Do we need a:
func newDestroyedResources() destroyedResources {
return map[string]bool{}
}
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are initialized with destroyedResources{}
. I don't think we need to introduce a function to do that.
pkg/destroy/aws/aws.go
Outdated
for k, v := range r { | ||
if !v { | ||
notDestroyed = append(notDestroyed, k) | ||
delete(r, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the delete
? They're still ARNs we have not deleted, so seems like they should stay in r
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delete
is for the case where there was a failure deleting a resource in one iteration of the loop but the resource is no longer present in subsequent iterations of the loop. For example, let's say the destroyer failed to delete resource A in iteration 1. Later in that same iteration, deleting a different resource had the ripple effect of deleting resource A. In iteration 2, there is no attempt to delete resource A since it is not found when looking for tagged resources. When the context expires, the RunWithContext
function will respond that resource A could not be deleted, even though it no longer exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delete is for the case where there was a failure deleting a resource in one iteration of the loop but the resource is no longer present in subsequent iterations of the loop.
Hmm... I think we may want to iterate over undeleted resources and confirm that they are gone, instead of hoping that we'll discover them again if they aren't gone. Of course, if we kill and restart the destroyer, we'd loose that state. But within a given run, I'm concerned about conflating "we don't check it again" with "it is deleted". Don't hold on this account though; I'll mull it over some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retains the same behavior that we have had before. If a resource is not found on a subsequent iteration, then the destroyer makes no further attempt to delete the resource.
pkg/destroy/aws/aws.go
Outdated
if err != nil { | ||
return err | ||
return deleted.extractRemaining(), err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a very limited view of remaining ARNs, since it does not include all the other resource types we check in the poll loop below. How thorough do callers expect our remaining-ARN list to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I am not sure what else we can do here because this will happen when the context expires. It may be better to collect all of the ARNs from tags upfront before attempting to delete anything. That still would not cover sub-resources, but at least it will give a more complete picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still would not cover sub-resources...
Personally, I'm fine treating sub-resources of owned resources as also owned, so a caller who received a set of top-level resources could recurse through sub-resources and remove the whole set. @abhinavdahiya has pushed back on that interpretation, although I'm not clear on his counter-argument. In the context of this PR, that means that I'm fine with returning a list of whatever the tag-search logic turns up, but that you might want to run it by him before coding anything up.
Returning remaining ARNs seems tricky. If we want to land a good chunk of this change while we sort out the remaining-ARN logic, maybe spin out the |
The PR title and the changes in this PR do not match. Either separate the context cancel and report remaining out into separate PRs or update the PR + desc to include/highlight info on the remaining reporting changes. |
I will extract the changes for returning the resources that could not be deleted into a separate PR. |
The `RunWithContext` method was added to the AWS uninstaller. This provides a way for the user to supply a context that can be used to stop the uninstall. This will be used by Hive to backoff uninstall attempts. https://issues.redhat.com/browse/CO-973
a977a42
to
dea5719
Compare
This PR has been updated to only include the option of providing a context and not the returning of resources that could not be deleted. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@staebler: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Run function in the Destroyer interface was modified to take a context as a parameter. This provides a way for the user stop the uninstall after a period of time by providing a context with a deadline. A "--timeout" flag was added to the `openshift-install destroy cluster` command. This allows the user to time out the destroy after a specified number of seconds. The baremetal, libvirt, openstack, and ovirt providers do not provide a means by which most requests made to the provider can be stopped prematurely. In these cases, the context is checked prior to making the requests as a best effort. But the uninstall may continue for a period of time after the context is done. The RunWithContext function introduced in openshift#3765 for AWS has been obsoleted since the Run function now accepts a context. This will be used by Hive to backoff uninstall attempts. https://issues.redhat.com/browse/CO-974
The Run function in the Destroyer interface was modified to take a context as a parameter. This provides a way for the user stop the uninstall after a period of time by providing a context with a deadline. A "--timeout" flag was added to the `openshift-install destroy cluster` command. This allows the user to time out the destroy after a specified number of seconds. The baremetal, libvirt, openstack, and ovirt providers do not provide a means by which most requests made to the provider can be stopped prematurely. In these cases, the context is checked prior to making the requests as a best effort. But the uninstall may continue for a period of time after the context is done. The RunWithContext function introduced in openshift#3765 for AWS has been obsoleted since the Run function now accepts a context. This will be used by Hive to backoff uninstall attempts. https://issues.redhat.com/browse/CO-974
The Run function in the Destroyer interface was modified to take a context as a parameter. This provides a way for the user stop the uninstall after a period of time by providing a context with a deadline. A "--timeout" flag was added to the `openshift-install destroy cluster` command. This allows the user to time out the destroy after a specified number of seconds. The baremetal, libvirt, openstack, and ovirt providers do not provide a means by which most requests made to the provider can be stopped prematurely. In these cases, the context is checked prior to making the requests as a best effort. But the uninstall may continue for a period of time after the context is done. The RunWithContext function introduced in openshift#3765 for AWS has been obsoleted since the Run function now accepts a context. This will be used by Hive to backoff uninstall attempts. https://issues.redhat.com/browse/CO-974
The Run function in the Destroyer interface was modified to take a context as a parameter. This provides a way for the user stop the uninstall after a period of time by providing a context with a deadline. A "--timeout" flag was added to the `openshift-install destroy cluster` command. This allows the user to time out the destroy after a specified number of seconds. The baremetal, libvirt, openstack, and ovirt providers do not provide a means by which most requests made to the provider can be stopped prematurely. In these cases, the context is checked prior to making the requests as a best effort. But the uninstall may continue for a period of time after the context is done. The RunWithContext function introduced in openshift#3765 for AWS has been obsoleted since the Run function now accepts a context. This will be used by Hive to backoff uninstall attempts. https://issues.redhat.com/browse/CO-974
The
RunWithContext
method was added to the AWS uninstaller. This provides a way for the user to supply a context that can be used to stop the uninstall.This will be used by Hive to backoff uninstall attempts.
https://issues.redhat.com/browse/CO-973