-
-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding feature to ignore list and nuke errors #114
Adding feature to ignore list and nuke errors #114
Conversation
Thanks for the PR! I'm super booked this week, but will try to take a look in the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Is there some way we can add automated tests to check this behavior?
README.md
Outdated
cloud-nuke aws --ignore-errors ec2 | ||
``` | ||
|
||
This will ignore any errorrs encountered while listing/nuking EC2 resources - please note that if there are resources that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify what other values you can pass here beyond ec2
and how to handle multiple values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this ignore an EC2 error and continue nuking EC2 instances... Or does it stop nuking EC2 instances at the first EC2 error, but then keeps nuking other types of resources? We should make the behavior clear in the docs to avoid surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback - clarified with example. It keeps on nuking other resource types - updated the README.
aws/aws.go
Outdated
if collections.ListContainsElement(resourceTypes, "all") || | ||
collections.ListContainsElement(resourceTypes, resourceType) { | ||
return true | ||
} | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if collections.ListContainsElement(resourceTypes, "all") || | |
collections.ListContainsElement(resourceTypes, resourceType) { | |
return true | |
} | |
return false | |
return collections.ListContainsElement(resourceTypes, "all") || | |
collections.ListContainsElement(resourceTypes, resourceType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
commands/cli.go
Outdated
} | ||
} | ||
// Check command line resource type values | ||
invalidResourceTypes := map[string][]string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why store these in a map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had kept them in a map in case we want to aggregate errors for list, exclude and ignore error types if user specifies invalid values. But there isn't a need of doing so at the moment - fixed it.
I doubt if there is a straightforward way to do so. We are not checking for per resource type deletion errors - we are checking access related errors they return and deciding whether we want to move ahead or not. The way I tested it was to create a role that has permissions to create other roles - created a restricted role with it, applied that role's credentials to the cli, ran cloud-nuke and tried to nuke a resource not covered by the restricted role. Doing that in a test case would essentially be equal to retesting the above process. The way to introduce errors will lie not in the creation of the resource or its properties but in the role that tries to delete the resource. And to test if cloud-nuke goes ahead - we would essentially have to test multiple deletions while switching the creating and deleting roles. Seems like a bigger overhead than the value it provides. Does it make sense to do that? Honestly speaking - I was hoping this to be a feature which helps first time users get some best effort deletion work done and not make them turn away if they encounter errors. I wouldn't use this flag if I know I have the right roles in place to run cloud-nuke. |
README.md
Outdated
### Ignoring list/nuke errors for certain/all resources | ||
|
||
Ideally you should have AWS IAM permissions to list and nuke all target resources. If you do not have permissions to | ||
nuke resource X - you should exclude it using the `--exclude-resource-type` flag. However, if you don't know which resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nuke resource X - you should exclude it using the `--exclude-resource-type` flag. However, if you don't know which resource | |
nuke resource X, you should exclude it using the `--exclude-resource-type` flag. However, if you don't know which resource |
README.md
Outdated
cloud-nuke aws --ignore-errors ec2 --ignore-errors s3 | ||
``` | ||
|
||
This will ignore any errorrs encountered while listing/nuking EC2 and S3 resources. For example, if there are 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will ignore any errorrs encountered while listing/nuking EC2 and S3 resources. For example, if there are 3 | |
This will ignore any errors encountered while listing/nuking EC2 and S3 resources. For example, if there are 3 |
README.md
Outdated
``` | ||
|
||
This will ignore any errorrs encountered while listing/nuking EC2 and S3 resources. For example, if there are 3 | ||
EC2 instances and 2 S3 buckets to nuke and few/all of the EC2 instances and few/all of the S3 bucket deletion fails - cloud-nuke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EC2 instances and 2 S3 buckets to nuke and few/all of the EC2 instances and few/all of the S3 bucket deletion fails - cloud-nuke | |
EC2 instances and 2 S3 buckets to nuke and few/all of the EC2 instances and few/all of the S3 bucket deletion fails, cloud-nuke |
README.md
Outdated
Please note that cloud-nuke deletes resource types in a certain order to avoid dependency errors. If there are resources | ||
that should be deleted before EC2 or S3 - and there are errors encountered while listing/nuking those resources e.g. | ||
cloud-nuke deletes autoscaling groups before EC2 and if your role does not have permissions to delete autoscaling groups | ||
and you have not added `asg` to the `--ignore-errors` list, then cloud-nuke will error out and fail before reaching to EC2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that cloud-nuke deletes resource types in a certain order to avoid dependency errors. If there are resources | |
that should be deleted before EC2 or S3 - and there are errors encountered while listing/nuking those resources e.g. | |
cloud-nuke deletes autoscaling groups before EC2 and if your role does not have permissions to delete autoscaling groups | |
and you have not added `asg` to the `--ignore-errors` list, then cloud-nuke will error out and fail before reaching to EC2. | |
Please note that cloud-nuke deletes resource types in a certain order to avoid dependency errors. If there are resources | |
that should be deleted before EC2 or S3, and there are errors encountered while listing/nuking those resources, then cloud-nuke will error out and fail before reaching to EC2 or S3. For example, | |
cloud-nuke deletes autoscaling groups before EC2 instances, and if your role does not have permissions to delete autoscaling groups | |
and you added `ec2` to the `--ignore-errors` list but not `asg`, then cloud-nuke will still end up exiting with an error around deleting those autoscaling groups. |
errors by specifying the `--ignore-errors` flag: | ||
|
||
```shell | ||
cloud-nuke aws --ignore-errors ec2 --ignore-errors s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to provide users with the a pointer to what resource types can be specified via --ignore-errors
. i.e., It's more than just ec2
and s3
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump
I actually think it does make sense to test this! In particular, I'd create automated tests for two scenarios:
|
Okay - will give it a shot and update this thread with findings. |
@brikis98 - was able to try out the individual functions for creating/assuming role and
Same query for creating test S3 buckets but taking a call for S3 is more complicated as its PR is still open - #110 - @yorinasub17 - your feedback would also be useful for this query as it is tied to S3 also.
|
Feel free to move these functions to some package (e.g.,
That test coverage sounds great 👍 One NIT: Go naming convention is CamelCase and not snake_case or Upper_Snake_case. So the function names should be |
6112b76
to
5d6ffd3
Compare
* Adding iam_utils.go for common IAM role, policy CRUD operations. * Added assume role tests for ignore-errors. * Updated GetAllResources and NukeAllResources to take optional per region session param to assume role. * Moving out EC2 and S3 create/delete functions to respective _utils.go files. * Adding waitForTermination flag in ec2.go to avoid longer test times and random shutdown/deletion timeouts. * Updating EBS, ASG, AMI, LaunchConfig tests to use ec2_utils.go * Updating S3 tests to use s3_utils.go
5630e7e
to
d3b0f03
Compare
@brikis98 Updated as per feedback. As the config file PR also touched some common files - Summary:
|
@brikis98 @yorinasub17 This change has more testing code than the actual functionality :) - maybe as the feature required very methodical series of steps (create role, assume role, create resources, try deletion, delete resources, delete role) to verify ignoring errors. |
@brikis98 - can you please review this and check if this looks good? Thanks. |
Sorry for the delay. It's a huge PR, so I keep snoozing it every time I see it 😁 I'll try to get to it in the next week or so! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thank you for your patience while I finally found some time to sit down and really go through this! The new test cases are excellent. This should go a long way in ensuring we handle errors correctly 👍
errors by specifying the `--ignore-errors` flag: | ||
|
||
```shell | ||
cloud-nuke aws --ignore-errors ec2 --ignore-errors s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump
if err != nil { | ||
assert.Failf(t, "Could not create test EC2 instance", errors.WithStackTrace(err).Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: You can use require.NoError(t, err, "<message>")
to reduce this sort of check to one-line, both here, and elsewhere in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have done that in the new files I added, but didn't do so in the old ones as they all were using assert.Fail
and wanted to keep in sync. Ideally they should be using require
where we want the test to not proceed further. But didn't want to do that level of yak shaving. Is that ok or should I use require
for all the new code I added and keep the older asserts as is?
instances, err := findEC2InstancesByNameTag(session, uniqueTestID) | ||
if err != nil { | ||
assert.Fail(t, errors.WithStackTrace(err).Error()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this check before the defer nukeAllEc2Instances
call?
This question applies here and several places below in the PR where similar changes were added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used to checking err
immediately after the call but I see your point. Would changing
// clean up after this test
defer nukeAllAMIs(session, []*string{image.ImageId})
instances, err := findEC2InstancesByNameTag(session, uniqueTestID)
if err != nil {
assert.Fail(t, errors.WithStackTrace(err).Error())
}
defer nukeAllEc2Instances(session, instances, true)
to
// clean up after this test
defer nukeAllAMIs(session, []*string{image.ImageId})
instances, err := findEC2InstancesByNameTag(session, uniqueTestID)
defer nukeAllEc2Instances(session, instances, true)
if err != nil {
assert.Fail(t, errors.WithStackTrace(err).Error())
}
make sense?
func TestNukeAllResources(t *testing.T) { | ||
t.Parallel() | ||
// Create a top level AWS session object which will be used to create/destroy roles | ||
// and resources. Specifically use us-east-1 as we are running >= 8 tests and vCpu limits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the default limits for us-east-1
any higher? Or is it just that you requested limited increases already for that region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the default limits. I did not request an increase.
assumeRolePolicyDocument: assumeRolePolicyDocument, | ||
}, | ||
createIAMRolePolicyArgs{ | ||
roleName: "cloud-nuke-test-1-noperms", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the policy need a role name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is an inline policy attached to a role -https://docs.aws.amazon.com/IAM/latest/APIReference/API_PutRolePolicy.html
cc48fc9
to
062b7db
Compare
5593b42
to
be5130a
Compare
Thanks for the detailed feedback @brikis98 - I appreciate that. I have closed most of them and have put in queries where I needed some guidance. Smoke tests on |
@brikis98 as per above ran tests for the following test files also - looks good
|
We can close this PR as its review has been pending for a few months and it has deviated from the main codebase. |
This PR adds the following features:
Support for --ignore-errors flag on command line as per the discussion in Incomplete Execution Due to Missing Exception Handling #112
Updated ami.go to fix a bug which prints wrong number of deletions if ami deletion fails.
Demo:
--ignore-errors all
:Complete command output - https://gist.github.com/saurabh-hirani/eaa9455488af5299ee5d07746a965f88 - does a best effort deletion.
Detailed description as per the change in README.md