Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding feature to ignore list and nuke errors #114

Conversation

saurabh-hirani
Copy link
Contributor

@saurabh-hirani saurabh-hirani commented May 13, 2020

This PR adds the following features:

  1. Support for --ignore-errors flag on command line as per the discussion in Incomplete Execution Due to Missing Exception Handling #112

  2. Updated ami.go to fix a bug which prints wrong number of deletions if ami deletion fails.

Demo:

  1. Running cloud-nuke with --ignore-errors all:
cloud-nuke --ignore-errors all
.
.
INFO[2020-05-06T19:31:11+05:30] Checking region [1/1]: us-east-1
WARN[2020-05-06T19:31:12+05:30] Ignoring get resources error - asg - AccessDenied: User: arn:aws:sts::12345678:assumed-role/role-1/session-1 is not authorized to perform: autoscaling:DescribeAutoScalingGroups
        status code: 403, request id: xxxx
WARN[2020-05-06T19:31:12+05:30] Ignoring get resources error - lc - AccessDenied: User: arn:aws:sts::12345678:assumed-role/role-1/session-1 is not authorized to perform: autoscaling:DescribeLaunchConfigurations
        status code: 403, request id: xxxx
.
.

Complete command output - https://gist.github.com/saurabh-hirani/eaa9455488af5299ee5d07746a965f88 - does a best effort deletion.

Detailed description as per the change in README.md

@brikis98
Copy link
Member

Thanks for the PR! I'm super booked this week, but will try to take a look in the next few days.

Copy link
Member

@brikis98 brikis98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Is there some way we can add automated tests to check this behavior?

README.md Outdated Show resolved Hide resolved
README.md Outdated
cloud-nuke aws --ignore-errors ec2
```

This will ignore any errorrs encountered while listing/nuking EC2 resources - please note that if there are resources that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify what other values you can pass here beyond ec2 and how to handle multiple values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this ignore an EC2 error and continue nuking EC2 instances... Or does it stop nuking EC2 instances at the first EC2 error, but then keeps nuking other types of resources? We should make the behavior clear in the docs to avoid surprises.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback - clarified with example. It keeps on nuking other resource types - updated the README.

aws/aws.go Outdated
Comment on lines 539 to 543
if collections.ListContainsElement(resourceTypes, "all") ||
collections.ListContainsElement(resourceTypes, resourceType) {
return true
}
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if collections.ListContainsElement(resourceTypes, "all") ||
collections.ListContainsElement(resourceTypes, resourceType) {
return true
}
return false
return collections.ListContainsElement(resourceTypes, "all") ||
collections.ListContainsElement(resourceTypes, resourceType)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

commands/cli.go Outdated
}
}
// Check command line resource type values
invalidResourceTypes := map[string][]string{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why store these in a map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had kept them in a map in case we want to aggregate errors for list, exclude and ignore error types if user specifies invalid values. But there isn't a need of doing so at the moment - fixed it.

@saurabh-hirani
Copy link
Contributor Author

saurabh-hirani commented May 26, 2020

Thanks for the PR!

Is there some way we can add automated tests to check this behavior?

I doubt if there is a straightforward way to do so. We are not checking for per resource type deletion errors - we are checking access related errors they return and deciding whether we want to move ahead or not. The way I tested it was to create a role that has permissions to create other roles - created a restricted role with it, applied that role's credentials to the cli, ran cloud-nuke and tried to nuke a resource not covered by the restricted role.

Doing that in a test case would essentially be equal to retesting the above process. The way to introduce errors will lie not in the creation of the resource or its properties but in the role that tries to delete the resource. And to test if cloud-nuke goes ahead - we would essentially have to test multiple deletions while switching the creating and deleting roles. Seems like a bigger overhead than the value it provides. Does it make sense to do that?

Honestly speaking - I was hoping this to be a feature which helps first time users get some best effort deletion work done and not make them turn away if they encounter errors. I wouldn't use this flag if I know I have the right roles in place to run cloud-nuke.

README.md Outdated
### Ignoring list/nuke errors for certain/all resources

Ideally you should have AWS IAM permissions to list and nuke all target resources. If you do not have permissions to
nuke resource X - you should exclude it using the `--exclude-resource-type` flag. However, if you don't know which resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nuke resource X - you should exclude it using the `--exclude-resource-type` flag. However, if you don't know which resource
nuke resource X, you should exclude it using the `--exclude-resource-type` flag. However, if you don't know which resource

README.md Outdated
cloud-nuke aws --ignore-errors ec2 --ignore-errors s3
```

This will ignore any errorrs encountered while listing/nuking EC2 and S3 resources. For example, if there are 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This will ignore any errorrs encountered while listing/nuking EC2 and S3 resources. For example, if there are 3
This will ignore any errors encountered while listing/nuking EC2 and S3 resources. For example, if there are 3

README.md Outdated
```

This will ignore any errorrs encountered while listing/nuking EC2 and S3 resources. For example, if there are 3
EC2 instances and 2 S3 buckets to nuke and few/all of the EC2 instances and few/all of the S3 bucket deletion fails - cloud-nuke
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EC2 instances and 2 S3 buckets to nuke and few/all of the EC2 instances and few/all of the S3 bucket deletion fails - cloud-nuke
EC2 instances and 2 S3 buckets to nuke and few/all of the EC2 instances and few/all of the S3 bucket deletion fails, cloud-nuke

README.md Outdated
Comment on lines 142 to 145
Please note that cloud-nuke deletes resource types in a certain order to avoid dependency errors. If there are resources
that should be deleted before EC2 or S3 - and there are errors encountered while listing/nuking those resources e.g.
cloud-nuke deletes autoscaling groups before EC2 and if your role does not have permissions to delete autoscaling groups
and you have not added `asg` to the `--ignore-errors` list, then cloud-nuke will error out and fail before reaching to EC2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Please note that cloud-nuke deletes resource types in a certain order to avoid dependency errors. If there are resources
that should be deleted before EC2 or S3 - and there are errors encountered while listing/nuking those resources e.g.
cloud-nuke deletes autoscaling groups before EC2 and if your role does not have permissions to delete autoscaling groups
and you have not added `asg` to the `--ignore-errors` list, then cloud-nuke will error out and fail before reaching to EC2.
Please note that cloud-nuke deletes resource types in a certain order to avoid dependency errors. If there are resources
that should be deleted before EC2 or S3, and there are errors encountered while listing/nuking those resources, then cloud-nuke will error out and fail before reaching to EC2 or S3. For example,
cloud-nuke deletes autoscaling groups before EC2 instances, and if your role does not have permissions to delete autoscaling groups
and you added `ec2` to the `--ignore-errors` list but not `asg`, then cloud-nuke will still end up exiting with an error around deleting those autoscaling groups.

errors by specifying the `--ignore-errors` flag:

```shell
cloud-nuke aws --ignore-errors ec2 --ignore-errors s3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to provide users with the a pointer to what resource types can be specified via --ignore-errors. i.e., It's more than just ec2 and s3, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump

@brikis98
Copy link
Member

Doing that in a test case would essentially be equal to retesting the above process. The way to introduce errors will lie not in the creation of the resource or its properties but in the role that tries to delete the resource. And to test if cloud-nuke goes ahead - we would essentially have to test multiple deletions while switching the creating and deleting roles. Seems like a bigger overhead than the value it provides. Does it make sense to do that?

I actually think it does make sense to test this! In particular, I'd create automated tests for two scenarios:

  1. Create an IAM role that only has read permissions for EC2 and S3. Create an S3 bucket and an EC2 instance, assume the role, run cloud-nuke telling it to ignore errors, and make sure the run completes successfully, and that the S3 bucket and EC2 instance are still around (which the test can then clean up at the end).
  2. Same as above, except this time, the IAM role also has write permissions for S3, so at the end of the test, the S3 bucket should be deleted, but the EC2 instance should still be around.

@saurabh-hirani
Copy link
Contributor Author

  1. Create an IAM role that only has read permissions for EC2 and S3. Create an S3 bucket and an EC2 instance, assume the role, run cloud-nuke telling it to ignore errors, and make sure the run completes successfully, and that the S3 bucket and EC2 instance are still around (which the test can then clean up at the end).
  2. Same as above, except this time, the IAM role also has write permissions for S3, so at the end of the test, the S3 bucket should be deleted, but the EC2 instance should still be around.

Okay - will give it a shot and update this thread with findings.

@saurabh-hirani
Copy link
Contributor Author

saurabh-hirani commented Jun 7, 2020

@brikis98 - was able to try out the individual functions for creating/assuming role and
creating/deleting policies for the same locally. I had the following queries before proceeding:

  1. Having a file called cli_nukeallresources_test.go (content fleshed out in the next query)
    besides cli_test.go. But as we have to create test EC2 and S3 instances - can we reuse the
    corresponding createTestEC2Instance in aws/ec2_test.go - that would mean uppercasing that func to export it. getRandomRegion cannot be renamed as it is heavily used outside also. Or does it make sense to repeat the test EC2 creation code in the test case to avoid any dependencies? Update: read cmd/go: Types and Functions defined in _test.go files are not visible/exported golang/go#10184 and saw that it is not possible to export funcs in _test.go files - but would be better to get your thoughts on the same.

Same query for creating test S3 buckets but taking a call for S3 is more complicated as its PR is still open - #110 - @yorinasub17 - your feedback would also be useful for this query as it is tied to S3 also.

  1. Having the following test functions in cli_nukeallresources_test.go - does this coverage
    make sense?
  • TestNukeAllResources_AllPerms()

    • Creates test EC2, S3
    • No assume role
    • Call GetAllRresources - for resource type - EC2, S3 - returns list with test EC2 and S3
    • Call NukeAllResources with test EC2 and S3 - deletes them
    • Test EC2, S3 do not exist
  • TestNukeAllResources_NoPerms()

    • Creates test EC2, S3
    • Creates test role with no read/write for EC2, S3
    • Assume role
    • Call GetAllRresources - for resource type - EC2, S3 - returns errors
    • Test EC2, S3 still exist
  • TestNukeAllResources_NoPerms_IgnoreErrors()

    • Same as above but GetAllResurces call has ignore errors flag set - so it doesn't fail and returns empty list.
    • Test EC2, S3 still exist
  • TestNukeAllResources_EC2S3_RORole()

    • Creates test EC2, S3
    • Creates test role with read for EC2, S3
    • Assume role
    • Call GetAllRresources - for resource type - EC2, S3 - no errors - returns test EC2 and S3
    • Call NukeAllResources with test EC2 and S3 - returns errors
    • Test EC2, S3 still exist
  • TestNukeAllResources_EC2S3_RORole_IgnoreErorors()

    • Same as above but NukeAllResources has ignore errors flag set - so it doesn't fail and returns nil.
    • Test EC2, S3 still exist
  • TestNukeAllResources_EC2_RORole_S3_RWRole()

    • Creates test EC2, S3
    • Creates test role with read for EC2, write for S3
    • Assume role
    • Call GetAllRresources - for resource type - EC2, S3 - no errors - returns test EC2 and S3
    • Call NukeAllResources with test EC2 and S3 - fails at EC2 - does not go upto S3
    • Test EC2, S3 still exist
  • TestNukeAllResources_EC2_RORole_S3_RWRole_IgnoreErrors()

    • Same as above but NukeAllResources has ignore errors flag set - so it doesn't fail
      for EC2 and continues S3 deletion and returns nil.
    • Test S3 deleted. Test EC2 exists.

@brikis98
Copy link
Member

Having a file called cli_nukeallresources_test.go (content fleshed out in the next query)
besides cli_test.go. But as we have to create test EC2 and S3 instances - can we reuse the
corresponding createTestEC2Instance in aws/ec2_test.go - that would mean uppercasing that func to export it. getRandomRegion cannot be renamed as it is heavily used outside also. Or does it make sense to repeat the test EC2 creation code in the test case to avoid any dependencies?

Feel free to move these functions to some package (e.g.,test_util) so they can be imported wherever needed.

Having the following test functions in cli_nukeallresources_test.go - does this coverage
make sense?

That test coverage sounds great 👍

One NIT: Go naming convention is CamelCase and not snake_case or Upper_Snake_case. So the function names should be TestNukeAllResourcesEC2S3ReadOnlyRoleIgnoreErorors... But otherwise, LGTM!

* Adding iam_utils.go for common IAM role, policy CRUD operations.
* Added assume role tests for ignore-errors.
* Updated GetAllResources and NukeAllResources to take optional per region session param to assume role.
* Moving out EC2 and S3 create/delete functions to respective _utils.go files.
* Adding waitForTermination flag in ec2.go to avoid longer test times and random shutdown/deletion timeouts.
* Updating EBS, ASG, AMI, LaunchConfig tests to use ec2_utils.go
* Updating S3 tests to use s3_utils.go
@saurabh-hirani
Copy link
Contributor Author

saurabh-hirani commented Jun 19, 2020

@brikis98 Updated as per feedback. As the config file PR also touched some common files - cli.go, aws.go, s3_test.go - had to merge and the diffs didn't seem reviewable. So squashed the commits to review as one unit. Please review and let me know if this looks good.

Summary:

  • Added the following new files in aws/:

    • ec2_utils.go - moving out common test CRUD functions for ec2 - led to updating test files - ec2_test.go and dependent test files - ebs_test.go, asg_test.go, ami_test.go, launch_config_test.go .
    • s3_utils.go - same as above but for S3 - led to updating test files - s3_test.go
    • iam_utils.go - new file for CRUD for IAM roles/policies - doubles up as a base for future addition of IAM role/policy nuking
  • Added the following new files in commands/:

  • Updated aws.GetAllResources and aws.NukeAllResources to accept session param because in order to assume role you have to pass in session param while deleting resources - which wasn't there. The argument list of these 2 functions was getting longer and hence used a struct to make it simpler + support optional args.

  • Updated ec2.go to support flag which decides whether to wait or not for instance deletion after triggering it - https://github.com/gruntwork-io/cloud-nuke/pull/114/files#diff-6c5db3fd88049ee01a36ba713dfcdbdeR76 - Ec2 nuke to not wait for instance termination by default - default case is trigger and return. This is because when testing deletion repeatedly with around 8-10 EC2 instances - (encountered when deferred nukes run for assume role tests) - there were random timeouts + long delay sometimes for EC2 deletion. For dependent tests like ebs_test.go they wait for EC2 deletion e.g https://github.com/gruntwork-io/cloud-nuke/pull/114/files#diff-c57ef596e24478400f32b6312d13256eR183

  • Update - maybe I was hitting some API limits on my free tier and my requests were getting throttled due to mutliple create/delete ops - retried after some time and EC2 deletions work fine - kept default - ec2 nuke behaviour as wait for deletion.

  • There is no support for user to specify whether to return after triggering deletion or wait until deletion completion for resources that support it (--sync / --async maybe?). Can be taken up as separate feature request.

  • When running s3_test.go - the following tests failed for me - because as per Add config option for complex matching against s3 buckets #113 (comment) some tests run in Gruntwork specific accounts e.g. phxdevops - and I am running them in my own account. Wanted to check if the corresponding tests should be rewritten to create test buckets so that they can run on non-Gruntwork AWS accounts also.

--- FAIL: TestFilterS3Bucket_Config (30.65s)
    --- FAIL: TestFilterS3Bucket_Config/config_tests (0.00s)
        --- FAIL: TestFilterS3Bucket_Config/config_tests/Include (0.20s)
            s3_test.go:441:
                        Error Trace:    s3_test.go:441
                        Error:          Not equal:
                                        expected: 4
                                        actual  : 0
                        Test:           TestFilterS3Bucket_Config/config_tests/Include
        --- FAIL: TestFilterS3Bucket_Config/config_tests/IncludeAndExclude (0.22s)
            s3_test.go:441:
                        Error Trace:    s3_test.go:441
                        Error:          Not equal:
                                        expected: 3
                                        actual  : 0
                        Test:           TestFilterS3Bucket_Config/config_tests/IncludeAndExclude
        --- FAIL: TestFilterS3Bucket_Config/config_tests/Exclude (0.27s)
            s3_test.go:441:
                        Error Trace:    s3_test.go:441
                        Error:          Not equal:
                                        expected: 6
                                        actual  : 0
                        Test:           TestFilterS3Bucket_Config/config_tests/Exclude

@saurabh-hirani
Copy link
Contributor Author

saurabh-hirani commented Jun 19, 2020

@brikis98 @yorinasub17 This change has more testing code than the actual functionality :) - maybe as the feature required very methodical series of steps (create role, assume role, create resources, try deletion, delete resources, delete role) to verify ignoring errors.

@saurabh-hirani
Copy link
Contributor Author

@brikis98 - can you please review this and check if this looks good? Thanks.

@brikis98
Copy link
Member

brikis98 commented Jul 4, 2020

Sorry for the delay. It's a huge PR, so I keep snoozing it every time I see it 😁

I'll try to get to it in the next week or so!

Copy link
Member

@brikis98 brikis98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thank you for your patience while I finally found some time to sit down and really go through this! The new test cases are excellent. This should go a long way in ensuring we handle errors correctly 👍

errors by specifying the `--ignore-errors` flag:

```shell
cloud-nuke aws --ignore-errors ec2 --ignore-errors s3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump

Comment on lines +37 to +38
if err != nil {
assert.Failf(t, "Could not create test EC2 instance", errors.WithStackTrace(err).Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: You can use require.NoError(t, err, "<message>") to reduce this sort of check to one-line, both here, and elsewhere in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have done that in the new files I added, but didn't do so in the old ones as they all were using assert.Fail and wanted to keep in sync. Ideally they should be using require where we want the test to not proceed further. But didn't want to do that level of yak shaving. Is that ok or should I use require for all the new code I added and keep the older asserts as is?

Comment on lines +110 to +113
instances, err := findEC2InstancesByNameTag(session, uniqueTestID)
if err != nil {
assert.Fail(t, errors.WithStackTrace(err).Error())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this check before the defer nukeAllEc2Instances call?

This question applies here and several places below in the PR where similar changes were added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to checking err immediately after the call but I see your point. Would changing


	// clean up after this test
	defer nukeAllAMIs(session, []*string{image.ImageId})
	instances, err := findEC2InstancesByNameTag(session, uniqueTestID)
	if err != nil {
		assert.Fail(t, errors.WithStackTrace(err).Error())
	}
	defer nukeAllEc2Instances(session, instances, true)

to


	// clean up after this test
	defer nukeAllAMIs(session, []*string{image.ImageId})
	instances, err := findEC2InstancesByNameTag(session, uniqueTestID)
	defer nukeAllEc2Instances(session, instances, true)
         if err != nil {
		assert.Fail(t, errors.WithStackTrace(err).Error())
	}

make sense?

aws/aws.go Outdated Show resolved Hide resolved
aws/aws.go Show resolved Hide resolved
func TestNukeAllResources(t *testing.T) {
t.Parallel()
// Create a top level AWS session object which will be used to create/destroy roles
// and resources. Specifically use us-east-1 as we are running >= 8 tests and vCpu limits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the default limits for us-east-1 any higher? Or is it just that you requested limited increases already for that region?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the default limits. I did not request an increase.

commands/cli_nukeallresources_test.go Show resolved Hide resolved
commands/cli_nukeallresources_test.go Show resolved Hide resolved
assumeRolePolicyDocument: assumeRolePolicyDocument,
},
createIAMRolePolicyArgs{
roleName: "cloud-nuke-test-1-noperms",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the policy need a role name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is an inline policy attached to a role -https://docs.aws.amazon.com/IAM/latest/APIReference/API_PutRolePolicy.html

commands/cli_nukeallresources_test.go Outdated Show resolved Hide resolved
saurabh-hirani added a commit to saurabh-hirani/cloud-nuke that referenced this pull request Jul 26, 2020
saurabh-hirani added a commit to saurabh-hirani/cloud-nuke that referenced this pull request Jul 26, 2020
saurabh-hirani added a commit to saurabh-hirani/cloud-nuke that referenced this pull request Jul 26, 2020
saurabh-hirani added a commit to saurabh-hirani/cloud-nuke that referenced this pull request Jul 26, 2020
@saurabh-hirani
Copy link
Contributor Author

Thanks for the detailed feedback @brikis98 - I appreciate that. I have closed most of them and have put in queries where I needed some guidance. Smoke tests on cli_nukeallresources_test.go look good - will be testing other _test.go files tomorrow and will update this thread.

@saurabh-hirani
Copy link
Contributor Author

@brikis98 as per above ran tests for the following test files also - looks good

ami_test.go
asg_test.go
ebs_test.go
ec2_test.go
s3_test.go

@saurabh-hirani
Copy link
Contributor Author

saurabh-hirani commented Jan 18, 2021

We can close this PR as its review has been pending for a few months and it has deviated from the main codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants