-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to replace EKS cluster: Error revoking security group ingress rules #69
Comments
Curious - did you use any non-default settings - in particular for the node securitygroup? |
Nothing non-default - I reproduced the issue with an extremely minimal config, just
And then added some subnets to the list and the same issue occurred. |
Any updates for this issue? I just ran into the same situation. |
This appears to be related to a long-standing TF bug when using descriptions in the security group rules: hashicorp/terraform-provider-aws#2879, and a fix supposedly came in terraform-provider-aws 1.19.0 but doesn't appear to be a remedy in this scenario. Per user comments, removing the descriptions from the security group rules seems to skate around this issue. thoughts @jen20 @lukehoban |
Update: this error seems related to various other TF user issues when mixing & matching secgroup in-line rules vs standalone secgroup rules, and all issues have been bubbled up into hashicorp/terraform-provider-aws#2069 by TF staff. There's a fix that's supposed to correct this in this comment. However, the merged fix does not actually fix the problem, and a comment in that issue calls out the identical behavior we're experiencing:
This crossing of ingress rules listed above matches similar experiences we're seeing in CloudTrail events denoting the attempt to revoke an ingress rule without permission, and the rules in question for the Per the linked issue and others, all signs point to avoiding the use of defining in-line rules in SecurityGroups in TF due to issues in how ingress rules are gathered within SecurityGroups, and instead opting for separate SecurityGroupRules seems to resolve most folks in general. Attempts at using SecurityGroupRules seem like the right path, but this does not seem to play well with existing clusters. |
@metral Do you have a repro for this behavior so that an issue can be opened on https://github.com/terraform-providers/terraform-provider-aws ? |
Here's a minimum repro Pulumi program: Steps:
|
For a little more insight into this - with @metral's repro, here's what the changes are that are happening as part of the second update:
Note in particular that the SecurityGroup is actually only even being updated due to a change in it's |
Turning on verbose logging from the Terraform Provider (
This corresponds to the code at https://github.com/terraform-providers/terraform-provider-aws/blob/7beffd7ce24efe62af769e581b3c354ddb724a6d/aws/resource_aws_security_group.go#L725. What's really surprising here is that it says we are revoking this:
But the corresponding ingress rule looks like this:
Which is then what gets embedded into the AWS API request that fails:
Note that these have different And indeed, when I look at the actual AWS resources, there is a resource corresponding to the second, but no resource corresponding to the first. It is unclear where the Terraform provider is even coming up with that In the source code, this ingress rule is defined as:
Notably - it uses |
@lukehoban Your analysis is spot on to what I've been seeing. I've even tried removing the use of Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing. Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors. Thoughts? |
Okay - I'm beginning to believe the root cause here is actually a core Pulumi engine bug. And moreover, I'm also beginning to think it's the same bug I spent most of this week fighting as part of pulumi/pulumi-terraform#362 and ultimately currently being tracked in pulumi/pulumi#2650. The key is to look at the Pulumi state file ( Inputs:
Outputs:
Note that the two arrays are in different orders. But then as highlighted in pulumi/pulumi#2650, the Pulumi engine does a (highly suspicious) deep merge of input and outputs to compute the "old inputs", including an element-wise merge of arrays. This will cause the values passed to the Terraform provider And indeed - a close inspection of the
Which includes:
Which includes both Net - I have fairly high confidence the root cause here is pulumi/pulumi#2650. |
As a result - a workaround for this bug is to do the following:
|
As for near term fixes to the specific case here.
I agree that in general this is a better approach - both to avoid issues like the one described above - but also just because it's generally easier to extend (users can add their own rules if the want).
I think you are right that any attempt to change the SecurityGroup that this resource currently deploys will fail, leading to any attempt to "fix" this to actually lead to triggering this issue. We may want to wait to see if we can fix pulumi/pulumi#2650 in the core engine first, and then after we do that, I believe we could safely transition to a cleaner approach to managing ingress rules here. For right now, the workaround in #69 (comment) should unblock anyone currently blocked by this. |
The original issue here will be fixed by using a |
The fix for this issue is now available in the current latest version of the CLI, v0.17.8: https://github.com/pulumi/pulumi/releases/tag/v0.17.8 |
I just encountered |
Performing a change that requires an EKS cluster to be replaced fails, and leaves the stack in a state which cannot be destroyed nor updated.
Pulumi version: v0.16.14
I have a stack containing an EKS cluster. Updating the subnetIds causes a "replace" (as expected), however the replace fails.
The stack is then in a bad state, as the EKS cluster was replaced, but other resources were not updated.
Further updates (e.g. trying to revert to the previous subnetIds) fail with the same issue.
If I have deployments on the cluster, I cannot destroy the stack either, as these cannot be deleted (the cluster they belonged to no longer exists).
The text was updated successfully, but these errors were encountered: