Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replace EKS cluster: Error revoking security group ingress rules #69

Closed
colinbankier opened this issue Feb 25, 2019 · 16 comments
Assignees
Labels
p1 A bug severe enough to be the next item assigned to an engineer
Milestone

Comments

@colinbankier
Copy link

colinbankier commented Feb 25, 2019

Performing a change that requires an EKS cluster to be replaced fails, and leaves the stack in a state which cannot be destroyed nor updated.

Pulumi version: v0.16.14

I have a stack containing an EKS cluster. Updating the subnetIds causes a "replace" (as expected), however the replace fails.

Diagnostics:
  aws:ec2:SecurityGroup (cluster-nodeSecurityGroup):
    error: Plan apply failed: 1 error occurred:
    
    * updating urn:pulumi:test::eks-test::eks:index:Cluster$aws:ec2/securityGroup:SecurityGroup::cluster-nodeSecurityGroup: Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group.
    	status code: 400, request id: 577f76ae-06d1-4030-8f84-a8844b89a6d3

The stack is then in a bad state, as the EKS cluster was replaced, but other resources were not updated.
Further updates (e.g. trying to revert to the previous subnetIds) fail with the same issue.
If I have deployments on the cluster, I cannot destroy the stack either, as these cannot be deleted (the cluster they belonged to no longer exists).

@lukehoban
Copy link
Member

Curious - did you use any non-default settings - in particular for the node securitygroup?

@lukehoban lukehoban transferred this issue from pulumi/pulumi Feb 25, 2019
@lukehoban lukehoban added this to the 0.21 milestone Feb 25, 2019
@colinbankier
Copy link
Author

colinbankier commented Feb 25, 2019

Nothing non-default - I reproduced the issue with an extremely minimal config, just

const cluster = new eks.Cluster("cluster", {
    vpcId: "vpc-xxxx",
    subnetIds: ["subnet-xxxx", "subnet-xxxx"],
    instanceType: "r5.large",
    desiredCapacity: 2,
    minSize: 2,
    maxSize: 2,
    storageClasses: "gp2",
    deployDashboard: true,
});

And then added some subnets to the list and the same issue occurred.
(The initial subnets were public ones, I added extra private ones, if that makes any difference).

@metral metral modified the milestones: 0.21, 0.22 Mar 6, 2019
@ggilmore
Copy link

Any updates for this issue? I just ran into the same situation.

@metral
Copy link
Contributor

metral commented Apr 8, 2019

This appears to be related to a long-standing TF bug when using descriptions in the security group rules: hashicorp/terraform-provider-aws#2879, and a fix supposedly came in terraform-provider-aws 1.19.0 but doesn't appear to be a remedy in this scenario.

Per user comments, removing the descriptions from the security group rules seems to skate around this issue. thoughts @jen20 @lukehoban

@metral
Copy link
Contributor

metral commented Apr 18, 2019

Update: this error seems related to various other TF user issues when mixing & matching secgroup in-line rules vs standalone secgroup rules, and all issues have been bubbled up into hashicorp/terraform-provider-aws#2069 by TF staff. There's a fix that's supposed to correct this in this comment.

However, the merged fix does not actually fix the problem, and a comment in that issue calls out the identical behavior we're experiencing:

In the result of the second apply action above, note the ingress rule with number 2396394866 Terraform, somehow and from somewhere sees an ingress rule that tries to delete and which doesn't exist anywhere in my state file or my security group in AWS. And that's why it complains that the security group ingress rule does not exist. Why and how is Terraform seeing something that doesn't exist in the .tf or the state file is the question.

This crossing of ingress rules listed above matches similar experiences we're seeing in CloudTrail events denoting the attempt to revoke an ingress rule without permission, and the rules in question for the nodeSecurityGroup can at times show another rule that doesn't exist anywhere, often times using the eksClusterSecurityGroup id in lieu of the desired nodeSecurityGroup id or some other variation, while none of this matches the actual state in AWS. There definitely seems to be some crossing of wires happening in the request.

Per the linked issue and others, all signs point to avoiding the use of defining in-line rules in SecurityGroups in TF due to issues in how ingress rules are gathered within SecurityGroups, and instead opting for separate SecurityGroupRules seems to resolve most folks in general. Attempts at using SecurityGroupRules seem like the right path, but this does not seem to play well with existing clusters.

@ggilmore
Copy link

@metral Do you have a repro for this behavior so that an issue can be opened on https://github.com/terraform-providers/terraform-provider-aws ?

@metral
Copy link
Contributor

metral commented Apr 20, 2019

Here's a minimum repro Pulumi program:

Steps:

  1. Run pulumi up in the unzipped dir
  2. After initial deployment is complete, comment out line Support tagging of instances, ASGs, launch config, etc. #74 subnetIds.pop(), and run another update.
    • This simulates having a VPC with existing subnets, and increasing up from using 2 subnets in the cluster to 3 subnets (as noted in the earlier comment).
  3. After about ~10 min it'll error out with Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group.

@lukehoban
Copy link
Member

For a little more insight into this - with @metral's repro, here's what the changes are that are happening as part of the second update:

Previewing update (dev):

     Type                                          Name                           Plan        Info
     pulumi:pulumi:Stack                           eks69-dev
     └─ eks:index:Cluster                          eks69
 +-     ├─ aws:eks:Cluster                         eks69-eksCluster               replace     [diff: ~name,vpcConfig]
 ~      ├─ pulumi:providers:kubernetes             eks69-eks-k8s                  update      [diff: ~kubeconfig]
 ~      ├─ kubernetes:storage.k8s.io:StorageClass  eks69-gp2                      update      [diff: ~metadata]; 1 warning
 ~      ├─ pulumi-nodejs:dynamic:Resource          eks69-vpc-cni                  update      [diff: ~kubeconfig]
 ~      ├─ aws:ec2:SecurityGroup                   eks69-nodeSecurityGroup        update      [diff: ~tags]
 +-     ├─ aws:ec2:LaunchConfiguration             eks69-nodeLaunchConfiguration  replace     [diff: ~name,userData]
 ~      ├─ aws:cloudformation:Stack                eks69-nodes                    update      [diff: ~templateBody]
 ~      └─ pulumi:providers:kubernetes             eks69-provider                 update      [diff: ~kubeconfig]

Note in particular that the SecurityGroup is actually only even being updated due to a change in it's tags.

@lukehoban
Copy link
Member

Turning on verbose logging from the Terraform Provider (export TF_LOG=TRACE) and then logging the pulumi update (e.g. pulumi up --logtostderr -v=9 --debug 2> out.txt) shows this:

    debug: Revoking security group {
    debug:   Description: "Managed by Pulumi",
    debug:   GroupId: "sg-00ed8f2d8501ba93f",
    debug:   GroupName: "eks69-nodeSecurityGroup-dc62a71",
    debug:   IpPermissions: [{
    debug:       IpProtocol: "-1",
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow nodes to communicate with each other",
    debug:           GroupId: "sg-00ed8f2d8501ba93f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     },{
    debug:       FromPort: 1025,
    debug:       IpProtocol: "tcp",
    debug:       ToPort: 65535,
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow worker Kubelets and pods to receive communication from the cluster control plane",
    debug:           GroupId: "sg-01fb5f6f2cd802f8f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     },{
    debug:       FromPort: 443,
    debug:       IpProtocol: "tcp",
    debug:       ToPort: 443,
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
    debug:           GroupId: "sg-01fb5f6f2cd802f8f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     }],
    debug:   IpPermissionsEgress: [{
    debug:       IpProtocol: "-1",
    debug:       IpRanges: [{
    debug:           CidrIp: "0.0.0.0/0",
    debug:           Description: "Allow internet access."
    debug:         }]
    debug:     }],
    debug:   OwnerId: "153052954103",
    debug:   Tags: [{
    debug:       Key: "kubernetes.io/cluster/eks69-eksCluster-254e5bc",
    debug:       Value: "owned"
    debug:     }],
    debug:   VpcId: "vpc-0eab3666fe9e0ab7b"
    debug: } ingress rule: []*ec2.IpPermission{{
    debug:   FromPort: 0,
    debug:   IpProtocol: "-1",
    debug:   ToPort: 0,
    debug:   UserIdGroupPairs: [{
    debug:       Description: "Allow nodes to communicate with each other",
    debug:       GroupId: "sg-01fb5f6f2cd802f8f"
    debug:     }]
    debug: }}
    debug: [aws-sdk-go] DEBUG: Request ec2/RevokeSecurityGroupIngress Details:
    debug: ---[ REQUEST POST-SIGN ]-----------------------------
    debug: POST / HTTP/1.1
    debug: Host: ec2.us-west-2.amazonaws.com
    debug: User-Agent: aws-sdk-go/1.19.4 (go1.12.3; darwin; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.12
    debug: Content-Length: 297
    debug: Authorization: AWS4-HMAC-SHA256 Credential=redacted/20190420/us-west-2/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=7b1037d29442740b6c5d3cd1b8b86c97304eb2dd84a49b250c6fcf06a920baf9
    debug: Content-Type: application/x-www-form-urlencoded; charset=utf-8
    debug: X-Amz-Date: 20190420T191141Z
    debug: Accept-Encoding: gzip
    debug: 
    debug: Action=RevokeSecurityGroupIngress&GroupId=sg-00ed8f2d8501ba93f&IpPermissions.1.FromPort=0&IpPermissions.1.Groups.1.Description=Allow+nodes+to+communicate+with+each+other&IpPermissions.1.Groups.1.GroupId=sg-01fb5f6f2cd802f8f&IpPermissions.1.IpProtocol=-1&IpPermissions.1.ToPort=0&Version=2016-11-15
    debug: -----------------------------------------------------
    debug: RegisterResource RPC finished: resource:eks69-gp2[kubernetes:storage.k8s.io/v1:StorageClass]; err: null, resp: urn:pulumi:dev::eks69::eks:index:Cluster$kubernetes:storage.k8s.io/v1:StorageClass::eks69-gp2,eks69-gp2-ffhxrwou,__inputs,,,,,apiVersion,,,storage.k8s.io/v1,kind,,,StorageClass,metadata,,,,,annotations,,,,,pulumi.com/autonamed,,,true,storageclass.kubernetes.io/is-default-class,,,true,name,,,eks69-gp2-ffhxrwou,parameters,,,,,type,,,gp2,provisioner,,,kubernetes.io/aws-ebs,apiVersion,,,storage.k8s.io/v1,kind,,,StorageClass,metadata,,,,,annotations,,,,,pulumi.com/autonamed,,,true,storageclass.kubernetes.io/is-default-class,,,true,creationTimestamp,,,2019-04-20T18:57:19Z,name,,,eks69-gp2-ffhxrwou,resourceVersion,,,370,selfLink,,,/apis/storage.k8s.io/v1/storageclasses/eks69-gp2-ffhxrwou,uid,,,1f9c24b2-639e-11e9-b4bc-02ace6ef69c6,parameters,,,,,type,,,gp2,provisioner,,,kubernetes.io/aws-ebs,reclaimPolicy,,,Delete,volumeBindingMode,,,Immediate,true,
    debug: [aws-sdk-go] DEBUG: Response ec2/RevokeSecurityGroupIngress Details:
    debug: ---[ RESPONSE ]--------------------------------------
    debug: HTTP/1.1 400 Bad Request
    debug: Connection: close
    debug: Transfer-Encoding: chunked
    debug: Date: Sat, 20 Apr 2019 19:11:41 GMT
    debug: Server: AmazonEC2
    debug: 
    debug: 
    debug: -----------------------------------------------------
    debug: [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
    debug: <Response><Errors><Error><Code>InvalidPermission.NotFound</Code><Message>The specified rule does not exist in this security group.</Message></Error></Errors><RequestID>994d0790-1dc9-487c-969e-993b4e89f8f4</RequestID></Response>
    debug: [aws-sdk-go] DEBUG: Validate Response ec2/RevokeSecurityGroupIngress failed, not retrying, error InvalidPermission.NotFound: The specified rule does not exist in this security group.
    debug:      status code: 400, request id: 994d0790-1dc9-487c-969e-993b4e89f8f4
    error: update failed

This corresponds to the code at https://github.com/terraform-providers/terraform-provider-aws/blob/7beffd7ce24efe62af769e581b3c354ddb724a6d/aws/resource_aws_security_group.go#L725.

What's really surprising here is that it says we are revoking this:

    debug: []*ec2.IpPermission{{
    debug:   FromPort: 0,
    debug:   IpProtocol: "-1",
    debug:   ToPort: 0,
    debug:   UserIdGroupPairs: [{
    debug:       Description: "Allow nodes to communicate with each other",
    debug:       GroupId: "sg-01fb5f6f2cd802f8f"
    debug:     }]
    debug: }}

But the corresponding ingress rule looks like this:

    debug:    {
    debug:       IpProtocol: "-1",
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow nodes to communicate with each other",
    debug:           GroupId: "sg-00ed8f2d8501ba93f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     }

Which is then what gets embedded into the AWS API request that fails:

Action=RevokeSecurityGroupIngress&GroupId=sg-00ed8f2d8501ba93f&IpPermissions.1.FromPort=0&IpPermissions.1.Groups.1.Description=Allow+nodes+to+communicate+with+each+other&IpPermissions.1.Groups.1.GroupId=sg-01fb5f6f2cd802f8f&IpPermissions.1.IpProtocol=-1&IpPermissions.1.ToPort=0&Version=2016-11-15

Note that these have different GroupId values.

And indeed, when I look at the actual AWS resources, there is a resource corresponding to the second, but no resource corresponding to the first. It is unclear where the Terraform provider is even coming up with that sg-01fb5f6f2cd802f8f value. It's the right value for a different ingress rule, but is not present in any form for this ingress rule.

In the source code, this ingress rule is defined as:

            {
                description: "Allow nodes to communicate with each other",
                fromPort: 0,
                toPort: 0,
                protocol: "-1", // all
                self: true,
            },

Notably - it uses self: true to specify the source securityGroup, but the Terraform Provider appears to be getting confused and using some totally unrelated Security Group in place of the self link.

@metral
Copy link
Contributor

metral commented Apr 20, 2019

@lukehoban Your analysis is spot on to what I've been seeing. I've even tried removing the use of self in place for the nodeSecurityGroupId itself, but even that doesn't seem to fix the TF provider from tripping up when gathering rules - which is a spot ref'd in a couple of issues.

Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing. Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors. Thoughts?

@lukehoban
Copy link
Member

Okay - I'm beginning to believe the root cause here is actually a core Pulumi engine bug. And moreover, I'm also beginning to think it's the same bug I spent most of this week fighting as part of pulumi/pulumi-terraform#362 and ultimately currently being tracked in pulumi/pulumi#2650.

The key is to look at the Pulumi state file (pulumi stack export > stack.json), and notice the following values for ingress on the "inputs" and "outputs": of the security group in question respectively:

Inputs:

"ingress": [
                        {
                            "__defaults": [],
                            "description": "Allow nodes to communicate with each other",
                            "fromPort": 0,
                            "protocol": "-1",
                            "self": true,
                            "toPort": 0
                        },
                        {
                            "__defaults": [
                                "self"
                            ],
                            "description": "Allow worker Kubelets and pods to receive communication from the cluster control plane",
                            "fromPort": 1025,
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 65535
                        },
                        {
                            "__defaults": [
                                "self"
                            ],
                            "description": "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
                            "fromPort": 443,
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 443
                        }
                    ],

Outputs:

"ingress": [
                        {
                            "cidrBlocks": [],
                            "description": "Allow worker Kubelets and pods to receive communication from the cluster control plane",
                            "fromPort": 1025,
                            "ipv6CidrBlocks": [],
                            "prefixListIds": [],
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 65535
                        },
                        {
                            "cidrBlocks": [],
                            "description": "Allow nodes to communicate with each other",
                            "fromPort": 0,
                            "ipv6CidrBlocks": [],
                            "prefixListIds": [],
                            "protocol": "-1",
                            "securityGroups": [],
                            "self": true,
                            "toPort": 0
                        },
                        {
                            "cidrBlocks": [],
                            "description": "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
                            "fromPort": 443,
                            "ipv6CidrBlocks": [],
                            "prefixListIds": [],
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 443
                        }
                    ],

Note that the two arrays are in different orders. But then as highlighted in pulumi/pulumi#2650, the Pulumi engine does a (highly suspicious) deep merge of input and outputs to compute the "old inputs", including an element-wise merge of arrays. This will cause the values passed to the Terraform provider Update call to be a strange mix of two of the specified rules - which is capable of triggering the results seen above.

And indeed - a close inspection of the Update call Pulumi schedules shows:

I0420 12:11:39.874548   23508 rpc.go:68] Marshaling property for RPC[Provider[aws, 0xc004f81d60].Update(sg-00ed8f2d8501ba93f,urn:pulumi:dev::eks69::eks:index:Cluster$aws:ec2/securityGroup:SecurityGroup::eks69-nodeSecurityGroup).olds]: ingress={[{map[__defaults:{[]} cidrBlocks:{[]} description:{Allow worker Kubelets and pods to receive communication from the cluster control plane} fromPort:{1025} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{tcp} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{false} toPort:{65535}]} {map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow nodes to communicate with each other} fromPort:{0} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{-1} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{true} toPort:{0}]} {map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow pods running extension API servers on port 443 to receive communication from cluster control plane} fromPort:{443} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{tcp} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{false} toPort:{443}]}]}

Which includes:

{map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow nodes to communicate with each other} fromPort:{0} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{-1} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{true} toPort:{0}]} 

Which includes both self: true and securityGroups: ["sg-01fb5f6f2cd802f8f"] - even though the user never specified this, and no such thing exists in the cloud or even in the state file.

Net - I have fairly high confidence the root cause here is pulumi/pulumi#2650.

@lukehoban
Copy link
Member

As a result - a workaround for this bug is to do the following:

  1. pulumi stack export > stack.json
  2. Find the nodeSecurityGroup outputs ingress array and ensure it is ordered the same as the inputs ingress. This most likely means swapping the first two elements of the three element array. Save the file.
  3. pulumi stack import < stack.json

@lukehoban
Copy link
Member

As for near term fixes to the specific case here.

Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing.

I agree that in general this is a better approach - both to avoid issues like the one described above - but also just because it's generally easier to extend (users can add their own rules if the want).

Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors .

I think you are right that any attempt to change the SecurityGroup that this resource currently deploys will fail, leading to any attempt to "fix" this to actually lead to triggering this issue. We may want to wait to see if we can fix pulumi/pulumi#2650 in the core engine first, and then after we do that, I believe we could safely transition to a cleaner approach to managing ingress rules here.

For right now, the workaround in #69 (comment) should unblock anyone currently blocked by this.

@lukehoban
Copy link
Member

The original issue here will be fixed by using a pulumi CLI with the fix for pulumi/pulumi#2650. @metral will open a new issue to track some related improvements we can make to avoid using inline security groups. But the specific issue here should be fixed.

@metral
Copy link
Contributor

metral commented Apr 24, 2019

The fix for this issue is now available in the current latest version of the CLI, v0.17.8: https://github.com/pulumi/pulumi/releases/tag/v0.17.8

@infin8x infin8x added the p1 A bug severe enough to be the next item assigned to an engineer label Jul 10, 2021
@steveellis
Copy link

I just encountered Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group. I'm on pulumi cli version 3.19.0. Running pulumi destroy -r was all I needed. Hopefully this will help someone else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 A bug severe enough to be the next item assigned to an engineer
Projects
None yet
Development

No branches or pull requests

6 participants