Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect security group cleanup after service deletion #84926

Open
gnieto opened this issue Nov 7, 2019 · 1 comment

Comments

@gnieto
Copy link

@gnieto gnieto commented Nov 7, 2019

What happened:

We were running a service with LoadBalancer type with AWS cloud provider and after deleting this service, the cleanup process on AWS removed a wrong ingress input rule which provoked that all traffic between nodes was unable to reach out the destination.

The service we were running had a configuration similar to:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: Key=Value
    service.beta.kubernetes.io/aws-load-balancer-extra-security-groups: sg-123456
  name: test-service
  namespace: test
spec:
  loadBalancerSourceRanges:
  - 10.0.0.0/8
  ports:
  - name: port-name
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    label: "true"
  sessionAffinity: None
  type: LoadBalancer

After applying this manfiest on the cluster, the cloud provider created the associated load balancer with two security groups:

  • sg-123456 (due to extra security group annotation)
  • sg-abcdef (a new SG created by the AWS cloud provider with the rules defined on the loadBalancerSourceRanges)

Cloud provider also created a new input rule on the nodes security group which allows all traffic from all ports from sg-abcdef (which is the sg created for the newly created load balancer in our example).

One particularity of our setup is that the security group we added as service.beta.kubernetes.io/aws-load-balancer-extra-security-groups was the nodes security group. The input rules of this security group after the creation of this service was similar to:

Type            Protocol        Port Range        Source        Description
All traffic     All             All               sg-abcdef     LB security group
All traffic     All             All               sg-123456     Nodes security group

Everything worked fine with this setup until we deleted this service and the cloud provider started the deletion process of the LoadBalancer on AWS. We've been debugging the deletion code and we found that the first thing it does is cleaning up the security group. To do it, it first removes the input rule from the nodes security group which allows traffic from the load balancer security group (sg-abcdef, on this example) and then, as it's not referenced, it removes the security group.
The expected output of this process, would be that nodes security groups would be:

Type            Protocol        Port Range        Source        Description
All traffic     All             All               sg-123456     Nodes security group

But, in our case, it ended up as:

Type            Protocol        Port Range        Source        Description
All traffic     All             All               sg-abcdef     LB security group

which dropped all the traffic between nodes (as it didn't allow any traffic between them).

The cleanup process didn't properly remove the input rule from the node security group (the one which allows all traffic from source sg-abcdef) and it also did not remove the sg-abcdef security group.
In our case, as we were using the nodes security group on the extra security groups, this bug provoked a network partition between nodes, but we think that if service.beta.kubernetes.io/aws-load-balancer-extra-security-groups is used, the cleanup process may be broken and have unexpected side-effects like those we observed.

We've been debugging and we found that the process which removes the input rule from the nodes security groups starts here: https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/cloudprovider/providers/aws/aws.go#L4021

The first thing this function does is this: https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/cloudprovider/providers/aws/aws.go#L3737-L3748 :

When extra security groups are used, elb.LoadBalancerDescription.SecurityGroups will contain the security group created for the LoadBalancer and all the extra security groups. We also found that AWS returns this SecurityGroups list in distinct orders after changing the LoadBalancer configuration. That code is always assuming that latest security group on the list is the loadBalancerSecurityGroupID, which may not be true when extra security groups are used.
On the cases where the load balancer security group is not returned on the latest position, the cleanup process does not remove the security group and the input rule from the nodes security group is not removed as expected.

We've seen that master branch still contains this code and we think this may be hapenning on newer versions:

for _, securityGroup := range lb.SecurityGroups {
if aws.StringValue(securityGroup) == "" {
continue
}
if loadBalancerSecurityGroupID != "" {
// We create LBs with one SG
klog.Warningf("Multiple security groups for load balancer: %q", aws.StringValue(lb.LoadBalancerName))
}
loadBalancerSecurityGroupID = *securityGroup
}
), so we think that newer versions may be affected by this bug (we didn't try it).

In our case, we saw several times this log during the removal process: https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/cloudprovider/providers/aws/aws.go#L3745

Additionally, the code tried to remove the security group (but being unable to do so), after wrongly revoking the rule we have talked about before.

What you expected to happen:

The cleanup process should properly detect the load balancer security group, remove the input rule from nodes security group and remove the load balancer security group, regardless of the order in which AWS returns the security group list on elb.LoadBalancerDescription.SecurityGroups.

How to reproduce it (as minimally and precisely as possible):

We did not find any consistent process to create the scenario, as it happens randomly (we are not sure if due to AWS or any kind of race condition on AWS cloud provider).

Setup:

  • Create a Service with type: LoadBalancer running controller manager with AWS cloud provider with tha annotation service.beta.kubernetes.io/aws-load-balancer-extra-security-groups containing one or more security groups (as the one the What happened section)
  • Wait until load balancer is created
  • Run aws elb describe-load-balancers --load-balancer-names <balancer_name>
  • Check if SecurityGroups does not have the newly created security group for the balancer in the latest position.
  • Repeat until the balancer does not have the balancer security group on the latest position

Trigger the bug:

  • kubectl delete service <service>

Result:

  • The log Multiple security groups for load balancer will appear multiple times on controller-manager logs
  • Load balancer associated security group is not removed
  • Nodes security group input rule allowing all traffic from load balancer security group is not removed
  • If the latest security group on the list was the nodes security group, the input rule allowing all traffic on all ports from all the other nodes is removed and no traffic is allowed between nodes on the cluster.

Environment:

  • Kubernetes version (use kubectl version): version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.10", GitCommit:"7a578febe155a7366767abce40d8a16795a96371", GitTreeState:"clean", BuildDate:"2019-05-01T04:05:01Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: aws
  • OS (e.g: cat /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
  • Kernel (e.g. uname -a): Linux ip-10-2-13-67 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
  • Install tools: kops
@gnieto gnieto added the kind/bug label Nov 7, 2019
@gnieto

This comment has been minimized.

Copy link
Author

@gnieto gnieto commented Nov 7, 2019

/sig cloud-provider

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.