Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
We were running a service with LoadBalancer type with AWS cloud provider and after deleting this service, the cleanup process on AWS removed a wrong ingress input rule which provoked that all traffic between nodes was unable to reach out the destination.
The service we were running had a configuration similar to:
After applying this manfiest on the cluster, the cloud provider created the associated load balancer with two security groups:
Cloud provider also created a new input rule on the nodes security group which allows all traffic from all ports from
One particularity of our setup is that the security group we added as
Everything worked fine with this setup until we deleted this service and the cloud provider started the deletion process of the LoadBalancer on AWS. We've been debugging the deletion code and we found that the first thing it does is cleaning up the security group. To do it, it first removes the input rule from the nodes security group which allows traffic from the load balancer security group (sg-abcdef, on this example) and then, as it's not referenced, it removes the security group.
But, in our case, it ended up as:
which dropped all the traffic between nodes (as it didn't allow any traffic between them).
The cleanup process didn't properly remove the input rule from the node security group (the one which allows all traffic from source sg-abcdef) and it also did not remove the sg-abcdef security group.
We've been debugging and we found that the process which removes the input rule from the nodes security groups starts here: https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/cloudprovider/providers/aws/aws.go#L4021
The first thing this function does is this: https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/cloudprovider/providers/aws/aws.go#L3737-L3748 :
When extra security groups are used,
We've seen that master branch still contains this code and we think this may be hapenning on newer versions:
In our case, we saw several times this log during the removal process: https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/cloudprovider/providers/aws/aws.go#L3745
Additionally, the code tried to remove the security group (but being unable to do so), after wrongly revoking the rule we have talked about before.
What you expected to happen:
The cleanup process should properly detect the load balancer security group, remove the input rule from nodes security group and remove the load balancer security group, regardless of the order in which AWS returns the security group list on
How to reproduce it (as minimally and precisely as possible):
We did not find any consistent process to create the scenario, as it happens randomly (we are not sure if due to AWS or any kind of race condition on AWS cloud provider).
Trigger the bug: