Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Shared Security Group to Allow Traffic from Unlimited Number of ELBs. #26670

Closed
kevinkim9264 opened this issue Jun 2, 2016 · 37 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@kevinkim9264
Copy link
Contributor

Currently, Kubernetes adds every ELB’s security group rule to instances, which means the number of rules included in instance’s security group grows as the number of ELB grows.

AWS supports up to 50 inbound rules per security group and 5 security groups per network interface. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html#vpc-limits-security-groups

With this AWS limit and the current setup of Kubernetes, there is a hard limit of 250 services for each Kubernetes cluster. This problem arises mainly because every ELB creation results in a new rule added to every instance security group.

We can resolve this issue by introducing a shared security group per Kubernetes cluster. A simple solution is to modify codes in aws.go such that when ELB is created, it either finds or creates a shared security group with no rule and attaches it to ELB. Also, instead of adding the ELB’s own security group rule to instances, it tests if each instance already has the shared security group id as one of the source group id. Here, we also have to make sure each instance accepts every traffic coming from the shared security group source. If it finds that some instances do not have the shared security group id added yet, it does so.

With this revision, the number of security group rules for instances becomes independent of the number of ELBs, which means the number of service it can support is not limited by the AWS limit.

@dgolja
Copy link

dgolja commented Sep 19, 2016

We hit the same limit. As it is now with the default AWS limitations you are able to to have only 50 services where you need to use ELB.

Our workaround will be to increase the max. inbound rules to 100, but this can be only used if you do not have other EC2 instances with more than 2 SG per network interface.

Hopefully there will be a fix before we hit ~100 ELB services.

@jswoods
Copy link

jswoods commented Oct 4, 2016

Have either of you tried out the setting DisableSecurityGroupIngress? From https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L379-L386

@hjacobs
Copy link

hjacobs commented Dec 13, 2016

This is a pretty serious limitation we ran into today. Luckily we are fixing it by using Ingress: zalando-incubator/kubernetes-on-aws#169

@Krylon360
Copy link

It looks like this is already in the Go SDK; but doesn't look like it's hitting the correct methods to actually work.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1836

@dimpavloff
Copy link
Contributor

dimpavloff commented Feb 13, 2017

@Krylon360 I don't work on the codebase but it seems to me the code should work already. The code you've linked gets called by setSecurityGroupIngress https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L2730 with the ELB's SG arg.
The issue is to do with modifying the Node's SG, which happens in updateInstanceSecurityGroupsForLoadBalancer https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L2757

p.s. For anyone else who also like me didn't know how to set DisableSecurityGroupIngress , you can pass in --cloud-config=<filepath> to the master components with contents matching https://godoc.org/gopkg.in/kubernetes/kubernetes.v1/pkg/cloudprovider/providers/aws#CloudConfig . I haven't yet confirmed whether this solves the issue

@prakash1991
Copy link

Hey guys,

I also was facing the same roadblock but was able to resolve it by editing the minion sg rules to allow traffic from kubernetes vpc CIDR and removing all other rules.

Hope this helps.

Regards,
Prakash

@cbluth
Copy link

cbluth commented Feb 28, 2017

@prakash1991 , can you elaborate?

@Krylon360
Copy link

Krylon360 commented Feb 28, 2017 via email

@rexc
Copy link

rexc commented Mar 23, 2017

I have VPC subnets, nodes/minion tagged with the correct KubernetesCluster but still seeing our node SG getting an entry for each ELB that is created.

@Krylon360 is there anything else I might be missing?

@szuecs
Copy link
Member

szuecs commented Apr 27, 2017

We (same team as @hjacobs) are running again into this issue for all non http traffic. We use ingress for http traffic, but for Postgres it is not an option.

@szuecs
Copy link
Member

szuecs commented May 2, 2017

@cbluth I think @prakash1991 manually modified the SG, which can be done as a workaround, because no one is resetting the SG. A simple manual delete works, but is not a solution for Kubernetes.

@szuecs
Copy link
Member

szuecs commented May 2, 2017

For the record, you could use the following configuration change to fix your "Too many ELBs: RulesPerSecurityGroupLimitExceeded" issue:

https://github.com/zalando-incubator/kubernetes-on-aws/pull/390/files

DISCLAIMER: we do not use it in production yet, so please make sure you understand the change and test it.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@chrislovecnm
Copy link
Contributor

/sig aws

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 19, 2017
@chrislovecnm
Copy link
Contributor

@justinsb has any work been done on this?

@ahawkins
Copy link

To prevent a SG per ELB, you have to tag the VPC Subnet (you would need to
tag all subnets associated with the Cluster, internal/private and
external/public), assigned to the Minions with the same
Name=KubernetesCluster,Value=clusterName tag.

@chrislovecnm is this relevant to kops? (re: #26670 (comment))

@ahawkins
Copy link

I've observed something else. My cluster has SG entries for ELBs for services that no longer exist. It seems in my case I'm left with dangling ELBs and (perhaps) thus dangling SG entries.

@jeb5-ccl
Copy link

We have an issue whereby we deploy new versions of services from our CI pipeline frequently. We delete services and recreate them as part of this process. ELBs, SGs, Rules and network IFs don't get deleted and we end up hitting our AWS account limits very quickly forcing manual deletion of these via the AWS console. Can anyone point me at which logs I should be looking at to see what might be going wrong during the deletes?

@szuecs
Copy link
Member

szuecs commented Jul 30, 2017

@jeb5-ccl You should have a look into the logs of the controller-manager.

@chrislovecnm
Copy link
Contributor

This is supported through the cloud controller manager configuration. You can name a single group to be used now

@lypht
Copy link

lypht commented Nov 10, 2017

Glad to see this is addressed (as alpha) in 1.8. Does anyone have a programmatic work around on legacy cluster versions (K8S 1.5 or older)?

@szuecs
Copy link
Member

szuecs commented Nov 11, 2017

@lypht not sure what you exactly mean, but we are happy with https://github.com/zalando-incubator/kube-ingress-aws-controller/
We use one SG shared for all ALBs https://github.com/zalando-incubator/kube-ingress-aws-controller/blob/master/aws/adapter.go#L238
If you have any problems with using it, please let us know in an issue in the kube-ingress-aws-controller repository.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2018
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2018
@racyber
Copy link

racyber commented Apr 15, 2018

Hi, does anyone knows how to workaround this issue on AWS? we have a lot of TCP based apps and no HTTP and we can't use ingress.

@hsingh6764
Copy link

hsingh6764 commented Apr 15, 2018

@racyber if you are using kops then you can apply this setting:

spec
  cloudConfig:
    disableSecurityGroupIngress: true

Kubernetes will still create security group per ELB but won't add that to Node security group.
You will have to add rule to your Node Security Group to allow all ELB to have access. (mostly adding your VPC CIDR)

@racyber
Copy link

racyber commented Apr 15, 2018

@hsingh6764 just found the documentation today! thanks!

https://github.com/kubernetes/kops/blob/release-1.9/docs/cluster_spec.md#cloudconfig

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 14, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2018
@szuecs
Copy link
Member

szuecs commented Aug 13, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 13, 2018
@luckymagic7
Copy link

@hsingh6764 @racyber
Where shoud I apply that configuration?
I have yamlfile that kinds are Service, Deployment, HorizontalPodAutoscaler

@hsingh6764
Copy link

@luckymagic7 it is part of kubernetes setup not these objects.

@luckymagic7
Copy link

@hsingh6764 Many Thanks!! I created a new cluster and it works fine^^

@ghost
Copy link

ghost commented Feb 4, 2019

Hi, any more word on this? I'm creating NLBs (for TCP services), and getting issues around too many rules for my worker security groups. Is there any way to reference a specific security group for a NLB?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 6, 2019
@ghost
Copy link

ghost commented May 6, 2019

For reference this has been implemented #62774

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 5, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests