New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split KUBE-SERVICES chain to re-shrink the INPUT chain #56164

Merged
merged 1 commit into from Feb 23, 2018

Conversation

Projects
None yet
8 participants
@danwinship
Contributor

danwinship commented Nov 21, 2017

What this PR does / why we need it:
#43972 added an iptables rule "-A INPUT -j KUBE-SERVICES" to make NodePort ICMP rejection work. (Previously the KUBE-SERVICES chain was only run from OUTPUT, not INPUT.) #44547 extended that patch for ExternalIP rejection as well.

However, the KUBE-SERVICES chain may potentially have a very large number of ICMP reject rules for plain ClusterIP services (the ones that get run from OUTPUT), and it seems that for some reason the kernel is much more sensitive to the length of the INPUT chain than it is to the length of the OUTPUT chain. So a node that worked fine with kube 1.6 (when KUBE-SERVICES was only run from OUTPUT) might fall over with kube 1.7 (with KUBE-SERVICES being run from both INPUT and OUTPUT).

(Specifically, a node with about 5000 ClusterIP reject rules that ran fine with OpenShift 3.6 [kube 1.6] slowed almost to a complete halt with OpenShift 3.7 [kube 1.7].)

This PR fixes things by splitting out the "new" part of KUBE-SERVICES (NodePort and ExternalIP reject rules) into a separate KUBE-EXTERNAL-SERVICES chain run from INPUT, and moves KUBE-SERVICES back to being only run from OUTPUT. (So, yes, this assumes that you don't have 5000 NodePort/ExternalIP services, but, if you do, there's not much we can do, since those rules have to be run on the INPUT side.)

Oh, and I left in the code to clean up the "-A INPUT -j KUBE-SERVICES" rule even though we don't generate it any more, so it gets fixed on upgrade.

Release note:

Reorganized iptables rules to fix a performance regression on clusters with thousands of services.

@kubernetes/sig-network-bugs @kubernetes/rh-networking

@danwinship

This comment has been minimized.

Contributor

danwinship commented Nov 21, 2017

/retest

@thockin

Do we need an equivalent on ipvs side?

@m1093782566

@m1093782566

This comment has been minimized.

Member

m1093782566 commented Nov 22, 2017

Thanks @thockin

I haven't take a deep look yet, but IPVS proxier does not need to use iptables REJECT to reject packets if service has no endpoints. Because when visit an IPVS virtual server which has no real server, kernel will reject it by itself, for example,

[root@SHA1000130405 app]# curl 1.2.3.4:8080
curl: (7) Failed connect to 1.2.3.4:8080; Connection refused

Of course, please correct me if this PR has other benefits.

@danwinship

This comment has been minimized.

Contributor

danwinship commented Nov 27, 2017

It's not about rejecting packets specifically, it's about having too many rules in the INPUT chain. But IPVS doesn't use the INPUT chain at all, so it's fine.

@danwinship

This comment has been minimized.

Contributor

danwinship commented Dec 7, 2017

/hold
The reporter of #56842 apparently has enough externalip services to hit this problem even with this patch. I think the suggestion there is not quite right though. I'll look into this more next week when I'm back from kubecon.

@danwinship

This comment has been minimized.

Contributor

danwinship commented Dec 20, 2017

/hold cancel
While #56842 needs more than just this, this PR doesn't conflict with the changes needed there, and I think the fix here (having one chain for INPUT and one for OUTPUT rather than a single chain containing a mix of rules some of which only apply to input packets and some of which only apply to output packets) makes sense regardless of whether it fixes #56842.

@dcbw

This comment has been minimized.

Member

dcbw commented Jan 18, 2018

/lgtm

@dcbw

This comment has been minimized.

Member

dcbw commented Jan 18, 2018

/test pull-kubernetes-e2e-kops-aws

"error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached."

@thockin

This comment has been minimized.

Member

thockin commented Feb 6, 2018

@m1093782566 do we need an IPVS equivalent? Or does IPVS get this for free?

@thockin

This comment has been minimized.

Member

thockin commented Feb 6, 2018

LGTM, but let's wait for #57336 to merge, since I didn't re-review the first commit here :) Or rebase the 2nd commit here on top of that so I can verify same hash and not re-read it :)

monopole pushed a commit to monopole/kubernetes that referenced this pull request Feb 6, 2018

Merge pull request kubernetes#57336 from danwinship/proxier-simplific…
…ation

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Abstract some duplicated code in the iptables proxier

Reorganizes the iptables proxier code so we only have the list of "-A FOO -j KUBE-BAR" rules in one place rather than duplicating the same list in multiple places. Split out from kubernetes#56164 for ease of review/merging.

**Release note**:
```release-note
NONE
```
@m1093782566

This comment has been minimized.

Member

m1093782566 commented Feb 7, 2018

@thockin

IPVS get this for free since there is no INPUT chain created by IPVS proxier.

@k8s-ci-robot k8s-ci-robot added size/S and removed size/L labels Feb 7, 2018

@danwinship

This comment has been minimized.

Contributor

danwinship commented Feb 8, 2018

@thockin: rebased

@knobunc

This comment has been minimized.

Contributor

knobunc commented Feb 12, 2018

@thockin #57336 has merged. What's the next step to getting this in? Thanks.

@dcbw

This comment has been minimized.

Member

dcbw commented Feb 21, 2018

/lgtm on the rebase

@thockin

This comment has been minimized.

Member

thockin commented Feb 23, 2018

/lgtm
/approve no-issue

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 23, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Feb 23, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dcbw, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@m1093782566

This comment has been minimized.

Member

m1093782566 commented Feb 23, 2018

/retest

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 23, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 23, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit f0ca996 into kubernetes:master Feb 23, 2018

12 of 13 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation danwinship authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-unit Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

k8s-merge-robot added a commit that referenced this pull request Feb 23, 2018

Merge pull request #57461 from danwinship/proxier-no-dummy-nat-rules
Automatic merge from submit-queue (batch tested with PRs 55637, 57461, 60268, 60290, 60210). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Don't create no-op iptables rules for services with no endpoints

Currently for all services we create `-t nat -A KUBE-SERVICES` rules that match the destination IPs (ClusterIP, ExternalIP, NodePort IPs, etc) and then jump to the appropriate `KUBE-SVC-XXXXXX` chain. But if the service has no endpoints then the `KUBE-SVC-XXXXXX` chain will be empty and so nothing happens except that we wasted time (a) forcing iptables-restore to parse the match rules, and (b) forcing the kernel to test matches that aren't going to have any effect.

This PR gets rid of the match rules in this case. Which is to say, it changes things so that every incoming service packet is matched *either* by nat rules to rewrite it *or* by filter rules to ICMP reject it, but not both. (Actually, that's not quite true: there are no filter rules to reject Ingress-addressed packets, and I *think* that's a bug?)

I also got rid of some comments that seemed redundant.

The patch is mostly reindentation, so best viewed with `diff -w`.

Partial fix for #56842 / Related to #56164 (which it conflicts with but I'll fix that after one or the other merges).

**Release note**:
```release-note
Removed some redundant rules created by the iptables proxier, to improve performance on systems with very many services.
```

openshift-merge-robot added a commit to openshift/origin that referenced this pull request Feb 27, 2018

Merge pull request #18754 from danwinship/upstream-iptables-fixes
Automatic merge from submit-queue (batch tested with PRs 18754, 18761).

kube-proxy iptables performance fixes

Pull in multiple upstream iptables fixes to improve performance in "very large clusters" (ie, Online).

Includes kubernetes/kubernetes#57336, kubernetes/kubernetes#56164, kubernetes/kubernetes#57461, and kubernetes/kubernetes#60306.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1514174

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Feb 27, 2018

Merge pull request kubernetes#18754 from danwinship/upstream-iptables…
…-fixes

Automatic merge from submit-queue (batch tested with PRs 18754, 18761).

kube-proxy iptables performance fixes

Pull in multiple upstream iptables fixes to improve performance in "very large clusters" (ie, Online).

Includes kubernetes#57336, kubernetes#56164, kubernetes#57461, and kubernetes#60306.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1514174

Origin-commit: e2e14cb4fe6a6789936da736d627ae96ca822116

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Mar 5, 2018

Merge pull request kubernetes#18754 from danwinship/upstream-iptables…
…-fixes

Automatic merge from submit-queue (batch tested with PRs 18754, 18761).

kube-proxy iptables performance fixes

Pull in multiple upstream iptables fixes to improve performance in "very large clusters" (ie, Online).

Includes kubernetes#57336, kubernetes#56164, kubernetes#57461, and kubernetes#60306.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1514174

Origin-commit: e2e14cb4fe6a6789936da736d627ae96ca822116

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Mar 23, 2018

Merge pull request kubernetes#18754 from danwinship/upstream-iptables…
…-fixes

Automatic merge from submit-queue (batch tested with PRs 18754, 18761).

kube-proxy iptables performance fixes

Pull in multiple upstream iptables fixes to improve performance in "very large clusters" (ie, Online).

Includes kubernetes#57336, kubernetes#56164, kubernetes#57461, and kubernetes#60306.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1514174

Origin-commit: e2e14cb4fe6a6789936da736d627ae96ca822116

@danwinship danwinship deleted the danwinship:proxier-chain-split branch Mar 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment