Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy/iptables: don't call iptables-restore when we don't need to #36006

Closed
wants to merge 4 commits into from

Conversation

dcbw
Copy link
Member

@dcbw dcbw commented Nov 1, 2016

Diff old and new rules and don't bother calling iptables-restore when nothing has actually changed.

iptables-restore is a very heavy operation and depending on the kernel,
the CPU, and the number of rules to restore, can take a very long time
(~500ms or more).

i7-5600U @ 2.6GHz: 700ms for 1000 services (2+4 cores+threads, kernel 4.6.6)
i7-4790 @ 3.6GHz: 270ms for 1000 services (4+8 cores+threads, kernel 4.6.7)

Other parts of kubernetes (eg kubenet) might be running iptables-restore
too, leading to some pileup as each iptables-restore waits for others to
complete.

Related: #26637
Related: #33693
Related: #35334


This change is Reviewable

@thockin @timothysc @danwinship @eparis

@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. release-note-label-needed labels Nov 1, 2016
@dcbw
Copy link
Member Author

dcbw commented Nov 1, 2016

@k8s-bot unit test this #35898

@dcbw
Copy link
Member Author

dcbw commented Nov 1, 2016

Rebase to get revert that fixes #35898

@dcbw
Copy link
Member Author

dcbw commented Nov 2, 2016

@k8s-bot node e2e test this issue #35983

if destIP != "" {
return strings.Contains(r[iptablestest.Destination], destIP)
_, ok := r.Options[iptablestest.Destination]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (and the destPort change below) seems wrong; it only tests that r has some destination, not that it's the right one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// no-op implementation of iptables Interface
type FakeIPTables struct {
Lines []byte
RestoreLines []byte
SaveLines []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveLines seems to be unused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the commit that actually uses it.

}

func parseRule(line string, filterChains []Chain, filterChainPrefixes []string) (*Rule, Chain, error) {
knownOptions := []knownOption{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a global (and const?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays can't be 'const' in Go, something about being able to take the address of an array element which would be illegal for a const. It also shouldn't really be global since it's used in only one function, so I moved the struct declaration into the function too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that means it's going to be constructing the array every time you call the function. If you made it global it would only have to do it once.

return nil, "", fmt.Errorf("invalid iptables rule (not enough arguments): %v", line)
}
if len(items[i]) == 0 {
return nil, "", fmt.Errorf("invalid iptables rule (empty argument): %v", line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it just means there were multiple spaces rather than just 1... which doesn't seem like it should necessarily be invalid. (also, you don't check for empty argument if it's quoted)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, multiple spaces are now handled initially by a helper that splits the line, and empty quoted arguments are passed through as-is (sans quoting of course).

option.Arg += " "
}
option.Arg += items[j]
if strings.HasSuffix(option.Arg, "\"") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use HasPrefix for the initial quote above for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

} else if strings.HasPrefix(line, "-A ") {
rule, chain, err := parseRule(line, filterChains, filterChainPrefixes)
if err != nil {
return make(map[Chain][]Rule), err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, err? (likewise elsewhere)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed for the error cases.

@@ -235,7 +236,7 @@ func NewProxier(ipt utiliptables.Interface, sysctl utilsysctl.Interface, exec ut
return nil, fmt.Errorf("invalid iptables-masquerade-bit %v not in [0, 31]", masqueradeBit)
}
masqueradeValue := 1 << uint(masqueradeBit)
masqueradeMark := fmt.Sprintf("%#08x/%#08x", masqueradeValue, masqueradeValue)
masqueradeMark := fmt.Sprintf("%#04x/%#04x", masqueradeValue, masqueradeValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly in the wrong commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right. At least on my machines, it is actually best-fit:

-A FOO -j MARK --set-xmark 0x1/0x1
-A FOO -j MARK --set-xmark 0x10/0x10
-A FOO -j MARK --set-xmark 0x100/0x100
-A FOO -j MARK --set-xmark 0x1000/0x1000
-A FOO -j MARK --set-xmark 0x10000/0x10000
-A FOO -j MARK --set-xmark 0x100000/0x100000
-A FOO -j MARK --set-xmark 0x1000000/0x1000000
-A FOO -j MARK --set-xmark 0x10000000/0x10000000

newChains, err := parseKubeProxyTableAddRules(table, newLines)
if err != nil {
return false, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a difference between oldLines failing to parse and newLines failing to parse: oldLines can fail to parse if the administrator mucks around with iptables behind our back, or if the format of iptables-save changes. But newLines can only fail to parse in the event of a programmer error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, fixed.

// Revert new local ports.
revertPorts(replacementPortsMap, proxier.portsMap)
return
glog.Warningf("failed to parse iptables filter rules: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some failure cases (eg, iptables-save output format changes), every single call to sameIptablesRuleset() will fail... we might want to not log every time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll now only warn if parsing the new rules failed, which should be entirely under our control.

stickyMaxAgeSeconds: 180, // TODO: paramaterize this in the API.
clusterIP: ip,
protocol: protocol,
port: port,
onlyNodeLocalEndpoints: onlyNodeLocalEndpoints,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what happened there but this isn't gofmt'ed correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it is; that's what gofmt -s -w spit out for this file. Maybe Go 1.7 now considers variable length?

@timothysc timothysc added area/kube-proxy priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 2, 2016
@timothysc timothysc added this to the v1.5 milestone Nov 2, 2016
@timothysc timothysc assigned danwinship, thockin and wojtek-t and unassigned lavalamp Nov 3, 2016
@timothysc timothysc added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Nov 3, 2016
@timothysc
Copy link
Member

re-assigning, b/c daniel is out.

@timothysc timothysc added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Nov 4, 2016
@dcbw
Copy link
Member Author

dcbw commented Nov 9, 2016

@danwinship PTAL, thanks!

@dcbw
Copy link
Member Author

dcbw commented Nov 9, 2016

integration test flake is #36473

@dcbw
Copy link
Member Author

dcbw commented Nov 9, 2016

@k8s-bot test this #36473

// Split a line into space-separated arguments, handling simple quoting
func splitLine(line string) ([]string, error) {
items := strings.Split(line, " ")
args := make([]string, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider make([]string, 0, len(items)) as an optimization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func parseRule(line string, allowChains []Chain, allowChainPrefixes []string) (*Rule, Chain, error) {
optionAttrs := map[string]optionAttrs{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, this has to be re-built every time you call parseRule(), while if it was global it would only have to be built once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe sort them into four groups by the optionAttr values, with a comment above each group? (// arg required, can't negate, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


attr, ok := optionAttrs[opt]
if !ok {
return nil, "", fmt.Errorf("Unknown iptables rule option %v", opt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think there are some more places where you said %v when you meant %q (across multiple commits)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// negation only valid at beginning of options
if items[i] == "!" {
negated = true
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would simplify the loop if you just did an i++ here like you do for options with arguments. (eg, then both opt and negated would be local to the loop body)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it that way before, but then we have to also check if i == len(items) a lot. I'll try it again this way and see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

args, err := splitLine(tc.line)
if err != nil {
if tc.err != "" {
if !strings.HasPrefix(fmt.Sprintf("%v", err), tc.err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.Error(). Although none of the test cases sets tc.err anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and added some error testcases.

result, err := ParseTableAddRules(tc.table, tc.filterChains, tc.filterChainPrefixes, tc.data)
if err != nil {
if tc.err != "" {
if !strings.HasPrefix(fmt.Sprintf("%v", err), tc.err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.Error()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Revert new local ports.
revertPorts(replacementPortsMap, proxier.portsMap)
return
glog.Warningf("failed to parse iptables filter rules: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With just a warning, I fear that this is just going to get broken. Someone will add a new rule that the parsing code doesn't know how to parse, no one will notice the warning message, and we'll ship a release where all of this parsing and comparing ends up just being an expensive no-op. OTOH, the code is complicated enough that I'm not comfortable saying we should panic if parsing fails...

Maybe utiliptables.EnsureRule() should return an error if it finds itself about to add a rule that it wouldn't later be able to parse? (Though in that case we'd want to validate module names too, to ensure that there aren't going to be any new "fixups for default module options" needed.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy testcases should pick up changes to the ruleset though, so it should fail pretty early if syncProxyRules() gets changed. And it's at least designed to fail safe, as in it shouldn't change existing behavior if things don't work, it'll just always sync and log a message.

Ideally instead of adding textual rules we'd have a more structured iptables module that would use the vishvananda/netlink to read/write stuff and thus be able to more directly compare instead of screen-scraping, but that's a ton of work...

If you really want it I can try the EnsureRule thing and add hostports, userspace proxy, and other testcases too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it didn't occur to me that the parsing code didn't deal with hostports, etc, already...

Maybe just add warning comments to all the relevant bits of the iptables proxy then where people might change things in ways that would cause breakage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship the proxy code is the only one so far that uses ParseTableRules(), and while the others could use this too, they don't get called anywhere near as often so they aren't a hotpath like the proxy stuff is. I'll definitely add comments.

@dcbw
Copy link
Member Author

dcbw commented Nov 10, 2016

@danwinship PTAL, thanks

@dims
Copy link
Member

dims commented Nov 13, 2016

@thockin @wojtek-t @danwinship @saad-ali : seems a bit big change at the tail end of v1.5. right?

@saad-ali saad-ali removed this from the v1.5 milestone Nov 14, 2016
@saad-ali
Copy link
Member

@thockin @wojtek-t @danwinship @saad-ali : seems a bit big change at the tail end of v1.5. right?

Agreed, we are well into the code freeze for 1.5. Is this a bug fix? @kubernetes/sig-network for more context. Removing 1.5 milestone until we have verfication.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2016
We'll need this in the iptables proxy code to figure out when its
pointless to call iptables-restore since things haven't changed.
iptables-restore is a very heavy operation and depending on the kernel,
the CPU, and the number of rules to restore, can take a very long time
(~500ms or more).

i7-5600U @ 2.6GHz: 700ms for 1000 services (2+4 cores+threads, kernel 4.6.6)
i7-4790  @ 3.6GHz: 270ms for 1000 services (4+8 cores+threads, kernel 4.6.7)

Other parts of kubernetes (eg kubenet) might be running iptables-restore
too, leading to some pileup as each iptables-restore waits for others to
complete.

Related: kubernetes#26637
Related: kubernetes#33693
Related: kubernetes#35334
Go assumes int types for numerics that aren't explicitly typed,
leading to:

"constant 2147483648 overflows int"
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GCE e2e failed for commit 12fa2af. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@dcbw
Copy link
Member Author

dcbw commented Dec 1, 2016

@k8s-bot gci gce e2e test this issue #37361

@jeremyeder
Copy link

To give you a sense of what we're seeing wrt iptables save/restore churn, here is a graph showing CPU utilization across a variety of system roles in a cluster. Avg in the 2+ core range for iptables rule validation.

This cluster has 1000 nodes and 14,000 pods. At scales far lower than this, you can find iptables consuming a core or so on every node in the cluster at any given time. For RH'rs reading, if you'd like access please ping me internally.

iptables

@rantoniuk
Copy link

@jeremyeder out of curiosity, how many services do you have in the cluster?

@jeremyeder
Copy link

Approximately 600 services.

@widgetpl
Copy link

@jeremyeder do you have any network issues like packet drops etc. couse we have faced such issues when we hit to around 500 services.

@jeremyeder
Copy link

@widgetpl Just confirmed, no drops or errors on any of the interfaces.

@k8s-github-robot
Copy link

@dcbw PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @brendandburns
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

This PR hasn't been active in 62 days. It will be closed in 27 days (Mar 16, 2017).

cc @danwinship @dcbw @dchen1107 @smarterclayton @thockin @wojtek-t

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@dcbw
Copy link
Member Author

dcbw commented Feb 17, 2017

Going to close this one too, as the core issue may have already been addressed by other changes like #41022 and #38996

@dcbw dcbw closed this Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet