Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy iptables code clarity cleanups #106158

Merged
merged 2 commits into from
Nov 6, 2021

Conversation

thockin
Copy link
Member

@thockin thockin commented Nov 4, 2021

Looking at PRs in the iptables code makes my heart ache. I stared at this code for a while and have been looking for ways to make it more manageable without making it significantly less readable.

This PR explores some of the "low hanging fruit".

/kind cleanup

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2021
@k8s-ci-robot
Copy link
Contributor

@thockin: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/ipvs sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 4, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2021
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 5, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 5, 2021
@thockin thockin changed the title WIP: kube-proxy cleanups kube-proxy cleanups Nov 5, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2021
@thockin thockin changed the title kube-proxy cleanups kube-proxy iptables code clarity cleanups Nov 5, 2021
@thockin
Copy link
Member Author

thockin commented Nov 5, 2021

OK to review this now

@@ -481,18 +481,6 @@ func WriteLine(buf *bytes.Buffer, words ...string) {
}
}

// WriteRuleLine prepends the strings "-A" and chainName to the buffer and calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this was a bit obscure

@aojea
Copy link
Member

aojea commented Nov 5, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit bdb9c08 into kubernetes:master Nov 6, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 6, 2021
}
utilproxy.WriteRuleLine(proxier.natRules, string(kubeServicesChain), append(args, "-j", string(svcChain))...)
args = prepend(args, "-A", string(kubeServicesChain))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this breaks the "not reallocating args thing we had before". args was initialized as

	args := make([]string, 64)

at the top of syncProxyRules, before the start of the range proxier.serviceMap loop, and then each time we started a new arg list, it would do args = args[:0] to truncate it while keeping the underlying storage.

but a single args = prepend(...) breaks that for the entire rest of the current syncProxyRules. So we will now be doing A LOT more memory allocations than we had been doing before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch

package main

import (
	"testing"
)

func BenchmarkPrependAlloc(b *testing.B) {
	a := make([]int, b.N)

	for i := 0; i < b.N; i++ {
		a = prepend(a, i)
	}
}

func BenchmarkAlloc(b *testing.B) {
	a := make([]int, b.N)

	for i := 0; i < b.N; i++ {
		a = append(a, i)
	}
}

func prepend(sl []int, args ...int) []int {
	return append(args, sl...)
}
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
BenchmarkPrependAlloc-12    	  112240	    171318 ns/op
BenchmarkAlloc-12           	164673328	         6.298 ns/op
PASS

Copy link
Member Author

@thockin thockin Nov 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can either a) change prepend to allocate the full capacity of args (so it allocs but only once); b) change prepend to shuffle the elements down and insert into the same slice; c) change Write to take a [][]string (scatter-gather style). The truth is that every time we do Write() with a list of args, we cause a slice allocation behind the scenes. I'm skeptical that re-using args is a big win.

If we want to take this seriously, we should explore how to do it better in all cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change Write to take a [][]string (scatter-gather style). The truth is that every time we do

Your sentence got cut off, but whatever you were going to say, that idea sounds really good...

matchClusterIP := []string{
	"-m", protocol, "-p", protocol,
	"-d", utilproxy.ToCIDR(svcInfo.ClusterIP()),
	"--dport", strconv.Itoa(svcInfo.Port()),
}

if hasEndpoints {
	comment := utilproxy.Comment("%s cluster IP"`, svcNameString)
	if proxier.masqueradeAll {
		proxier.natRules.Write(
			[]string{"-A", string(svcChain)},
			comment,
			matchClusterIP,
			[]string{"-j", string(KubeMarkMasqChain)},
		)
	} else if proxier.localDetector.Implemented() {
		// ...
		proxier.natRules.Write(
			[]string{"-A", string(svcChain)},
			comment,
			proxier.localDetector.JumpIfNotLocal(KubeMarkMasqChain),
		)
	}
	proxier.natRules.Write(
		[]string{"-A", string(kubeServicesChain)},
		comment,
		matchClusterIP,
		[]string{"-j", string(svcChain)},
	)
} else {
	// No endpoints.
	proxier.filterRules.Write(
		[]string{"-A", string(kubeServicesChain)},
		utilproxy.Comment("%s has no endpoints", svcNameString),
		matchClusterIP,
		[]string{"-j", "REJECT"},
	)
}

(utilproxy.Comment is left as an exercise to the reader). Maybe even

func addToChain(chain utiliptables.Chain) []string {
	return []string{"-A", string(chain)}
}

func jumpToChain(chain utiliptables.Chain) []string{
	return []string{"-j", string(chain)}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even have Write(args ...interface{}) and let you pass an arbitrary combination of string, []string, and utiliptables.Chain...

	proxier.filterRules.Write(
		"-A", kubeServicesChain,
		utilproxy.Comment("%s has no endpoints", svcNameString),
		matchClusterIP,
		"-j", "REJECT",
	)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a string-or-slice mode to write seems plausible, and makes the net diff smaller AND opens further opportunities.

I want to be careful not to make this code any less readable than it already is. Let me try a couple things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I'll take it apart:

BenchmarkOneAppend-6                   	12719592	        95.78 ns/op	      48 B/op	       1 allocs/op
BenchmarkOneVariadic-6                 	13332460	        90.14 ns/op	      48 B/op	       1 allocs/op
BenchmarkOnePrepend-6                  	 6574177	       183.5 ns/op	     176 B/op	       2 allocs/op
BenchmarkOnePrependInPlace-6           	11058524	       108.8 ns/op	      48 B/op	       1 allocs/op
BenchmarkOneSG-6                       	11836407	       100.9 ns/op	      48 B/op	       1 allocs/op
BenchmarkOneIntf-6                     	12110347	        98.34 ns/op	      48 B/op	       1 allocs/op

No penalty for variadic calls. Prepend is bad. In-place prepend is not bad but surprising. Scatter-gather and using an interface both are within spitting distance of no-op.

Run these under more load to see amortize overhead out:

BenchmarkAppend-6                      	 3696194	       286.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkPrepend-6                     	  394273	      2944 ns/op	    6464 B/op	      32 allocs/op
BenchmarkPrependInPlaceManual-6        	 1438062	       747.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkPrependInPlaceCopy-6          	 2344239	       519.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkScatterGatherFlat-6           	 1000000	      1064 ns/op	     608 B/op	      17 allocs/op
BenchmarkScatterGatherDeep-6           	  897902	      1352 ns/op	    1120 B/op	      33 allocs/op

Append doesn't actually work, but it sets the bar. Prepend is 10x slower. In-place prepend is only 2x slower. Scatter-gather turns into a different form of the prepend problem. These results are not very interesting.

A more representative test (look at the gist for code):

BenchmarkReprSGAsFewAsPossible-6       	11684551	        92.63 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprSGMany-6                  	 3608090	       333.5 ns/op	     288 B/op	       5 allocs/op
BenchmarkReprSGManyInline-6            	10851805	       100.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfAsFewAsPossible-6   	11803284	        91.16 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfMany-6              	 5385872	       235.8 ns/op	     160 B/op	       3 allocs/op
BenchmarkReprIntfManyInline-6        	10768394	        98.16 ns/op	       0 B/op	       0 allocs/op

These show that both SG and intf are viable, and depending how we use them can perform well. Some code readability considerations...

To explore code reading:

BenchmarkReprClauseAsFewAsPossible-6   	11474544	        91.50 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprClauseMany-6              	 4877307	       244.6 ns/op	     128 B/op	       4 allocs/op
BenchmarkReprClauseManyInline-6        	 6517003	       182.9 ns/op	      72 B/op	       2 allocs/op

This adds a function-type argument. I like the code but perf leaves something to be desired.

BenchmarkReprFuncAsFewAsPossible-6     	11681811	        91.60 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprFuncAsFewAsPossible2-6    	10203375	       103.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprFuncMany-6                	 5535034	       216.0 ns/op	     160 B/op	       3 allocs/op
BenchmarkReprFuncManyInline-6          	10537470	       101.1 ns/op	       0 B/op	       0 allocs/op

This is more or less Intf with function helpers. I am not sure it makes the code any clearer.

Take a look and tell me what you think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more:

We could define a "WordList" and "WordSlice" type, which manage an offset slice underneath so prepends are free. Code would look like:

wl := NewWordList(64, 8) // capacity and zero-offset
// ... later
sl := a.Slice(0, 0)  // a view into wl
sl = sl.Append("-m", protocol, "-p", protocol)
sl = sl.Append("-d", ip, "--dport", port)
sl = sl.Append("-j", "FOOBAR")
proxier.natRules.write(sl.Prepend("-X", "CHAIN").Slice())

We could user helper funcs if we think it is cleaner:

sl = sl.Append(protocolFunc(protocol)...)
sl = sl.Append(destFunc(ip, port)...)
sl = sl.Append(jumpFunc("FOOBAR")...)
proxier.natRules.write(sl.Prepend(appendFunc("CHAIN")...).Slice())

We could maybe drop the ....

We could maybe even get more DSLish, maybe?

I wrote a simple form of the code, it's only 80 LOC. Scatter-gather and Intf are simpler (less LOC) but maybe this is cleaner? Not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These show that both SG and intf are viable, and depending how we use them can perform well. Some code readability considerations...

BenchmarkReprIntfManyInline is using arrays that it doesn't need to use though:

		writeIntf(&buf,
			"-A", "CHAIN",
			[]string{"-m", protocol, "-p", protocol},
			[]string{"-d", ip, "--dport", port},
			[]string{"-j", "FOOBAR"})

This could just be

		writeIntf(&buf,
			"-A", "CHAIN",
			"-m", protocol, "-p", protocol,
			"-d", ip, "--dport", port,
			"-j", "FOOBAR")

which doesn't seem unreadable to me at all.

Of course, it also doesn't test the interface{} functionality. But we could do:

		matchService := []string{
			"-m", protocol, "-p", protocol,
			"-d", ip, "--dport", port,
		}
		writeIntf(&buf,
			"-A", "CHAIN",
			matchService,
			"-j", "FOOBAR")

which again seems readable to me.

FTR

BenchmarkReprIntfAsFewAsPossible-12            	15124519	        80.31 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfManyInlinePreArray-12         	14044212	        82.45 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfManyInlineNoExtraArrays-12    	14103712	        84.79 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfManyInlineNonRecursive-12     	13345648	        88.40 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfManyInline-12                 	12820171	        90.49 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfMany-12                       	 2739234	       437.9 ns/op	     424 B/op	       8 allocs/op

(PreArray is the one with matchService, NoExtraArrays is the one with all the arrays replaced with individual strings. NonRecursive was another experiment, with making writeIntf loop over []string{} arguments itself rather than doing write(buf, x...).)

https://gist.github.com/danwinship/b5524b557dc8ab8b544ab25f0df9ecf7

We could define a "WordList" and "WordSlice" type, which manage an offset slice underneath so prepends are free.

I think we can just avoid needing to do prepends

We could maybe even get more DSLish, maybe?

Yeah, I was wondering about that when I suggested utilproxy.Comment, jumpToChain etc... I'm not sure if it's a good idea, because we're not going to be able to completely abstract away iptables syntax, so we'd probably just end up making it so that people need to learn both iptables syntax and our DSL if they want to hack on the code. It may be better to just keep all the iptables syntax explicit, but just make it easier to efficiently reuse bits and pieces of it as needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using arrays that it doesn't need to use

I was just trying to push the pattern and see where it broke down. Using many slices allows the proliferation of "builder" functions which each return a []string.

It may be better to just keep all the iptables syntax explicit, but just make it easier to efficiently reuse bits and pieces of it as needed.

+1

The main downside of interface{} is that it changes a compile time error (passing the wrong type) into a runtime error.

Shall we agree that everything but scatter-gather and interface models are ruled out?

@thockin
Copy link
Member Author

thockin commented Nov 7, 2021

all names subject to revision :)

type WordList struct {
	words []string
	zero  int // index into words
}

//FIXME: uint?
func NewWordList(capacity, zero int) *WordList {
	wl := &WordList{}
	wl.words = make([]string, capacity)
	wl.zero = zero
	return wl
}

func (wl *WordList) append(idx int, words ...string) {
	wl.maybeGrowRight(idx, len(words))
	copy(wl.words[wl.zero+idx:], words)
}

func (wl *WordList) maybeGrowRight(idx, n int) {
	if len(wl.words)-(wl.zero+idx) >= n {
		return
	}
	if n < len(wl.words)-wl.zero {
		n = len(wl.words) - wl.zero
	}
	tmp := make([]string, len(wl.words)+n)
	copy(tmp[wl.zero:], wl.words[wl.zero:])
	wl.words = tmp
}

func (wl *WordList) prepend(idx int, words ...string) {
	wl.maybeGrowLeft(idx, len(words))
	copy(wl.words[wl.zero+idx:], words)
}

//FIXME: if idx is neg, prepend.  one func?
func (wl *WordList) maybeGrowLeft(idx, n int) {
	if wl.zero+idx >= 0 {
		return
	}
	if n < wl.zero {
		n = wl.zero
	}
	tmp := make([]string, len(wl.words)+n)
	copy(tmp[n:], wl.words)
	wl.words = tmp
	wl.zero += n
}

type WordSlice struct {
	wl    *WordList
	start int // index into wl.words[wl.zero:]
	size  int
}

//FIXME: uint?
func (wl *WordList) Slice(start, size int) WordSlice {
	return WordSlice{
		wl:    wl,
		start: start,
		size:  size,
	}
}

func (ws WordSlice) Append(words ...string) WordSlice {
	ws.wl.append(ws.start+ws.size, words...)
	ws.size += len(words)
	return ws
}

func (ws WordSlice) Prepend(words ...string) WordSlice {
	ws.wl.prepend(ws.start-len(words), words...)
	ws.start -= len(words)
	ws.size += len(words)
	return ws
}

func (ws WordSlice) Slice() []string {
	//FIXME: hide zero stuff in WL
	return ws.wl.words[ws.wl.zero+ws.start : ws.wl.zero+ws.start+ws.size]
}

@thockin
Copy link
Member Author

thockin commented Nov 7, 2021

Updated to just play with SG and Intf:

https://gist.github.com/thockin/8fc599cc4adc7aba72a367057704a798#file-bench2_test-go

Each "Repr" test covers all 3 patterns of usage.

BenchmarkOneAppend-6                   	12585250	        94.25 ns/op	      48 B/op	       1 allocs/op
BenchmarkOneVariadic-6                 	13252210	        90.64 ns/op	      48 B/op	       1 allocs/op
BenchmarkOnePrepend-6                  	 6501661	       184.9 ns/op	     176 B/op	       2 allocs/op
BenchmarkOneSG-6                       	12065289	        99.08 ns/op	      48 B/op	       1 allocs/op
BenchmarkOneIntf-6                     	11805279	       101.4 ns/op	      48 B/op	       1 allocs/op

This shows intf and SG are same ballpark as append and variadic.

BenchmarkReprSGMinSlices-6             	 3983193	       264.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprSGMinSlicesHelpers-6      	 4134222	       296.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprSGManySlices-6            	 1709065	       706.9 ns/op	     480 B/op	       9 allocs/op
BenchmarkReprSGManySlicesHelpers-6     	 1701709	       703.9 ns/op	     480 B/op	       9 allocs/op
BenchmarkReprIntfMinSlices-6           	 3839482	       278.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfMinSlicesHelpers-6    	 3778912	       298.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkReprIntfManySlices-6          	 1266530	       904.1 ns/op	     600 B/op	      15 allocs/op
BenchmarkReprIntfManySlicesHelpers-6   	 1327848	       901.1 ns/op	     600 B/op	      15 allocs/op

Whether we use helper funcs or not doesn't matter WRT allocs. That is a code-clarity question.

SG vs Intf is neglible - both are fine if we minimize number of slices and don't try to accumulate a slice of slices.

BenchmarkIntf-6                        	10762976	        99.28 ns/op	       0 B/op	       0 allocs/op
BenchmarkIntf2-6                       	10822875	        96.12 ns/op	       0 B/op	       0 allocs/op

Your intf2 is not notably faster

@thockin
Copy link
Member Author

thockin commented Nov 7, 2021 via email

@thockin
Copy link
Member Author

thockin commented Nov 7, 2021

https://github.com/kubernetes/kubernetes/pull/106214/files for a PoC.

I admit that adding []string{} around things is annoying.

We could do the Intf model after all. Or we could do WriteWords() and WriteSlices() (or something).

Some of the append(args could become extra lines, if we think it reads better. E.g.:

-                        proxier.natRules.Write(append(args,
-                               "-s", utilproxy.ToCIDR(netutils.ParseIPSloppy(epInfo.IP())),
-                               "-j", string(KubeMarkMasqChain)))
+                        proxier.natRules.Write(
+                               args,
+                               []string{
+                                   "-s", utilproxy.ToCIDR(netutils.ParseIPSloppy(epInfo.IP())),
+                                   "-j", string(KubeMarkMasqChain),
+                               })

or:

                         proxier.natRules.Write(
-                                []string{"-A", string(svcChain)},
-                                append(args, "-j", string(KubeMarkMasqChain)))
+                                []string{"-A", string(svcChain)},
+                                args,
+                                []string{"-j", string(KubeMarkMasqChain)}
+                        )

...because those are all on-stack, they are zero-alloc.

@danwinship
Copy link
Contributor

My feeling is that the strong typing of SG makes it a better choice at (to me) a low price. Agree or disagree?

I don't oppose SG, but I still prefer Intf. If the unit test is comprehensive enough (which it nearly is) then that should catch any type errors in the Intf version. (And we want a very comprehensive unit test to catch logic errors anyway, so relying on it to also catch type errors is not awful.)

@thockin
Copy link
Member Author

thockin commented Nov 8, 2021

Let's move discussion to #106214 ?

hanamantagoudvk pushed a commit to Nordix/kpng that referenced this pull request Dec 1, 2021
porting latest optimization from kube-proxy
kubernetes/kubernetes#106158
hanamantagoudvk pushed a commit to Nordix/kpng that referenced this pull request Dec 3, 2021
porting latest optimization from kube-proxy
kubernetes/kubernetes#106158
hanamantagoudvk pushed a commit to Nordix/kpng that referenced this pull request Dec 6, 2021
porting latest optimization from kube-proxy
kubernetes/kubernetes#106158
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants