Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy/ipvs: Only compute node ip addresses once per sync #79444

Merged
merged 2 commits into from Jul 24, 2019

Conversation

@cezarsa
Copy link
Contributor

commented Jun 26, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

This PR rearranges the syncProxyRules loop when kube-proxy is using IPVS to only compute node IPs once per sync.

Previously the same ip addresses would be computed for each nodePort service and this could be CPU intensive for a large number of nodePort services with a large number of ipaddresses on the node.

Which issue(s) this PR fixes:
Possibly related to #73382.

Special notes for your reviewer:

cc @andrewsykim
/sig network

Does this PR introduce a user-facing change?:

Reduce kube-proxy cpu usage in IPVS mode when a large number of nodePort services exist.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

Hi @cezarsa. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

if hasNodePort {
nodeAddresses, err = utilproxy.GetNodeAddresses(proxier.nodePortAddresses, proxier.networkInterfacer)
if err != nil {
klog.Errorf("Failed to get node ip address matching nodeport cidr: %v", err)

This comment has been minimized.

Copy link
@tedyu

tedyu Jun 26, 2019

Contributor

why not to skip iterating nodeAddresses in this case ?

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jun 27, 2019

Author Contributor

This is what will happen in practice. When GetNodeAddresses returns an error it also returns a nil value for nodeAddresses and this will cause the iteration to be skipped.

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jun 28, 2019

Member

This isn't directly obviously though, I agree with tedyu it's easier to follow with an explicity skip or check. You can use nodeAddresses.Len() to do an explicity check before iterating node addresses

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jun 28, 2019

Author Contributor

Done.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

/assign

if err != nil {
klog.Errorf("Failed to get node ip address matching nodeport cidr: %v", err)
}
for address := range nodeAddresses {

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jun 28, 2019

Member

use nodeAddresses.List() here

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jun 28, 2019

Author Contributor

Done, I actually changed nodeAddresses to a []string that is only set if nodeAddrSet.Len() > 0 to avoid multiple calls to .List() down the road.

if err != nil {
klog.Errorf("Failed to list all node IPs from host, err: %v", err)
}
break

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jun 28, 2019

Member

Is this break intentional?

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jun 28, 2019

Author Contributor

Yes, that's intentional. The reasoning was that if IsZeroCIDR is true to any address then nodeIPs will be set to all host IPs (the return of ipGetter.NodeIPs()) and we don't need to append extra IP addresses to nodeIPs and it's safe to break here.

continue
}

var lps []utilproxy.LocalPort
for address := range addresses {
for address := range nodeAddresses {

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jun 28, 2019

Member

nodeAddresses.List()

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Thanks for the PR @cezarsa!

/ok-to-test

@cezarsa cezarsa force-pushed the cezarsa:node_addrs branch 2 times, most recently from 141876b to 289430a Jun 28, 2019

if nodeAddrSet.Len() > 0 {
nodeAddresses = nodeAddrSet.List()
for _, address := range nodeAddresses {
if utilproxy.IsZeroCIDR(address) {

This comment has been minimized.

Copy link
@lbernail

lbernail Jul 1, 2019

Contributor

Is there a case where nodeAddresses could contain 0.0.0.0/0 and other addresses?
Because we could

  • if nodeAddrSet.Len()==1 and nodeAddrSet[0]=0.0.0.0/0 => iterate over nodeAddrSet
  • otherwise iterate over proxier.ipGetter.NodeIPs()

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jul 1, 2019

Author Contributor

Assuming the node has three interfaces:

eth0:    10.0.0.1, fe80::1c00:a8ff:fe00:481f
lo0:     127.0.0.1, ::1
docker0: 172.17.0.1

It's possible for nodeAddresses to contain both ::/0 and another IP, but
only when mixing IPv4 and IPv6 arguments due to checks here:

if utilnet.IsIPv6(ip) && !uniqueAddressList.Has(IPv6ZeroCIDR) {
uniqueAddressList.Insert(ip.String())
}
if !utilnet.IsIPv6(ip) && !uniqueAddressList.Has(IPv4ZeroCIDR) {
uniqueAddressList.Insert(ip.String())
}

Starting kube-proxy with --nodeport-addresses=::/0,10.0.0.1/32 would cause these values:

nodeAddresses == ["10.0.0.1", "::/0"]                    # This is the order sets.String.List would return them.
nodeIPs       == ["10.0.0.1", "127.0.0.1", "172.17.0.1]  # IPv6 address possibly here also, if there are routes using them as source.

Down the line, when configuring a node port service we would have:

lps == []utilproxy.LocalPort{
    {
        IP: "10.0.0.1",
        ...
    },
    {
        IP: "",
    },
}

I'm not sure this value for lps makes much sense but I didn't touch this part of the code. If it's okay for it to have a single empty IP entry then I guess utilproxy.GetNodeAddresses could be modified to return only ::/0 and them skip iterating on nodeAddresses if utilproxy.IsZeroCIDR(nodeAddresses[0]) is true.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

/priority important-soon

}
if nodeAddrSet.Len() > 0 {
nodeAddresses = nodeAddrSet.List()
for _, address := range nodeAddresses {

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 22, 2019

Member

For this PR, can we keep the logic the same as what it was before and open a separate PR for the zero cidr changes?

i.e. keep this section as:

                      for _, address := range nodeAddresses {
				if !utilproxy.IsZeroCIDR(address) {
					nodeIPs = append(nodeIPs, net.ParseIP(address))
					continue
				}
				// zero cidr
				nodeIPs, err = proxier.ipGetter.NodeIPs()
				if err != nil {
					klog.Errorf("Failed to list all node IPs from host, err: %v", err)
				}
			}

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 22, 2019

Member

Fwiw I think your current changes do make more sense since we should break on zero cidr but easier to isolate the PRs that way

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 22, 2019

Member

Not opposed to separate commits either actually

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jul 23, 2019

Author Contributor

Thanks for the feedback @andrewsykim! I have updated the PR separating the changes in 2 different commits and trying to explain the reasoning better in the commit messages.

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 23, 2019

Member

thank you!

@cezarsa cezarsa force-pushed the cezarsa:node_addrs branch from 289430a to 7460ca2 Jul 23, 2019

if err != nil {
klog.Errorf("Failed to get node ip address matching nodeport cidr: %v", err)
}
if nodeAddrSet.Len() > 0 {

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 23, 2019

Member

This will panic if err != nil above

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jul 23, 2019

Author Contributor

Actually this works because the set implementation will simply call len() on the underlying map which will return 0 even for a nil map: https://play.golang.org/p/ZO4U9-VvxPs. I agree this is confusing though, I'll update this condition to if err == nil && nodeAddrSet.Len() > 0

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 23, 2019

Member

Good point though on the nil map

@cezarsa cezarsa force-pushed the cezarsa:node_addrs branch from 7460ca2 to 9fee940 Jul 23, 2019

@andrewsykim
Copy link
Member

left a comment

/approve
/assign @lbernail

One really minor comment, assigning @lbernail for final lgtm

// List of node addresses to listen on if a nodePort is set.
nodeAddresses []string
// List of node IP addresses to be used as IPVS services if nodePort is
// set.

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 23, 2019

Member

small nit: make this comment into a single line?

This comment has been minimized.

Copy link
@cezarsa

cezarsa Jul 23, 2019

Author Contributor

Done.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

@cezarsa can you update the release notes please?

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, cezarsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cezarsa added some commits Jun 26, 2019

proxy/ipvs: Only compute node ip addresses once per sync
Previously the same ip addresses would be computed for each nodePort
service and this could be CPU intensive for a large number of nodePort
services with a large number of ipaddresses on the node.
proxy/ipvs: Compute all node ips only once when a zero cidr is used
Computing all node ips twice would always happen when no node port
addresses were explicitly set. The GetNodeAddresses call would return
two zero cidrs (ipv4 and ipv6) and we would then retrieve all node IPs
twice because the loop wouldn't break after the first time.

Also, it is possible for the user to set explicit node port addresses
including both a zero and a non-zero cidr, but this wouldn't make sense
for nodeIPs since the zero cidr would already cause nodeIPs to include
all IPs on the node.
@lbernail

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2019

/lgtm

Thanks again!

Nice improvement for later: modify GetNodeAddresses and realIPGetter.NodeIPs() to ignore kube-ivps0 interface (even if it runs only once now, going through possibly 1000s unnecessary IPs will take some time that we can easily avoid)

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 24, 2019

@lbernail

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2019

We should cherry-pick this fix to all supported releases

@k8s-ci-robot k8s-ci-robot merged commit 1dac5fd into kubernetes:master Jul 24, 2019

23 checks passed

cla/linuxfoundation cezarsa authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@cezarsa

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2019

Thanks @lbernail! About your suggestion on ignoring the kube-ipvs0 interface. I've been taking a look at this, and by no means I'm anywhere near to fully understanding how this works on the kernel side, but from what I could gather there's no feasible way to filter which interfaces we want the kernel to list addresses or routes.

To retrieve the IP addresses from the kernel a netlink RTM_GETADDR request is issued. One of the flags that can be be set is NLM_F_MATCH to filter the response, however according to the man page this flag is not implemented yet in the kernel. This also applies to getting routes from the kernel with RTM_GETROUTE. I found this somewhat old discussion on the subject here https://lists.infradead.org/pipermail/libnl/2013-June/001014.html.
Also, a patch to handle NLM_F_MATCH on ipv6 routes was rejected in https://patchwork.ozlabs.org/patch/133440/

To make sure this is what happens today I tested running strace on this snippet https://play.golang.org/p/vtBjdRhwO3K which should only list addresses for the lo interface and also on ip addr list lo:

strace -e trace=network ip addr list lo

On both cases I can see on the response messages all interfaces addresses being returned by the kernel (strace output edited for clarity):

sendto(3 ... type=RTM_GETADDR
...
recvmsg(3 ... if_nametoindex("lo") ... if_nametoindex("enp0s3") ... if_nametoindex("enp0s8")

Sure we could still filter try improving the way we filter the kube-ivps0 interface on our side, however I'm not sure it would help a lot since the kernel would still be doing most of the work of enumerating all ip addresses on it anyway.

Again I might be wrong as I have very little experience with interacting directly with netlink. It would be great if some of you folks with more kernel/netlink expertise could take a look at this and see if this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.