Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort will not listen on external IP if same IP is used as loadBalancerIP #114815

Closed
adr-xyt opened this issue Jan 4, 2023 · 30 comments · Fixed by #115019
Closed

NodePort will not listen on external IP if same IP is used as loadBalancerIP #114815

adr-xyt opened this issue Jan 4, 2023 · 30 comments · Fixed by #115019
Assignees
Labels
area/ipvs kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@adr-xyt
Copy link

adr-xyt commented Jan 4, 2023

What happened?

Hey, after upgrading kubernetes from 1.22 to 1.23, I'm experiencing strange service provisioning behavior using nodePort. I'm using kube-proxy (ipvs) + calico + metallb.
If external IP is used by another service type (Loadbalancer - allocated by metallb), then the node owning this IP doesn't listen to NodePort on this interface.
Unfortunately, I can't find any changes that could be causing this, any ideas?
Kubernetes 1.23

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/loadBalancerIPs: externalip1
  name: service
spec:
  ports:
  - name: client
    port: 31010
    protocol: TCP
    targetPort: client
  type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
  name: service2
spec:
  ports:
  - appProtocol: http
    name: http
    nodePort: 32080
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 32443
    port: 443
    protocol: TCP
    targetPort: https
  type: NodePort
ipvsadm -Ln | grep externalip1
node1:
    TCP  externalip1:31010 lc
    TCP  externalip1:32010 lc
node2:
    TCP  externalip1:31010 lc
    TCP  externalip1:32010 lc
node3:
    TCP  externalip1:31010 lc
    TCP  externalip1:32010 lc

Missing externalip1 here:

ipvsadm -L -n | grep 32443
node1:
    TCP  172.17.0.1:32443 lc
    TCP  10.20.10.13:32443 lc
    TCP  10.50.135.0:32443 lc
node2:
    TCP  externalip2:32443 lc
    TCP  172.17.0.1:32443 lc
    TCP  10.20.10.12:32443 lc
    TCP  10.58.88.192:32443 lc
node3:
    TCP  externalip3:32443 lc
    TCP  172.17.0.1:32443 lc
    TCP  172.26.0.1:32443 lc
    TCP  10.20.10.10:32443 lc
    TCP  10.20.10.11:32443 lc
    TCP  10.63.32.192:32443 lc

What did you expect to happen?

Using k8s 1.22 service with nodeport caused listening on every interface, despite using external IP by another service.

Kubernetes 1.22

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/loadBalancerIPs: externalip1
  name: service
spec:
  ports:
  - name: client
    port: 31010
    protocol: TCP
    targetPort: client
  type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
  name: service2
spec:
  ports:
  - appProtocol: http
    name: http
    nodePort: 32080
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 32443
    port: 443
    protocol: TCP
    targetPort: https
  type: NodePort
ipvsadm -Ln | grep externalip1
node1:
    TCP  externalip1:31010 lc
    TCP  externalip1:32010 lc
node2:
    TCP  externalip1:31010 lc
    TCP  externalip1:32010 lc
node3:
    TCP  externalip1:31010 lc
    TCP  externalip1:32010 lc
ipvsadm -Ln | grep 32443
node1:
    TCP  172.17.0.1:32443 lc
    TCP  externalip1:32443 lc
    TCP  10.10.10.1:32443 lc
    TCP  10.36.174.0:32443 lc
node2:
    TCP  172.17.0.1:32443 lc
    TCP  externalip2:32443 lc
    TCP  10.10.10.4:32443 lc
    TCP  10.45.192.0:32443 lc
node3:
    TCP  172.17.0.1:32443 lc
    TCP  externalip3:32443 lc
    TCP  10.10.10.2:32443 lc
    TCP  10.45.11.192:32443 lc

How can we reproduce it (as minimally and precisely as possible)?

Upgrade k8s from 1.22.15 to 1.23.15

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.15", GitCommit:"b84cb8ab29366daa1bba65bc67f54de2f6c34848", GitTreeState:"clean", BuildDate:"2022-12-08T10:49:13Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.15", GitCommit:"b84cb8ab29366daa1bba65bc67f54de2f6c34848", GitTreeState:"clean", BuildDate:"2022-12-08T10:42:57Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

on-premise instances

OS version

# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux D001-FSN1DC17 6.0.8-060008-generic #202211110629-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 11 06:36:01 UTC x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

docker

Related plugins (CNI, CSI, ...) and versions (if applicable)

calico v3.24.5

metallb v0.13.7
ipset v7.5, protocol version: 7
ipvsadm v1.31 2019/12/24 (compiled with popt and IPVS v1.2.1)

#ipvs config:

    ipvs:
      excludeCIDRs: null
      scheduler: lc
      strictARP: true
@adr-xyt adr-xyt added the kind/bug Categorizes issue or PR as related to a bug. label Jan 4, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 4, 2023
@adr-xyt
Copy link
Author

adr-xyt commented Jan 4, 2023

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 4, 2023
@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

/assign

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

From slack:

loadBalancerIP is assigned to the physical interface (eno1). It accepts traffic from another network there and I use metallb in L2 mode for that. From what I can see, in the dummy interface I have all external addresses used as loadBalancerIP and they are excluded from listening to NodePort - it is as you mentioned

This is not a valid setup. A virtual IP (VIP), like the loadBalancerIP, must not be assigned to a physical interface.

The reason to assign a VIP to a physical interface would be to attract traffic with L2 mechanisms, ARP for IPv4 and neighbor discovery for IPv6, but this way is flawed for several reasons. That's why you use metallb in L2 mode instead. Metallb in L2 mode answers L2 requests without assigning the VIP to any interface.

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

When you send traffic to a nodePort, the destination address should be a node address, not a loadBalancerIP.

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

There is however a change in the implementation (I will try to find the PR) in the way the nodePort addresses are computed. This is how i should be;

nodePort addresses = addresses on all node interface excluding link-local and loopback and all addresses on kube-ipvs0

On kube-ipvs0 all clusterIP, loadBalancerIP and externalIPs are assigned. So the current implementation is correct.

What I suspect that pre v1.23 K8s did this wrong by adding all addresses on interfaces except kube-ipvs0 (including loopback which would not work).

@adr-xyt
Copy link
Author

adr-xyt commented Jan 4, 2023

This is not a valid setup. A virtual IP (VIP), like the loadBalancerIP, must not be assigned to a physical interface.
The reason to assign a VIP to a physical interface would be to attract traffic with L2 mechanisms, ARP for IPv4 and neighbor discovery for IPv6, but this way is flawed for several reasons. That's why you use metallb in L2 mode instead. Metallb in L2 mode answers L2 requests without assigning the VIP to any interface.

These loadBalancerIP addresses are just the external IP of my nodes, which I use in metallb as an available pool for service(loadbalancer). Provider automatically assigns this address to the physical interface as a lease, I don't really have the option of a different setup at the moment.

When you send traffic to a nodePort, the destination address should be a node address, not a loadBalancerIP.

Ye, I know. In my case the public IP address pool for node and loadBalancerIP addresses are the same

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

These loadBalancerIP addresses are just the external IP of my nodes

So, the loadBalancerIPs assigned by metallb is actually the real addresses of your K8s nodes?

@adr-xyt
Copy link
Author

adr-xyt commented Jan 4, 2023

So, the loadBalancerIPs assigned by metallb is actually the real addresses of your K8s nodes?

Yes, these are public IP addresses (each node has its own), which I use as an entry point for some services(loadbalancer)
but I also have an external loadbalancer that targets nodePort. To eliminate SPoF, it directs traffic to the service (NodePort), which has been listening on all nodes so far, regardless of IP assignment to another service

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

So, the loadBalancerIPs assigned by metallb is actually the real addresses of your K8s nodes?

@thockin @danwinship Is this setup supported?

@adr-xyt Thanks for explaining. I understand your setup now, . I am unsure if it's supposed to work, so I have to ask. Have you tested proxy-mode=iptables? It may work, but I wouldn't be suprised if it doesn't.

@adr-xyt
Copy link
Author

adr-xyt commented Jan 4, 2023

I will try to test it with iptables, but I use ipvs because of specific services, so I can use the load balancing algorithm - least connection... this method won't solve my problem :(

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

Found the PR #101429

@uablrek
Copy link
Contributor

uablrek commented Jan 4, 2023

Why do you need nodePorts services at all? Can't you use only loadBalancer?

NodePort is mainly intended to be used by and external load-balancer, i.e. like in AWS, but when you manage your own cluster you usually don't have an external loadBalancer so use only loadBalancerIPs makes sense. We do that, and even disable nodePort allocation.

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2023

Why do you need nodePorts services at all? Can't you use only loadBalancer?

Scratch that! In your case where the node addresses are the external addresses you should use NodePort only.

If you use your node addresses as loadBalancerIP you will hit this PR #108460 when you upgrade to later K8s versions. It inserts a blocking rule in the INPUT chain for loadBalancerIPs.

WARNING

This will block input to your node if loadBalancerIP==nodeIP.

@adr-xyt
Copy link
Author

adr-xyt commented Jan 5, 2023

I understand, thanks for the info. In my case it's a breaking change and I will have to migrate some services to NodePort before kubernetes upgrade on prod environment.

@adr-xyt
Copy link
Author

adr-xyt commented Jan 5, 2023

@adr-xyt Thanks for explaining. I understand your setup now, . I am unsure if it's supposed to work, so I have to ask. Have you tested proxy-mode=iptables? It may work, but I wouldn't be suprised if it doesn't.

@uablrek

with proxy-mode=iptables, everything started working, node is listening on nodePort and I can get to loadBalancerIP, I cleared the rules kube-proxy --cleanup on each node, just to be sure
However this is not a solution for me, I need the lc algorithm from ipvs

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2023

I will have to migrate some services to NodePort before kubernetes upgrade on prod environment.

Port 31010 used in the LoadBakancer service example is within the default NodePort range. I think it can be converted to a NodePort service without affecting external users. Or is 31010 just an example? Or are there other reasons for type:LoadBalancer?

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2023

/area ipvs

@adr-xyt
Copy link
Author

adr-xyt commented Jan 5, 2023

Port 31010 used in the LoadBakancer service example is within the default NodePort range. I think it can be converted to a NodePort service without affecting external users. Or is 31010 just an example? Or are there other reasons for type:LoadBalancer?

This is just an example, I won't have much trouble with the migration. At most, I will divide the available lease of IP addresses into smaller pools, for NodePort / Loadblancer services.
I think I already know everything and there should be no more surprises.
Thanks for the information and your time!

@thockin
Copy link
Member

thockin commented Jan 5, 2023

It is a little "unusal" to use the host IP as the LB IP, but it's unfortunate that we broke something that used to work. https://www.hyrumslaw.com/

Lars, is there any path back to "working" that doesn't lose the protections in that linked PR?

I could easily see iptables also breaking this..

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2023

To not exclude loadBalancerIPs (and ecternalIPs) when they are on a physical interface is not hard, but messes up the nice set-operation.

The access protection in #108460 It should basically not be there if the any loadBalancerIPs (and ecternalIPs) is also on aphysical interface. Can be done, but may be messy.

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2023

/triage accepted

Even though this issue is accepted, it's still under investigation.

I will check the code if the old bevaivor can be restored in a fairly simple way. But there is a problem with back-port. If I make a PR now it will be in v1.26.x or v1.27 but the function was broken in v1.23 so a fix should really be back-ported all the way back.

I would prefer to document that a real address on an interface on the node can't be used as a virtual address, like loadBalancerIP or externalIP.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 5, 2023
@chlam4
Copy link

chlam4 commented Jan 12, 2023

Just wanted to chime in and say that I've also encountered this issue... very grateful to find the discussion and looking forward to a solution. Thank you all!

Also just to share, I've tried the following 3 workarounds and they all work in my case:

  • Switching to iptables mode. It works I think because the iptables NAT rule refers to the hostname as the destination, which gets resolved to the public node IP.
  • Disable IPv6 on my OS. This works too as it makes the code fall back to the single stack mode, which grabs the node IP also from looking up the hostname.
  • Explicitly configure the nodePortAddresses field in the kube proxy config, that overrides the default get local addresses.

@uablrek
Copy link
Contributor

uablrek commented Jan 13, 2023

@adr-xyt I backed down to K8s v1.22.4 and I can't get it to work unless you have the master outside the cluster (i.e. not a node), or have a setup where the external-ip you set as loadBalancerIP is not on the main K8s network, something like;

K8s is using the "backend" network, the "external" ips are on the "frontend" network.

The reason is when you assign a loadBalancerIP, that address will be assigned to the kube-ipvs0 interface, even before v1.23. So, on an in-cluster master node you will see something like;

vm-001 ~ # ip addr show dev kube-ipvs0
15: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 1a:8d:7f:b9:0a:95 brd ff:ff:ff:ff:ff:ff
    inet 12.0.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 12.0.253.214/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 12.0.117.25/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 192.168.1.2/32 scope global kube-ipvs0

The 192.168.1.2 address is the loadBalancerIP and is also the node-address of another node in the cluster, "vm-002". When the master receives "normal" packets from "vm-002" they will have src=192.168.1.2 but since that address is assigned to a local interface on the master the packets will be discarded as martians or spoofed packets.

The result is that "vm-002" (e.g. kubelet on that node) loses contact with the master;

vm-002 ~ # kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Unable to connect to the server: dial tcp 192.168.1.1:6443: connect: no route to host

@uablrek
Copy link
Contributor

uablrek commented Jan 13, 2023

I discovered this when testing PR #115019. I think I can restore the pre v1.23 behavior now, but I suffer from the problem described above.

@uablrek
Copy link
Contributor

uablrek commented Jan 13, 2023

@chlam4 I don't think your problem is precisely the same. To disable ipv6 or set nodePortAddresses can't really be used as work-arounds for this issue.

@chlam4
Copy link

chlam4 commented Jan 14, 2023

Thank you, @uablrek. You're right about IPv6 - disabling it didn't really help. I forgot to remove the nodePortAddresses I put in while testing with disabling IPv6. It was the nodePortAddresses that makes the difference. That said, it really appears to work if we explicitly put in the nodePortAddresses. Maybe I'm still missing something in the code, but if I read it correctly, here is where we take out the load balancer IP:

func (r *realIPGetter) NodeIPs() (ips []net.IP, err error) {

	nodeAddress, err := r.nl.GetAllLocalAddresses()
	if err != nil {
		return nil, fmt.Errorf("error listing LOCAL type addresses from host, error: %v", err)
	}

	// We must exclude the addresses on the IPVS dummy interface
	bindedAddress, err := r.BindedIPs()
	if err != nil {
		return nil, err
	}
	ipset := nodeAddress.Difference(bindedAddress)

	// translate ip string to IP
	for _, ipStr := range ipset.UnsortedList() {
		a := netutils.ParseIPSloppy(ipStr)
		ips = append(ips, a)
	}
	return ips, nil
}

It is called from the folllowing, but only if nodeAddrSet contains any zero CIDRs. If a node port address is explicitly provided and it is NOT a zero CIDR, then ipGetter.NodeIPs() will not be called.

	if hasNodePort {
		nodeAddrSet, err := utilproxy.GetNodeAddresses(proxier.nodePortAddresses, proxier.networkInterfacer)
		if err != nil {
			klog.ErrorS(err, "Failed to get node IP address matching nodeport cidr")
		} else {
			nodeAddresses = nodeAddrSet.List()
			for _, address := range nodeAddresses {
				a := netutils.ParseIPSloppy(address)
				if a.IsLoopback() {
					continue
				}
				if utilproxy.IsZeroCIDR(address) {
					nodeIPs, err = proxier.ipGetter.NodeIPs()
					if err != nil {
						klog.ErrorS(err, "Failed to list all node IPs from host")
					}
					break
				}
				nodeIPs = append(nodeIPs, a)
			}
		}
	}

We're planning to roll out this workaround before upgrading to a future Kubernetes release with your fix. We would much appreciate it if you could shed more light why it wouldn't work. We didn't want to change to iptables mode if this second workaround really works. Thank you so much!

@uablrek
Copy link
Contributor

uablrek commented Jan 16, 2023

@chlam4 You are right. To explicitly set nodePortAddresses in the kube-proxy config works as a work-around.

Still the loadBalancerIP (which is also an address on a node) can't be used by the main k8s network, or the master must not be a node in the cluster as described in #114815 (comment). And...

Warning

This will work up to and including K8s v1.25.x, but NOT v1.26.0! I haven't checked yet but I think it's because of #108460.

@uablrek
Copy link
Contributor

uablrek commented Jan 16, 2023

Since the loadBalancerIP that is owned by a node is assigned to kube-ipvs0 on other nodes you must stop those other nodes from responding to ARP. For example with:

sysctl -w net.ipv4.conf.all.arp_ignore=1

@adr-xyt
Copy link
Author

adr-xyt commented Jan 17, 2023

From my perspective, this is not a big problem, I have the whole architecture implemented in separate networks. I think it's worth adding this information somewhere in the documentation.

@chlam4
Copy link

chlam4 commented Jan 19, 2023

@uablrek Thank you so much for the response, epesically the info about v1.26. We'll check it out for sure and will report back. Our node port use case is for single-node deployments only, so luckily there are no other nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipvs kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants