Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node is nat'd and doesn't know its IP address on hybrid cluster use wireguard-native is wrong #9535

Closed
vast0906 opened this issue Feb 21, 2024 · 18 comments

Comments

@vast0906
Copy link

vast0906 commented Feb 21, 2024

Environmental Info:
K3s Version:
k3s -v
k3s version v1.28.6+k3s2 (c9f49a3)
go version go1.20.13

Node(s) CPU architecture, OS, and Version:

Linux master 6.1.0-13-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
Linux node-x86 6.1.0-13-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
Linux node-arm 5.15.0-1049-oracle #55-Ubuntu SMP Mon Nov 20 19:53:49 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:

server:

  1. master
    EXTERNAL-IP: xx.xx.xx.xx
    INTERNAL-IP: 10.0.8.17

node:

  1. node-x86
    node-x86 is NAT'd and doesn't know its IP address.
    EXTERNAL-IP: xx.xx.xx.yy
    INTERNAL-IP: 192.168.36.22

  2. node-arm
    EXTERNAL-IP: xx.xx.xx.zz
    INTERNAL-IP: 10.0.1.217

  • Installed K3s:
export PUBLIC_IP=`curl -sSL https://ipconfig.sh`
export INSTALL_K3S_EXEC="--disable servicelb --kube-proxy-arg proxy-mode=ipvs  --kube-proxy-arg masquerade-all=true --kube-proxy-arg metrics-bind-address=0.0.0.0  --disable traefik --node-ip 10.0.8.17 --node-external-ip $PUBLIC_IP --flannel-backend wireguard-native --flannel-external-ip"
curl -sfL https://get.k3s.io | sh -
  • node-x86 configuration
/usr/local/bin/k3s \
    agent \
	'--node-ip' \
	'192.168.36.22' \
cat /etc/systemd/system/k3s-agent.service.env
K3S_TOKEN='K10f09c8dffcb10a0d83dbd3eb2875327de80ffe9c03a208fe68ffb5b32fa51d78e::server:5d3906836799daaa8b70851155c11190'
K3S_URL='https://xx.xx.xx.xx:6443'
  • node-arm configuration
/usr/local/bin/k3s \
    agent \
	'--node-ip' \
	'10.0.1.217' \
cat /etc/systemd/system/k3s-agent.service.env
K3S_TOKEN='K10f09c8dffcb10a0d83dbd3eb2875327de80ffe9c03a208fe68ffb5b32fa51d78e::server:5d3906836799daaa8b70851155c11190'
K3S_URL='https://xx.xx.xx.xx:6443'
  • master wg show
# wg show flannel-wg
interface: flannel-wg
  public key: Wxxxx
  private key: (hidden)
  listening port: 51820

peer: hldi2xxxx
  endpoint: xx.xx.xx.zz:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 25 seconds ago
  transfer: 11.72 MiB received, 6.53 MiB sent
  persistent keepalive: every 25 seconds

peer: Ap//Dxxx
  endpoint: 192.168.36.22:51820  # It's wrong
  allowed ips: 10.42.5.0/24
  transfer: 0 B received, 33.39 KiB sent
  persistent keepalive: every 25 seconds
  • node-x86 wg show
interface: flannel-wg
  public key: Ap//xxxx
  private key: (hidden)
  listening port: 51820

peer: hldi2xxx
  endpoint: xx.xx.xx.zz:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 28 seconds ago
  transfer: 1.52 KiB received, 3.16 KiB sent
  persistent keepalive: every 25 seconds

peer: Ww7xx
  endpoint: xx.xx.xx.xx:51820
  allowed ips: 10.42.0.0/24
  transfer: 0 B received, 30.06 KiB sent
  persistent keepalive: every 25 seconds
  • node-arm wg show
interface: flannel-wg
  public key: hldi26xxxx
  private key: (hidden)
  listening port: 51820

peer: Ww7xxxx
  endpoint: xx.xx.xx.xx:51820
  allowed ips: 10.42.0.0/24
  latest handshake: 8 seconds ago
  transfer: 6.53 MiB received, 15.16 MiB sent
  persistent keepalive: every 25 seconds

peer: Ap//xxxx
  endpoint: xx.xx.xx.yy:8598 # that's right
  allowed ips: 10.42.5.0/24
  latest handshake: 1 minute, 12 seconds ago
  transfer: 2.86 KiB received, 2.04 KiB sent
  persistent keepalive: every 25 seconds

Describe the bug:

  1. I am trying to create a hybrid cluster with two node have public net ip and one node in uncontrolled nat network
  2. I using --flannel-backend=wireguard-native create a mesh VPN network between nodes and use that network for internal communication
  3. wg show on every node seems to confirm that master can communicate with workers, and workers can ping master, using VPN mesh network. but master ping node-x86 is error, because a wrong address(internal-ip) is configured.
  4. What I did not understand :
  • How to set wg on the master to the correct configuration
  • Why does the node-x86's endpoint on the master use internal-ip and is not consistent with node-arm?
  1. metrics-server not use flannel-wg ip

Steps To Reproduce:

Installed K3s server using configurations above
Installed K3s workers using configurations above
Expected behavior:

  • master wg show
# wg show flannel-wg
interface: flannel-wg
  public key: Wxxxx
  private key: (hidden)
  listening port: 51820

peer: hldi2xxxx
  endpoint: xx.xx.xx.zz:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 25 seconds ago
  transfer: 11.72 MiB received, 6.53 MiB sent
  persistent keepalive: every 25 seconds

peer: Ap//Dxxx
  endpoint: xx.xx.xx.yy:8598 # that's right
  allowed ips: 10.42.5.0/24
  transfer: 0 B received, 33.39 KiB sent
  persistent keepalive: every 25 seconds

Actual behavior:

  • master wg show
# wg show flannel-wg
interface: flannel-wg
  public key: Wxxxx
  private key: (hidden)
  listening port: 51820

peer: hldi2xxxx
  endpoint: xx.xx.xx.zz:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 25 seconds ago
  transfer: 11.72 MiB received, 6.53 MiB sent
  persistent keepalive: every 25 seconds

peer: Ap//Dxxx
  endpoint: 192.168.36.22:51820  # It's wrong
  allowed ips: 10.42.5.0/24
  transfer: 0 B received, 33.39 KiB sent
  persistent keepalive: every 25 seconds

Additional context / logs:

@brandond
Copy link
Contributor

Did you want to enable the --flannel-external-ip option?

https://docs.k3s.io/installation/network-options?_highlight=flannel-external-ip#flannel-options

@vast0906
Copy link
Author

Did you want to enable the --flannel-external-ip option?

https://docs.k3s.io/installation/network-options?_highlight=flannel-external-ip#flannel-options

--flannel-external-ip is enable, but node-x86 is NAT'd and doesn't know its IP address.
should use the public network address negotiated by the wireguard peer itself.

@brandond
Copy link
Contributor

brandond commented Feb 22, 2024

What specifically do you mean by NAT'd and doesn't know its IP address. Why doesn't it know? Did you not configure it, or does it change periodically?

Each node needs to know its external IP for Flannel to operate correctly. It cannot discover it for itself.

@vast0906
Copy link
Author

vast0906 commented Feb 22, 2024

What specifically do you mean by NAT'd and doesn't know its IP address. Why doesn't it know? Did you not configure it, or does it change periodically?

Each node needs to know its external IP for Flannel to operate correctly. It cannot discover it for itself.

it change periodically, it's home broadband, dynamic IP and uncontrolled NAT.
Can the master automatically get the ADDRESS OF THE WG peer endpoint as node-arm does

@brandond
Copy link
Contributor

No, it cannot. The node needs to know its public IP, and the IP should be static.

@brandond
Copy link
Contributor

@manuelbuil can you think of any way that this might work without knowing the external IP?

@vast0906
Copy link
Author

No, it cannot. The node needs to know its public IP, and the IP should be static.

The wireguard peer negotiation connection will automatically give the master the correct IP port

How Do I modify the WIREGUARD configuration file on master?

@manuelbuil
Copy link
Contributor

No, it cannot. The node needs to know its public IP, and the IP should be static.

The wireguard peer negotiation connection will automatically give the master the correct IP port

How Do I modify the WIREGUARD configuration file on master?

How are you configuring wireguard in the arm server so that it knows its public IP address?

@manuelbuil
Copy link
Contributor

If you have a NAT, you'll need a more advanced solution like tailscale. Here you get a nice entry on how they solve the NAT traversal problem: https://tailscale.com/blog/how-nat-traversal-works. Tailscale is integrated into K3s :) ==> https://docs.k3s.io/installation/network-options#integration-with-the-tailscale-vpn-provider-experimental

@vast0906
Copy link
Author

How are you configuring wireguard in the arm server so that it knows its public IP address?

i want to configuring wireguard in the master server,

@manuelbuil
Copy link
Contributor

endpoint: xx.xx.xx.zz:51820

How are you configuring wireguard in the arm server so that it knows its public IP address?

i want to configuring wireguard in the master server,

I want to understand one thing. You deployed k3s-master and k3s-node-x86 and k3s-node-arm. You claim that the wireguard configuration between k3s-master and k3s-node-arm is correct and I can see that k3s-master knows about endpoint: xx.xx.xx.zz:51820 . My question: did you manually configure something in wireguard of k3s-master or k3s-node-arm, so that master knows about xx.xx.xx.zz? In the docs it is specified that you should pass the public IP address in the config of k3s-node-arm but you are not doing that. However, according to your description, wireguard knows about xx.xx.xx.zz, I wonder how can that happen

@vast0906
Copy link
Author

endpoint: xx.xx.xx.zz:51820

How are you configuring wireguard in the arm server so that it knows its public IP address?

i want to configuring wireguard in the master server,

I want to understand one thing. You deployed k3s-master and k3s-node-x86 and k3s-node-arm. You claim that the wireguard configuration between k3s-master and k3s-node-arm is correct and I can see that k3s-master knows about endpoint: xx.xx.xx.zz:51820 . My question: did you manually configure something in wireguard of k3s-master or k3s-node-arm, so that master knows about xx.xx.xx.zz? In the docs it is specified that you should pass the public IP address in the config of k3s-node-arm but you are not doing that. However, according to your description, wireguard knows about xx.xx.xx.zz, I wonder how can that happen

You can refer to it here https://www.wireguard.com/#built-in-roaming. wireguard can get client current public ip and port.
but wireguard-native seems to set the peer's endpoint in the master's configuration file . The server configuration doesn't have any initial endpoints of its peers (the clients). This is because the server discovers the endpoint of its peers by examining from where correctly authenticated data originates.

@manuelbuil
Copy link
Contributor

endpoint: xx.xx.xx.zz:51820

How are you configuring wireguard in the arm server so that it knows its public IP address?

i want to configuring wireguard in the master server,

I want to understand one thing. You deployed k3s-master and k3s-node-x86 and k3s-node-arm. You claim that the wireguard configuration between k3s-master and k3s-node-arm is correct and I can see that k3s-master knows about endpoint: xx.xx.xx.zz:51820 . My question: did you manually configure something in wireguard of k3s-master or k3s-node-arm, so that master knows about xx.xx.xx.zz? In the docs it is specified that you should pass the public IP address in the config of k3s-node-arm but you are not doing that. However, according to your description, wireguard knows about xx.xx.xx.zz, I wonder how can that happen

You can refer to it here https://www.wireguard.com/#built-in-roaming. wireguard can get client current public ip and port. but wireguard-native seems to set the peer's endpoint in the master's configuration file . The server configuration doesn't have any initial endpoints of its peers (the clients). This is because the server discovers the endpoint of its peers by examining from where correctly authenticated data originates.

Thanks for the information. You are correct, the wireguard implementation in flannel is creating a peer config with its endpoint: https://github.com/flannel-io/flannel/blob/master/pkg/backend/wireguard/wireguard_network.go#L177. That behaviour is the same for server and agents. So whenever a new node is included in K3s, all nodes (server or agent) get updated with this new node information. The endpoint information comes from the annotation:

 flannel.alpha.coreos.com/public-ip:

in the new node. Therefore, I am surprised that in your master node you get endpoint: 192.168.36.22:51820 for node peer: Ap//Dxxx. Whereas in node-arm you get endpoint: xx.xx.xx.yy:8598 for peer: Ap//Dxxx. It should be the same on both because they are consuming the same information. Could you check the value of flannel.alpha.coreos.com/public-ip: in node node-x86 please? Thanks

@manuelbuil
Copy link
Contributor

Apart from the previous comment, I understand that you would like to change the wireguard implementation of flannel to be able to work around the NAT issue (or if you don't know in advanced the Public IP). It might be tricky to implement because in K8s, when it comes to pod-pod communication, there is not really a server-client architecture. Anyway, as K3s is basically using Flannel project for this, we should continue discussing in the flannel repo. Could you open an issue over there please: https://github.com/flannel-io/flannel?

@vast0906
Copy link
Author

Thanks for the information. You are correct, the wireguard implementation in flannel is creating a peer config with its endpoint: https://github.com/flannel-io/flannel/blob/master/pkg/backend/wireguard/wireguard_network.go#L177. That behaviour is the same for server and agents. So whenever a new node is included in K3s, all nodes (server or agent) get updated with this new node information. The endpoint information comes from the annotation:

 flannel.alpha.coreos.com/public-ip:

in the new node. Therefore, I am surprised that in your master node you get endpoint: 192.168.36.22:51820 for node peer: Ap//Dxxx. Whereas in node-arm you get endpoint: xx.xx.xx.yy:8598 for peer: Ap//Dxxx. It should be the same on both because they are consuming the same information. Could you check the value of flannel.alpha.coreos.com/public-ip: in node node-x86 please? Thanks

Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.36.22
                    flannel.alpha.coreos.com/backend-data: {"PublicKey":"Ap//xxxx"}
                    flannel.alpha.coreos.com/backend-type: wireguard
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.36.22
                    k3s.io/hostname: node-x86
                    k3s.io/internal-ip: 192.168.36.22
                    k3s.io/node-args: ["agent","--node-ip","192.168.36.22"]

@vast0906
Copy link
Author

Apart from the previous comment, I understand that you would like to change the wireguard implementation of flannel to be able to work around the NAT issue (or if you don't know in advanced the Public IP). It might be tricky to implement because in K8s, when it comes to pod-pod communication, there is not really a server-client architecture. Anyway, as K3s is basically using Flannel project for this, we should continue discussing in the flannel repo. Could you open an issue over there please: https://github.com/flannel-io/flannel?

ok, Thanks

@manuelbuil
Copy link
Contributor

Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.36.22
                    flannel.alpha.coreos.com/backend-data: {"PublicKey":"Ap//xxxx"}
                    flannel.alpha.coreos.com/backend-type: wireguard
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.36.22
                    k3s.io/hostname: node-x86
                    k3s.io/internal-ip: 192.168.36.22
                    k3s.io/node-args: ["agent","--node-ip","192.168.36.22"]

Can you confirm that you arm node does not have 192.168.36.22 as the endpoint to contact this node? And that you did not manually change it? I wonder if we are hitting some kind of race condition

@vast0906
Copy link
Author

Can you confirm that you arm node does not have 192.168.36.22 as the endpoint to contact this node? And that you did not manually change it? I wonder if we are hitting some kind of race condition

arm node run wg show not have 192.168.36.22 as the endpoint to contact this node
not manually change .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

3 participants