Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layer2 mode doesn't receive broadcast packets on VM unless promiscuous mode is enabled #253

Closed
michaelfig opened this issue Apr 26, 2018 · 10 comments

Comments

@michaelfig
Copy link

michaelfig commented Apr 26, 2018

Is this a bug report or a feature request?: Feature request, for documentation.

What happened:
I couldn't connect to my LoadBalancer IP from outside the host, and the documentation provided no hints.

What you expected to happen:
Traffic would flow correctly, or some kind of hint would be present in the documentation.

How to reproduce it (as minimally and precisely as possible):
I installed the stable/metallb Helm chart with the following resulting configmap:

$ kubectl get cm metallb -oyaml
apiVersion: v1
data:
  config: |
    address-pools:
    - addresses:
      - 192.168.1.39/32
      name: default
      protocol: layer2
kind: ConfigMap
metadata:
  creationTimestamp: 2018-04-26T18:22:43Z
  labels:
    app: metallb
    chart: metallb-0.5.0
    heritage: Tiller
    release: metallb
  name: metallb
  namespace: default
  resourceVersion: "3517970"
  selfLink: /api/v1/namespaces/default/configmaps/metallb
  uid: cfb29a64-497e-11e8-a2c2-6c3be52d32a5
$ kubectl get svc | grep 192.168.1
ingress-nginx-ingress-controller        LoadBalancer   192.168.223.82    192.168.1.39   80:31455/TCP,443:30129/TCP   2h

The node host's vmbr0 interface is 192.168.1.40/24. I have a second host using 192.168.1.41/24, that I expect to be able to take over the loadbalancer IPs.

When I do "nc -v -z 192.168.1.39 80" from another host, it just hangs. However, when I run 'tcpdump' on the node, the netcats and HTTP connections magically start working. This is because tcpdump puts the interface into promiscuous mode.

I solved this by putting the following into my Debian host's /etc/network/interfaces script, under the configuration for the interface in question:

iface vmbr0 inet static
       [...]
       up /bin/ip link set vmbr0 promisc on

Environment:

  • MetalLB version: 0.6.1
  • Kubernetes version: 1.9.6
  • BGP router type/version: none (using layer2)
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)
  • Kernel (e.g. uname -a): Linux pve2 4.13.16-1-pve Implement BGP add-path #1 SMP PVE 4.13.16-43 (Fri, 16 Mar 2018 19:41:43 +0100) x86_64 GNU/Linux
@michaelfig michaelfig changed the title Promiscuous mode is needed to accept packets with protocol layer2 DOCS: Promiscuous mode is needed to accept packets with protocol layer2 Apr 30, 2018
@danderson
Copy link
Contributor

This is interesting. This is the first time I've heard of any setup that requires promiscuous mode to make MetalLB layer2 mode work. It specifically does not rely on anything other than broadcast ethernet packets, which all correct ethernet NICs and drivers should make visible to the OS.

Based on your kernel version, it looks like you're running on Proxmox. My dev environment is on proxmox, with a bunch of clusters that work fine without the need for promiscuous mode... So there is definitely some way to make it work, now we just need to figure out what's different between your proxmox and mine :)

What version of PVE are you running? Can you paste the output of ip addr on the PVE machine (not the VM) ? What kind of NIC hardware do you have configured on the VM?

@danderson danderson changed the title DOCS: Promiscuous mode is needed to accept packets with protocol layer2 layer2 mode doesn't receive broadcast packets on VM unless promiscuous mode is enabled May 3, 2018
@michaelfig
Copy link
Author

michaelfig commented May 8, 2018

Hi, sorry for the delay...

The kubelet is running directly on PVE via systemd, not a VM. Here is my PVE version from the web UI at https://192.168.1.40:8006/:

Proxmox Virtual Environment 5.1-46

and ip addr output:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
    link/ether 6c:3b:e5:2d:32:a5 brd ff:ff:ff:ff:ff:ff
3: vmbr0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:3b:e5:2d:32:a5 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.40/24 brd 192.168.1.255 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::6e3b:e5ff:fe2d:32a5/64 scope link
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:8d:52:28:db brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0a:58:c0:a8:e0:01 brd ff:ff:ff:ff:ff:ff
    inet 192.168.224.1/24 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::cc3e:6bff:fea5:e73c/64 scope link
       valid_lft forever preferred_lft forever
6: vethc93787bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 stateUP group default
    link/ether a6:70:3c:83:35:95 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::a470:3cff:fe83:3595/64 scope link
       valid_lft forever preferred_lft forever
7: veth43403364@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 stateUP group default
    link/ether 2e:65:06:66:13:8a brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::2c65:6ff:fe66:138a/64 scope link
       valid_lft forever preferred_lft forever
8: veth8c19b389@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 stateUP group default
    link/ether e2:1c:5c:31:0b:5c brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::e01c:5cff:fe31:b5c/64 scope link
       valid_lft forever preferred_lft forever
9: vethed5ca8c3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 stateUP group default
    link/ether ee:ca:71:db:62:84 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::ecca:71ff:fedb:6284/64 scope link
       valid_lft forever preferred_lft forever
10: vethac465a18@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether 86:5b:d7:32:a0:91 brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::845b:d7ff:fe32:a091/64 scope link
       valid_lft forever preferred_lft forever
12: vethfafa9b33@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether 32:6e:8d:6c:8a:0c brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::306e:8dff:fe6c:8a0c/64 scope link
       valid_lft forever preferred_lft forever
14: veth2c501807@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether 46:b7:81:b4:27:90 brd ff:ff:ff:ff:ff:ff link-netnsid 8
    inet6 fe80::44b7:81ff:feb4:2790/64 scope link
       valid_lft forever preferred_lft forever
15: veth56f5577c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether 8a:e6:18:d4:f1:3f brd ff:ff:ff:ff:ff:ff link-netnsid 9
    inet6 fe80::88e6:18ff:fed4:f13f/64 scope link
       valid_lft forever preferred_lft forever
17: vethb1aef139@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether 46:cc:dc:ff:a0:fd brd ff:ff:ff:ff:ff:ff link-netnsid 11
    inet6 fe80::44cc:dcff:feff:a0fd/64 scope link
       valid_lft forever preferred_lft forever
18: vethc9584dbb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether 2a:22:b0:4b:81:45 brd ff:ff:ff:ff:ff:ff link-netnsid 12
    inet6 fe80::2822:b0ff:fe4b:8145/64 scope link
       valid_lft forever preferred_lft forever
19: vethcf81ee97@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default
    link/ether a2:95:af:4d:53:15 brd ff:ff:ff:ff:ff:ff link-netnsid 13
    inet6 fe80::a095:afff:fe4d:5315/64 scope link
       valid_lft forever preferred_lft forever
22: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 3e:df:93:54:5b:9d brd ff:ff:ff:ff:ff:ff
    inet 192.168.224.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::3cdf:93ff:fe54:5b9d/64 scope link
       valid_lft forever preferred_lft forever
$ cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.40
        netmask 255.255.255.0
        gateway 192.168.1.1
        bridge_ports eno1
        bridge_stp off
        bridge_fd 0
        up /bin/ip link set vmbr0 promisc on
$

I'm using flannel for my CNI plugin, if that helps.

Michael.

@danderson
Copy link
Contributor

Thanks for the info.

My hunch is that this is a weirdness of how linux implements bridges. Either it's filtering ARP requests based on L3 layer knowledge, or it's filtering inbound traffic that doesn't match local IPs (but that makes no sense, it would break IP forwarding... Just to be paranoid, does cat /proc/sys/net/ipv4/ip_forward on the PVE machine output 1?)

Once the IP is working with promisc on, if you turn promisc off again, does it keep working for a few minutes, or does it stop working immediately? When it's broken, on the client side can you run tcpdump -i any arp or host 192.168.1.39 and try to curl or netcat again? I want to see if it's the ARP traffic or the TCP traffic that's getting blackholed.

I'll try to reproduce this locally, but my available time to work on MetalLB right now is very limited, so I can't promise quick progress :(

@michaelfig
Copy link
Author

michaelfig commented May 23, 2018

So, ip_forward is indeed 1.

When I turn off promisc, it does still work to nc -v -z 192.168.1.39 80. In fact, I can't get it to fail now.

I'll try again tomorrow from the office where I was able to get it to fail before.

Will keep you updated.

@michael-robbins
Copy link

michael-robbins commented Dec 24, 2018

As referenced in #284 I got this as well on my Raspberry Pi cluster running Metal LB, all nodes using wlan0 to connect to the local LAN for connectivity (inter node and rest of network).

All nodes within the cluster correctly resolved the IP and I was able to see the nginx landing page from the layer 2 tutorial.

But PC's on the local network were not able to do the same, no ARP requests were responded to by Metal LB.

By enabling promiscuous mode on wlan, all computers on the LAN were able to correctly see the nginx landing page.

Here's my log:

{"caller":"announcer.go:89","event":"createARPResponder","interface":"cni0","msg":"created ARP responder for interface","ts":"2018-12-24T05:21:18.138825423Z"}
{"caller":"announcer.go:94","error":"creating NDP responder for \"cni0\": listen ip6:ipv6-icmp fe80::8ce0:7ff:fe8a:4301%cni0: bind: invalid argument","interface":"cni0","msg":"failed to create NDP responder","op":"createNDPResponder","ts":"2018-12-24T05:21:18.139929008Z"}
{"caller":"announcer.go:94","error":"creating NDP responder for \"cni0\": listen ip6:ipv6-icmp fe80::8ce0:7ff:fe8a:4301%cni0: bind: invalid argument","interface":"cni0","msg":"failed to create NDP responder","op":"createNDPResponder","ts":"2018-12-24T05:21:28.142298761Z"}
{"caller":"announcer.go:98","event":"createNDPResponder","interface":"cni0","msg":"created NDP responder for interface","ts":"2018-12-24T05:21:38.147381253Z"}
{"caller":"announcer.go:89","event":"createARPResponder","interface":"veth24075984","msg":"created ARP responder for interface","ts":"2018-12-24T05:21:38.219083376Z"} {"caller":"announcer.go:98","event":"createNDPResponder","interface":"veth24075984","msg":"created NDP responder for interface","ts":"2018-12-24T05:21:38.219997379Z"} {"caller":"main.go:159","event":"startUpdate","msg":"start of service update","service":"default/nginx","ts":"2018-12-24T05:22:10.985143393Z"}
{"caller":"main.go:229","event":"serviceAnnounced","ip":"192.168.58.170","msg":"service has IP, announcing","pool":"dal-prahran-k8s-lb","protocol":"layer2","service":"default/nginx","ts":"2018-12-24T05:22:10.985626723Z"}
{"caller":"main.go:231","event":"endUpdate","msg":"end of service update","service":"default/nginx","ts":"2018-12-24T05:22:10.985969949Z"}

# curl 192.168.58.170 from PC times out
# Enabled wlan0 promiscuous mode

{"caller":"arp.go:102","interface":"wlan0","ip":"192.168.58.170","msg":"got ARP request for service IP, sending response","responseMAC":"b8:27:eb:e9:f3:6f","senderIP":"192.168.58.107","senderMAC":"78:24:af:be:54:e5","ts":"2018-12-24T05:43:49.764529457Z"}

The odd thing is MetalLB doesn't say it's started an ARP Responder on wlan0 initially?

Interfaces on node01 which has the pod running on it hosting the example nginx service:

pi@node01:~ $ ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.1.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::8ce0:7ff:fe8a:4301  prefixlen 64  scopeid 0x20<link>
        ether 0a:58:0a:f4:01:01  txqueuelen 1000  (Ethernet)
        RX packets 69  bytes 11566 (11.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 203  bytes 35160 (34.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:25:13:ef:f6  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether b8:27:eb:bc:a6:3a  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.1.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::3857:1cff:fe76:92d6  prefixlen 64  scopeid 0x20<link>
        ether 3a:57:1c:76:92:d6  txqueuelen 0  (Ethernet)
        RX packets 7  bytes 450 (450.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5  bytes 1118 (1.0 KiB)
        TX errors 0  dropped 204 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 12  bytes 640 (640.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 640 (640.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth24075984: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 169.254.246.118  netmask 255.255.0.0  broadcast 169.254.255.255
        inet6 fe80::34bc:84ff:fe09:2bcc  prefixlen 64  scopeid 0x20<link>
        ether 36:bc:84:09:2b:cc  txqueuelen 0  (Ethernet)
        RX packets 69  bytes 12532 (12.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 300  bytes 61358 (59.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 1500
        inet 192.168.58.61  netmask 255.255.255.0  broadcast 192.168.58.255
        inet6 fe80::9466:e215:e3cc:93b4  prefixlen 64  scopeid 0x20<link>
        ether b8:27:eb:e9:f3:6f  txqueuelen 1000  (Ethernet)
        RX packets 133801  bytes 136482035 (130.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 90754  bytes 10312512 (9.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

@russellb
Copy link
Collaborator

I'm going to close this out as stale since it hasn't been updated in a couple of years.

@arthurcgc
Copy link

arthurcgc commented Feb 2, 2022

I can confirm that this is still an issue on my Raspberry Pi 4 model B.
Turning on promiscuous mode fixes the problem but is it possible that a better method exists?

@virgil9306
Copy link

virgil9306 commented Feb 5, 2022

I am encountering this issue as well on k3s nodes running on Proxmox.

Thus, I don't think this is a Pi-specific issue. I'm not using a Pi.

More, recent reports are available here which I think are the same issue: #284

What I found was that upon an initial loadbalancer being created I could access the public ip until a timeout occurred and then the service would become unavailable. If I edited the service the public IP would become available again over the wifi until that same timeout occurs. So I plugged the Pi into ethernet on the same network and boom it immediately worked and the service doesn't go down.

The same issue happens for me. I am trying to test with the nodes' NICs to promiscuous to see if that will change anything.


If this is a known issue, even if it's outside the scope of MetalLB, it's something that many people seem to have encountered and could be added to documentation to help people who are stuck.

I don't know the root cause of this problem, however, so can't say "in XYZ circumstances, this is how to fix it."

@bbockelm
Copy link

@virgil9306 - we tripped over this locally as well. Here's the most succinct description of the problem I can come up with:

When Kubernetes uses a bridge device in default configurations, the bridge device will not pass IP traffic for load-balanced IPs assigned by metallb UNLESS the pod happens to be running on the same host as the L2 IP.

Reported workarounds:

I believe this was recently re-triggered by #976 in metallb 0.11.0 because, with that code merged, the requirement that the load balanced IP be on the same host as a pod was relaxed.

(Fun story - this one was a pain to debug because the cluster would magically start working whenever I did a tcpdump of the interface to see where the packet was getting dropped. One side effect of tcpdump, of course, is to set promiscuous mode on the bridge ... so debugging the problem "fixed" it!)

Hopefully this summary + workaround is helpful for whoever the next person hits this issue.

@csmithson12345
Copy link

I just ran into this issue as well. Enabling promiscuous mode works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants