Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: DSA breaks roaming to WLAN bridged to VLAN #11650

Open
1 task done
dahopem opened this issue Dec 30, 2022 · 78 comments
Open
1 task done

Regression: DSA breaks roaming to WLAN bridged to VLAN #11650

dahopem opened this issue Dec 30, 2022 · 78 comments
Labels
bug issue report with a confirmed bug

Comments

@dahopem
Copy link

dahopem commented Dec 30, 2022

Describe the bug

Subject: Regression: DSA breaks roaming to WLAN bridged to VLAN

Summary

When bridging a WLAN with a VLAN on DSA-enabled OpenWRT versions, an erroneous FDB entry creates 100% packet loss for 300 seconds.

Summary analysis

When a VLAN-tagged packet is received by the device from the an Ethernet (LAN) port side, 2 FDB entries for the source MAC address of the packet are created in the forwarding database (FDB) of the bridge:

  1. one VLAN-tagged FDB entry and
  2. one VLAN-untagged FDB entry

Both FDB entries tell the switch to forward packets whose destination is that MAC address to that Ethernet port.

When then a packet from the same source is received by the device from the WiFi (WLAN) side (e.g. after roaming a client device to that WiFi) and forwarded to the Ethernet (LAN) side (with a VLAN tag), only 1 of these 2 FDB entries are updated (namely the VLAN-tagged FDB entry). The other VLAN-untagged FDB entry remains unchanged, erroneously.

When then another VLAN-tagged packet (e.g. an ARP reply packet or a DHCP reply packet) is received by the device from the Ethernet (LAN) side and that packet's destination is the client device's MAC address (=the VLAN-untagged FDB entry's MAC address), then the switch believes that this packet ought to be forwarded to the Ethernet port (even although it is received from exactly this port), while it actually should be forwarded to the CPU of the router (and then further to the WLAN). This packet then gets dropped.

This packet loss continues until, after about 300 seconds (5 minutes), the erroneous FDB entry expires.

Impact of this bug

This bug affects potentially all DSA-enabled platforms, but at least a sizable subset of all DSA-enabled platforms.

It has been verified to exist on:

  1. OpenWRT 22.03 on BT HomeHub 5a (hh5a).
  2. Turris OS 4.0 on Turris Omnia already since the year 2020.

How to reproduce

Steps

  1. Ensure that your OpenWRT device A and software supports Distributed Switch Architecture (DSA).
  2. Setup your OpenWRT device A to have at least one WiFi interface and at least one ethernet port with VLAN support:
    1. Create bridge device "br-switch", with the list of bridge ports only consisting of port "lan2".
    2. Enable "VLAN filtering" for that bridge-device and create a VLAN with VLAN ID 31, local enabled and Egress tagged for port "lan2".
    3. Create OpenWRT interface "users" with device "br-switch.31".
    4. Create a OpenWRT WiFi interface with SSID "test". Ensure (under Interface Configuration/General Setup/Network) that it is connected to network "users".
    5. Verify by running "brctl show" on the command line (e.g. using SSH) that there is a bridge called "br-switch" with at least 2 members (one member being "lan2" and the other member being a WiFi interface, for example "wlan1-1").
    6. Reboot the OpenWRT device A.
  3. Connect your OpenWRT device A to another WiFi router B
    1. Ensure that WiFi router B also support VLANs
    2. Connect an Ethernet cable to port "lan2" of OpenWRT device A and a suitable Ethernet port of WiFi router B.
    3. Setup WiFi router B such that it has a VLAN with VLAN ID 31 which is available (tagged) at the Ethernet port of WiFi router B.
    4. Setup a WiFi interface on WiFi router B with the same SSID "test". Similarly to OpenWRT device A, ensure that that this WiFi interface is bridged to VLAN ID 31.
  4. Connect a client device C to OpenWRT device A.
    1. Run "ping" from the client device C to the IP address of WiFi router B.
    2. Observe that client device C receives ping replies.
  5. Roam to WiFi router B.
    1. Change the physical position of client device C to be close to WiFi router B and away from OpenWRT device A. (You may need to reduce the output power of OpenWRT device A if the devices are close to each other.)
    2. Wait 15 seconds.
    3. Verify that client device C is now associated with WiFi router B. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)
    4. Observe that client device C still receives ping replies although its WLAN association has changed.
  6. Roam to OpenWRT device A again.
    1. Change the physical position of client device C to be close to OpenWRT device A and away from WiFi router B . (You may need to reduce the output power of WiFi router B if the devices are close to each other.)
    2. Wait 15 seconds.
    3. Verify that client device C is now associated with OpenWRT device A. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)

Expected result

It is expected that client device C still receives ping replies although its WLAN association has changed again.

Observed result

It can be observed that client device C does not receive ping replies once its WLAN association has changed again.

How to analyze

  1. Install OpenWRT packages "ip-bridge" on your OpenWRT device A.
  2. Re-run the reproduction steps 4, 5, and 6. Assume that the MAC address of client device C is "02:ff:04:05:06:07".
  3. Run "bridge fdb show | grep 02:03:04:05:06:07" after step 3 but before step 4. Observe that the output is empty.
  4. Run "bridge fdb show | grep 02:03:04:05:06:07" after step 4 but before step 5. Observe that the output is akin
    02:ff:04:05:06:07 dev wlan1-1 vlan 31 master br-switch
    
    This shows that there is one FDB entry which says that if there is a packet which is tagged with VLAN 31 and which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "wlan1-1".
  5. Run "bridge -statistics fdb show | grep 02:03:04:05:06:07" after step 5 but before step 6. Observe that the output is akin
    02:ff:04:05:06:07 dev lan2 vlan 31 master br-switch
    02:ff:04:05:06:07 dev lan2 self
    
    This shows that there is one FDB entry which says that if there is a packet which is tagged with VLAN 31 and which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "lan2", and another FDB entry which says that if there is a packet which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "lan2".
  6. Run "bridge -statistics fdb show | grep 02:03:04:05:06:07" after step 6. Observe that the output is akin
    02:ff:04:05:06:07 dev lan2 self
    02:ff:04:05:06:07 dev wlan1-1 vlan 31 master br-switch 
    
    This shows that there is one FDB entry which says that if there is a packet which is tagged with VLAN 31 and which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "wlan1-1", and another FDB entry which says that if there is a packet which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "lan2".

Working workarounds

Wait 5 minutes

After 5 minutes, the erroneous FDB entry expires automatically.

Delete the erroneous FDB entry explicitly once

Run "bridge fdb del 02:ff:04:05:06:07 dev lan2". Once you do this, you will observe immediately that the packets are not dropped anymore.

Delete the erroneous FDB entry explicitly automatically

This bug is so pervasive that somehone has created a workaround which deletes the erroneous FDB entries automatically.

Downgrade to swconfig-enabled-OpenWRT versions (instead of DSA-enabled OpenWRT versions)

This is possible and works perfectly. However, not upgrading is not viable in the long run.

Non-working workarounds

Enabling learning_sync

Running

bridge link set dev lan2 learning_sync on

has no effect on the bug, it still occurs.

Disabling learning

Running

bridge link set dev lan2 learning off

fails with

RTNETLINK answers: Not supported

Analysis

The root cause seems to be the confusion between whether VLAN-untagged FDB entries should also apply to VLAN-tagged packets.
This confusion results in unequal treatment for VLAN-tagged packets incoming through the Ethernet port and forwarded to the CPU on the one hand and VLAN-tagged packets outgoing from the CPU through the Ethernet port on the other hand:

  1. When a VLAN-tagged packet is incoming, 2 FDB entries are created (or updated)
  2. When a VLAN-tagged packet is outgoing, only 1 FDB entry is created (or updated)

Solution 1: VLAN-untagged FDB entries should not apply to VLAN-tagged packets

In this case,

  1. when a VLAN-tagged packet is incoming, only 1 FDB entries should be created (or updated),
  2. when a VLAN-tagged packet is outgoing, only 1 FDB entry should be created (or updated).

Solution 2: VLAN-untagged FDB entries should also apply to VLAN-tagged packets

In this case,

  1. when a VLAN-tagged packet is incoming, 2 FDB entries should be created (or updated),
  2. when a VLAN-tagged packet is outgoing, 2 FDB entry should be created (or updated).

This confusion is evidently software-based, as non-DSA versions of OpenWRT do not exhibit this bug.

The exact location of this confusion is (currently) unknown.

OpenWrt version

r19803-9a599fee93

OpenWrt target/subtarget

lantiq/xrx200

Device

hh5a (BT HomeHub 5a)

Image kind

Official downloaded image

Steps to reproduce

  1. Ensure that your OpenWRT device A and software supports Distributed Switch Architecture (DSA).
  2. Setup your OpenWRT device A to have at least one WiFi interface and at least one ethernet port with VLAN support:
    1. Create bridge device "br-switch", with the list of bridge ports only consisting of port "lan2".
    2. Enable "VLAN filtering" for that bridge-device and create a VLAN with VLAN ID 31, local enabled and Egress tagged for port "lan2".
    3. Create OpenWRT interface "users" with device "br-switch.31".
    4. Create a OpenWRT WiFi interface with SSID "test". Ensure (under Interface Configuration/General Setup/Network) that it is connected to network "users".
    5. Verify by running "brctl show" on the command line (e.g. using SSH) that there is a bridge called "br-switch" with at least 2 members (one member being "lan2" and the other member being a WiFi interface, for example "wlan1-1").
    6. Reboot the OpenWRT device A.
  3. Connect your OpenWRT device A to another WiFi router B
    1. Ensure that WiFi router B also support VLANs
    2. Connect an Ethernet cable to port "lan2" of OpenWRT device A and a suitable Ethernet port of WiFi router B.
    3. Setup WiFi router B such that it has a VLAN with VLAN ID 31 which is available (tagged) at the Ethernet port of WiFi router B.
    4. Setup a WiFi interface on WiFi router B with the same SSID "test". Similarly to OpenWRT device A, ensure that that this WiFi interface is bridged to VLAN ID 31.
  4. Connect a client device C to OpenWRT device A.
    1. Run "ping" from the client device C to the IP address of WiFi router B.
    2. Observe that client device C receives ping replies.
  5. Roam to WiFi router B.
    1. Change the physical position of client device C to be close to WiFi router B and away from OpenWRT device A. (You may need to reduce the output power of OpenWRT device A if the devices are close to each other.)
    2. Wait 15 seconds.
    3. Verify that client device C is now associated with WiFi router B. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)
    4. Observe that client device C still receives ping replies although its WLAN association has changed.
  6. Roam to OpenWRT device A again.
    1. Change the physical position of client device C to be close to OpenWRT device A and away from WiFi router B . (You may need to reduce the output power of WiFi router B if the devices are close to each other.)
    2. Wait 15 seconds.
    3. Verify that client device C is now associated with OpenWRT device A. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)

Actual behaviour

It can be observed that client device C does not receive ping replies once its WLAN association has changed again.

Expected behaviour

It is expected that client device C still receives ping replies although its WLAN association has changed again.

Additional info

No response

Diffconfig

No response

Terms

  • I am reporting an issue for OpenWrt, not an unsupported fork.
@dahopem dahopem added the bug issue report with a confirmed bug label Dec 30, 2022
@f00b4r0
Copy link
Contributor

f00b4r0 commented Jan 11, 2023

Possibly related: #11218 #11277 #11029

@Brain2000
Copy link

Brain2000 commented Mar 31, 2023

I have experienced the exact same issue on the Belkin RT3200.
I worked around it with the following:

brctl setfd br-lan 0.01
brctl setageing br-lan 10

brctl setfd br-vlan8 0.01
brctl setageing br-vlan8 10
...
repeat for every vlan

This doesn't solve the initial 5 minute wait when you first boot up, but it does resolve the roaming issue once it is past the initial 5 minutes.

However, this issue goes further. Or perhaps me setting the lower ageing causes this secondary problem.

After a couple of days, the bridge will stop learning newly connected wifi devices altogether.

The only traffic that is passed from a connecting station is the 802.11X to the radius server. But that's generated by the local router anyways. Once the station has been authenticated, nothing can pass through from the station's mac address.

I have recently added the ip-bridging utility so hopefully that might help me find a workaround.... (thank you for mentioning the package @dahopem , I would have never found it since "ip-bridging" hints at layer 3 )

In the meantime my workaround for this secondary fdb blockade is a reboot.

@Flole998
Copy link
Contributor

Just for completeness: I am seeing these issues on the Netgear RBR50 and Cisco Meraki MR33, both are also IPQ40xx based. On the Forum @Ansuel suggested that further debugging is needed and this can definitely be solved, but unfortunately I'm not sure where to start.

@spolack
Copy link

spolack commented May 8, 2023

I'm also affected by that issue

@Znevna
Copy link
Contributor

Znevna commented Jun 14, 2023

Can anyone test if this commit fixes this issue? 244328b

@dseven
Copy link

dseven commented Jun 14, 2023

Can anyone test if this commit fixes this issue? 244328b

On a BT HomeHub 5a with build from master today (git log shows above commit), I still see the issue:

root@hh5a-1:~# bridge fdb | grep xx:xx:xx:xx:xx:xx
xx:xx:xx:xx:xx:xx dev lan3 self
xx:xx:xx:xx:xx:xx dev phy1-ap0 vlan 1 master br-lan
root@hh5a-1:~#

Hope I'm not doing something stupid, but it doesn't seem to have made a difference.

As a side-note; I've never been able to get bridge fdb del to work...

root@hh5a-1:~# bridge fdb del xx:xx:xx:xx:xx:xx dev lan3
RTNETLINK answers: No such file or directory
root@hh5a-1:~#

@thebub
Copy link

thebub commented Jun 20, 2023

I'm also encountering the issue on MediaTek devices (ramips/mt7621; TP-Link EAP235-Wall & TP-Link EAP615-Wall). However I haven't had the chance to validate with any of the recent build, whether the issue is fixed.

@jtkohl
Copy link

jtkohl commented Jun 21, 2023

I'm seeing this work OK for me, perhaps it's an accidental configuration tweak?
When the mobile device is associated with OpenWRT access point A (Linksys MR-8300, 23.05.0-rc1), the bridge ethernet info shows:

01:02:03:04:05:06 dev eth0 self permanent
01:02:03:04:05:06 dev phy1-ap0 vlan 5 offload master br-vlan 

When the mobile device roams to the other OpenWRT access point B (Netgear R7800, 22.03.5), the bridge ethernet info on access point A changes to:

01:02:03:04:05:06 dev lan1 vlan 5 offload master br-vlan 
01:02:03:04:05:06 dev lan1 vlan 5 self 

I'm watching the system logs on both OpenWRT systems, and I see the device changing its association between the access points. Momentary packet loss (1 or 2 seconds at most), but no 5-minute delay.

brctl show shows that STP is disabled.

root@linksysAP:~# brctl show
bridge name	bridge id		STP enabled	interfaces
br-vlan		7fff.e89f8015759c	no		guest-hi
							phy2-ap0
							rep-only6
							lan4
							lan2
							lan-24g-radio1
							rep-local6
							rep-iot
							wan
							iot-2.4G
							lan-hi
							guest-24g
							lan-radio1
							rep-24ghz
							guest-low
							lan-5g-hi
							lan3
							phy1-ap0
							lan-5g-low
							lan1
							lan-low

I think that matters...when I turn on STP, roaming behavior is inconsistent I think and similar to that described in earlier reports above.

@Flole998
Copy link
Contributor

You only have a single vlan bridge configured. I think that might be important. In my case I had a bridge without vlan tagging and another one with vlan tagging.

@dseven
Copy link

dseven commented Jun 21, 2023

I see the same as above on my MR8300's. Note that in that case the "other" FDB entry points to eth0, which I believe is the management interface for the switch, as opposed to one of the lanX ports, and it does not go away after 5 minutes. This still looks wrong to me, but I don't know DSA well enough to tell if it's OK or not. In any case, the problem described in this issue (traffic not passing until the old FDB entry expires) does not occur on MR8300 for me either.

The problem does occur on a BT Home Hub 5 type A, and STP is not in play.

@krokodilerian
Copy link

FWIW. TP-Link EAP615 (mt7621), no STP, running 2 ESSIDs, each in a separate VLAN. Had exactly this problem with 22.03.5, 23.05r2 works. Haven't tried to bisect it.

@thebub
Copy link

thebub commented Jul 29, 2023

I will try one of the more recent builds next week to see whether this is solved.

@Flole998
Copy link
Contributor

For mt76xx there has been a commit that supposedly fixes it, that doesn't help on ipq40xx and other affected architectures though.

@dseven
Copy link

dseven commented Jul 30, 2023

For mt76xx there has been a commit that supposedly fixes it, that doesn't help on ipq40xx and other affected architectures though.

A link to the commit or/and discussion about how it supposedly fixes this issue would be helpful, so we can see if there's any commonality....

@Flole998
Copy link
Contributor

Try searching for something like "mt76 dsa roaming".

@thebub
Copy link

thebub commented Jul 30, 2023

I think he might be referring to this commit, as listed in the 23.05-rc2 changelog:
09322f3 kernel: remove bridge offload hack (-846)

@Flole998
Copy link
Contributor

I'm referring to #9706 which contains a patch file aswell.

@schuettecarsten
Copy link

I'm referring to #9706 which contains a patch file aswell.

Please let me know if the patch fixes your issue.

@Flole998
Copy link
Contributor

For obvious reasons the patch won't fix any issue on non-mt76 hardware. So I don't even need to test it on ipq40xx.

@dseven
Copy link

dseven commented Jul 30, 2023

I'm referring to #9706 which contains a patch file aswell.

#9706 is still open. There is a patch that seems to make the problem go away for some people. A patch file is not a commit - it's just a file that someone posted with some code changes - there is no indication that it has been committed to any git repo. It also sounds like it's dodging the issue by configuring things in a way that the OpenWRT developers don't want to (allowing switch learning). This would not explain why the problem has apparently gone away for @krokodilerian anyway (since, AFAICT, it has not been committed / merged)....

@Flole998
Copy link
Contributor

That patch file is apparently a ported version of some other commit. There is a commit somewhere aswell, but I couldn't find it when doing a quick search

@quarkysg
Copy link
Contributor

quarkysg commented Aug 1, 2023

@Flole998 if you are referring to the patch that @schuettecarsten linked to, it's not ported from anywhere. I just commented out the codes that disabled mt7530/1 switch MAC autolearn.

My analysis of the issue suggested that the DSA framework is not updating switch ARL tables at all and with switch's autolearn disabled, switch has no way of sending the network packets it receives correctly. Turning on the switch's autolearn feature seems to solve the issue.

Edit:
To add on, the patch only works for routers with mt7530/1 network switches. It does not work for any other network switches.

@rbubley
Copy link

rbubley commented Sep 17, 2023

Can anyone test if this commit fixes this issue? 244328b

On a BT HomeHub 5a with build from master today (git log shows above commit), I still see the issue:

root@hh5a-1:~# bridge fdb | grep xx:xx:xx:xx:xx:xx
xx:xx:xx:xx:xx:xx dev lan3 self
xx:xx:xx:xx:xx:xx dev phy1-ap0 vlan 1 master br-lan
root@hh5a-1:~#

Hope I'm not doing something stupid, but it doesn't seem to have made a difference.

As a side-note; I've never been able to get bridge fdb del to work...

root@hh5a-1:~# bridge fdb del xx:xx:xx:xx:xx:xx dev lan3
RTNETLINK answers: No such file or directory
root@hh5a-1:~#

FWIW, a workaround that works for me on HH5A (based on various hints and pointers above and elsewhere):

A file, /etc/init.d/fixfdbservice with execute permission containing:

#!/bin/sh /etc/rc.common

USE_PROCD=1
START=99
STOP=01

start_service() {
    procd_open_instance
    procd_set_param command /bin/fixfdb.sh
    procd_set_param stdout 1
    procd_set_param stderr 1
    procd_close_instance
}

Together with a file, /bin/fixfdb.sh, again with execute permission, containing:

#!/bin/sh

/usr/sbin/bridge monitor fdb | while read -r mac d dev rest; do
     if [ "$mac" != Deleted ]; then
       echo Found "$mac" on "$dev";
       /usr/sbin/bridge fdb show br br-lan | while read -r omac d odev rest; do
         if [ "$omac" = "$mac" ] && [ "$odev" != "$dev" ]; then
           echo Removing old entry from "$odev";
           /usr/sbin/bridge fdb del "$mac" dev "$odev" "$rest";
         fi;
       done;
     fi;
   done

(This relies on the ip-bridge package, (opkg install ip-bridge))

@rondoval
Copy link

rondoval commented Sep 28, 2023

Redmi AX6000 running 23.05.0-rc3 and 23.05 snapshots is affected too.
Problem is, the workaround does not work either.
Given e.g. two entries:

xx:xx:xx:xx:xx:xx dev wan vlan 206 master br-lan
xx:xx:xx:xx:xx:xx dev wan vlan 206 self

I can do:
bridge fdb del xx:xx:xx:xx:xx:xx dev wan vlan 206 master
but

bridge fdb del xx:xx:xx:xx:xx:xx dev wan vlan 206 self
RTNETLINK answers: No such file or directory

@dseven
Copy link

dseven commented Sep 29, 2023

@rondoval Do you have an actual problem, where traffic is not passing (for some period)? Your example above shows two entries, but they both point to the wan device, so there doesn't appear to be a conflict, where in the problem case we have entries pointing to two different devices. I'm not sure that this is a manifestation of the issue that we're dealing with (or not dealing with!) here.

@rondoval
Copy link

rondoval commented Oct 2, 2023

@rondoval Do you have an actual problem, where traffic is not passing (for some period)? Your example above shows two entries, but they both point to the wan device, so there doesn't appear to be a conflict, where in the problem case we have entries pointing to two different devices. I'm not sure that this is a manifestation of the issue that we're dealing with (or not dealing with!) here.

Yes, I do have a problem where I end up having (after roaming between APs) entries pointing to both a local wifi device (e.g. phy0-ap1) and LAN interface (in my case the wan device - just a name of the port on Redmi AX6000).

This was an example meant to show that the workaround mentioned in posts above does not work on Redmi AX6000, because it is not possible to remove one of the entries pointing towards LAN. One has to wait till it expires.

I'll post my config and command outputs from the issue itself later today.

@rany2
Copy link
Contributor

rany2 commented Feb 23, 2024

@corvin84 that offload parameter is obviously not right, maybe try to fix the script so it's not sent?

Edit: so I think it should be /usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master br-lan (without offload)

@corvin84
Copy link

/usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master br-lan

That's was the first thing i tried but i get:

root@router:/etc/config# /usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master br-lan
Error: either "to" is duplicate, or "br-lan" is a garbage.

Although it is still there:

root@router:/etc/config#  /usr/sbin/bridge fdb show br br-lan |grep ac:80:fb:bc:aa:d1
ac:80:fb:bc:aa:d1 dev phy1-ap0 vlan 11 offload master br-lan

@rany2
Copy link
Contributor

rany2 commented Feb 23, 2024

@corvin84 I don't know but maybe try without br-lan as well? So /usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master

@corvin84
Copy link

@corvin84 I don't know but maybe try without br-lan as well? So /usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master

Script works as expected but there is an error the way the command output is checked. The entry is indeed removed correctly. Thanks all. Is there any positive sign that it will be fixed in one of the upcoming version?

@rbubley
Copy link

rbubley commented Feb 24, 2024

@corvin84 do you have a fix for the script? Happy to amend...

@corvin84
Copy link

corvin84 commented Feb 24, 2024

@corvin84 do you have a fix for the script? Happy to amend...

Hi,

So i added sed to remove offload and br-lan from the end of the lines because these should not be part of the fdb del command:

/usr/sbin/bridge fdb show br "$bridge_device" | sed -e 's/offload\s//g' -e 's/br-lan\s$//g' | while read -r omac d odev rest; do

However it still does not work but i have no idea why:

# Log
Sat Feb 24 20:27:46 2024 daemon.info fixfdb.sh[4089]: Attempting to remove old entry for ac:80:fb:bc:aa:d1 from lan1 as it should now be on lan4
Sat Feb 24 20:27:46 2024 daemon.info fixfdb.sh[4089]: Error running /usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master
Sat Feb 24 20:27:46 2024 daemon.info fixfdb.sh[4089]: Found ac:80:fb:bc:aa:d1 on lan1

# So it is still on lan1
root@router:~# bridge fdb show br br-lan dev lan4 vlan 11 | grep ac:80
root@router:~# bridge fdb show br br-lan dev lan1 vlan 11 | grep ac:80
ac:80:fb:bc:aa:d1 vlan 11 master br-lan
ac:80:fb:bc:aa:d1 vlan 11 self

# I can run the command without any problem:
root@bzs_router:~# /usr/sbin/bridge fdb del ac:80:fb:bc:aa:d1 dev lan1 vlan 11 master
root@bzs_router:~# echo $?
0

# To confirm: 
root@bzs_router:~# bridge fdb show br br-lan dev lan4 vlan 11 | grep ac:80
root@bzs_router:~# bridge fdb show br br-lan dev lan1 vlan 11 | grep ac:80
ac:80:fb:bc:aa:d1 vlan 11 self

@corvin84
Copy link

So I tried various things, for some reason if the fdb entry contains self then I am unable to delete the entry thus I filtered those and removed the br-lan from the end of the lines and also offload from the lines. Even though I do see the script is doing its job I am still having this roaming problem. Any other suggestions? 22.03 was trouble free for me I started to see this issue when I upgraded to 23. I used DSA on 22.03 before...

@rany2
Copy link
Contributor

rany2 commented Feb 26, 2024

@corvin84 you could avoid this issue all together by just using the dsa interfaces (lan1, lan2, etc.) directly instead of doing vlan on the bridge itself containing those dsa interfaces.

So lan1.<vlanid> instead of br-vlan.<vlanid>. That's what I've done and it works fine. IMHO i think its the least hacky solution to all this, but you will need to have your entire network be tagged; won't work if you need untagged traffic.

@rany2
Copy link
Contributor

rany2 commented Feb 26, 2024

but you will need to have your entire network be tagged; won't work if you need untagged traffic.

To clarify, you could still have other ports be untagged but you can't have a single port, say lan1, be both tagged and untagged. It won't work reliably, behaves very weirdly.

@corvin84
Copy link

but you will need to have your entire network be tagged; won't work if you need untagged traffic.

To clarify, you could still have other ports be untagged but you can't have a single port, say lan1, be both tagged and untagged. It won't work reliably, behaves very weirdly.

Interesting, is it only required to be set like that on the router or on the dump APs as well? Additionally all my ports on the router are actually trunk ports as all four ports are connected to APs/switches. Can I setup multiple VLANs per LAN port without a bridge? Do you know if it is ever going to be fixed? I had no trouble on 22... Thank you.

@rany2
Copy link
Contributor

rany2 commented Feb 27, 2024

is it only required to be set like that on the router or on the dump APs as well?

I've done so on all my routers/dumb APs to be safe. I don't know if you need to do it on both.

Can I setup multiple VLANs per LAN port without a bridge?

Yes, that is fine.

Do you know if it is ever going to be fixed?

No idea.

@corvin84
Copy link

is it only required to be set like that on the router or on the dump APs as well?

I've done so on all my routers/dumb APs to be safe. I don't know if you need to do it on both.

Can I setup multiple VLANs per LAN port without a bridge?

Yes, that is fine.

Do you know if it is ever going to be fixed?

No idea.

Okay, I will do the same. I guess it is not an issue if I have end devices connected to one of the AP and I use PVID?

@fascinatedcow
Copy link

@corvin84 you could avoid this issue all together by just using the dsa interfaces (lan1, lan2, etc.) directly instead of doing vlan on the bridge itself containing those dsa interfaces.

So lan1. instead of br-vlan.. That's what I've done and it works fine. IMHO i think its the least hacky solution to all this, but you will need to have your entire network be tagged; won't work if you need untagged traffic.

Hello. Did you have to do anything particularly special to get this to work? I tried like so:

  1. Removed lan2 and lan3 from my existing bridge (br-lan)
  2. Created lan2.4 and lan3.4 (vlan 4 was not previously in use anywhere)
  3. Created a new bridge (expvlan4) and added lan2.4 and lan3.4 to it
  4. Assigned an IP to exp4vlan

But I was not able to get any traffic through it. Connecting a computer to one of the ports and sending tagged frames I wasn't even able to ping the address I'd assigned to expvlan4. Did I miss something?

I agree, this solution looks the least hacky and is more similar to how things used to look with swconfig.

@corvin84
Copy link

@corvin84 you could avoid this issue all together by just using the dsa interfaces (lan1, lan2, etc.) directly instead of doing vlan on the bridge itself containing those dsa interfaces.
So lan1. instead of br-vlan.. That's what I've done and it works fine. IMHO i think its the least hacky solution to all this, but you will need to have your entire network be tagged; won't work if you need untagged traffic.

Hello. Did you have to do anything particularly special to get this to work? I tried like so:

1. Removed `lan2` and `lan3` from my existing bridge (`br-lan`)

2. Created `lan2.4` and `lan3.4` (vlan 4 was not previously in use anywhere)

3. Created a new bridge (`expvlan4`) and added `lan2.4` and `lan3.4` to it

4. Assigned an IP to `exp4vlan`

But I was not able to get any traffic through it. Connecting a computer to one of the ports and sending tagged frames I wasn't even able to ping the address I'd assigned to expvlan4. Did I miss something?

I agree, this solution looks the least hacky and is more similar to how things used to look with swconfig.

Why did you created a new bridge? I think the suggestion is not to use any bridge interface and use vlan tagging on a per port basis. I have not tried this yet though, I really hope that shortly there is going to be a new OpenWRT release... If so and it is still broken then I will start messing with my network. BTW do you know how a non-bridged port can be set for PVID?

@fascinatedcow
Copy link

Why did you created a new bridge? I think the suggestion is not to use any bridge interface and use vlan tagging on a per port basis. I have not tried this yet though, I really hope that shortly there is going to be a new OpenWRT release... If so and it is still broken then I will start messing with my network. BTW do you know how a non-bridged port can be set for PVID?

Not sure I understand what PVID is, tbh, even after a bit of Googling.

Regarding the two bridges, the first one br-lan was already there. This is a configuration I'm testing on an "unclean" router with all of my current configuration on it. Thanks for clarifying. Not sure I understand why one would want to tag the ports and then not bridge them though. How would a frame coming in on say vlan2.4 get forwarded to vlan3.4 without a bridge? I guess I misunderstood the use case.

Also, trying to hold out for a fix, but starting to consider rolling back to 19.x, where everything worked great as far as I recall.

@enmaskarado
Copy link
Contributor

enmaskarado commented Mar 16, 2024

Does this issue continue with the new 6.1 and 6.6 kernels on main?

@corvin84
Copy link

corvin84 commented Mar 17, 2024

Does this issue continue with the new 6.1 and 6.6 kernels on main?

I assume to use a new kernel, one need to install a snapshot build, right? I am on the official 23.05.2.

@rany2
Copy link
Contributor

rany2 commented Mar 17, 2024

@enmaskarado Yes, the issue still exists in 6.6.

@corvin84 You'll need to enable testing kernel in make menuconfig and compile OpenWRT yourself, but in this case it doesn't matter because the issue is not fixed in 6.6.

@corvin84
Copy link

@enmaskarado Yes, the issue still exists in 6.6.

@corvin84 You'll need to enable testing kernel in make menuconfig and compile OpenWRT yourself, but in this case it doesn't matter because the issue is not fixed in 6.6.

Thanks @rany2 and @enmaskarado. I am fairly new to OpenWRT, what one can do in this case?

@rany2
Copy link
Contributor

rany2 commented Mar 17, 2024

Hello. Did you have to do anything particularly special to get this to work? I tried like so:

1. Removed `lan2` and `lan3` from my existing bridge (`br-lan`)

2. Created `lan2.4` and `lan3.4` (vlan 4 was not previously in use anywhere)

3. Created a new bridge (`expvlan4`) and added `lan2.4` and `lan3.4` to it

4. Assigned an IP to `exp4vlan`

But I was not able to get any traffic through it. Connecting a computer to one of the ports and sending tagged frames I wasn't even able to ping the address I'd assigned to expvlan4. Did I miss something?

It works fine for me, this looks OK. I did have an issue with /etc/network/reload not working, I think you might be better off with a reboot to make sure it applied properly.

@rany2
Copy link
Contributor

rany2 commented Mar 17, 2024

Thanks @rany2 and @enmaskarado. I am fairly new to OpenWRT, what one can do in this case?

I don't know, the workaround I've described previously worked OK for me. Unfortunately it seems like it might be "hit or miss"

@corvin84
Copy link

Thanks @rany2 and @enmaskarado. I am fairly new to OpenWRT, what one can do in this case?

I don't know, the workaround I've described previously worked OK for me. Unfortunately it seems like it might be "hit or miss"

Yup. The only reason I have not tried it yet, because I am not sure how can I define PVID without DSA/bridge.

@slogen
Copy link

slogen commented Mar 31, 2024

I am also affected by this issue on 5xBelkin RT3200 acting as dump AP (since at least 23.05.2, upgraded to from 22). It took some time before i found out what was the issue!

I have complicated VLAN and WLAN setups, with guest MACs on the network. Thus, the only workaround I have made work is disabling the physical port FDB learning on the port I use for uplink, suggested by @ccdunder in #11650 (comment)_

@corvin84
Copy link

I am also affected by this issue on 5xBelkin RT3200 acting as dump AP (since at least 23.05.2, upgraded to from 22). It took some time before i found out what was the issue!

I have complicated VLAN and WLAN setups, with guest MACs on the network. Thus, the only workaround I have made work is disabling the physical port FDB learning on the port I use for uplink, suggested by @ccdunder in #11650 (comment)

What do you mean port used for uplinks? Ports connected to the switches? Did you do it on every devices or just on the router? How can MAC learning be disabled? Thanks!

@slogen
Copy link

slogen commented Mar 31, 2024

@corvin84 I have selected the WAN port to connect to my (dumb-switched) network. I have all 5 ports (lan1-4+wan) on the devices in a "br-lan" bridge, providing untagged "guest" traffic and tagged separated traffic, with individual WLAN (for net X) hooked up to br-lan(.X).

I have simply configured the wan port to not learn, which workarounds the bug (thanks @ccdunder in #11650). You can do that by setting learning=0 for the device (at index 1 on my devices):

root@ap3:~# uci show 'network.@device[1]' | egrep -e '(learning|name)'
network.cfg040f15.name='wan'
network.cfg040f15.learning='0'

Which I configured with:

root@ap3:~# uci set "network.cfg040f15.learning"=0
root@ap3:~# uci commit

If you use the luci GUI, you can do this with Network => Interfaces => "wan" => Configure => "bridge port specific options" => "Enable MAC address learning" => disabled.

@corvin84
Copy link

@corvin84 I have selected the WAN port to connect to my (dumb-switched) network. I have all 5 ports (lan1-4+wan) on the devices in a "br-lan" bridge, providing untagged "guest" traffic and tagged separated traffic, with individual WLAN (for net X) hooked up to br-lan(.X).

I have simply configured the wan port to not learn, which workarounds the bug (thanks @ccdunder in #11650). You can do that by setting learning=0 for the device (at index 1 on my devices):

root@ap3:~# uci show 'network.@device[1]' | egrep -e '(learning|name)'
network.cfg040f15.name='wan'
network.cfg040f15.learning='0'

Which I configured with:

root@ap3:~# uci set "network.cfg040f15.learning"=0
root@ap3:~# uci commit

If you use the luci GUI, you can do this with Network => Interfaces => "wan" => Configure => "bridge port specific options" => "Enable MAC address learning" => disabled.

This is pretty much the way my network is configured. I have one router (WAN is obviously connected to the ISP modem) LAN1, LAN2 are connected to the dumb switches' WAN port. I disabled MAC learning on the dumb switches and rebooted all my devices. Now obviously everything works, but it worked before until I move my devices in the house and end up roaming. Will circle back if I have an update. Thanks for the quick response, fingers crossed this fix the problem. Actually, why it fixes its? Why MAC learning causes the trouble, how the network can be functional without the MAC learning feature?

@corvin84
Copy link

corvin84 commented Apr 1, 2024

@corvin84 I have selected the WAN port to connect to my (dumb-switched) network. I have all 5 ports (lan1-4+wan) on the devices in a "br-lan" bridge, providing untagged "guest" traffic and tagged separated traffic, with individual WLAN (for net X) hooked up to br-lan(.X).
I have simply configured the wan port to not learn, which workarounds the bug (thanks @ccdunder in #11650). You can do that by setting learning=0 for the device (at index 1 on my devices):

root@ap3:~# uci show 'network.@device[1]' | egrep -e '(learning|name)'
network.cfg040f15.name='wan'
network.cfg040f15.learning='0'

Which I configured with:

root@ap3:~# uci set "network.cfg040f15.learning"=0
root@ap3:~# uci commit

If you use the luci GUI, you can do this with Network => Interfaces => "wan" => Configure => "bridge port specific options" => "Enable MAC address learning" => disabled.

This is pretty much the way my network is configured. I have one router (WAN is obviously connected to the ISP modem) LAN1, LAN2 are connected to the dumb switches' WAN port. I disabled MAC learning on the dumb switches and rebooted all my devices. Now obviously everything works, but it worked before until I move my devices in the house and end up roaming. Will circle back if I have an update. Thanks for the quick response, fingers crossed this fix the problem. Actually, why it fixes its? Why MAC learning causes the trouble, how the network can be functional without the MAC learning feature?

Interestingly, despite i set this on the wan ports of all the dumb switches, i am still facing the same issue. Did you disable mac learning on the dumb switches' wan ports only or on the main router's lan port as well that you connect to the wan ports?

@corvin84
Copy link

corvin84 commented Apr 4, 2024

@corvin84 I have selected the WAN port to connect to my (dumb-switched) network. I have all 5 ports (lan1-4+wan) on the devices in a "br-lan" bridge, providing untagged "guest" traffic and tagged separated traffic, with individual WLAN (for net X) hooked up to br-lan(.X).

I have simply configured the wan port to not learn, which workarounds the bug (thanks @ccdunder in #11650). You can do that by setting learning=0 for the device (at index 1 on my devices):

root@ap3:~# uci show 'network.@device[1]' | egrep -e '(learning|name)'
network.cfg040f15.name='wan'
network.cfg040f15.learning='0'

Which I configured with:

root@ap3:~# uci set "network.cfg040f15.learning"=0
root@ap3:~# uci commit

If you use the luci GUI, you can do this with Network => Interfaces => "wan" => Configure => "bridge port specific options" => "Enable MAC address learning" => disabled.

Did you do anything else? Simply disabling the MAC learning on the APs' WAN port (part of the bridge) connected to the router via cable does not fix the problem. If I disable MAC learning on the router itself where the AP is connected to then I cannot join to the network through the AP in question. Thanks

@corvin84
Copy link

corvin84 commented Apr 8, 2024

Fixed this on 23.05 by disabling MAC address learning (learning) on the trunked network device(s).

On 22.03.03, FDB shows the same issue but doesn't cause a problem strangely. Also, learning can't be disabled on the VLAN trunk — it automatically re-enables itself.

Hi, Did you do anything else? @slogen followed what you did and it works for him but not for me. I have an OpenWRT router with two additional dumb APs - also running OpenWRT. I have two VLANs (so most of the ports are TRUNK ports) for the LAN and IOT and also have separate wifi for the two. On the APs the WAN port is part of the bridge and I use the WAN link to connect to the main router. I disabled MAC learning on the WAN ports of the APs but I am still having issues. If I start from scratch (reboot everything) things works. E.g. if i am connected to one of the AP with my phone I can access internet and everything on my local network. If I move close to the router - things still works, however from that point if I am not connected to the router but one of the APs I can no longer access one of the AP and the LAN or IOT devices - until I move close to the router and then things works again. I had no issues with my setup/config on 22.XX things started to break when I upgraded to 23. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue report with a confirmed bug
Projects
None yet
Development

No branches or pull requests