New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: DSA breaks roaming to WLAN bridged to VLAN #11650
Comments
I have experienced the exact same issue on the Belkin RT3200.
This doesn't solve the initial 5 minute wait when you first boot up, but it does resolve the roaming issue once it is past the initial 5 minutes. However, this issue goes further. Or perhaps me setting the lower ageing causes this secondary problem. After a couple of days, the bridge will stop learning newly connected wifi devices altogether. The only traffic that is passed from a connecting station is the 802.11X to the radius server. But that's generated by the local router anyways. Once the station has been authenticated, nothing can pass through from the station's mac address. I have recently added the ip-bridging utility so hopefully that might help me find a workaround.... (thank you for mentioning the package @dahopem , I would have never found it since "ip-bridging" hints at layer 3 ) In the meantime my workaround for this secondary fdb blockade is a reboot. |
Just for completeness: I am seeing these issues on the Netgear RBR50 and Cisco Meraki MR33, both are also IPQ40xx based. On the Forum @Ansuel suggested that further debugging is needed and this can definitely be solved, but unfortunately I'm not sure where to start. |
I'm also affected by that issue |
Can anyone test if this commit fixes this issue? 244328b |
On a BT HomeHub 5a with build from master today (git log shows above commit), I still see the issue:
Hope I'm not doing something stupid, but it doesn't seem to have made a difference. As a side-note; I've never been able to get
|
I'm also encountering the issue on MediaTek devices (ramips/mt7621; TP-Link EAP235-Wall & TP-Link EAP615-Wall). However I haven't had the chance to validate with any of the recent build, whether the issue is fixed. |
I'm seeing this work OK for me, perhaps it's an accidental configuration tweak?
When the mobile device roams to the other OpenWRT access point B (Netgear R7800, 22.03.5), the bridge ethernet info on access point A changes to:
I'm watching the system logs on both OpenWRT systems, and I see the device changing its association between the access points. Momentary packet loss (1 or 2 seconds at most), but no 5-minute delay.
I think that matters...when I turn on STP, roaming behavior is inconsistent I think and similar to that described in earlier reports above. |
You only have a single vlan bridge configured. I think that might be important. In my case I had a bridge without vlan tagging and another one with vlan tagging. |
I see the same as above on my MR8300's. Note that in that case the "other" FDB entry points to The problem does occur on a BT Home Hub 5 type A, and STP is not in play. |
FWIW. TP-Link EAP615 (mt7621), no STP, running 2 ESSIDs, each in a separate VLAN. Had exactly this problem with 22.03.5, 23.05r2 works. Haven't tried to bisect it. |
I will try one of the more recent builds next week to see whether this is solved. |
For mt76xx there has been a commit that supposedly fixes it, that doesn't help on ipq40xx and other affected architectures though. |
A link to the commit or/and discussion about how it supposedly fixes this issue would be helpful, so we can see if there's any commonality.... |
Try searching for something like "mt76 dsa roaming". |
I think he might be referring to this commit, as listed in the 23.05-rc2 changelog: |
I'm referring to #9706 which contains a patch file aswell. |
Please let me know if the patch fixes your issue. |
For obvious reasons the patch won't fix any issue on non-mt76 hardware. So I don't even need to test it on ipq40xx. |
#9706 is still open. There is a patch that seems to make the problem go away for some people. A patch file is not a commit - it's just a file that someone posted with some code changes - there is no indication that it has been committed to any git repo. It also sounds like it's dodging the issue by configuring things in a way that the OpenWRT developers don't want to (allowing switch learning). This would not explain why the problem has apparently gone away for @krokodilerian anyway (since, AFAICT, it has not been committed / merged).... |
That patch file is apparently a ported version of some other commit. There is a commit somewhere aswell, but I couldn't find it when doing a quick search |
@Flole998 if you are referring to the patch that @schuettecarsten linked to, it's not ported from anywhere. I just commented out the codes that disabled mt7530/1 switch MAC autolearn. My analysis of the issue suggested that the DSA framework is not updating switch ARL tables at all and with switch's autolearn disabled, switch has no way of sending the network packets it receives correctly. Turning on the switch's autolearn feature seems to solve the issue. Edit: |
FWIW, a workaround that works for me on HH5A (based on various hints and pointers above and elsewhere): A file, #!/bin/sh /etc/rc.common
USE_PROCD=1
START=99
STOP=01
start_service() {
procd_open_instance
procd_set_param command /bin/fixfdb.sh
procd_set_param stdout 1
procd_set_param stderr 1
procd_close_instance
} Together with a file, #!/bin/sh
/usr/sbin/bridge monitor fdb | while read -r mac d dev rest; do
if [ "$mac" != Deleted ]; then
echo Found "$mac" on "$dev";
/usr/sbin/bridge fdb show br br-lan | while read -r omac d odev rest; do
if [ "$omac" = "$mac" ] && [ "$odev" != "$dev" ]; then
echo Removing old entry from "$odev";
/usr/sbin/bridge fdb del "$mac" dev "$odev" "$rest";
fi;
done;
fi;
done (This relies on the ip-bridge package, ( |
Redmi AX6000 running 23.05.0-rc3 and 23.05 snapshots is affected too.
I can do:
|
@rondoval Do you have an actual problem, where traffic is not passing (for some period)? Your example above shows two entries, but they both point to the |
Yes, I do have a problem where I end up having (after roaming between APs) entries pointing to both a local wifi device (e.g. phy0-ap1) and LAN interface (in my case the wan device - just a name of the port on Redmi AX6000). This was an example meant to show that the workaround mentioned in posts above does not work on Redmi AX6000, because it is not possible to remove one of the entries pointing towards LAN. One has to wait till it expires. I'll post my config and command outputs from the issue itself later today. |
@corvin84 that offload parameter is obviously not right, maybe try to fix the script so it's not sent? Edit: so I think it should be |
That's was the first thing i tried but i get:
Although it is still there:
|
@corvin84 I don't know but maybe try without |
Script works as expected but there is an error the way the command output is checked. The entry is indeed removed correctly. Thanks all. Is there any positive sign that it will be fixed in one of the upcoming version? |
@corvin84 do you have a fix for the script? Happy to amend... |
Hi, So i added
However it still does not work but i have no idea why:
|
So I tried various things, for some reason if the fdb entry contains |
@corvin84 you could avoid this issue all together by just using the dsa interfaces (lan1, lan2, etc.) directly instead of doing vlan on the bridge itself containing those dsa interfaces. So lan1.<vlanid> instead of br-vlan.<vlanid>. That's what I've done and it works fine. IMHO i think its the least hacky solution to all this, but you will need to have your entire network be tagged; won't work if you need untagged traffic. |
To clarify, you could still have other ports be untagged but you can't have a single port, say lan1, be both tagged and untagged. It won't work reliably, behaves very weirdly. |
Interesting, is it only required to be set like that on the router or on the dump APs as well? Additionally all my ports on the router are actually trunk ports as all four ports are connected to APs/switches. Can I setup multiple VLANs per LAN port without a bridge? Do you know if it is ever going to be fixed? I had no trouble on 22... Thank you. |
I've done so on all my routers/dumb APs to be safe. I don't know if you need to do it on both.
Yes, that is fine.
No idea. |
Okay, I will do the same. I guess it is not an issue if I have end devices connected to one of the AP and I use PVID? |
Hello. Did you have to do anything particularly special to get this to work? I tried like so:
But I was not able to get any traffic through it. Connecting a computer to one of the ports and sending tagged frames I wasn't even able to ping the address I'd assigned to expvlan4. Did I miss something? I agree, this solution looks the least hacky and is more similar to how things used to look with swconfig. |
Why did you created a new bridge? I think the suggestion is not to use any bridge interface and use vlan tagging on a per port basis. I have not tried this yet though, I really hope that shortly there is going to be a new OpenWRT release... If so and it is still broken then I will start messing with my network. BTW do you know how a non-bridged port can be set for PVID? |
Not sure I understand what PVID is, tbh, even after a bit of Googling. Regarding the two bridges, the first one Also, trying to hold out for a fix, but starting to consider rolling back to 19.x, where everything worked great as far as I recall. |
Does this issue continue with the new 6.1 and 6.6 kernels on main? |
I assume to use a new kernel, one need to install a snapshot build, right? I am on the official 23.05.2. |
@enmaskarado Yes, the issue still exists in 6.6. @corvin84 You'll need to enable testing kernel in |
Thanks @rany2 and @enmaskarado. I am fairly new to OpenWRT, what one can do in this case? |
It works fine for me, this looks OK. I did have an issue with |
I don't know, the workaround I've described previously worked OK for me. Unfortunately it seems like it might be "hit or miss" |
Yup. The only reason I have not tried it yet, because I am not sure how can I define PVID without DSA/bridge. |
I am also affected by this issue on 5xBelkin RT3200 acting as dump AP (since at least 23.05.2, upgraded to from 22). It took some time before i found out what was the issue! I have complicated VLAN and WLAN setups, with guest MACs on the network. Thus, the only workaround I have made work is disabling the physical port FDB learning on the port I use for uplink, suggested by @ccdunder in #11650 (comment)_ |
What do you mean port used for uplinks? Ports connected to the switches? Did you do it on every devices or just on the router? How can MAC learning be disabled? Thanks! |
@corvin84 I have selected the WAN port to connect to my (dumb-switched) network. I have all 5 ports (lan1-4+wan) on the devices in a "br-lan" bridge, providing untagged "guest" traffic and tagged separated traffic, with individual WLAN (for net X) hooked up to br-lan(.X). I have simply configured the wan port to not learn, which workarounds the bug (thanks @ccdunder in #11650). You can do that by setting learning=0 for the device (at index 1 on my devices):
Which I configured with:
If you use the luci GUI, you can do this with Network => Interfaces => "wan" => Configure => "bridge port specific options" => "Enable MAC address learning" => disabled. |
This is pretty much the way my network is configured. I have one router (WAN is obviously connected to the ISP modem) LAN1, LAN2 are connected to the dumb switches' WAN port. I disabled MAC learning on the dumb switches and rebooted all my devices. Now obviously everything works, but it worked before until I move my devices in the house and end up roaming. Will circle back if I have an update. Thanks for the quick response, fingers crossed this fix the problem. Actually, why it fixes its? Why MAC learning causes the trouble, how the network can be functional without the MAC learning feature? |
Interestingly, despite i set this on the wan ports of all the dumb switches, i am still facing the same issue. Did you disable mac learning on the dumb switches' wan ports only or on the main router's lan port as well that you connect to the wan ports? |
Did you do anything else? Simply disabling the MAC learning on the APs' WAN port (part of the bridge) connected to the router via cable does not fix the problem. If I disable MAC learning on the router itself where the AP is connected to then I cannot join to the network through the AP in question. Thanks |
Hi, Did you do anything else? @slogen followed what you did and it works for him but not for me. I have an OpenWRT router with two additional dumb APs - also running OpenWRT. I have two VLANs (so most of the ports are TRUNK ports) for the LAN and IOT and also have separate wifi for the two. On the APs the WAN port is part of the bridge and I use the WAN link to connect to the main router. I disabled MAC learning on the WAN ports of the APs but I am still having issues. If I start from scratch (reboot everything) things works. E.g. if i am connected to one of the AP with my phone I can access internet and everything on my local network. If I move close to the router - things still works, however from that point if I am not connected to the router but one of the APs I can no longer access one of the AP and the LAN or IOT devices - until I move close to the router and then things works again. I had no issues with my setup/config on 22.XX things started to break when I upgraded to 23. Thanks |
Describe the bug
Subject: Regression: DSA breaks roaming to WLAN bridged to VLAN
Summary
When bridging a WLAN with a VLAN on DSA-enabled OpenWRT versions, an erroneous FDB entry creates 100% packet loss for 300 seconds.
Summary analysis
When a VLAN-tagged packet is received by the device from the an Ethernet (LAN) port side, 2 FDB entries for the source MAC address of the packet are created in the forwarding database (FDB) of the bridge:
Both FDB entries tell the switch to forward packets whose destination is that MAC address to that Ethernet port.
When then a packet from the same source is received by the device from the WiFi (WLAN) side (e.g. after roaming a client device to that WiFi) and forwarded to the Ethernet (LAN) side (with a VLAN tag), only 1 of these 2 FDB entries are updated (namely the VLAN-tagged FDB entry). The other VLAN-untagged FDB entry remains unchanged, erroneously.
When then another VLAN-tagged packet (e.g. an ARP reply packet or a DHCP reply packet) is received by the device from the Ethernet (LAN) side and that packet's destination is the client device's MAC address (=the VLAN-untagged FDB entry's MAC address), then the switch believes that this packet ought to be forwarded to the Ethernet port (even although it is received from exactly this port), while it actually should be forwarded to the CPU of the router (and then further to the WLAN). This packet then gets dropped.
This packet loss continues until, after about 300 seconds (5 minutes), the erroneous FDB entry expires.
Impact of this bug
This bug affects potentially all DSA-enabled platforms, but at least a sizable subset of all DSA-enabled platforms.
It has been verified to exist on:
How to reproduce
Steps
Expected result
It is expected that client device C still receives ping replies although its WLAN association has changed again.
Observed result
It can be observed that client device C does not receive ping replies once its WLAN association has changed again.
How to analyze
Working workarounds
Wait 5 minutes
After 5 minutes, the erroneous FDB entry expires automatically.
Delete the erroneous FDB entry explicitly once
Run "bridge fdb del 02:ff:04:05:06:07 dev lan2". Once you do this, you will observe immediately that the packets are not dropped anymore.
Delete the erroneous FDB entry explicitly automatically
This bug is so pervasive that somehone has created a workaround which deletes the erroneous FDB entries automatically.
Downgrade to swconfig-enabled-OpenWRT versions (instead of DSA-enabled OpenWRT versions)
This is possible and works perfectly. However, not upgrading is not viable in the long run.
Non-working workarounds
Enabling learning_sync
Running
has no effect on the bug, it still occurs.
Disabling learning
Running
fails with
Analysis
The root cause seems to be the confusion between whether VLAN-untagged FDB entries should also apply to VLAN-tagged packets.
This confusion results in unequal treatment for VLAN-tagged packets incoming through the Ethernet port and forwarded to the CPU on the one hand and VLAN-tagged packets outgoing from the CPU through the Ethernet port on the other hand:
Solution 1: VLAN-untagged FDB entries should not apply to VLAN-tagged packets
In this case,
Solution 2: VLAN-untagged FDB entries should also apply to VLAN-tagged packets
In this case,
This confusion is evidently software-based, as non-DSA versions of OpenWRT do not exhibit this bug.
The exact location of this confusion is (currently) unknown.
OpenWrt version
r19803-9a599fee93
OpenWrt target/subtarget
lantiq/xrx200
Device
hh5a (BT HomeHub 5a)
Image kind
Official downloaded image
Steps to reproduce
Actual behaviour
It can be observed that client device C does not receive ping replies once its WLAN association has changed again.
Expected behaviour
It is expected that client device C still receives ping replies although its WLAN association has changed again.
Additional info
No response
Diffconfig
No response
Terms
The text was updated successfully, but these errors were encountered: