Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#4099 - Mesh 802.11s blackhole due to bogus mpath routes with nexthop 00:00:00:00:00:00 #9083

Open
openwrt-bot opened this issue Oct 20, 2021 · 2 comments
Labels

Comments

@openwrt-bot
Copy link

@openwrt-bot openwrt-bot commented Oct 20, 2021

nemesisdev:

  • Device problem occurs on: reported by [[https://forum.openwrt.org/t/mesh-802-11s-routing-table-gets-filled-with-garbage-causing-a-black-hole-openwrt-21-02-rc4-mt7603e-mt7615e/104808|a few users on different devices]], I am using a mediatek based one

  • I am experiencing this on current master, [[https://github.com/openwrt/openwrt/commit/ade56b8d9e|ade56b8d9e]]

  • Steps to reproduce: I am not sure what exactly triggers this, the bug happens on its own periodically, if anyone has suggestions on specific actions to do to try to replicate it, please let me know

What happens:

Connection to the root mesh node is lost, but inspecting the status of the mesh links with "iw mesh0 station dump" or "iw mesh1 station dump" shows the links are active.

Inspecting "iw mesh0 mpath dump" or "iw mesh1 mpath dump" show a list of mac addresses which are from devices in the LAN, with an invalid next hop (00:00:00:00:00:00), which for some reason end up in the mesh routing table and fill it.

For example, when the problem starts showing up, the mesh routing table may look as follows:

iw mesh1 mpath dump
DEST ADDR NEXT HOP IFACE SN METRIC QLEN EXPTIME DTIM DRET FLAGS HOP_COUNT PATH_CHANGE
16:dd:0c:a4:ba:aa 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
fc:93:c3:3b:0b:fe 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
90:f4:c0:8f:de:80 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
bc:a1:da:cb:87:a8 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
1e:f7:95:47:4a:b3 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
3a:e2:e6:88:65:fb 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
6c:cd:48:37:af:bc 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
d8:54:0b:7c:20:46 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
ce:43:28:84:44:7e 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
26:75:58:0b:39:18 00:00:00:00:00:00 mesh1 0 0 0 0 1600 4 0x0 0 0
80:3f:5d:::** 80:3f:5d:::** mesh1 0 4857 0 0 0 0 0x10 1 1

After one minute, the size may have doubled or tripled.

At some point one of the mesh nodes ends up in the routing table with a bogus route (with nexthop 00:00:00:00:00:00), which basically screws up routing 100%, until that happens, the other rough mesh routes do not cause issues, but once the black hole appears, removing the bogus mesh routes does not fix it, only turning off wifi and then on again fixes it.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Oct 20, 2021

jonozzz:

I'm running on slightly older master: [[https://github.com/openwrt/openwrt/commit/0470159552|0470159552]]

I used to experience something similar, albeit not as many null nexthops in my mpath dump. I remember seeing 2-3 at a time. I have a mesh between a Netgear EA8500 and an EX6400, not sure if that matters at all.

For me, when this issue was happening the radio the mesh network was on (in my case 5ghz) was going down for no reason. All my APs share the same SSID for both 2ghz and 5ghz so the way I was able to tell that one of the radios on APs was dead was to use WifiMan app on my phone. The workaround was to shutdown the radio and start it again, or trigger a scan (iw dev mesh0 scan freq 5785)

However, about 3 weeks ago I changed these 802.11r options on all APs:
option reassociation_deadline '20000'
option ft_over_ds '0'
and surprisingly I haven't had any issues since.

So, I'm still not sure what's happening and I don't have any repro steps.

I'll update if, for some reason, it starts happening again.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 4, 2021

nemesisdev:

I reflashed the routers with a build based on [[https://github.com/openwrt/openwrt/commits/94c41ef2ef|Revision 94c41ef]] on master and ensured these options are present on all APs:

option reassociation_deadline '20000' option ft_over_ds '0'

But the issue keeps on appearing from time to time.

I am going to follow Jow's suggestion on the forum and send an inquiry to the Linux Wireless mailing list shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant