FS#4098 - MESH-SAE-AUTH-BLOCKED #9082
Some times, the devices in a mesh can't connect each other after a power outage or a reboot of the root node (the node which is connected to the gateway and allows the rest of the mesh to connect to the internet).
When this happens, the links show up in "iw mesh0 station dump" or "iw mesh1 station dump" but in BLOCKED state.
Rebooting the nodes which have their link blocked at the same time fixes the issue, which seems to rule out an interference issue, because how can a reboot fix an interference issue?
I also tried setting "cell_density '1'" in the configuration of the radios, but the problem keep happening, it doesn't happen often, but when it happens it can wreak havoc.
The mesh configuration is the following:
The text was updated successfully, but these errors were encountered:
We have this problem too. It occurs in two of our three meshes. It is much more frequent lately. I do not know whether it is merely coincidental that we recently upgraded from 21.01 to 21.02.
My current solution is to maintain a pair of openssh tunnels between each dhcp server (in which gw_mode='server') and each client (in which gw_mode='client'). If a dhcp server finds itself with no clients that are (still?) in contact with it, it reboots. If a client finds itself with no dhcp server that is (still?) in contact with it, it reboots. It's a ridiculously heavy solution which is a lot of trouble to set up in a secure manner, but it has the advantage that each node can detect whether it is in contact with the node(s) with which it has one or more critical relationships.
I suspect this problem is actually a driver issue. These are all Archer [CA]7 v  routers (affordable!) with QCA "wave1" radios. I haven't been able to use the -CT (Candela Technologies) driver for those radios in a mesh; perhaps I haven't understood the advice I've received about that, or perhaps the advice just doesn't work. Therefore, I have to use the stock (QCA) driver's inherent 802.11s implementation, which has quirks. For example, it always fails, usually with hours or minutes, if I have tweaked the radio's built-in MAC address. Therefore, I suspect the QCA firmware may be insufficiently hardened against the depredations of real-world environments.
On the other hand, this could be a real OpenWRT bug. I have no explanation as to why it is suddenly so much more frequent. If anyone can suggest debugging instrumentation that I haven't already tried, I'll be grateful for the advice.
Currently I am not using openwrt, but I have/had a similar issue. This had to do with "too" many clients trying to connect to the mesh peer at the same time. It then also got into the PLINK_BLOCKED state.
First of all, I removed setting the PLINK_BLOCKED state when authentication fails several times (couldn't find it in the ieee802.11 standard anyway...). Then I noticed a lot of "anti-clogging" messages (see also chapter 12.4.6 in ieee802.11 standard). This mechanism will start sending tokens along with frames to reduce the number of peers which are allowed to perform authentication at the same time. This then led to peers getting blocked because they were not allowed to authenticate.
Maybe you can check your logs for this kind of messages; Also, when you try to reproduce the issue, make sure you have a lot of peers (I had to have more than 5 peers...)
I have posted my original issue here, maybe this helps to get more insight into the issue. http://lists.infradead.org/pipermail/hostap/2021-December/040095.html