Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mt7915e: AP mode: UDP flood kills driver/MCU communication #776

Closed
rany2 opened this issue May 2, 2023 · 94 comments
Closed

mt7915e: AP mode: UDP flood kills driver/MCU communication #776

rany2 opened this issue May 2, 2023 · 94 comments

Comments

@rany2
Copy link
Contributor

rany2 commented May 2, 2023

Issue found courtesy of @Brain2000.

This issue was brought up in #690 but I think it is worth keeping track off in its own issue as it appears there are many different things that could trigger #690.

In essence, the following is a requirement to trigger this issue:

  • an mt7915e AP with "Isolated Clients" disabled in hostapd
  • an STA that connects and floods the STA with broadcast/multicast traffic for about 30 seconds and then immediately does an rfkill with no reconnection to the AP

In order to trigger this I will provide you with the following code from @Brain2000.

File spam_multicast.py:

import socket

UDP_IP = "224.0.0.251"
UDP_PORT = 5353
DEST_PAIR = (UDP_IP, UDP_PORT)

TTL = 2
DATA = b"flajshdflkjashdflkjhasdlkfjhwlueiryluiashdfljhasljkdfhlkajsdhfl ashdfljkashdlfkjhaslkdjfhlaskdfhwhateverandever"

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, TTL)

while True:
	sock.sendto(DATA, DEST_PAIR)

You could trigger it like so:

$ # Connect to the MT7915E AP
$ nmcli con up mt7915ap
$ # Run the script for a period of 30 seconds and immediately rfkill 
$ timeout 30 python3 ~/spam_multicast.py; rfkill block wifi
$ # Wait 5 seconds
$ sleep 5
$ # You could unblock your radio now

I've tried triggering this on MT76x2E, MT7610E and MT7628AN with no luck so I presume this is unique to mt7915e's firmware or kernel driver.

Notes:

  • WiFi encryption does not matter, it occurs on open networks as well (not GTK related)
  • Occurs when only 802.11n is enabled so not specific to 802.11ax
  • It does occurs irrespective of whether AMPDU is enabled or disabled.
  • On rare occasions you might need to try this more than once to trigger it.
  • It is easier to trigger this when you are closer to the AP, it appears that the further away you are from the AP the less likely it gets triggered. So the link speed is of the essence here.

All the best

@rany2
Copy link
Contributor Author

rany2 commented May 2, 2023

EDIT: wrong, did not fix the issue.

Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch:

diff --git a/mt7915/main.c b/mt7915/main.c
index 8ce7b1c5..98e37afa 100644
--- a/mt7915/main.c
+++ b/mt7915/main.c
@@ -777,6 +777,8 @@ mt7915_ampdu_action(struct ieee80211_hw *hw, struct ieee80211_vif *vif,
        struct mt76_txq *mtxq;
        int ret = 0;
 
+       return -EOPNOTSUPP;
+
        if (!txq)
                return -EINVAL;

@rany2
Copy link
Contributor Author

rany2 commented May 18, 2023

Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch:

I would like to make an important update, it turns there is no relation. This is not the issue.

@rany2
Copy link
Contributor Author

rany2 commented May 18, 2023

Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch:

I would like to make an important update, it turns there is no relation. This is not the issue.

Well, perhaps the connection between this issue and APDU is simply the degraded link speed that makes this issue occur less; and obviously disabling AMPDU would cause you to have slower speeds. It might just be some memory corruption issue where after a certain amount of time, memory corruption occurs leading to this. I legitimately don't know but I think it might be likely seeing how the behavior after this issue occurs varies tremendously. (Sometimes it recovers, sometimes not, sometimes it doesn't recover automatically, sometimes setting some value to sys_recovery fixes it, etc)

Edit: I should mention by "it recovers" I mean L1 SER kicks in.

@rany2
Copy link
Contributor Author

rany2 commented May 18, 2023

Also important to note something I mentioned on another issue thread:

This issue occurs also when running that spam_multicast.py script from the ethernet switch itself HOWEVER it ends up doing an L1 recovery everytime. So perhaps some corruption in the RX buffers that overtime takes its toll on the system?

Well, perhaps the connection between this issue and APDU is simply the degraded link speed that makes this issue occur less; and obviously disabling AMPDU would cause you to have slower speeds. It might just be some memory corruption issue where after a certain amount of time, memory corruption occurs leading to this. I legitimately don't know but I think it might be likely seeing how the behavior after this issue occurs varies tremendously. (Sometimes it recovers, sometimes not, sometimes it doesn't recover automatically, sometimes setting some value to sys_recovery fixes it, etc)

Edit: I should mention by "it recovers" I mean L1 SER kicks in.

@rany2 rany2 changed the title mt7915e: AP mode: broadcast flood from STA and immediate subsequent rfkill from STA kills driver/MCU communication mt7915e: AP mode: broadcast/mcast flood from STA and immediate subsequent rfkill from STA kills driver/MCU communication May 18, 2023
@rany2
Copy link
Contributor Author

rany2 commented May 18, 2023

Apologies for the constant corrections, but I tried once more to get it to crash from the ethernet switch and this time around it was not so lucky and was unable to recover itself. This whole ordeal is really inconsistent.

@rany2
Copy link
Contributor Author

rany2 commented May 20, 2023

@ryderlee1110 This node had L1 SER kicking in whenever it was crashing under normal circumstances (and was working OK) but when I ran the spam_multicast script, it tried to recover itself but ended up failing to do so.

Starting from 42764.738366 I intentionally tried to crash it with that script, however it seems like a recovery was triggered but with no impact:

[21428.819377] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000004
[21428.825916] mt7915e 0000:02:00.0: phy0 L1 SER recovery start.
[21428.832581] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000008
[21428.859071] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000010
[21428.865835] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000020
[21428.875986] mt7915e 0000:02:00.0: phy0 L1 SER recovery completed.
[21494.763591] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000004
[21494.770172] mt7915e 0000:02:00.0: phy0 L1 SER recovery start.
[21494.776999] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000008
[21494.803319] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000010
[21494.809978] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000020
[21494.820620] mt7915e 0000:02:00.0: phy0 L1 SER recovery completed.
[42764.731797] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000004
[42764.738366] mt7915e 0000:02:00.0: phy0 L1 SER recovery start.
[42764.770530] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000008
[42764.823208] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000010
[42764.829861] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000020
[42764.911799] mt7915e 0000:02:00.0: phy0 L1 SER recovery completed.
[42771.059616] mt7915e 0000:02:00.0: Message 00005aed (seq 15) timeout
[42774.099611] mt7915e 0000:02:00.0: Message 00005aed (seq 1) timeout
[42777.139574] mt7915e 0000:02:00.0: Message 00005aed (seq 2) timeout
[42780.179584] mt7915e 0000:02:00.0: Message 00005aed (seq 3) timeout
[42783.219522] mt7915e 0000:02:00.0: Message 00005aed (seq 4) timeout
[42786.259499] mt7915e 0000:02:00.0: Message 00005aed (seq 5) timeout
[42789.299507] mt7915e 0000:02:00.0: Message 00005aed (seq 6) timeout
[42792.349449] mt7915e 0000:02:00.0: Message 00005aed (seq 7) timeout
[42795.379422] mt7915e 0000:02:00.0: Message 00005aed (seq 8) timeout
[42798.419397] mt7915e 0000:02:00.0: Message 000025ed (seq 9) timeout

@rany2
Copy link
Contributor Author

rany2 commented May 21, 2023

@ryderlee1110 I think there is an issue with SER with respect to how MT_PCIE1_MAC_INT_ENABLE is used unconditionally.

This didn't fix my issue but I noticed in mac_restart the driver isn't checking if mdev is mt7915 causing you to use MT_PCIE1_MAC_INT_ENABLE when you should have used MT_PCIE1_MAC_INT_ENABLE_MT7916. At least this is the logic in pci.c

If unclear check this patch I made which fixes what I think is a possible issue: rany2/mt76@7cb022d.

Something else I noticed is that after I do a full chip recovery by writing 7 to sys_recovery, hif no longer receives any interrupts. So I thought the above would fix it but apparently not. Regardless I think it might be worth checking to see if it was intended or if this is indeed a bug.

@rany2
Copy link
Contributor Author

rany2 commented May 22, 2023

Something else I noticed is that after I do a full chip recovery by writing 7 to sys_recovery, hif no longer receives any interrupts. So I thought the above would fix it but apparently not. Regardless I think it might be worth checking to see if it was intended or if this is indeed a bug.

Should have mentioned that writing 7 to sys_recovery until memory consumption goes back to normal and then rmmod and insmod mt7915e fixes it.

@rany2 rany2 changed the title mt7915e: AP mode: broadcast/mcast flood from STA and immediate subsequent rfkill from STA kills driver/MCU communication mt7915e: AP mode: UDP flood kills driver/MCU communication May 22, 2023
@rany2
Copy link
Contributor Author

rany2 commented May 22, 2023

Updated title to reflect the fact that it doesn't matter whether the flood originated from STA or not, if there is a flood of any kind and the AP has to transmit that traffic to clients; it will die and this will be spammed to dmesg (provided fw wm debug is enabled):

[  471.558903] ieee80211 phy2: WM: ( 116.141511:98:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.568533] ieee80211 phy2: WM: ( 116.141541:99:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.764205] ieee80211 phy2: WM: ( 116.346711:00:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.773806] ieee80211 phy2: WM: ( 116.346772:01:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.860779] ieee80211 phy2: WM: ( 116.443299:02:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.947629] ieee80211 phy2: WM: ( 116.530121:03:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  472.066295] ieee80211 phy2: WM: ( 116.648743:04:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  472.075883] ieee80211 phy2: WM: ( 116.648774:05:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)

rany2 added a commit to rany2/mt76 that referenced this issue May 23, 2023
This version I sourced from EAP615 WALL GPL source dump fixes the issue
of random crashes, in particular these issues:

- openwrt/mt76#776
- openwrt/mt76#690

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz
rany2 added a commit to rany2/mt76 that referenced this issue May 23, 2023
This version I sourced from EAP615 WALL GPL source dump fixes the issue
of random crashes, in particular these issues:

- openwrt/mt76#776
- openwrt/mt76#690

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz
Signed-off-by: rany <ranygh@riseup.net>
@rany2
Copy link
Contributor Author

rany2 commented May 23, 2023

Just a heads up, I no longer face this issue when using this WA firmware I sourced from the EAP615 WALL source dump: rany2/mt76@78fd50c

@ryderlee1110
Copy link
Contributor

which version you downgraded from ?

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

These firmwares I tried had this issue:

  • 2023041815 (downgraded from this)
  • v204520220929

However it still has the random MCU timeout issue like before, just that now UDP flood doesn't kill it and it recovers.

@lukasz1992
Copy link

@rany2 So:

  • issue from this ticket is solved with this specific firmware
  • you still have issues with MCU timeouts
  • but driver from your repo recovers itself

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

@lukasz1992 No, it only recovers itself in this specific case (UDP flooding). I still have random driver hangs/MCU timeout.

@lukasz1992
Copy link

:( many thanks anyway

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

Looking at another source dump, I noticed that there seems to be a distinction between three variants of the same firmware with MT7915D working fine only in fixing this issue; however OpenWRT/linux-firmware don't make this distinction. Could this be the issue?

@littoy
Copy link

littoy commented May 24, 2023

Just a heads up, I no longer face this issue when using this WA firmware I sourced from the EAP615 WALL source dump: rany2/mt76@78fd50c

@rany2 hi, tools and scripts is missing in your source. is this by design?

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

@littoy It's missing because my mt76 is based on wireless-next which has some changes compared to OWRT variant.

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

@littoy on your openwrt source tree, you need to make the following changes: rany2/openwrt@f52d443

just change PKG_SOURCE_VERSION to dee319231825423a9ac5135591a781e1e398267f

@littoy
Copy link

littoy commented May 24, 2023

@littoy on your openwrt source tree, you need to make the following changes: rany2/openwrt@f52d443

just change PKG_SOURCE_VERSION to e3578d0be0984451eb4c80dafa5906a969dd4fae

👌, thank you very much.

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

@ryderlee1110 hang is with MCUWA now, and the UDP flood is no longer relevant:

root@router:/sys/kernel/debug/ieee80211/phy0/mt76# cat xmit-queues 
     queue | hw-queued |      head |      tail |
      MAIN |         0 |       837 |       837 |
     MCUWM |         0 |        17 |        17 |
     MCUWA |        94 |       122 |        28 |
   MCUFWDL |         0 |       121 |       121 |
root@router:/sys/kernel/debug/ieee80211/phy0/mt76# cat sys_recovery 
Please echo the correct value ...
0: grab firmware transient SER state
1: trigger system error L1 recovery
2: trigger system error L2 recovery
3: trigger system error L3 rx abort
4: trigger system error L3 tx abort
5: trigger system error L3 tx disable
6: trigger system error L3 bf recovery
7: trigger system error full recovery
8: trigger firmware crash

let's dump firmware SER statistics...
::E  R , SER_STATUS        = 0x00000000
::E  R , SER_PLE_ERR       = 0x00000000
::E  R , SER_PLE_ERR_1     = 0x00000000
::E  R , SER_PLE_ERR_AMSDU = 0x00000000
::E  R , SER_PSE_ERR       = 0x00000000
::E  R , SER_PSE_ERR_1     = 0x00000000
::E  R , SER_LMAC_WISR6_B0 = 0x00000000
::E  R , SER_LMAC_WISR6_B1 = 0x00000000
::E  R , SER_LMAC_WISR7_B0 = 0x00000000
::E  R , SER_LMAC_WISR7_B1 = 0x00000000

SYS_RESET_COUNT: WM 0, WA 0

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

If I write 7 to sys_recovery, I get this after about a minute:

[ 4787.831468] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4787.837063] rcu: 	0-....: (5999 ticks this GP) idle=767/1/0x40000002 softirq=140248/140248 fqs=2999 
[ 4787.846166] 	(t=6000 jiffies g=215273 q=2055)
[ 4787.850509] NMI backtrace for cpu 0
[ 4787.853980] CPU: 0 PID: 694 Comm: napi/phy0-9 Tainted: G        W         5.15.112 #0
[ 4787.861782] Stack : 00000000 800859e8 00000000 00000004 00000000 00000000 8140dd14 80a20000
[ 4787.870139]         80860000 80789264 81505078 8085ce03 00000000 00000001 8140dcc0 81452640
[ 4787.878490]         00000000 00000000 80789264 8140db60 ffffefff 00000000 ffffffea 00000000
[ 4787.886841]         8140db6c 00000309 80862ae0 ffffffff 80789264 00000000 00000000 00000000
[ 4787.895193]         00000000 8085a2fc 80860000 8085a300 00000018 8040b510 00000000 80a20000
[ 4787.903546]         ...
[ 4787.905986] Call Trace:
[ 4787.908415] [<8000812c>] show_stack+0x28/0xf0
[ 4787.912788] [<80380234>] dump_stack_lvl+0x60/0x80
[ 4787.917489] [<80387020>] nmi_cpu_backtrace+0x108/0x178
[ 4787.922607] [<803871d8>] nmi_trigger_cpumask_backtrace+0x148/0x178
[ 4787.928764] [<8009f254>] rcu_dump_cpu_stacks+0x158/0x1ac
[ 4787.934072] [<8009fb98>] rcu_sched_clock_irq+0x800/0x9d0
[ 4787.939368] [<800a6270>] update_process_times+0xc8/0x124
[ 4787.944665] [<800bae90>] tick_handle_periodic+0x34/0xc8
[ 4787.949888] [<804d34c8>] gic_compare_interrupt+0x7c/0x9c
[ 4787.955189] [<8008e348>] handle_percpu_devid_irq+0xbc/0x188
[ 4787.960757] [<80087a9c>] generic_handle_domain_irq+0x2c/0x44
[ 4787.966393] [<8039d4a0>] gic_handle_local_int+0xa4/0x110
[ 4787.971690] [<8039d51c>] gic_irq_dispatch+0x10/0x20
[ 4787.976550] [<800879fc>] handle_irq_desc+0x20/0x38
[ 4787.981321] [<806dc538>] do_domain_IRQ+0x3c/0x50
[ 4787.985940] [<8039c7dc>] plat_irq_dispatch+0x98/0xcc
[ 4787.990908] [<80003568>] except_vec_vi_end+0xb8/0xc4
[ 4787.995858] [<82f501a4>] mt76_mmio_init+0xf8/0x144 [mt76]
[ 4788.001271] [<82e6dd38>] mt7915_mac_wtbl_lmac_addr+0x1b0/0x960 [mt7915e]
[ 4788.007980] [<82e70000>] mt7915_mac_reset_work+0x464/0xc88 [mt7915e]

@lukasz1992
Copy link

@rany2 because of this line https://github.com/openwrt/mt76/blob/969b7b5ebd129068ca56e4b0d831593a2f92382f/mmio.c#LL103C23-L103C36

  • probably need to reset spinlock or something like that - I am not sure, do not have enough knowledge, but this is the line that hangs cpu

@rany2
Copy link
Contributor Author

rany2 commented May 24, 2023

@lukasz1992 honestly I think the stacktrace is wrong/deceitful... doesn't make sense

rany2 added a commit to rany2/mt76 that referenced this issue Jun 1, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
@ggg70
Copy link

ggg70 commented Jun 2, 2023

@ggg70 if you could try again, then observing /proc/interrupts would be useful to help troubleshoot. Could you also share your config/scenario?

@lukasz1992 After almost 4 days of working fine, the problem with 100% CPU usage appeared again. Here is the interrupts file:

           CPU0       CPU1       
 10:   14125046   18713122     GICv2  30 Level     arch_timer
 15:          1          0  MT_SYSIRQ 163 Level     mt-pmic-pwrap
 22:          0          0   mt-eint   0 Edge      gpio-keys
 75:          1          0   mt-eint  53 Level     mt7530
124:          0          0   mt-eint 102 Edge      gpio-keys
125:         12          0  MT_SYSIRQ  91 Level     ttyS0
128:          0          0  MT_SYSIRQ 118 Level     1100a000.spi
131:      22145          0  MT_SYSIRQ  96 Level     mtk-snand
132:      84039          0  MT_SYSIRQ  95 Level     mtk-ecc
133:          0          0  MT_SYSIRQ 122 Level     11016000.spi
134:    4144758          0  MT_SYSIRQ 211 Level     mt7615e
135:          0          0  MT_SYSIRQ 232 Level     xhci-hcd:usb1
138:          0          0  MT_SYSIRQ 219 Level     1b007000.dma-controller
139:   40526140          0  MT_SYSIRQ 214 Level     mt7915e
142:   13794927          0  MT_SYSIRQ 224 Level     1b100000.ethernet
143:        290    1852388  MT_SYSIRQ 225 Level     1b100000.ethernet
146:          0          0     dummy   0 Edge      PCIe PME
147:          0          0    mt7530   0 Edge      mt7530-0:00
148:          0          0    mt7530   1 Edge      mt7530-0:01
149:          0          0    mt7530   2 Edge      mt7530-0:02
150:          0          0    mt7530   3 Edge      mt7530-0:03
151:          0          1    mt7530   4 Edge      mt7530-0:04
IPI0:   1039635    2000396       Rescheduling interrupts
IPI1:  13525940   45954702       Function call interrupts
IPI2:         0          0       CPU stop interrupts
IPI3:         0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0       Timer broadcast interrupts
IPI5:   1487744    8024278       IRQ work interrupts
IPI6:         0          0       CPU wake-up interrupts
Err:          0

I am using Belkin RT3200 as a router with PPPoE connection and with mostly WLAN clients, so wed is enabled. All variants I tried have some issues:

  • master main mt76 -> timeout issue #690
  • rany2's mt76 -> 100% CPU after a while (several days)
  • main mt76 + some specific patches (TP-Link WA firmware, VHT on 2.4GHz + fix command timeout in AP stop period) -> 100% CPU, restart needed for fixing. This is the build I'm using now and I never observed the UDP flood bug in my situation, only the timeout or 100% CPU issues.

What remains to be tested by me is main mt76 + this patch which I hope will fix the timeout issue and not be affected by the 100% CPU usage.

@lukasz1992
Copy link

@ggg70 Could you check this file but two times, like second time 10 seconds after first time?
My idea was to compare numbers between calls, not just the static value.

@rany2
Copy link
Contributor Author

rany2 commented Jun 2, 2023

rany2's mt76 -> 100% CPU after a while (several days)

what process was responsible for it? ksoftirqd or the tx worker?

@ggg70
Copy link

ggg70 commented Jun 2, 2023

what process was responsible for it? ksoftirqd or the tx worker?

ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days.

rany2 added a commit to rany2/mt76 that referenced this issue Jun 2, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
rany2 added a commit to rany2/mt76 that referenced this issue Jun 2, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
rany2 added a commit to rany2/mt76 that referenced this issue Jun 2, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
rany2 added a commit to rany2/mt76 that referenced this issue Jun 2, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
rany2 added a commit to rany2/mt76 that referenced this issue Jun 3, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
rany2 added a commit to rany2/mt76 that referenced this issue Jun 5, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
rany2 added a commit to rany2/mt76 that referenced this issue Jun 5, 2023
Using WA firmware from EAP605 fixes the issue of UDP floods
killing MCU communication.

Firmware downloaded from:

https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz

Fixes: openwrt/mt76#776

Signed-off-by: rany <ranygh@riseup.net>
@ggg70
Copy link

ggg70 commented Jun 6, 2023

ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days.

@rany2 OK, after almost 4 days, back to 100% CPU by ksoftirqd using your mt76 version. Here are some numbers from interrupts if they tell you something:

 10:   17854275   15025655     GICv2  30 Level     arch_timer
 10:   17860521   15031962     GICv2  30 Level     arch_timer

131:      21776          0  MT_SYSIRQ  96 Level     mtk-snand
131:      21783          0  MT_SYSIRQ  96 Level     mtk-snand

132:      82566          0  MT_SYSIRQ  95 Level     mtk-ecc
132:      82586          0  MT_SYSIRQ  95 Level     mtk-ecc

134:    4144772          0  MT_SYSIRQ 211 Level     mt7615e
134:    4145319          0  MT_SYSIRQ 211 Level     mt7615e

139:   30191900          0  MT_SYSIRQ 214 Level     mt7915e
139:   30197192          0  MT_SYSIRQ 214 Level     mt7915e

142:    8407271          0  MT_SYSIRQ 224 Level     1b100000.ethernet
142:    8408206          0  MT_SYSIRQ 224 Level     1b100000.ethernet

143:        244    1344989  MT_SYSIRQ 225 Level     1b100000.ethernet
143:        244    1345620  MT_SYSIRQ 225 Level     1b100000.ethernet

IPI0:    479847    1067299       Rescheduling interrupts
IPI0:    481092    1069887       Rescheduling interrupts

Anything else I can do?

@rany2
Copy link
Contributor Author

rany2 commented Jun 6, 2023

@ggg70 I have a similar issue, it's been occurring with the mainline kernel version as well. When an iperf3 test is carried out there seems to be an onslaught of IRQ activity at around that time (with softirq taking up multiple CPU cores). I think it's the same issue/theme.

When that happens TX takes a hit and instead of having >800Mbit/s it drops to 400Mbit and often worse. Reboot sometimes fixes and it gets to a nice 800+Mbit/s; however this happens immediately and not after many days.

@mrkiko
Copy link

mrkiko commented Jun 24, 2023

Is this reproducible with latest HEAD?

@rany2
Copy link
Contributor Author

rany2 commented Jun 24, 2023

@mrkiko Yep, seeing that it is not reproducible with TP-Link's WA firmware; it's probably firmware and not driver specific UNLESS the driver could take some steps to prevent this issue from occurring.

@mrkiko
Copy link

mrkiko commented Jun 24, 2023 via email

@rany2
Copy link
Contributor Author

rany2 commented Feb 9, 2024

This issue doesn't occur for me anymore, so I'm closing it.

@rany2 rany2 closed this as completed Feb 9, 2024
@rany2
Copy link
Contributor Author

rany2 commented Mar 9, 2024

I was able to reproduce it again on the latest master, it's still relevant.

@rany2 rany2 reopened this Mar 9, 2024
@lukasz1992
Copy link

@rany2 are you able to reproduce it with this firmware? https://github.com/cmonroe/feed-wifi-master/tree/smartrg-master/mt76/files/firmware

@rany2
Copy link
Contributor Author

rany2 commented Mar 11, 2024

@lukasz1992 I'll try this firmware. Related but I haven't had any success with this patch #690 (comment)... maybe I'm imagining things but with this applied it does happen less frequently but I was still able to repro regardless.

@rany2
Copy link
Contributor Author

rany2 commented Mar 11, 2024

@lukasz1992 No luck, that firmware doesn't help.

@rany2
Copy link
Contributor Author

rany2 commented Mar 11, 2024

Either way, it does seem like a firmware issue seeing that it seems like rany2/mt76@07be1c7 still works as a workaround.

@rany2
Copy link
Contributor Author

rany2 commented Mar 17, 2024

It seems like with the broadcast AQL patches, this issue is resolved. I can't trigger it anymore. Hopefully I don't reopen this again, but so far so good!

@rany2 rany2 closed this as completed Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants