New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mt7915e: AP mode: UDP flood kills driver/MCU communication #776
Comments
EDIT: wrong, did not fix the issue. Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch: diff --git a/mt7915/main.c b/mt7915/main.c
index 8ce7b1c5..98e37afa 100644
--- a/mt7915/main.c
+++ b/mt7915/main.c
@@ -777,6 +777,8 @@ mt7915_ampdu_action(struct ieee80211_hw *hw, struct ieee80211_vif *vif,
struct mt76_txq *mtxq;
int ret = 0;
+ return -EOPNOTSUPP;
+
if (!txq)
return -EINVAL; |
I would like to make an important update, it turns there is no relation. This is not the issue. |
Well, perhaps the connection between this issue and APDU is simply the degraded link speed that makes this issue occur less; and obviously disabling AMPDU would cause you to have slower speeds. It might just be some memory corruption issue where after a certain amount of time, memory corruption occurs leading to this. I legitimately don't know but I think it might be likely seeing how the behavior after this issue occurs varies tremendously. (Sometimes it recovers, sometimes not, sometimes it doesn't recover automatically, sometimes setting some value to sys_recovery fixes it, etc) Edit: I should mention by "it recovers" I mean L1 SER kicks in. |
Also important to note something I mentioned on another issue thread: This issue occurs also when running that spam_multicast.py script from the ethernet switch itself HOWEVER it ends up doing an L1 recovery everytime. So perhaps some corruption in the RX buffers that overtime takes its toll on the system?
|
Apologies for the constant corrections, but I tried once more to get it to crash from the ethernet switch and this time around it was not so lucky and was unable to recover itself. This whole ordeal is really inconsistent. |
@ryderlee1110 This node had L1 SER kicking in whenever it was crashing under normal circumstances (and was working OK) but when I ran the spam_multicast script, it tried to recover itself but ended up failing to do so. Starting from
|
@ryderlee1110 I think there is an issue with SER with respect to how MT_PCIE1_MAC_INT_ENABLE is used unconditionally. This didn't fix my issue but I noticed in mac_restart the driver isn't checking if mdev is mt7915 causing you to use MT_PCIE1_MAC_INT_ENABLE when you should have used MT_PCIE1_MAC_INT_ENABLE_MT7916. At least this is the logic in pci.c If unclear check this patch I made which fixes what I think is a possible issue: rany2/mt76@7cb022d. Something else I noticed is that after I do a full chip recovery by writing 7 to sys_recovery, hif no longer receives any interrupts. So I thought the above would fix it but apparently not. Regardless I think it might be worth checking to see if it was intended or if this is indeed a bug. |
Should have mentioned that writing 7 to sys_recovery until memory consumption goes back to normal and then rmmod and insmod mt7915e fixes it. |
Updated title to reflect the fact that it doesn't matter whether the flood originated from STA or not, if there is a flood of any kind and the AP has to transmit that traffic to clients; it will die and this will be spammed to dmesg (provided fw wm debug is enabled):
|
This version I sourced from EAP615 WALL GPL source dump fixes the issue of random crashes, in particular these issues: - openwrt/mt76#776 - openwrt/mt76#690 Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz
This version I sourced from EAP615 WALL GPL source dump fixes the issue of random crashes, in particular these issues: - openwrt/mt76#776 - openwrt/mt76#690 Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Signed-off-by: rany <ranygh@riseup.net>
Just a heads up, I no longer face this issue when using this WA firmware I sourced from the EAP615 WALL source dump: rany2/mt76@78fd50c |
which version you downgraded from ? |
These firmwares I tried had this issue:
However it still has the random MCU timeout issue like before, just that now UDP flood doesn't kill it and it recovers. |
@rany2 So:
|
@lukasz1992 No, it only recovers itself in this specific case (UDP flooding). I still have random driver hangs/MCU timeout. |
:( many thanks anyway |
Looking at another source dump, I noticed that there seems to be a distinction between three variants of the same firmware with MT7915D working fine only in fixing this issue; however OpenWRT/linux-firmware don't make this distinction. Could this be the issue? |
@rany2 hi, tools and scripts is missing in your source. is this by design? |
@littoy It's missing because my mt76 is based on wireless-next which has some changes compared to OWRT variant. |
@littoy on your openwrt source tree, you need to make the following changes: rany2/openwrt@f52d443 just change PKG_SOURCE_VERSION to dee319231825423a9ac5135591a781e1e398267f |
👌, thank you very much. |
@ryderlee1110 hang is with MCUWA now, and the UDP flood is no longer relevant:
|
If I write 7 to sys_recovery, I get this after about a minute:
|
@rany2 because of this line https://github.com/openwrt/mt76/blob/969b7b5ebd129068ca56e4b0d831593a2f92382f/mmio.c#LL103C23-L103C36
|
@lukasz1992 honestly I think the stacktrace is wrong/deceitful... doesn't make sense |
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
@lukasz1992 After almost 4 days of working fine, the problem with 100% CPU usage appeared again. Here is the interrupts file:
I am using Belkin RT3200 as a router with PPPoE connection and with mostly WLAN clients, so wed is enabled. All variants I tried have some issues:
What remains to be tested by me is main mt76 + this patch which I hope will fix the timeout issue and not be affected by the 100% CPU usage. |
@ggg70 Could you check this file but two times, like second time 10 seconds after first time? |
what process was responsible for it? ksoftirqd or the tx worker? |
ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days. |
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
Using WA firmware from EAP605 fixes the issue of UDP floods killing MCU communication. Firmware downloaded from: https://static.tp-link.com/upload/gpl-code/2023/202305/20230515/eap615-wall_gpl_code.tar.gz Fixes: openwrt/mt76#776 Signed-off-by: rany <ranygh@riseup.net>
@rany2 OK, after almost 4 days, back to 100% CPU by ksoftirqd using your mt76 version. Here are some numbers from interrupts if they tell you something:
Anything else I can do? |
@ggg70 I have a similar issue, it's been occurring with the mainline kernel version as well. When an iperf3 test is carried out there seems to be an onslaught of IRQ activity at around that time (with softirq taking up multiple CPU cores). I think it's the same issue/theme. When that happens TX takes a hit and instead of having >800Mbit/s it drops to 400Mbit and often worse. Reboot sometimes fixes and it gets to a nice 800+Mbit/s; however this happens immediately and not after many days. |
Is this reproducible with latest HEAD? |
@mrkiko Yep, seeing that it is not reproducible with TP-Link's WA firmware; it's probably firmware and not driver specific UNLESS the driver could take some steps to prevent this issue from occurring. |
Thanks a lot for your reply. Now I remember you explained this to me so my question was kinda useless. Sorry.
|
This issue doesn't occur for me anymore, so I'm closing it. |
I was able to reproduce it again on the latest master, it's still relevant. |
@rany2 are you able to reproduce it with this firmware? https://github.com/cmonroe/feed-wifi-master/tree/smartrg-master/mt76/files/firmware |
@lukasz1992 I'll try this firmware. Related but I haven't had any success with this patch #690 (comment)... maybe I'm imagining things but with this applied it does happen less frequently but I was still able to repro regardless. |
@lukasz1992 No luck, that firmware doesn't help. |
Either way, it does seem like a firmware issue seeing that it seems like rany2/mt76@07be1c7 still works as a workaround. |
It seems like with the broadcast AQL patches, this issue is resolved. I can't trigger it anymore. Hopefully I don't reopen this again, but so far so good! |
Issue found courtesy of @Brain2000.
This issue was brought up in #690 but I think it is worth keeping track off in its own issue as it appears there are many different things that could trigger #690.
In essence, the following is a requirement to trigger this issue:
In order to trigger this I will provide you with the following code from @Brain2000.
File
spam_multicast.py
:You could trigger it like so:
I've tried triggering this on MT76x2E, MT7610E and MT7628AN with no luck so I presume this is unique to mt7915e's firmware or kernel driver.
Notes:
All the best
The text was updated successfully, but these errors were encountered: