Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mt7981 - mt798x-wmac 18000000.wifi: Message 00005aed timeout #860

Closed
lukasz1992 opened this issue Feb 15, 2024 · 20 comments
Closed

mt7981 - mt798x-wmac 18000000.wifi: Message 00005aed timeout #860

lukasz1992 opened this issue Feb 15, 2024 · 20 comments

Comments

@lukasz1992
Copy link

Hello, with the most recent version of mt76 in OpenWRT 23.05.2 I have some temporary breaks (every 2-4 weeks) on WiFi - both SSID (2.4GHz and 5GHz disappears). Driver/fw/hw is able to recover itself within a minute.

[559161.263353] mt798x-wmac 18000000.wifi: Message 00005aed (seq 10) timeout
[1563791.313702] mt798x-wmac 18000000.wifi: Message 00005aed (seq 1) timeout
[2326012.477521] mt798x-wmac 18000000.wifi: Message 00005aed (seq 4) timeout
[2326776.176547] mt798x-wmac 18000000.wifi: Message 000026ed (seq 1) timeout
[2326776.183360] mt798x-wmac 18000000.wifi: Message 00005aed (seq 2) timeout

The common thing is message 0x5A, which is GET_MIB_INFO (I'd guess 0x26 request timed out because of the next 0x5A request).
The same problem was reported with mt7915 chip, @rany2 found a workaround to revert d9dd763.

I will try it and report if it helps.

@Fail-Safe
Copy link

Fail-Safe commented Mar 21, 2024

@lukasz1992 Any chance your issue and this could be related? #866

cc: @rany2

@lukasz1992
Copy link
Author

lukasz1992 commented Mar 23, 2024

@Fail-Safe Crash log looks exactly the same.

I compile mt76 on my own. I reverted commit d9dd763 , now I have 0 crashes

@rany2
Copy link
Contributor

rany2 commented Mar 23, 2024

You could use this patch from my tree if you don't want to revert it yourself: https://raw.githubusercontent.com/rany2/openwrt/4ad7ed1d5d5b77d32999c9314cc1cec3ee2a9724/package/kernel/mt76/patches/9009-wifi-mt76-mt7915-do-not-use-event-format-to-get-.patch

@Fail-Safe
Copy link

So with your patch, @rany2, you're able to run multicast_to_unicast_all enabled without any crashes?

@rany2
Copy link
Contributor

rany2 commented Mar 23, 2024

@Fail-Safe I never had to daily drive multicast_to_unicast_all so I don't know, I guess you'll have to figure that out yourself

@lukasz1992
Copy link
Author

After ~1.5 months of tests I can tell that reverting d9dd763 is a working workaround.
My WiFi is now rock stable, no more MCU timeout.

@xize
Copy link

xize commented Apr 15, 2024

@rany2
could this patch also fix certain multi psk situations aswell?

i noticed when I use wifi-station and wifi-vlan multicast is active by default.

here is a relevant part:

open stacktrace
[  658.714376] br-lan: port 12(phy1-ap0-aya) entered disabled state
[  658.720518] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered allmulticast mode
[  658.727865] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered promiscuous mode
[  658.738574] mt798x-wmac 18000000.wifi phy1-ap0-aya: left allmulticast mode
[  658.745449] mt798x-wmac 18000000.wifi phy1-ap0-aya: left promiscuous mode
[  658.752359] br-lan: port 12(phy1-ap0-aya) entered disabled state
[  658.778774] br-lan: port 12(phy1-ap0-aya) entered blocking state
[  658.784779] br-lan: port 12(phy1-ap0-aya) entered disabled state
[  658.790816] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered allmulticast mode
[  658.798121] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered promiscuous mode
[  658.995820] br-lan: port 6(phy1-ap0) entered blocking state
[  659.001419] br-lan: port 6(phy1-ap0) entered forwarding state
[  659.007405] br-lan: port 12(phy1-ap0-aya) entered blocking state
[  659.013402] br-lan: port 12(phy1-ap0-aya) entered forwarding state
[ 1804.642711] mt798x-wmac 18000000.wifi phy1-ap0: left allmulticast mode
[ 1804.649296] mt798x-wmac 18000000.wifi phy1-ap0: left promiscuous mode
[ 1804.655785] br-lan: port 6(phy1-ap0) entered disabled state
[ 1805.289028] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1805.548024] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1805.587960] mt798x-wmac 18000000.wifi phy1-ap0-aya (unregistering): left allmulticast mode
[ 1805.596217] mt798x-wmac 18000000.wifi phy1-ap0-aya (unregistering): left promiscuous mode
[ 1805.604429] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1806.306349] br-lan: port 6(phy1-ap0) entered blocking state
[ 1806.311964] br-lan: port 6(phy1-ap0) entered disabled state
[ 1806.317576] mt798x-wmac 18000000.wifi phy1-ap0: entered allmulticast mode
[ 1806.324532] mt798x-wmac 18000000.wifi phy1-ap0: entered promiscuous mode
[ 1806.332261] br-lan: port 6(phy1-ap0) entered blocking state
[ 1806.337830] br-lan: port 6(phy1-ap0) entered forwarding state
[ 1806.346761] br-lan: port 12(phy1-ap0-aya) entered blocking state
[ 1806.352791] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1806.358825] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered allmulticast mode
[ 1806.366118] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered promiscuous mode
[ 1806.374153] br-lan: port 12(phy1-ap0-aya) entered blocking state
[ 1806.380157] br-lan: port 12(phy1-ap0-aya) entered forwarding state
[ 1882.836808] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1882.896545] mt798x-wmac 18000000.wifi phy1-ap0-aya (unregistering): left allmulticast mode
[ 1882.904804] mt798x-wmac 18000000.wifi phy1-ap0-aya (unregistering): left promiscuous mode
[ 1882.912980] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1883.016297] mt798x-wmac 18000000.wifi phy1-ap0: left allmulticast mode
[ 1883.022836] mt798x-wmac 18000000.wifi phy1-ap0: left promiscuous mode
[ 1883.029359] br-lan: port 6(phy1-ap0) entered disabled state
[ 1883.872420] br-lan: port 6(phy1-ap0) entered blocking state
[ 1883.878016] br-lan: port 6(phy1-ap0) entered disabled state
[ 1883.883607] mt798x-wmac 18000000.wifi phy1-ap0: entered allmulticast mode
[ 1883.890650] mt798x-wmac 18000000.wifi phy1-ap0: entered promiscuous mode
[ 1883.898428] br-lan: port 6(phy1-ap0) entered blocking state
[ 1883.903999] br-lan: port 6(phy1-ap0) entered forwarding state
[ 1884.085102] br-lan: port 6(phy1-ap0) entered disabled state
[ 1904.462971] br-lan: port 12(phy1-ap0-aya) entered blocking state
[ 1904.468990] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1904.475115] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered allmulticast mode
[ 1904.482463] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered promiscuous mode
[ 1904.492963] mt798x-wmac 18000000.wifi phy1-ap0-aya: left allmulticast mode
[ 1904.499860] mt798x-wmac 18000000.wifi phy1-ap0-aya: left promiscuous mode
[ 1904.506708] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1904.563556] br-lan: port 12(phy1-ap0-aya) entered blocking state
[ 1904.569560] br-lan: port 12(phy1-ap0-aya) entered disabled state
[ 1904.575623] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered allmulticast mode
[ 1904.582912] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered promiscuous mode
[ 1904.787762] br-lan: port 6(phy1-ap0) entered blocking state
[ 1904.793349] br-lan: port 6(phy1-ap0) entered forwarding state
[ 1904.799282] br-lan: port 12(phy1-ap0-aya) entered blocking state
[ 1904.805296] br-lan: port 12(phy1-ap0-aya) entered forwarding state
[54243.578836] mt798x-wmac 18000000.wifi: Message 000026ed (seq 5) timeout
[54264.036475] mt798x-wmac 18000000.wifi: Message 00005aed (seq 6) timeout
[54284.495848] mt798x-wmac 18000000.wifi: Message 000026ed (seq 7) timeout
[54284.502531] ------------[ cut here ]------------
[54284.507134] WARNING: CPU: 2 PID: 13913 at ___ieee80211_stop_tx_ba_session+0x2b4/0x2f4 [mac80211]
[54284.515952] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_inet wireguard pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack_netlink nf_conntrack mt7915e(O) mt76_connac_lib(O) mt76(O) mac80211(O) libchacha20poly1305 iptable_mangle iptable_filter ipt_REJECT ipt_ECN ip_tables chacha_neon cfg80211(O) xt_time xt_tcpudp xt_tcpmss xt_statistic xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY x_tables slhc sch_cake poly1305_neon nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c libchacha compat(O) crypto_safexcel sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact
[54284.516113]  ip6_gre ip_gre gre ifb ip6_tunnel tunnel6 ip_tunnel vxlan udp_tunnel ip6_udp_tunnel sha512_arm64 sha1_ce sha1_generic seqiv md5 geniv des_generic libdes authencesn authenc leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd xhci_hcd gpio_button_hotplug(O) usbcore usb_common aquantia
[54284.631088] CPU: 2 PID: 13913 Comm: kworker/u8:0 Tainted: G           O       6.6.25 #0
[54284.639070] Hardware name: GL.iNet GL-MT6000 (DT)
[54284.643757] Workqueue: phy1 ieee80211_ba_session_work [mac80211]
[54284.649773] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[54284.656715] pc : ___ieee80211_stop_tx_ba_session+0x2b4/0x2f4 [mac80211]
[54284.663332] lr : ___ieee80211_stop_tx_ba_session+0x1d4/0x2f4 [mac80211]
[54284.669947] sp : ffffffc08c60bc80
[54284.673245] x29: ffffffc08c60bc80 x28: 0000000000000001 x27: ffffff8005c32780
[54284.680361] x26: ffffff800782c3b8 x25: ffffff80050408a0 x24: ffffff80050408a0
[54284.687477] x23: ffffffc0790aed10 x22: ffffff800782e0e8 x21: 0000000000000001
[54284.694592] x20: ffffff8005c32780 x19: ffffff800782c000 x18: 0000000000000028
[54284.701707] x17: 0000000000000001 x16: 0000000000006f68 x15: ffffff80050408a6
[54284.708823] x14: ffffffc08104ac90 x13: 0000000000000000 x12: 0000000000000002
[54284.715939] x11: 0000000000000040 x10: ffffffc080b57470 x9 : ffffffc080b57468
[54284.723056] x8 : 0000000000000002 x7 : 000000000000b8c1 x6 : 0000009f9d0fdd46
[54284.730171] x5 : 0000000001000000 x4 : 0000000000000000 x3 : 0000000000000000
[54284.737286] x2 : 0000000000000001 x1 : 0000000000000002 x0 : 00000000fffffff4
[54284.744403] Call trace:
[54284.746834]  ___ieee80211_stop_tx_ba_session+0x2b4/0x2f4 [mac80211]
[54284.753106]  ieee80211_ba_session_work+0x418/0x444 [mac80211]
[54284.758854]  process_one_work+0x154/0x2a0
[54284.762854]  worker_thread+0x2a8/0x484
[54284.766589]  kthread+0xdc/0xe8
[54284.769630]  ret_from_fork+0x10/0x20
[54284.773192] ---[ end trace 0000000000000000 ]---

only custom change i made on mt76 was this patch by leans OpenWrt: https://github.com/coolsnowwolf/lede/blob/master/package/kernel/mt76/patches/001-allow-vht-on-2g.patch though the crash happened on 5ghz for me and it seems from all my network devices only my Ayaneo Geek 1S is the trouble maker (Intel AX210).

by reading this issue about multicast and especially this line: [ 658.720518] mt798x-wmac 18000000.wifi phy1-ap0-aya: entered allmulticast mode makes me think it crashes by this issue aswell?

I compile my own openwrt based on test kernel 6.6, here is my fork: https://github.com/xize/openwrt-flint2-testing

-edit-

@rany2 I have tested this patch, but this doesn't work in my certain topology interestingly I have more multi psk networks mainly with homewizard and aqara devices these seem not to crash the MT76 driver (can be run a week+), but when I use my Ayaneo Geek 1S (Intel AX210) then it crashes, sometimes it does that directly other times it takes some time mainly with p2p udp gta online traffic, if I crashed it a few times it start working for a longer time of period, it is a very confusing issue and im being clueless where I should look, do you have any idea?, I realized multi psk phys all get enabled with allmulticast mode aswell.

^ I also tried playing around with the Intel Ax210 driver if WoWlan or turning SMPS to off would do a change but with no avail.

@lukasz1992
Copy link
Author

blocktrron@7447213 ?

@graysky2
Copy link

graysky2 commented Jul 3, 2024

@nbd168 - what do you think about this? My flogic/xiaomi_redmi-router-ax6000-ubootmod is experiencing this as well.

I compile mt76 on my own. I reverted commit d9dd763 , now I have 0 crashes

dmesg:
[ 3395.669339] mt798x-wmac 18000000.wifi: Message 00005aed (seq 3) timeout

logread:
Sun Jun 30 00:56:36 2024 kern.err kernel: [ 3395.669339] mt798x-wmac 18000000.wifi: Message 00005aed (seq 3) timeout

Many times, the client is just disconnected but I recently found that one of my SSIDs went down and would not recover without a reboot. I am on a snapshot I built several days ago.

@graysky2
Copy link

graysky2 commented Jul 3, 2024

@lukasz1992 -

After ~1.5 months of tests I can tell that reverting d9dd763 is a working workaround.

Do you have an updated revert of that commit? It no longer clearly reverts:

% git revert d9dd7635b0551839d2dc5544855c1ebb5205c800
Auto-merging mt7915/init.c
CONFLICT (content): Merge conflict in mt7915/init.c
Auto-merging mt7915/mac.c
CONFLICT (content): Merge conflict in mt7915/mac.c
Auto-merging mt7915/mcu.c
CONFLICT (content): Merge conflict in mt7915/mcu.c
Auto-merging mt7915/mcu.h
CONFLICT (content): Merge conflict in mt7915/mcu.h
Auto-merging mt7915/mt7915.h
CONFLICT (content): Merge conflict in mt7915/mt7915.h
Auto-merging mt7915/regs.h
CONFLICT (content): Merge conflict in mt7915/regs.h
error: could not revert d9dd7635... mt76: mt7915: use mt7915_mcu_get_mib_info() to get survey data
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"

% git status
On branch master
Your branch is up to date with 'origin/master'.

You are currently reverting commit d9dd7635.
  (fix conflicts and run "git revert --continue")
  (use "git revert --skip" to skip this patch)
  (use "git revert --abort" to cancel the revert operation)

Unmerged paths:
  (use "git restore --staged <file>..." to unstage)
  (use "git add <file>..." to mark resolution)
	both modified:   mt7915/init.c
	both modified:   mt7915/mac.c
	both modified:   mt7915/mcu.c
	both modified:   mt7915/mcu.h
	both modified:   mt7915/mt7915.h
	both modified:   mt7915/regs.h

no changes added to commit (use "git add" and/or "git commit -a")

@graysky2
Copy link

graysky2 commented Jul 3, 2024

Thanks @lukasz1992 - I am building from master, it seems your rebase is for a stable release. Do you also build from master/have a patch for that branch?

EDIT: my bad, this applies.

graysky2/openwrt@83c9baf

graysky2 added a commit to graysky2/mt76 that referenced this issue Jul 3, 2024
@lukasz1992
Copy link
Author

Thanks @lukasz1992 - I am building from master, it seems your rebase is for a stable release. Do you also build from master/have a patch for that branch?

EDIT: my bad, this applies.

graysky2/openwrt@83c9baf

I am basing on openwrt stable release, but for mt76 I use master.
PS: there were some reports that my fix does not solve the issue in all cases.

@lukasz1992
Copy link
Author

^ I also tried playing around with the Intel Ax210 driver if WoWlan or turning SMPS to off would do a change but with no avail.

openwrt/openwrt#15824 ?

@graysky2
Copy link

graysky2 commented Jul 3, 2024

@lukasz1992

I am basing on openwrt stable release, but for mt76 I use master.
PS: there were some reports that my fix does not solve the issue in all cases.

That's OK. I am not entirely sure what I am seeing is the same issue. See my description here: #860 (comment)

@degen91
Copy link

degen91 commented Jul 3, 2024

@graysky2

That's OK. I am not entirely sure what I am seeing is the same issue. See my description here: #860 (comment)

If using multicast_to_unicast_all or VLAN, potentially it can be #866 or #881, which may be connected. Some interesting comments -- #866 (comment) and #881 (comment) (potential workaround for VLANs). I bring them up because they can cause disconnects or crashes with similar timeout messages.

@graysky2
Copy link

graysky2 commented Jul 3, 2024

@degen91 - thanks for the reply. I am not using multicast_to_unicast_all but I am using VLANs with the xiaomi_redmi-router-ax6000-ubootmod as a dumb access point.

@graysky2
Copy link

graysky2 commented Jul 5, 2024

@lukasz1992 - the patch you shared does not seem to fix the issue for me. This time wifi did not crash but I found the same output in dmesg:

[10745.991134] mt798x-wmac 18000000.wifi: Message 00005aed (seq 10) timeout

and in logread

Fri Jul  5 14:40:00 2024 kern.err kernel: [10745.991134] mt798x-wmac 18000000.wifi: Message 00005aed (seq 10) timeout

@CAMOBAP
Copy link

CAMOBAP commented Sep 8, 2024

I recently started to observe the same issue on my OpenWRT (23.05.2 r23630-842932a63d)

[137054.969466] mt7915e 0000:06:10.0: Message 00005aed (seq 2) timeout
...
[137259.763160] mt7915e 0000:06:10.0: Message 00005aed (seq 15) timeout
[137054.969466] mt7915e 0000:06:10.0: Message 00005aed (seq 1) timeout
...
[137259.763160] mt7915e 0000:06:10.0: Message 00005aed (seq 15) timeout
...

So seq id reaches 15 max and start over again

In my case it looks like related to some hardware failure

@LuisMitaHL
Copy link

MediaTek recently released a new wifi firmware fixing various bugs: #881

You can try a SNAPSHOT OpenWrt firmware (this includes the latest wifi fw), or you can download the new wifi fw to /lib/firmware/mediatek/ from https://github.com/openwrt/mt76/tree/master/firmware

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants