Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mt76x2e/mt7603e: WiFi locks up requiring reboot #198

Closed
matt7aylor opened this issue Oct 3, 2018 · 7 comments
Closed

mt76x2e/mt7603e: WiFi locks up requiring reboot #198

matt7aylor opened this issue Oct 3, 2018 · 7 comments

Comments

@matt7aylor
Copy link

I have been experiencing an issue with mt76 hardware since OpenWRT 18.06 (didn't see the problem with 17.04). Every so often the WiFi will stop responding and any attempt to interact with it (change channel etc.), even just running ifconfig will hang forever (I think maybe any interaction that involves using netlink communication seems to just hang forever), the only fix appears to be to reboot the device. I can't find an obvious cause of the issue and no particularly relevant log messages, although the devices do often show messages like this one:

mt76x2e 0000:01:00.0: MCU message 3 (seq 7) timed out

I have observed the same issue on multiple different routers (all with the same mt76 hardware ZBT WE1326, the issue doesn't happen with non-mt76 hardware I have tested). I have also found similar looking reports by others in other seemingly unrelated issues e.g. Originally posted by @easyteacher in #188 (comment)

I am currently testing with OpenWRT 18.06 git snapshot r7332-2163b4936e (includes mt76 updates made 01 Oct) and still have the same issue.

@matt7aylor
Copy link
Author

Issue persists with OpenWRT master branch compiled from latest git on 31st October (including latest mt76 drivers). However, the messages relating to MCU time out do seem to have been fixed, therefore potentially unrelated.

The issue presents as just a silent failing of the WiFi devices (causing any interaction with them, such as just running ifconfig, to hang forever) and requires a reboot to return any functionality. If someone can recommend a means of getting additional information for debugging I'd be happy to try.

@kirelagin
Copy link

I think I am seeing this issue on my Xiaomi Mi WiFi Nano with OpenWrt 18.06.1, r7258-5eb055306f.

When this happens, tcpdump on the wireless interface looks like this:

17:19:04.317918 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:05.322228 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:06.362239 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:08.268090 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:09.322239 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:10.362243 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:11.619676 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:12.682226 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:13.722223 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:18.070372 IP 185.63.145.1.443 > 172.16.100.108.54755: Flags [.], seq 253:289, ack 410, win 15, options [nop,nop,TS val 1355765248 ecr 320151767], length 36
17:19:22.242125 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:23.082226 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:23.242224 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:24.122226 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:24.282225 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:25.162227 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:28.694310 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:29.722230 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:30.762228 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:33.363892 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:34.442227 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:35.482229 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:42.392233 IP 172.16.100.128.52544 > 104.238.19.81.5222: Flags [S], seq 2221626543, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 437526164 ecr 0,sackOK,eol], length 0
17:19:42.392476 IP 172.16.100.128.52543 > 104.238.19.81.80: Flags [S], seq 1924033833, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 437526164 ecr 0,sackOK,eol], length 0
17:19:42.392592 IP 172.16.100.128.52542 > 104.238.19.81.443: Flags [S], seq 2298650092, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 437526164 ecr 0,sackOK,eol], length 0
17:19:42.486043 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28
17:19:42.608942 IP 172.16.100.128.52545 > 192.241.190.55.443: Flags [S], seq 4159648810, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 437526481 ecr 0,sackOK,eol], length 0
17:19:42.831067 IP 172.16.100.128.52541 > 104.25.152.102.443: Flags [S], seq 91621987, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 437526703 ecr 0,sackOK,eol], length 0
17:19:42.922239 IP 172.16.100.108.54778 > 64.233.184.188.5228: Flags [S], seq 4004987576, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 320417023 ecr 0,sackOK,eol], length 0
17:19:42.922447 IP 172.16.100.108.54779 > 64.233.184.188.5228: Flags [S], seq 3834331790, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 320417111 ecr 0,sackOK,eol], length 0
17:19:43.009654 ARP, Request who-has 172.16.100.108 tell 172.16.100.1, length 28
17:19:43.562230 ARP, Request who-has 172.16.100.128 tell 172.16.100.1, length 28

That is, an occasional packet gets through, but mostly traffic just disappears. There is nothing interesting in logs and the stations do not get disassociated. I thought reloading the module might help; rmmod mt76x2e completed successfully, but rmmod mt7603e hangs forever.

I’ll try to build master next and see if it changes something. Please, let me know, if there is anything else I can do to help diagnose this issue. (BTW, I think it is the same as #192.)

@araujorm
Copy link

araujorm commented Nov 9, 2018

Hi, been having crashes once in a while with latest master commits. Most of the time the crashlog has nothing useful but this one may help since it shows mt76 function references:

<4>[33170.727113] Unhandled kernel unaligned access[#1]:
<4>[33170.736613] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.79 #0
<4>[33170.748158] task: 8039f310 task.stack: 8039a000
<4>[33170.757114] $ 0   : 00000000 00000001 00000001 0000000d
<4>[33170.767459] $ 4   : 3c5a0002 00000001 00000000 83af5918
<4>[33170.777803] $ 8   : 0000c7c8 00000000 00000000 00000000
<4>[33170.788147] $12   : 096c0000 00000000 3fff0000 00000000
<4>[33170.798493] $16   : 83820e40 83b6ff40 00000013 83b70004
<4>[33170.808840] $20   : 00000001 00000006 83aeede0 00000000
<4>[33170.819184] $24   : 00000000 02cbd800                 
<4>[33170.829530] $28   : 8039a000 83809d88 00000000 80220044
<4>[33170.839878] Hi    : 00000015
<4>[33170.845567] Lo    : 00000000
<4>[33170.851279] epc   : 80220c1c skb_release_data+0x80/0x188
<4>[33170.861791] ra    : 80220044 __kfree_skb+0x14/0x28
<4>[33170.871263] Status: 1100f403      KERNEL EXL IE
<4>[33170.879545] Cause : 00800010 (ExcCode 04)
<4>[33170.887470] BadVA : 3c5a0016
<4>[33170.893160] PrId  : 00019655 (MIPS 24KEc)
<4>[33170.901082] Modules linked in: mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 cfg80211 compat leds_gpio gpio_button_hotplug
<4>[33170.924708] Process swapper (pid: 0, threadinfo=8039a000, task=8039f310, tls=00000000)
<4>[33170.940372] Stack : 01080020 01095220 83820e40 80220008 83820e40 00000000 83af4bc0 00000000
<4>[33170.956922]         00000001 80220044 000002e0 838203c0 838203c0 83b78440 83820e40 83b2618c
<4>[33170.973470]         0ee6b280 ffffffff 80420000 8005ecc8 8325ac00 83a44bd0 00000000 83809e88
<4>[33170.990019]         803a53e8 8039f310 00000000 00000003 000000c8 00000000 803bed08 00000000
<4>[33171.006566]         00000000 00000000 00000000 096c0000 00000000 00000000 c803c800 000000c7
<4>[33171.023114]         ...
<4>[33171.027945] Call Trace:
<4>[33171.032779] [<80220c1c>] skb_release_data+0x80/0x188
<4>[33171.042605] [<80220044>] __kfree_skb+0x14/0x28
<4>[33171.051555] [<83b2618c>] ieee80211_rx_napi+0x948/0x960 [mac80211]
<4>[33171.063683] [<83afa430>] mt76_rx_complete+0x134/0x174 [mt76]
<4>[33171.074894] [<83afa668>] mt76_rx_poll_complete+0x1f8/0x334 [mt76]
<4>[33171.086966] [<83af906c>] mt76_dma_tx_queue_skb+0xc88/0x10e0 [mt76]
<4>[33171.099199] Code: 1000fff2  24420001  8e640000 <8c820014> 30430001  10600002  00000000  2444ffff  24820010
<4>[33171.118513]
<4>[33171.121499] ---[ end trace 5f93fc887a1c1d08 ]---

===================================
Time: 1541752845.56225
<4>[33170.956922]         00000001 80220044 000002e0 838203c0 838203c0 83b78440 83820e40 83b2618c
<4>[33170.973470]         0ee6b280 ffffffff 80420000 8005ecc8 8325ac00 83a44bd0 00000000 83809e88
<4>[33170.990019]         803a53e8 8039f310 00000000 00000003 000000c8 00000000 803bed08 00000000
<4>[33171.006566]         00000000 00000000 00000000 096c0000 00000000 00000000 c803c800 000000c7
<4>[33171.023114]         ...
<4>[33171.027945] Call Trace:
<4>[33171.032779] [<80220c1c>] skb_release_data+0x80/0x188
<4>[33171.042605] [<80220044>] __kfree_skb+0x14/0x28
<4>[33171.051555] [<83b2618c>] ieee80211_rx_napi+0x948/0x960 [mac80211]
<4>[33171.063683] [<83afa430>] mt76_rx_complete+0x134/0x174 [mt76]
<4>[33171.074894] [<83afa668>] mt76_rx_poll_complete+0x1f8/0x334 [mt76]
<4>[33171.086966] [<83af906c>] mt76_dma_tx_queue_skb+0xc88/0x10e0 [mt76]
<4>[33171.099199] Code: 1000fff2  24420001  8e640000 <8c820014> 30430001  10600002  00000000  2444ffff  24820010
<4>[33171.118513]
<4>[33171.121499] ---[ end trace 5f93fc887a1c1d08 ]---
<0>[33171.132859] Kernel panic - not syncing: Fatal exception in interrupt

@araujorm
Copy link

araujorm commented Nov 9, 2018

One more thing, this happened on a router that only has the 2.4 GHz radio on, the 5Ghz is disabled.

@araujorm
Copy link

Hello.

As I said in #193 (comment) this crash happened again, more or less the same call trace, even after commit ffccb48.

Best regards

@nbd168
Copy link
Member

nbd168 commented Nov 13, 2018

Please try latest OpenWrt master

@matt7aylor
Copy link
Author

I have been using snapshot r8454-d965f41ac8 compiled just after the update above (13 Nov) on one device for a few weeks now and have not seen any reoccurrence of the issue. This looks like it may now be fixed in master. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants