Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#3760 - Wifi to WAN traffic causes eth0 to crash with a NETDEV WATCHDOG #8777

Open
openwrt-bot opened this issue Apr 29, 2021 · 3 comments
Open
Labels
flyspray release/21.02

Comments

@openwrt-bot
Copy link

@openwrt-bot openwrt-bot commented Apr 29, 2021

stevenj:

I have been testing OpenWRT 21.02-RC1 stock released firmware image on the Newifi-D2 target (MT7621 based).

Sporadically while using the router, i experience the WAN interface crashing with NETDEV WATCHDOG errors.

I am using the released image for OpenWrt 21.02-rc1 for the Newifi-D2 target
I have also experienced this fault with a custom router build based on MT7621 for a different target, the fault appears general to MT7621 not this target.

I have loaded 19.07.7 stock firmware on this same hardware and the fault does not occur in that firmware version. A colleague has replicated this fault in 21.02-RC1 with the same type of target (physically different unit), but in a different environment with no shared network resources.

Other than setting up the wifi ssid and password and enabling the 5ghz radio, the network settings are stock.

I have worked out a 90% reliable test case to reproduce the fault:

Have a client device connected to the router on WiFi 5ghz
And a local Iperf3 server running on the wan side of the wired interface.

On the server run:

iperf3 -s

//Note the servers IP address//

On the client run:

iperf3 -c -P 50

//Notes:
Iperf3 server needs to be local on a 1gbps wired link. With an internet based Iperf3 server (or one with a slower connection) the error is much harder to trigger, it will happen but its much more random.
The number of connections does not need to be 50, however, the lower the number of connections the less reliable the error triggers.
The WiFi client should be getting about 300-400mbps throughput over the wifi link when its working correctly.//

On most runs, the iperf3 performance will drop to 0mbps, logread on the router then reveals:

Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.065748] ------------[ cut here ]------------
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.070403] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:448 0x8047e780
Thu Apr 29 03:58:44 2021 kern.info kernel: [ 578.077443] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.084412] Modules linked in: pppoe ppp_async iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.147503] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.4.111 #0
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.153490] Stack : 00000000 80840000 ffffffff 8007d6e0 00000000 00000000 00000000 00000000
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.161831] 00000000 00000000 00000000 00000000 00000000 00000001 8fc0fd50 8ba64c1c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.170163] 8fc0fde8 00000000 00000000 00000000 00000038 805e1804 342e3520 3131312e
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.178486] 00000000 0000001c 00000000 0002402f 00000000 8fc0fd30 00000000 8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.186808] 00000009 00000001 00200000 00000122 00000003 80359e2c 00000004 80810004
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.195132] ...
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.197567] Call Trace:
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.197583] [<8007d6e0>] 0x8007d6e0
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.203512] [<805e1804>] 0x805e1804
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.206990] [<8047e780>] 0x8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.210464] [<80359e2c>] 0x80359e2c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.213941] [<8000b05c>] 0x8000b05c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.217409] [<8000b064>] 0x8000b064
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.220879] [<805c6f9c>] 0x805c6f9c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.224351] [<8007d8ac>] 0x8007d8ac
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.227829] [<8002bfe8>] 0x8002bfe8
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.231302] [<8047e780>] 0x8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.234775] [<8002c0c0>] 0x8002c0c0
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.238257] [<8047e780>] 0x8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.241731] [<800a9018>] 0x800a9018
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.245212] [<8047e484>] 0x8047e484
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.248688] [<800965d4>] 0x800965d4
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.252161] [<8009681c>] 0x8009681c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.255640] [<805e7d1c>] 0x805e7d1c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.259116] [<80030768>] 0x80030768
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.262588] [<802f8404>] 0x802f8404
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.266064] [<80006c28>] 0x80006c28
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.269533]
Thu Apr 29 03:58:44 2021 kern.warn kernel: [ 578.271153] ---[ end trace eabfb050c5985bce ]---
Thu Apr 29 03:58:44 2021 kern.err kernel: [ 578.275799] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Thu Apr 29 03:58:44 2021 kern.info kernel: [ 578.282602] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
Thu Apr 29 03:58:44 2021 kern.info kernel: [ 578.318310] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
Thu Apr 29 03:58:44 2021 kern.info kernel: [ 578.326251] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

This only occurs on the first instance of the error, the error will repeatedly occur but the only evidence in the logs after the first error is:

Thu Apr 29 04:33:36 2021 kern.err kernel: [ 2670.043971] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out Thu Apr 29 04:33:36 2021 kern.info kernel: [ 2670.050555] mtk_soc_eth 1e100000.ethernet eth0: Link is Down Thu Apr 29 04:33:36 2021 kern.info kernel: [ 2670.084619] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode Thu Apr 29 04:33:36 2021 kern.info kernel: [ 2670.092556] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented May 1, 2021

stevenj:

I just build from master with a 5.10.32 kernel and the problem persists:

Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.135261] ------------[ cut here ]------------ Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.139934] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:443 dev_watchdog+0x2fc/0x304 Sat May 1 02:38:18 2021 kern.info kernel: [ 304.148205] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.155132] Modules linked in: pppoe ppp_async iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nf_nat nf_flow_table nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.215126] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.32 #0 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.221147] Stack : 8072aa08 ffffffff 80830000 80082648 80830000 8072b958 00000000 00000000 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.229518] 8140bdac 80980000 807bf2a8 807bedc7 806fa7cc 00000001 8140bd50 8fe23c7d Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.237866] 00000000 00000000 806fa7cc 8140bbf0 ffffefff 80828764 00000000 00000000 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.246212] 00000000 0000000e 00000000 00034856 00000000 80830000 00000000 80700000 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.254556] 00000009 ffffffff 00000122 ffffffff 00000018 803db974 00000000 80980000 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.262899] ... Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.265343] Call Trace: Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.267810] [<80008108>] show_stack+0x30/0x100 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.272251] [<803606e0>] dump_stack+0xa4/0xe0 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.276643] [<8002d958>] __warn+0xc0/0xe8 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.280646] [<8002da0c>] warn_slowpath_fmt+0x8c/0xac Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.285605] [<8050ad44>] dev_watchdog+0x2fc/0x304 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.290317] [<8009ff34>] call_timer_fn.isra.30+0x20/0x90 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.295607] [<800a08f4>] run_timer_softirq+0x220/0x4c8 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.300740] [<80652e4c>] __do_softirq+0x16c/0x334 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.305426] [<800325a4>] irq_exit+0xc4/0xec Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.309592] [<8037bc7c>] plat_irq_dispatch+0x64/0x104 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.314621] [<80003548>] except_vec_vi_end+0xb8/0xc4 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.319576] [<8047b9c8>] cpuidle_enter_state_coupled+0x3b0/0x500 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.325558] [<80479214>] cpuidle_enter+0x54/0xac Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.330182] [<8005d0d8>] do_idle+0x25c/0x2cc Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.334432] [<8005d3c0>] cpu_startup_entry+0x2c/0x34 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.339383] [<80844d68>] start_kernel+0x5bc/0x5e4 Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.344068] Sat May 1 02:38:18 2021 kern.warn kernel: [ 304.345968] ---[ end trace 67930318eb59ae3a ]--- Sat May 1 02:38:18 2021 kern.err kernel: [ 304.350623] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out Sat May 1 02:38:18 2021 kern.info kernel: [ 304.357732] mtk_soc_eth 1e100000.ethernet eth0: Link is Down

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented May 11, 2021

kristrev:

I spent some time trying to reproduce the error at hand. The transmit timeout issue has been bothering the mt7531 driver "forever", there are plenty of examples of the issue affecting both 19.07 and 18.06. When testing your setup, I was able to trigger the timeout relatively easy.

However, when I tested the timeout only happened every now and then (and not repeatedly). While the ideal would be to fix the problem once and for all, I will argue that the behavior of the upstream driver is better than that of the downstream driver used in for example 19.07. At least during my testing, the automatic restart of the switch was always successful and iperf3 would resume after ~5 seconds. With the downstream driver, the restart had to triggered from user space.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented May 11, 2021

stevenj:

I agree the restart is certainly more friendly. I am concerned about all the connections that are established at the time being dropped.

My company is prepared to pay a developer to fix the issue once and for all. So if there is anyone willing and capable of doing it, please contact me.

@aparcar aparcar added the release/21.02 label Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flyspray release/21.02
Projects
None yet
Development

No branches or pull requests

2 participants