Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#4066 - ipq40xx: Switch (ar40xx) freezes #9050

Closed
openwrt-bot opened this issue Oct 7, 2021 · 11 comments
Closed

FS#4066 - ipq40xx: Switch (ar40xx) freezes #9050

openwrt-bot opened this issue Oct 7, 2021 · 11 comments
Labels
flyspray kernel pull request/issue with Linux kernel related changes

Comments

@openwrt-bot
Copy link

openwrt-bot commented Oct 7, 2021

PolynomialDivision:

The ar40xx on an ipq40xx device freezes from time to time. The issue is hard to reproduce since it was seen on multiple devices on a very different time scale. We run a mesh network with olsr (IPv4) and babeld (IPv6) as routing daemons. The affected devices are mainly Fritz!Box 4040 and Fritz!Box 7530. Typically, on setups with a huge load, the switches begin to freeze. As an example, a church that acts as a central point of our mesh network and connection to the internet experiences the freeze daily. We can easily reproduce and test solutions in that location. Devices with fewer clients and just one mesh connection, also crash but it needs some time (30 days). Some devices with almost no traffic or client do not crash.

As a workaround, we wrote the naywatch daemon, which checks for ipv6 link-local connectivity. We also allow collecting debug output from it. With that, we can show what a diff of the swconfig looks like when a freeze happens. So the switch is already in a frozen state. As you can see the CPU Port 0 is not sending any frames (TX is not visible) to the CPU anymore. The switch still receives frames.

Port 0: mib: Port 0 MIB counters -RxBroad : 472793 +RxBroad : 472882 RxPause : 0 -RxMulti : 36772 +RxMulti : 36854 RxFcsErr : 0 RxAlignErr : 0 RxRunt : 0 RxFragment : 0 -Rx64Byte : 2075611 -Rx128Byte : 20964917 -Rx256Byte : 16459560 -Rx512Byte : 623700 -Rx1024Byte : 907303 -Rx1518Byte : 48003389 +Rx64Byte : 2075618 +Rx128Byte : 20964980 +Rx256Byte : 16459583 +Rx512Byte : 623752 +Rx1024Byte : 907344 +Rx1518Byte : 48003397 RxMaxByte : 21422694 RxTooLong : 0 -RxGoodByte : 107718798589 +RxGoodByte : 107718869396 RxBadByte : 0 RxOverFlow : 0 Filtered : 0 @@ -51,38 +51,38 @@ link: port:0 link:up speed:1000baseT full-duplex txflow rxflow

I would suggest:

  • Switch Broken?
  • Switch wrongly configured

However, there seems to be an DSA implementation of the ar40xx that should be released some day. So maybe it is better to switch to DSA before fixing this issue. I already wrote with blocktrron and to my understanding, he was also able to experience a freeze.

Rest of diff:

root@emma-core:~# diff -u 1633480443-swconfig\ dev\ switch0\ show.log 1633480463-swconfig\ dev\ switch0\ show.log
--- "1633480443-swconfig dev switch0 show.log" 2021-10-06 02:34:03.000000000 +0200
+++ "1633480463-swconfig dev switch0 show.log" 2021-10-06 02:34:23.000000000 +0200
@@ -7,22 +7,22 @@
linkdown: ???
Port 0:
mib: Port 0 MIB counters
-RxBroad : 472793
+RxBroad : 472882
RxPause : 0
-RxMulti : 36772
+RxMulti : 36854
RxFcsErr : 0
RxAlignErr : 0
RxRunt : 0
RxFragment : 0
-Rx64Byte : 2075611
-Rx128Byte : 20964917
-Rx256Byte : 16459560
-Rx512Byte : 623700
-Rx1024Byte : 907303
-Rx1518Byte : 48003389
+Rx64Byte : 2075618
+Rx128Byte : 20964980
+Rx256Byte : 16459583
+Rx512Byte : 623752
+Rx1024Byte : 907344
+Rx1518Byte : 48003397
RxMaxByte : 21422694
RxTooLong : 0
-RxGoodByte : 107718798589
+RxGoodByte : 107718869396
RxBadByte : 0
RxOverFlow : 0
Filtered : 0
@@ -51,38 +51,38 @@
link: port:0 link:up speed:1000baseT full-duplex txflow rxflow
Port 1:
mib: Port 1 MIB counters
-RxBroad : 107158
+RxBroad : 107267
RxPause : 0
-RxMulti : 1147
+RxMulti : 1173
RxFcsErr : 0
RxAlignErr : 0
RxRunt : 0
RxFragment : 0
-Rx64Byte : 1536
-Rx128Byte : 68123
-Rx256Byte : 20668
-Rx512Byte : 4912
-Rx1024Byte : 10327
-Rx1518Byte : 90128
-RxMaxByte : 9851
+Rx64Byte : 1555
+Rx128Byte : 68180
+Rx256Byte : 20673
+Rx512Byte : 4925
+Rx1024Byte : 10332
+Rx1518Byte : 90209
+RxMaxByte : 9860
RxTooLong : 0
-RxGoodByte : 166901240
+RxGoodByte : 167051151
RxBadByte : 0
RxOverFlow : 0
-Filtered : 118
-TxBroad : 856851
-TxPause : 1074
-TxMulti : 89960
+Filtered : 307
+TxBroad : 856940
+TxPause : 3442
+TxMulti : 90042
TxUnderRun : 0
-Tx64Byte : 38606
-Tx128Byte : 206906
-Tx256Byte : 56950
-Tx512Byte : 27986
-Tx1024Byte : 60718
-Tx1518Byte : 677209
+Tx64Byte : 40978
+Tx128Byte : 206965
+Tx256Byte : 56973
+Tx512Byte : 28038
+Tx1024Byte : 60759
+Tx1518Byte : 677217
TxMaxByte : 74048
TxOverSize : 0
-TxByte : 1186157213
+TxByte : 1186378970
TxCollision : 0
TxAbortCol : 0
TxMultiCol : 0
@@ -95,38 +95,38 @@
link: port:1 link:up speed:1000baseT full-duplex txflow rxflow auto
Port 2:
mib: Port 2 MIB counters
-RxBroad : 170588
+RxBroad : 170832
RxPause : 0
-RxMulti : 11717
+RxMulti : 11767
RxFcsErr : 0
RxAlignErr : 0
RxRunt : 0
RxFragment : 0
-Rx64Byte : 28452
-Rx128Byte : 6337895
-Rx256Byte : 408749
-Rx512Byte : 54975
-Rx1024Byte : 62053
-Rx1518Byte : 343977
-RxMaxByte : 130630
+Rx64Byte : 28455
+Rx128Byte : 6338150
+Rx256Byte : 408825
+Rx512Byte : 55001
+Rx1024Byte : 62066
+Rx1518Byte : 344149
+RxMaxByte : 130646
RxTooLong : 0
-RxGoodByte : 1345452080
+RxGoodByte : 1345783813
RxBadByte : 0
RxOverFlow : 0
-Filtered : 574
-TxBroad : 793647
-TxPause : 1151
-TxMulti : 79418
+Filtered : 1135
+TxBroad : 793736
+TxPause : 3519
+TxMulti : 79500
TxUnderRun : 0
-Tx64Byte : 97711
-Tx128Byte : 603920
-Tx256Byte : 173282
-Tx512Byte : 76156
-Tx1024Byte : 104686
-Tx1518Byte : 3582480
+Tx64Byte : 100079
+Tx128Byte : 603969
+Tx256Byte : 173305
+Tx512Byte : 76208
+Tx1024Byte : 104727
+Tx1518Byte : 3582488
TxMaxByte : 7391371
TxOverSize : 0
-TxByte : 16528781103
+TxByte : 16529001614
TxCollision : 0
TxAbortCol : 0
TxMultiCol : 0
@@ -139,38 +139,38 @@
link: port:2 link:up speed:1000baseT full-duplex txflow rxflow auto
Port 3:
mib: Port 3 MIB counters
-RxBroad : 159441
+RxBroad : 159671
RxPause : 0
-RxMulti : 18188
+RxMulti : 18257
RxFcsErr : 0
RxAlignErr : 0
RxRunt : 0
RxFragment : 0
-Rx64Byte : 9911
-Rx128Byte : 3051611
-Rx256Byte : 711251
-Rx512Byte : 377985
-Rx1024Byte : 459307
-Rx1518Byte : 46523339
-RxMaxByte : 21133903
+Rx64Byte : 9932
+Rx128Byte : 3052044
+Rx256Byte : 711326
+Rx512Byte : 378003
+Rx1024Byte : 459361
+Rx1518Byte : 46523507
+RxMaxByte : 21133924
RxTooLong : 0
-RxGoodByte : 100988636934
+RxGoodByte : 100989010961
RxBadByte : 0
RxOverFlow : 0
-Filtered : 36461
-TxBroad : 804786
-TxPause : 6016
-TxMulti : 72948
+Filtered : 37251
+TxBroad : 804875
+TxPause : 8384
+TxMulti : 73030
TxUnderRun : 0
-Tx64Byte : 1661972
-Tx128Byte : 18816233
-Tx256Byte : 15830078
-Tx512Byte : 276436
-Tx1024Byte : 524862
-Tx1518Byte : 3198326
+Tx64Byte : 1664343
+Tx128Byte : 18816282
+Tx256Byte : 15830101
+Tx512Byte : 276488
+Tx1024Byte : 524903
+Tx1518Byte : 3198334
TxMaxByte : 426878
TxOverSize : 0
-TxByte : 9485704082
+TxByte : 9485924799
TxCollision : 0
TxAbortCol : 0
TxMultiCol : 0
@@ -202,19 +202,19 @@
RxBadByte : 871047744
RxOverFlow : 0
Filtered : 99
-TxBroad : 909305
-TxPause : 4319
-TxMulti : 67716
+TxBroad : 909394
+TxPause : 6687
+TxMulti : 67798
TxUnderRun : 0
-Tx64Byte : 335196
-Tx128Byte : 1832541
-Tx256Byte : 520952
-Tx512Byte : 320853
-Tx1024Byte : 412422
-Tx1518Byte : 42645040
+Tx64Byte : 337564
+Tx128Byte : 1832588
+Tx256Byte : 520975
+Tx512Byte : 320905
+Tx1024Byte : 412463
+Tx1518Byte : 42645048
TxMaxByte : 13773617
TxOverSize : 0
-TxByte : 84230298783
+TxByte : 84230519096
TxCollision : 0
TxAbortCol : 0
TxMultiCol : 0

@openwrt-bot
Copy link
Author

openwrt-bot commented Oct 7, 2021

PolynomialDivision:

To clear up any misunderstandings. In the diff, you see the state when the switch is frozen. Both timestamps are then the router can no longer be reached, or the router can not reach anything, so the switch stopped working. The first time and the second time are 10 seconds apart.

@openwrt-bot
Copy link
Author

openwrt-bot commented Oct 18, 2021

PolynomialDivision:

Maybe we found a workaround for it. It could be that uboot initializes the switch with some config that is overwritten by swconfig again:

Since changing the configs to

config switch option name 'switch0' option reset '0' option enable_vlan '1'

we did not receive any crash since 3 days.

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 8, 2021

PolynomialDivision:

The DSA driver could fix the issue:
#4721

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 12, 2021

PolynomialDivision:

Maybe a combination of ipq40xx device with ubiquiti products using the cisco discovery protocol could be problematic?

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 12, 2021

PolynomialDivision:

Maybe some important message:
Fri Nov 12 21:19:20 2021 daemon.err babeld[2608]: netlink_read: recvmsg(): No buffer space available

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 15, 2021

PolynomialDivision:

Enabled offloads:

# ethtool -k lan1 | grep on
rx-checksumming: on [fixed]
tx-checksumming: on
tx-checksum-ip-generic: on [fixed]
scatter-gather: on
tx-scatter-gather: on [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on [fixed]
tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
tx-lockless: on [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
hw-tc-offload: on

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 15, 2021

PolynomialDivision:

Example diff

diff -u 1636876105-ethtool\ -S\ lan1.log 1636876125-ethtool\ -S\ lan1.log
--- "1636876105-ethtool -S lan1.log" 2021-11-14 09:57:54.296129780 +0100
+++ "1636876125-ethtool -S lan1.log" 2021-11-14 09:57:54.762796277 +0100
@@ -1,40 +1,40 @@
NIC statistics:

  • tx_packets: 37118865
    
  • tx_bytes: 39940635954
    
  • tx_packets: 37119048
    
  • tx_bytes: 39940687530
    rx_packets: 17896842
    rx_bytes: 2765004610
    
  • RxBroad: 618724
    
  • RxBroad: 618736
    RxPause: 0
    
  • RxMulti: 71642
    
  • RxMulti: 71666
    RxFcsErr: 0
    RxAlignErr: 0
    RxRunt: 0
    RxFragment: 0
    
  • Rx64Byte: 81401
    
  • Rx128Byte: 16481482
    
  • Rx256Byte: 266421
    
  • Rx512Byte: 74709
    
  • Rx1024Byte: 147338
    
  • Rx1518Byte: 919906
    
  • Rx64Byte: 81421
    
  • Rx128Byte: 16481568
    
  • Rx256Byte: 266434
    
  • Rx512Byte: 74718
    
  • Rx1024Byte: 147342
    
  • Rx1518Byte: 919914
    RxMaxByte: 87542
    RxTooLong: 0
    
  • RxGoodByte: 3173218205
    
  • RxGoodByte: 3173248165
    RxBadByte: 0
    RxOverFlow: 0
    
  • Filtered: 161894
    
  • TxBroad: 9615221
    
  • TxPause: 15710
    
  • TxMulti: 811563
    
  • Filtered: 162034
    
  • TxBroad: 9615315
    
  • TxPause: 18185
    
  • TxMulti: 811634
    TxUnderRun: 0
    
  • Tx64Byte: 948123
    
  • Tx128Byte: 14871136
    
  • Tx256Byte: 893717
    
  • Tx512Byte: 444792
    
  • Tx1024Byte: 907971
    
  • Tx64Byte: 950607
    
  • Tx128Byte: 14871198
    
  • Tx256Byte: 893747
    
  • Tx512Byte: 444849
    
  • Tx1024Byte: 907996
    Tx1518Byte: 10961510
    TxMaxByte: 14193799
    TxOverSize: 0
    
  • TxByte: 39901619727
    
  • TxByte: 39901830561
    TxCollision: 0
    TxAbortCol: 0
    TxMultiCol: 0
    

@@ -42,5 +42,5 @@
TxExcDefer: 0
TxDefer: 0
TxLateCol: 0

  • RXUnicast: 17368433
    
  • TXunicast: 32778554
    
  • RXUnicast: 17368537
    
  • TXunicast: 32778572</code>
    

rx-counters of the hardware still increase, but interface/host counter does not increase. I suspect some offload failure?

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 15, 2021

PolynomialDivision:

More information:
#4721 (comment)

@aparcar aparcar added the kernel pull request/issue with Linux kernel related changes label Feb 22, 2022
@PolynomialDivision
Copy link
Member

PolynomialDivision commented Apr 18, 2022

Fix: #9731

@PolynomialDivision
Copy link
Member

PolynomialDivision commented Apr 29, 2022

Should be fixed with: ab7e53e

@PolynomialDivision
Copy link
Member

PolynomialDivision commented Apr 29, 2022

@aparcar can you close?

@aparcar aparcar closed this as completed Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flyspray kernel pull request/issue with Linux kernel related changes
Projects
None yet
Development

No branches or pull requests

3 participants