Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#4146 - e1000e: Detected Hardware Unit Hang, Reset adapter unexpectedly #9135

Open
openwrt-bot opened this issue Nov 22, 2021 · 2 comments
Open
Labels
flyspray kernel release/21.02

Comments

@openwrt-bot
Copy link

@openwrt-bot openwrt-bot commented Nov 22, 2021

misieck:

System is a Fujitsu Esprimo C5731 with Intel Core2Duo E7500 and 4 GB RAM.

The problem NIC:
00:19.0 Ethernet controller [0200]: Intel Corporation 82567LF-3 Gigabit Network Connection [8086:10df] (rev 02)

Openwrt:
OpenWrt x86_64 21.02.1 r16325-88151b8303

System is configured as a simple router with the e1000e NIC as WAN and a skge NIC [Ethernet controller [0200]: D-Link System Inc Gigabit Ethernet Adapter [1186:4c00] (rev 11)] as LAN.
When doing a speedtest through the router (bredbandskollen.se) the hang occurs during the upload test (when the e1000e NIC sends data and the skge NIC receives data). The download test does not cause the error.

Similar (identical?) problems were reported previously:
https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang
https://serverfault.com/questions/193114/linux-e1000e-intel-networking-driver-problems-galore-where-do-i-start
https://web.archive.org/web/20160205153351/http://ehc.ac:80/p/e1000/bugs/378/

Turning TSO off is a workaround.
ethtool -K eth0 tso off

but pcie_aspm=off does not help.

[49573.954931] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
[49573.954931] TDH <2>
[49573.954931] TDT <1a>
[49573.954931] next_to_use <1a>
[49573.954931] next_to_clean
[49573.954931] buffer_info[next_to_clean]:
[49573.954931] time_stamp <100bbf478>
[49573.954931] next_to_watch <2>
[49573.954931] jiffies <100bbf6f8>
[49573.954931] next_to_watch.status <0>
[49573.954931] MAC Status <80083>
[49573.954931] PHY Status <796d>
[49573.954931] PHY 1000BASE-T Status <3800>
[49573.954931] PHY Extended Status <3000>
[49573.954931] PCI Status <10>
[49575.970909] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
[49575.970909] TDH <2>
[49575.970909] TDT <1a>
[49575.970909] next_to_use <1a>
[49575.970909] next_to_clean
[49575.970909] buffer_info[next_to_clean]:
[49575.970909] time_stamp <100bbf478>
[49575.970909] next_to_watch <2>
[49575.970909] jiffies <100bbf8f0>
[49575.970909] next_to_watch.status <0>
[49575.970909] MAC Status <80083>
[49575.970909] PHY Status <796d>
[49575.970909] PHY 1000BASE-T Status <3800>
[49575.970909] PHY Extended Status <3000>
[49575.970909] PCI Status <10>
[49577.954909] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
[49577.954909] TDH <2>
[49577.954909] TDT <1a>
[49577.954909] next_to_use <1a>
[49577.954909] next_to_clean
[49577.954909] buffer_info[next_to_clean]:
[49577.954909] time_stamp <100bbf478>
[49577.954909] next_to_watch <2>
[49577.954909] jiffies <100bbfae0>
[49577.954909] next_to_watch.status <0>
[49577.954909] MAC Status <80083>
[49577.954909] PHY Status <796d>
[49577.954909] PHY 1000BASE-T Status <3800>
[49577.954909] PHY Extended Status <3000>
[49577.954909] PCI Status <10>
[49578.082559] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
[49578.254005] e1000e: eth0 NIC Link is Down
[49581.083429] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 19, 2021

equid0x:

This is known as the "TX Unit Hang" issue and its allegedly a bug in silicon that can't be fixed. As far as I recall, Intel released an updated microcode(included in driver) for this series of chips that partially mitigates, but does not completely eliminate the issue. This is a very, very old issue.

I believe the workaround is to turn off checksum offloading:

ethtool -K eth0 tx off rx off

The bug is probably reproducible if you use something like iPerf or Netcat to totally flood the affected interface with TX traffic for an extended period of time (several minutes).

I did a cursory search on this out of curiosity and interestingly, there is at least one user who has reported that the issue does not seem to occur while running under kernel 5.11 so its possible someone finally tracked down and fixed a long standing bug in the driver source. This issue has been around since at least 2009(!).

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 23, 2021

misieck:

The problem does not exhibit in OPNSense. At least not in an overly noticable way. So even if it is a hardware problem, there ostensibly exist a workable workaround.

@aparcar aparcar added release/21.02 kernel labels Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flyspray kernel release/21.02
Projects
None yet
Development

No branches or pull requests

2 participants