New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MT7602E: Radio becomes unusable after ~3 days #246
Comments
What's the latest version that you tested so far? |
OpenWrt v18.06.2-2-g13eeee7b2b |
I just saw that I had not pushed my latest backport commit to the stable branch yet. Please try the latest version. |
Mesh "died" after 8 hours of uptime on openwrt/openwrt@e5ace80. The TQ to all neighbours is stuck at 5/255, so mesh is just very lossy. The load is unaffected at this time.
|
After another reboot it just behaves weird. Starts out meh, after 7 hours it starts working quite well, after 20h hours meh again, ok again after 22 hours uptime. And now it seems to go down again after 24 hours of uptime. I have neighbouring routers that are meshing just fine, so there are no rf issues. Client associations with the AP seem to work though.
|
I see the same issue |
Please try the latest version of mt76.git (not pushed to OpenWrt yet) |
After 8 hours cpu load went up, mesh connections died, client connections vanished. Log shows an abundance of MCU message timeouts. |
When that happens and the CPU load is up, could you please show me the number of interrupts per second for wifi and then do |
Another snapshot
|
I updated PKG_SOURCE_VERSION to the new commit id and dropped PKG_MIRROR_HASH. Is this the correct way to bump the package version? The resulting packages look like this:
|
Yes, that works. |
I found some more bugs related to ED/CCA, which could be causing this. Please try the latest version |
The device has seen its first bunch of MCU message timeouts but no visible impact after 7h uptime. I'm hopeful. |
The driver now resets device after MCU message timeout, so it is now able to recover itself. |
Yes, I understand what the code is supposed to do. Except after ~15 hours of uptime the mesh connections died, cpu load is up, the ap still seems to accept station connections though. |
Pushed some more fixes to mt76.git that should hopefully make it possible to recover from these errors now. |
No 2.4 GHz meshing ever occured with The load is fine, but alot more connectivity was lost. This was what I could read from dmesg now, I certainly can get a more complete log if necessary.
|
Also here is the output of logread which might be helpful in this case.
|
Is that with 18.06 or master? |
OpenWrt 18.06 openwrt/openwrt@1be6ff6 |
Could you please try this with OpenWrt master? The mac80211 version might make a difference here |
I just pushed another commit that should help with mesh connection stability |
The router disappeared after flashing OpenWrt master, I'll have to recover it later tonight or tomorrow. |
Complete serial log with several stacktraces and the result that mesh1 is removed from bat0 openwrt openwrt/openwrt@994428f |
I pushed some mac80211 fixes to openwrt and some more mt76 fixes to mt76.git. It should work much better now. |
Not quite yet. https://gist.github.com/mweinelt/e6beb6289b2f6162038fde5d61ee9a46 openwrt openwrt/openwrt@c6caa7a |
Please show me the output of ifconfig and the contents of /etc/config/wireless. |
Before the error occurs |
ifconfig
/etc/config/wireless
|
As I suspected, I couldn't reproduce your issue because it was triggered by your use of the macaddr option. I've pushed another change that should resolve this properly |
Yup, the stacktraces are gone. I'll be watching graphs again until something interesting happens (or not). Thanks so far! |
The wireless mesh connections, the node has no wired connection, seem to be quite lossy now, but the router is otherwise stable. Due to the loss it fails to report metrics (see graph below) and with mtr I get roughly 40% loss over several minutes. The node is very close (30cm) to the node it was meshing with since I attached the serial, I wouldn't think that this could influence the mesh connection in such a negative way. |
Please try the latest mt76 version, I've pushed a fix for ED/CCA that might help |
I just pushed more fixes to OpenWrt master, things should work much better now. |
Up 12 hours and everything looks pretty good. Packet loss is gone, metrics arrive properly, CPU isn't under pressure, clients and mesh connections work and are stable. |
nice, let's hope this state persists, as your original report was about problems after ~3d |
Several restarts on both phy and no visible negative impact yet.
|
Yep, looks stable three days in. Thanks a ton! Is that something we could get backported to 18.06? |
I will look into it. I probably need to backport some mac80211 fixes for that. |
Done |
The 2.4 GHz Radio (MT7602E) on my DIR-860L currently becomes unusuable again after ~3 days. It is configured to use two VAPs, one for AP, one for Mesh Point.
The kernel ringbuffer only contains these MCU message timeouts:
When that happens the load climbs somewhat significantly:
During all this the radio still supplies us with airtime values:
The radio had been running just fine (up to two weeks of continuous uptime without issues) on OpenWrt 18.06 between mid-november and mid-january. The issue appeared after upgrading OpenWrt from
eef6bd3393f406f73187a670fa34d5e6a228f9e8
to939fa07b041fef58196fba8dd4b5184adb7b4d3f
:The culprit is likely somewhere in:
The text was updated successfully, but these errors were encountered: