Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge ping value increase after running 5g wifi for some time #152

Closed
updatede opened this issue Feb 17, 2018 · 60 comments
Closed

Huge ping value increase after running 5g wifi for some time #152

updatede opened this issue Feb 17, 2018 · 60 comments

Comments

@updatede
Copy link

updatede commented Feb 17, 2018

device: xiaomi router 3g
target: ramips/mt7621
system: OpenWrt SNAPSHOT r6150-dc7a1e8
signal strength: -60dbm
I use just only one client to connect router: a ipad air2.
When wlan just start:
img_1880
After play game for some time, network get stucked, at the same place, close all programs on pad, ping value increase huge, won't decrease untill restart wlan on router, after restart ping value decrease to normal.
img_1879

@iancd
Copy link

iancd commented Feb 17, 2018

This should be expected behaviour from iOS. It would not be possible for an iPad to achieve 10 hours of battery life if it behaved like a desktop computer.

When the device is idle the wifi hardware goes into a powersave mode because it isn't critical to get data quickly. When network data is needed the wifi hardware is taken out of power save mode until network traffic is reduced to an idle state again so that power save mode is kicked back in. Source link

Maybe streaming audio from the internet in the background of your game will prevent the iPad powersaving.

@updatede
Copy link
Author

no, the device is not idle and hardware does not go to powersave mode. because it's never happen when using wndr3800 with this same ipad doing same thing at same place. and this "idle" can't wake up whatever you do to ipad, unless you do restart on mt76 router. so it's mt76 driver not ios.

@mastum
Copy link

mastum commented Feb 17, 2018

@updatede I can't reproduce it with my iPad, what is for you "for some time" ?
Mine is connected (5 GHz) from 5 days without problems.
Xiaomi 3G OpenWrt SNAPSHOT r6022

@updatede
Copy link
Author

@Mafesa for some time is a duration of time , not fixed, 3mins,5mins,10mins or just connected.

@updatede
Copy link
Author

@Mafesa what's your ping value from route to client ?

@mastum
Copy link

mastum commented Feb 17, 2018

192.168.0.98 is your router IP?
I will test r6150

@updatede
Copy link
Author

@Mafesa I don't know what's your client ip. you need ping from the router to client.
192.168.0.89 is my ipad's ip.

@mastum
Copy link

mastum commented Feb 17, 2018

OK I understand, I was doing the opposite from iPad to router with this app.

@mastum
Copy link

mastum commented Feb 17, 2018

Tested now OpenWrt SNAPSHOT r6152-f4e5880 and ping router -> iPad 5GHz is constant with ~50 ms

@updatede
Copy link
Author

@nbd168 because lack of debug info, I want to add some debug info in my copy of source code to find the reason why ping value get so high to 1000 or 2000 ms, any advice for places to add ?

@barhom
Copy link

barhom commented Feb 22, 2018

I can say that I see something similar on both mt7603 and mt7612.
After running a while ping increases a lot.
I still don't know the cause.

@updatede
Copy link
Author

Still unfixed on latest snapshot r6302-0f54d96, 7% packet loss on this test

img_1884

@Mushoz
Copy link

Mushoz commented Mar 6, 2018

Seeing similar behavior on my Dir-860L. 1000-2000ms pings, high packet loss and very slow speeds. Only rebooting the router solves the issue for me. Lede 17.01.4 is completely fine.

@nbd168
Copy link
Member

nbd168 commented Mar 8, 2018

Please try to build the latest version from my staging tree at https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=summary

@updatede
Copy link
Author

updatede commented Mar 8, 2018

Because I rely on snapshot sdk to build, and this tree switch to kernel 4.14 before mac80211 patch, I don't know how to build it, I will test as soon as it commit to master tree. Thank very much @nbd168

@moonwatcher
Copy link

moonwatcher commented Mar 9, 2018

nbd, I have been experimenting with mir3g and the new 4.14 kernel. i have been seeing similar behavior. Would love to test patches. I am currently running 6c194078db6958e349114c8605304a7b87edae3e on 4.09, but I can build 4.14 with patches tomorrow if you want me to try stuff (I have 6 of those units...).

How do I build your staging, is that the entire build system or a package repo?

@updatede
Copy link
Author

updatede commented Mar 9, 2018

@nbd168 I have modified function ieee80211_amsdu_aggregate according to your patch, and compile mac80211.ko on my old snapshot sdk with 4.9.77 kernel. I replace mac80211.ko on my router which run a 4.9.86 kernel snapshot, after restart, it always popup error of wifi password not correct. I'll try with same snapshot version system and sdk later.

@Mushoz
Copy link

Mushoz commented Mar 9, 2018

@nbd168 I am leaving for my holiday tomorrow. I will be back the 19th of March. If you still need testing to be done I will help after that. I'm sorry that I can't help at the moment.

@updatede
Copy link
Author

updatede commented Mar 9, 2018

@nbd168, I manually patched tx.c in mac80211 directory of backports-2017-11-01, and rebuild mac80211.ko file with latest snapshot sdk. After testing, the same problem still exist. But it take about 20mins to trigger, it usually take 3-5mins to trigger before this patch. There are other similar problems, move to weak signal place, client will stuck in high ping value and packet loss, move back to strong signal place won't get out of stuck. Turn off wifi or screen on client, wait about 5mins, turn on wifi or screen again, client will stuck in high ping value and packet loss either. The common problem of all above is, once high ping and packet loss appears, you can't let it disappear by any means unless restart wifi on router.

@nbd168
Copy link
Member

nbd168 commented Mar 9, 2018

Any chance you could give me SSH access to your device while it's in the faulty state? Maybe I can find out a bit more about this bug that way.

@updatede
Copy link
Author

updatede commented Mar 9, 2018

@nbd168 of course, but I need private way to tell you how. is there private message at github ? it is 22:28 at my time, if you can do it tonight, I’ll tell you privately as soon as stuck appear.

@nbd168
Copy link
Member

nbd168 commented Mar 9, 2018

You could send me an email: nbd@nbd.name

@updatede
Copy link
Author

updatede commented Mar 9, 2018

ok , wait my email

@updatede
Copy link
Author

updatede commented Mar 9, 2018

have you receive my email ? I have not receive any response.

@nbd168
Copy link
Member

nbd168 commented Mar 9, 2018

I'm logged in and running some tests already. Thanks!

@Mushoz
Copy link

Mushoz commented Mar 19, 2018

I am back from my skiing holiday. Do you still need testers, @nbd168 ?

@updatede
Copy link
Author

updatede commented Apr 9, 2018

@nbd168 The latest r6653-c0bbb97 still has this bug unfixed after issue opened for about two months.
After analysing captured 80211 packets, I found that while my IPAD in normal state, AP often received sleep frame from my IPAD, after AP send TIM+ (TIM indicating buffered frame for IPAD) in beacon, AP received wakeup frame from my IPAD immediately.
But when my IPAD enter wifi doze state(turn off screen for about 10 mins), then I turn on screen and ping my IPAD, AP can't receive my IPAD's wakeup frame right after it sent TIM+ , after sending many many TIM+, AP received wakeup frame from IPAD. No matter what you do (even reboot IPAD) , this state will not change unless restart wifi on AP. When repeat this doze then turn on process to my wndr3800 with openwrt 15.05.1, it can wake up correctly just like normal state.
I don't know why. Maybe wifi doze STA let AP enter a false state and possibly:

  1. make AP can't transmit TIM+ to STA.( I can just see TIM+ in monitor interface by tcpdump, maybe these frames not really sent to air or something).
  2. make AP can't receive wakeup frame from STA.
  3. AP notify STA stay in wifi doze state forever (weird,because I can't get any ping response in doze state, but I can get ping response when turn on screen from doze state, just massive latency, don't know if there is mechanism in 80211).

@Mushoz
Copy link

Mushoz commented Apr 9, 2018

Thank you very much for digging into this! @nbd if you need testers, please let me know.

@Mushoz
Copy link

Mushoz commented Apr 9, 2018

I meant to tag @nbd168 :) sorry about that

@nbd168
Copy link
Member

nbd168 commented Apr 15, 2018

Please try the latest version

@updatede
Copy link
Author

updatede commented Apr 15, 2018

@nbd168
It improves since commit 7a31731. Turn off screen for 10 mins seems have no latency any more. But for 15~20 mins, the same problem is back.
I checked ath9k code , when sta is about sleeping, ath9k call ieee80211_sta_set_buffered with "true", and it has a ath_tx_aggr_sleep function that call ieee80211_sta_set_buffered either, after releasing buffer, it call ieee80211_sta_set_buffered with "false". But in mt76 code , there is only place call ieee80211_sta_set_buffered with "false", no "true" call by checking sleeping, is it the reason of issue ?

@updatede
Copy link
Author

"If the driver buffers frames in the driver for aggregation in any way, it must use the ieee80211_sta_set_buffered() call when it is notified of the station going to sleep to inform mac80211 of any TIDs that have frames buffered." from Linux 802.11 Driver Developer’s Guide.
Maybe need some aggregation-sleeping logic.

@updatede
Copy link
Author

updatede commented Apr 16, 2018

@nbd168
I add some printk() in mt76_txq_schedule_list(), the doze time need to trigger bug increase again.
So, maybe it's newly added code increase the execution time of mt76_txq_schedule_list(), and the trigger of the bug need to call mt76_txq_schedule_list() a number of times.

@updatede
Copy link
Author

updatede commented Apr 17, 2018

@nbd168
I reverted mt76 driver to first commit 9e7049b "Initial import" and the same problem still exist.
I made some modification to fit latest mac80211/cfg80211 because I couldn't get old mac80211 source code.
After that I wonder if mir3g's hardware is not compatible with ios or ipad air2 hardware, so I restore mir3g to stock firmware, the issue disppeared.
So I think bug is in initial driver code, or mac80211 code before/after "Initial import".

@updatede
Copy link
Author

updatede commented Apr 18, 2018

@nbd168
I disable 80211n/ac by setting mode to legacy to see if issue still exist.
From captures I can see ping frame from AP to STA now because of no aggregation.
When bug triggered I saw once a ping frame sent to STA before STA sent wakup null frame to AP.
If it's not a monitor interface issue, I think maybe it's the reason of this bug.
I've sent packet capture to your email address.

@updatede
Copy link
Author

updatede commented Apr 21, 2018

@nbd168
I have some new findings about this bug.
I find new ways to reproduce the symptom:

  1. Switch off wifi on iPad air2 , restart wifi on AP, don't connect any STA to AP, let AP run wifi for 15~20 mins, then connect iPad air2 to AP, the symptom appears.
  2. Restart wifi on AP, connect iPad air2 to AP, keep pinging from AP to iPad air2 for 15-20 mins, the symptom appears.
    So, even we can see later wakeup null frame, but this bug is not about sleep, not about sleep time, it's only about running time of AP after restarting wifi.
    Old version ipad that only support 80211N seems not affected , not very sure.
    Android client that support 80211AC also affected, ping echo time after triggered is 500-600ms, it's less than iPad's but significantly larger than normal.

@updatede
Copy link
Author

updatede commented May 6, 2018

@nbd168
I don't know if you test after my last comment. But I found that changing "beacon_int" can change bug trigger time. For example, change "beacon_int" to 50 will trigger this bug 10 mins after restarting wifi, 100 is 20 mins, 200 is 40 mins, and so on. So I think this bug is in beacon or packet sending related code.

@updatede
Copy link
Author

this project does not want to fix this bug and i switch to stock driver to abandon this driver.

@nbd168
Copy link
Member

nbd168 commented May 16, 2018

I do want to fix this bug, I just don't have much time on my hands at the moment.

@nbd168 nbd168 reopened this May 16, 2018
@updatede
Copy link
Author

no, thanks

@jsantala
Copy link

Just keep the ticket open, please. There's other people who want to see this fixed. Thanks!

@Mushoz
Copy link

Mushoz commented May 18, 2018

Agreed. Please leave this ticket open. There is valuable information in here and there are more people who would like to see this fixed.

@jedi7
Copy link

jedi7 commented May 18, 2018

And want help. Because @nbd168 do great work. (It is still opensource)

@nbd168
Copy link
Member

nbd168 commented May 18, 2018

I think I may have found a possible cause for this. Apparently the beacon timer drifts by one microsecond every time it fires. The minimum configurable interval is 64us, so it has to be corrected every 64 beacons. This would also explain why changing the beacon interval affects the time it takes for this issue to manifest.
When the client is in powersave mode, it regularly wakes up before the beacon transmission time and waits for a short period of time before going back to sleep again. After some time, the drift becomes so big that the client goes back to sleep again before the AP sends its beacon. The client then only notices after beacon loss detection kicks in and forces the client out of powersave.
I will let you know when I've implemented a fix for this and verified the timing.
In the mean time, is there anybody here except for @updatede that can reliably reproduce the issue?

@nbd168 nbd168 reopened this May 18, 2018
@Mushoz
Copy link

Mushoz commented May 18, 2018

I was seeing the same issue 2 months ago when I tried the master branch. I returned back to 17.01.4 where this issue does not exist for me.

I haven't tried the master branch in the mean time, but I plan on flashing the 18.06 branch this Monday when I come back from my holiday to help testing. Will let you know if I am able to reliably reproduce this.

@nbd168
Copy link
Member

nbd168 commented May 18, 2018

I've made a fix and verified the timings. It's in the master branch now, and I intend to push it to the 18.06 branch soon.

@slthomason
Copy link

slthomason commented May 18, 2018 via email

@nbd168
Copy link
Member

nbd168 commented May 18, 2018

I ran some tests, MT7603 does not have the same issue

@araujorm
Copy link

Hello.

As I have mentioned in #139 during tests for mt76x2u (USB variant) that is sharing some of the code, this exact issue happens everytime one scans for networks. TX and RX are simply blocked for about 15 seconds, which is not normal. I've just had the chance of confirming that it happens exactly the same with mt76x2 on latest master.

Just try it, launch some kind of speed test, iperf or ping and then do a iw dev wlan1 scan inside a shell on your openwrt router, or invoke a scan with the 5GHz radio in LuCI.

As I've also mentioned there, if you go to tx.c and comment both of the if (...) return -EBUSY lines the issue no longer happens, although performance is subpar during the scan.

My guess is that the original issue reporter has something in his configuration that triggers a network scan once in a while (in my case with the USB variant, it was happening very frequently since I was in a PC with Fedora/MATE and NetworkManager triggers a scan every other minute if the signal isn't what they would consider perfect).

Hope this helps @nbd168 and @LorenzoBianconi to find out what is going on.

@nbd168
Copy link
Member

nbd168 commented May 20, 2018

I've fixed the issue with latency during scanning in current mt76 git. Please test.

@araujorm
Copy link

@nbd168 Looking good, I no longer have total RX/TX loss when scanning. Ping responses are now normal when doing so. Iperf3 performance isn't still what I'd call optimal on that situation, but that may involve something specific to my environment, so it would be nice if other people also tested and gave feedback :)

Thank you.

@Mushoz
Copy link

Mushoz commented May 24, 2018

I promised to get back to you @nbd168 to report whether I could still reproduce this issue. I have been running the 18.06 branch since Tuesday (which does not contain the beacon timer drift fix) and I am unable to reproduce the issue. So in my case, the flaky 5ghz connection must have been caused by something else. So therefor, I am unable to test whether the beacon timer drift fix actually fixed anything regarding this reported issue. It would be best if @updatede could confirm/deny whether the fix is successful for him.

@nbd168 nbd168 closed this as completed Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests