New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: time.Sleep takes more time than expected #44343
Comments
It looks like this happened in 8fdc79e. https://go-review.googlesource.com/c/go/+/232298 CC: @ChrisHines |
This is reproducible with a trivial benchmark in time package: func BenchmarkSimpleSleep(b *testing.B) {
for i := 0; i < b.N; i++ {
Sleep(50*Microsecond)
}
} amd64/linux, before/after http://golang.org/cl/232298:
|
For reference, across different sleep times:
|
Looking at the 100µs, the immediate problem is the delay resolution in Prior to http://golang.org/cl/232298, 95% of timer expirations in the 100µs case are detected by After http://golang.org/cl/232298, this path is gone and the wakeup must come from I'm not sure why I'm seeing ~500µs on the 10µs and 50µs benchmarks, but I may have bimodal distribution where ~50% of cases a spinning M is still awake long enough to detect the timer before entering I'm also not sure why @egonelbre is seeing ~14ms on Windows, as that also appears to have 1ms resolution on |
I think the ideal fix to this would be to increase the resolution of As it happens, Linux v5.11 includes In the past, I've also prototyped changing the Both of these are Linux-specific solutions, I'd have to research other platforms more to get a sense of the options there. We also may just want to bring the |
I guess that Perhaps when |
While working on CL232298 I definitely observed anecdotal evidence that the netpoller has more latency than other ways of sleeping. From #38860 (comment):
I didn't try to address that in CL232298 primarily because it was already risky enough that I didn't want to make bigger changes. But an idea for something to try occurred to me back then. Maybe we could improve the latency of non-network timers by having one M block on a I haven't fully gauged how messy that would get. Questions and concerns:
One other oddity that I noticed when testing CL232298: The linux netpoller sometimes wakes up from the timeout value early by a several microseconds. When that happens |
As somewhat of a combination of these, one potential option would be to make
Hm, this sounds like another bug, or perhaps a spurious |
It seems that on Windows, |
My first thought on the Windows behavior is that somehow |
That could be, but I was logging at least some of the calls to netpollBreak as well and don't recall seeing seeing that happen. I saved my logging code in case it can help. https://github.com/ChrisHines/go/tree/dlog-backup |
For reference, output on my Windows 10:
A totally different results from the #44343 (comment).
|
@vellotis this could be because there's something running in the background changing Windows timer resolution. This could be some other Go service/binary built using older Go version. Of course there can plenty of other programs that may change it. You can use https://github.com/tebjan/TimerTool to see what the current value is. There's some more detail in https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted. |
For what it's worth, go1.16.2 darwin/amd64 is also exhibiting this. In a program I'm running, the line |
@prattmic Strangely, I just checked the output of the script now and (although I just restarted it a few hours before) it was spot on at one hour twice in a row. However, yesterday it was "generally" 3 minutes. I don't have an exact time since we were only seeing the minute logged. It was always 3 minutes (rounded) from the previous run. 1:00pm, 2:03pm, 3:06pm, 4:09pm, etc. Some things to note about this loop I'm running, it is calling a shell script using If you have some code you would like me to run I would be happy to - otherwise I will keep an eye on it here as well and report anything else I see. Hopefully some of this is helpful in narrowing down the cause if you haven't already. UPDATE 3-20-2021 10am ET: I just ran my program for about the last 24 hours and sure enough, it would run fine for the first few iterations and then start to sleep longer based on the logging. Mostly it would hover between 3-5 minutes late but once it was 9 minutes! I've (this morning) written another program using the same logic but with simulated activity (since the actual tasks did not seem to be part of the problem) and much more detailed logging. I have it running now in 3 environments, two using |
3 minutes per hour is rather worrisome, especially if some cron-like functionality required in a long-running service... |
In case it's related; on Windows, timers tick while the system is in standby/suspend, but they pause on other platforms. This is supposed to have been addressed already #36141 |
@prattmic et al, I took the liberty of writing a testbed to experiment with this issue. See all the details of test runs on Linux, Intel, and M1, along with source code here: https://github.com/zobkiw/sleeptest Comments welcome - especially if you find a bug in the code. Also, see the |
cc @zx2c4 :-) |
@zobkiw could it be something due to cpu frequency scaling? Can you try locking the cpu to a specific frequency (something low) and see whether the problem disappears? |
My understanding is that this issue is about Go 1.16+ runtime timers having worse best case granularity than earlier versions of Go. In other words, a 50 microsecond timer always takes longer than that to fire and also takes longer to fire than on earlier versions of Go, especially on Windows. It does not require high CPU load to trigger the problem described here. It is fundamental to changes in how timers are serviced in the runtime in Go 1.16+. I think a fix for this issue would be focussed on reducing the timer latency and would not have to reduce the runtime overhead of |
|
@egonelbre my entire analysis was on Mac and Linux. Mac was unaffected by these changes, but Linux on both ARM and x86 was strongly impacted by a regression in Go 1.16, as I documented there. This isn’t just about Windows. I would also point to my comment discussing some ideas for how to move forward with resolving this regression without completely negating the original reasons that the regression was introduced, as far as I can tell. |
@egonelbre Your original post shows a regression on Debian 10 from ~80µs to ~1ms. The regression is much less than on Windows, but it's still there. |
The underlying issue here is that we depend on the Due to the extremely coarse resolution on Windows, the issue is more extreme, but it still affects all systems to some extent. |
@ChrisHines I need to find the facepalm emote. Somehow completely forgot and eyes glided over it. |
In #44343 (comment), I noted that @egonelbre No worries, this issue has gotten rather long and I keep having to reread to remember as well. :) |
I've thrown together the mostly-untested https://golang.org/cl/363417 to prototype using |
Change https://golang.org/cl/363417 mentions this issue: |
If you focus on the time spent in netpoll. Maybe this issue is related. #43997 |
By the way, could you review my CL https://go-review.googlesource.com/c/go/+/356253. It's for issue #49026. We are testing it on our own go. But we don't have enough confidence to use it in production environment. |
Dear Contributors, This is very bad, since we use sleep and ticker funcs for period hardware measuring. I think thouse funcs should always stay in region of 1ms accuracy, otherwise not usable anymore - timer funcs are very important. Many thanks for your great work! go version go1.15.15 windows/amd64 go version go1.17.6 windows/amd64 |
I want to pile in saying we are also experiencing this issue, causing us to rollback to Go 1.15 for our project (performance-sensitivity, degradation not worth any other benefits right now). This has been proven present on Linux-based distributions, so it would be great if the tagging reflected that (or removed the Windows-specific one) to help with prioritization. |
I wish there was a voting for this ticket, as I would have given it a +10 ( just like @jhnlsn). The current labeling of this issue, in my view, as a Windows-specific issue is misleading. If I read what's being written here correctly, then go 1.16 introduces several independent regressions within it's event triggering subsystem : Second, on Linux-based systems, the entire time-based event system ( which includes time.Sleep, but also time.After and time.Timer ), was downgraded to have millisecond level accuracy. Third, on Windows, the regression ( which I have not verified myself, and I'll refer to @woidl's excellent post to speak for itself ) is making the usage of these calls absurdly hard to justify. I do realize the fixing these issues might not be trivial; especially as @prattmic noted with his Linux v5.11 solution. Could I suggest that beside prioritizing this issue, the documentation itself would be updated to reflect the above ? Do you agree with my conclusion above @mknyszek @ChrisHines ? |
With regards to fixes, I've thought about this more today and my thoughts are:
cc @aclements |
If we're willing to assume that long timers require less precision, we could somewhat reduce this cost by only using the high-resolution timer if the timeout is short, and otherwise using the low-precision timeout to the blocking We could go even further and record the precision when a timer is enqueued, so we know at creation-time how precise we have to be once it bubbles to the top of the heap. However, this is complicated because the top timer might be a low-precision timer firing soon, immediately followed by a high-precision timer and we actually need a high-precision wait to avoid missing the second timer. |
Would this involve returning to 1ms precision on the regular timer? There are probably many cases where +/- 8ms are acceptable, but I doubt they account for the majority of sleeps in any given go program. |
Hi, I continue with my opinion that 1ms resolution is needed. |
I'm not sure what you mean by a "regular" timer. If it's a short sleep or timeout, we would use a slightly more CPU-expensive way to wait that has higher precision. Hopefully we can do that without I suspect it's actually quite common for programs to have long timers. RPC-driven programs often have a large number of long-duration timers for timeouts (often on the order of minutes). They also often go idle, which would enter a blocking netpoll the same way a sleep does. |
I see, you're right. The main concern I have, then, is that semaphores (and therefore notesleep/notewake) are still based on the system timer granularity. I understand that this is the "slow" path, but there's a difference between 1ms futex and a 16ms (when unmodified) singleobjectwait. |
## Summary This PR upgrade the go-algorand repository to use go 1.16 while adding workarounds for golang/go#44343. ## Test Plan Use performance testing to verify no regressions.
This seems to be a regression with Go 1.16
time.Sleep
.What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
hrtime
is a package that usesRDTSC
for time measurements.Output on Windows 10:
Output on Linux (Debian 10):
The time granularity shouldn't be that bad even for Windows. So, there might be something going wrong somewhere.
The text was updated successfully, but these errors were encountered: