-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time: timer gets significantly delayed during long time GC #45632
Comments
Thanks for the filing the issue. Does this issue reproduce with 1.16 as well? Edit: nevermind, I see you said it applies to the latest release as well. |
The listed demo works well with Go 1.13 while it reproduces with the Go versions with 1.14 and later. It is probably because the As a easy fix, we can make the P's not running It does work, but I am not sure whether it is a proper fix. |
|
@randall77 Hi randall, if I read this right, calling runtime.GC will block the caller goroutine. It is reasonable that the caller gets blocked because it needs to wait GC finished. I don't think the other goroutines should also be blocked. To make it clear, I collect the trace data of the timer demo. We can see goroutines are well scheduled during GC. Each goroutine has a small scheduling slice and all the runnable goroutines have chances to run. However, the timer-driven goroutine G19, which is expected to be scheduled every 200ms, gets significantly delayed. It misses many ticks when GC is working, i.e., the second GC phase in the following graph. The timer-driven goroutine have a totally different behavior compared with other goroutines. And such behavior weakens the precision of timer strongly. |
We don't make that promise but generally speaking it doesn't block other goroutines except briefly to stop-the-world (just as a GC does). I don't think the dedicated GC needs to be involved in this solution at all -- ideally, it really does just run unimpeded. We have 3 other Ps (in this reproducer) that should be checking timers regularly, basically whenever they yield back into the scheduler. Your execution trace shows that we're very clearly time slicing these goroutines, so they're calling into the scheduler fairly often. The fact that timers are missed seems like it could be a bug in the scheduler or timer system (but I haven't thought about this enough to have an idea of what it is yet). |
cc @ChrisHines |
I believe the issue here is that in the situations where the dedicated GC runs on the P that owns the 200ms timer the timer starves while the GC is running. It starves because the owning P is busy and because all of the other Ps have a steady supply of runnable Gs. Ps only attempt to steal timers from other Ps when they are out of local work, which never happens in this scenario. As a result the timer isn't checked until it's owning P finishes the GC work and checks its own timers. |
This problem has also been encountered in our company’s products. Because my company‘s product service has a very long GC cycle and a relatively large number of timers, the probability of serious timer timeouts is particularly high. A lot of our processing logic relies on the accuracy of the timer, so the accuracy of the timer is very important to us. |
Change https://golang.org/cl/352710 mentions this issue: |
We do have a fairly efficient mechanism for stealing timers from other Ps ( This problem isn't necessarily specific to the dedicated GC worker, though that maybe the worst culprit. We could make the scheduler try stealing timers more frequently, sort of like how it currently consults the global run queue periodically. We could focus on the dedicated GC worker and have (More broadly, I've been thinking for a while about unsharding the timer queue and replacing it with a true concurrent data structure. That would probably solve this, too, because any P could run an expired timer, but that's a big chunk of work to bite off.) |
@aclements I agree that the problem is not specific to the dedicated GC worker. I was going to reply along the same lines, but you beat me to it. :) I am curious about your ideas for unsharding the timer queue. It seems to me that it could certainly simplify the runtime's model of timers, but how do we avoid reintroducing #15133? Did you have any specific concurrent data structures in mind for this situation? |
@aclements Yes, steal work by |
Starting from |
Prior to 8fdc79e in Go 1.16 the This issue boils down to a question of priorities. The system is over subscribed with more work than it has P's to handle. If we want it to give more priority to timers something else will get less priority. Done poorly something else may starve. |
Yes, Prior to 8fdc79e in Go 1.16 the sysmon thread did exactly that, which doesn't work mainly because the |
What version of Go are you using (
go version
)?go1.15.8
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
First, I initialized a large number of objects that will not be released during the running of my program. These objects will cause a long GC time.
Then I started 6 goroutines:
After that, the main goroutine sleeps for 20 seconds, and waits for the ticker goroutine's printing.
The source codes of my program are also posted:
What did you expect to see?
I expect to see all the elapsed time will be around 200 with small deviation.
go1.15.8
timer with 200 ms period started...
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 190, 200, 200, 200, 200, 200, 200, 200, 200, 200, .......
Process finished with exit code 0
What did you see instead?
I actually found the timer will be randomly delayed with a large delta. In some test, the 200 ms ticker can be unbelievably delayed for more than one second. The result is posted:
go1.15.8
timer with 200 ms period started...
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 190, 200, 200, 200, 200, 200, 201, 200, 200, 200,
541 ms=========== WAKE UP TOO LATE!!!
59, 200, 200, 200, 200,
1516 ms=========== WAKE UP TOO LATE!!!
84, 200, 200, 200, 200,
1458 ms=========== WAKE UP TOO LATE!!!
142, 200, 200, 200, 200,
1403 ms=========== WAKE UP TOO LATE!!!
197, 200, 200, 200, 200,
1378 ms=========== WAKE UP TOO LATE!!!
22, 200, 200, 200, 200,
1528 ms=========== WAKE UP TOO LATE!!!
72, 200, 200, 200, 200,
1511 ms=========== WAKE UP TOO LATE!!!
89, 200,
Process finished with exit code 0
The text was updated successfully, but these errors were encountered: