-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: bgsweep's use of Gosched bloats runtime traces and wastes CPU time #54767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interestingly I also just ran into this while collecting a histogram of how long goroutines remain running for, which had a big chunk running for absurdly short periods (eg ~30ns). Upon digging I found it was this same loop. cc @mknyszek |
One somewhat hacky workaround would be to check for idle Ps prior to Gosched. If there are idle Ps, then Gosched is very unlikely to find work, so we could just skip it. |
I agree it would be great to do something about this. Putting it in backlog since it's not a new issue. At the very least we can do what Michael suggested and just patch it for bgsweep. I like your idea of making Gosched a little smarter at emitting events in general (or calling into the scheduler). |
This could bear to be higher priority. Wasting 1% of CPU is expensive these days, certainly on cloud platforms! But thanks for looking into it. |
Thanks for the bump. I agree this issue is important to resolve, but I don't think resolving it is going to result in a 1% improvement in CPU utilization (or an improvement to throughput for that matter). Something has to do that sweep work, and bgsweep typically only takes a non-trivial amount of CPU time if the application has idle time to spare anyway, precisely because it does call I think @dominikh's point is more that 1.31% of CPU time is responsible for 50% of the trace data, which I agree seems wildly disproportionate. In any case, I've already put this on my list for 1.20 (EDIT: just realized it was still in the Backlog milestone; moved), so I plan to work on it. There are a number of ways to resolve this issue.
I think (2) -> (3) -> (1) is a reasonable path. I'll do (2) first which should solve the immediate issue, (3) might be nice in general (though I need to double-check that it's not breaking any documented semantics), and (1) just to see how gross it is and if it makes any meaningful performance difference. |
And to be clear, (2) is the hacky workaround @prattmic mentioned. I personally don't think it's that hacky, and the effects of such a change are local to just |
I'm worried about (3). |
While this is true and makes (3) more risky, I'm not positive we actually guarantee any of that (and in theory, it could change out from underneath users). Though, I don't feel strongly about this. We can skip (3), and I'll just try (2). |
Also, (3) will probably break a whole bunch of our own tests... |
Working on (3), and while I think A mitigation for that case could just be to do more work than sweeping one span before yielding. For instance, we could sweep 10 spans at once. Since we have a pretty good idea that this goroutine runs for only 10s of nanoseconds at a time, I think we can easily do 10x the amount of work before trying to yield, which will produce 10% of the events it produces today. For something slightly more robust, we can more explicitly bound the latency the same way the scavenger does. However, while that conceptually is more robust, it has some issues that we need to work around (e.g. coarse grained timers) which add complexity. I'm inclined to just sweep 10 spans at time, write a comment about it, and call it good (plus the Any thoughts? |
|
I tend to agree, but I also don't see much of a downside to sweeping 10 spans at a time, and it would be unfortunate to miss this one case (for the sake of robustness). The bgsweep goroutine is still preemptible in between each span swept, so it's not like it adds any latency to say, a STW or something. |
Fewer calls into the scheduler sounds good to me. I think that doing even 10µs of work at a time would keep this responsive enough to hide in the noise of the other latency sources we have today. Avoiding coarse-grained timer problems and bumping by a constant 10x (and confirming that in practice that the loop typically takes 10µs or less on modern hardware) seems like a good plan. Maybe the next release goes up to 32 iterations, and the one after increases it to 100, if we're not yet happy with the balance of responsiveness vs efficiency vs simplicity. |
Change https://go.dev/cl/429615 mentions this issue: |
Sorry, yes, sweeping multiple spans SGTM. I was thinking of the complexity of timer-based latency bounds. |
Lines 298 to 306 in cc1b20e
@mknyszek It seems that const sweepBatchSize = 1 << 3 // or 1<<4
const mask = sweepBatchSize - 1
nSwept := 0
for sweepone() != ^uintptr(0) {
sweep.nbgsweep++
nSwept = (nSwept + 1) & mask
if nSwept == 0 {
goschedIfBusy()
}
} |
Thanks, but I don't think this is necessary. It's not actually possible for However, I don't think it needs a comment or anything; we have plenty of counters in the runtime that are proportional to the size of the heap that won't overflow for exactly the same reason. IMO the logic of the existing code demonstrates intent more clearly, and is also (I believe) correct. |
Sorry for the noise! I thought nSwept is a global variable when I wrote the comment. I must be out of my mind then. Thank you for your patience! |
It's no worries. :) You may have been thinking of |
runtime.bgsweep has a loop that calls Gosched after doing a small amount of work. The intention is to only use a
fraction of the available CPU time. When there are more runnable Gs than Ps, this works well: bgsweep relinquishes most
of its time and gets scheduled again after some time has passed. However, when we have excess Ps, bgsweep gets scheduled
again almost immediately after calling Gosched. This continues until bgsweep runs out of work entirely, at which point
it will stop looping. On large systems, with code that produces a disproportionate amount of garbage, bgsweep may only
rarely run out of work.
By itself, this might not be a major problem; although it's probably a waste of CPU time to constantly call Gosched just
to be scheduled again almost immediately. However, in combination with runtime/trace, this causes a bad user experience.
Each time bgsweep calls Gosched, and each time it gets scheduled again, an event gets emitted. This turns out to be a
lot of events. I have a trace captured from a real-world deployment of Prometheus. It is 60 seconds long and has the
following statistics:
continuing from the previous cycle when the trace begins).
go tool trace
's goroutine analysis, bgsweep was responsible for1.31% of total program execution time
.This means that during 28 seconds of actual activity, which constitute 3 periods of doing work, and 1.31% of all work
done, bgsweep was responsible for 50% of all events in the trace. During these periods, bgsweep spends about 500 ns
running and 600 ns inactive, repeated millions of times.
In the end, this results in a 300 MB large trace file that takes 2.5 minutes to be parsed and split by
go tool trace
,during which time it consumes over 23 GB of system memory. The resulting JSON for the browser frontend (for all split
traces combined) is 11 GB in size. Admittedly, halving all of these metrics would still not be ideal, but it'd be half
as bad. And 1.31% of all execution wouldn't be responsible for 50% of the trace data. Furthermore, the remaining 33
million events are much more evenly distributed.
I don't have any proposed solutions nor deep knowledge of the runtime. However, I wonder if it would make sense and be
feasible for Gosched to be a no-op when the system is under-utilized and we're not trying to stop the world? In this
particular example this should make bgsweep finish almost twice as fast, with approximately 100% fewer events, wasting
less time in the scheduler, without impacting other goroutines.
The text was updated successfully, but these errors were encountered: