-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: significant performance regression #67585
Comments
This may have been fixed in https://go.dev/cl/586796 — it looks like the benchmarks haven't run on a commit that includes that change. Starting with https://go.dev/cl/585638 and ending with https://go.dev/cl/586796, |
Indeed, I accidentally ran tip (6861b2e) as the "bad" commit initially when I was bisecting and was very confused because it didn't seem different from the "good" commit. I'll leave this issue open until we can confirm it on the perf dashboard, but I'm glad it's probably already fixed. :) |
Part of the problem here is I reduced the resources going to running perf benchmarks, so the data is much more sparse than it used to be. I'll piggyback on this issue to fix that and turn it up again. |
Looks like there is a run at d0edd9a now, which is slightly newer than your CL @rhysh (8ab131f). Performance is improved, but still significantly regressed. A lot of these are microbenchmarks, but https://perf.golang.org/dashboard/?benchmark=Tile38QueryLoad-88&unit=sec/op is not. I'm surprised to see such a huge regression there. That benchmark is part sweet (https://cs.opensource.google/go/x/benchmarks/+/master:sweet/). |
Ah, Tile38QueryLoad-88 regressed, but Tile38QueryLoad-16 did not (at least noticeably). So this is a scalability regression, or maybe regression when contention occurs. |
I'm having trouble getting a local version of the tile38 benchmarks. Starting with https://cs.opensource.google/go/x/benchmarks 's suggestion of
|
Sorry, the top level instructions in that repo are apparently out of date. You want the instructions from the I double-checked that those instructions work for me:
benchstat results will be in the It is possible to run the benchmark in a more isolated way, but it is a bit complicated for tile38, since it is client/server. See how sweet builds and runs the benchmark. |
That's working, thanks! |
Comparing before the mutex profile changes through to a version that includes this morning's sampling fix (https://go.dev/cl/586796) doesn't show much change on my "20 thread" linux/amd64 test machine — "sec/op" is "+0.75%". It's an i7-13700H, with a mix of E- and P-cores, running kernel 6.5. Here's the banner from
I'm working on a finer-grained set of benchmark runs. Seeing which commit in the series caused the regression should give good hints about how to resolve it ... at least, if this 20-thread machine is big enough to reproduce it.
|
My 20-thread machine doesn't show a clear reproduction of the Tile38QueryLoad-88 regression. But I'm able to see a slowdown in the runtime's own microbenchmarks, such as with
Focusing on the Looking at the program's syscalls with
Each iteration of BenchmarkChanContended involves 200 operations (100 send, 100 receives) on a buffered channel, so that run would have done 200 * 10 * 10000 = 20,000,000 channel operations. 27,000,000 futex calls would be 13,500,000 sleep/wake pairs (or many failed sleeps, since the value would keep changing as new Ms registered?) — could be 2/3rds of channel ops resulting in a futex sleep. Comparing against the mutex profile shows that not all contention events (in that microbenchmark) led to futexsleep:
19.8 million contended lock calls taking 123 seconds to resolve (with GOMAXPROCS=20) means an average of 6.22µs each. It seems to start with dfb7073, which restructured lock_futex.go to be similar to lock_sema.go (though ba1c5b2 gave another increase). That restructuring changed the speculative swap at the start into a CAS. But it wasn't until d881ed6 that each M started writing a pointer to itself into the My hypothesis on the Tile38QueryLoad-88 regression and why it appears only on relatively large machines is that there's a (runtime-internal) mutex that's not quite contended with that app, until a certain number of Ms/Ps are active. Could be something like |
But I had my focus in the on-CPU profile backwards: the interesting work isn't how threads spend their time waiting .. they're waiting because something else is inside the critical section. The ChanContended benchmark is a measure of the critical section of channel operations, including the portions of lock2 and unlock2 where the other threads are excluded. And the critical section includes new code, in the With each "op" representing 200 channel operations, growing from 43 µs/op to 82 µs/op means the test's critical section grew, in effect, from 220ns to 410ns. An extra 190ns of locking overhead. Does that sound like the right line of analysis?
|
I ran the tile38 benchmark on a
And then
The CPU profile shows much more contention when running with GOMAXPROCS=88 than with 44 or 22: it's 35% of on-CPU time, rather than 2.4% or 0.56%. The contention seems focused on
The call stack for
Which points to a My current understanding of the problem:
Do we have explicit testing for the ratio of contended versus uncontended runtime.mutex performance? It's possible that tile38's performance could be improved by reducing its calls to @golang/runtime , do you read the data in the same way, in that lock2/unlock2 pairs on contended runtime.mutex values are about 200 ns slower? And that memory latency is a likely cause? And if so, that a viable path forward could be somehow preloading that memory while outside of the critical section? |
Not that I am aware of. The tar pit is a bit worrying.
This sounds correct to me. It seems to be counting of the cross-core communication overhead as well, but I suppose that counts as part of the critical section (if other cores can't observe that you have unlocked, it is effectively locked).
I could believe that, though I'm not certain, I'd need to think about it more. Stepping back, I think we have three high-level options for next steps:
Though I hate to revert, I'd say I'm leaning towards (3). If it were the beginning of the dev cycle, I'd be more comfortable getting more production experience to see if (1) is OK. Similar with (2), this is somewhat unknown complexity, though if we find a single smoking gun that fixes everything, great. |
Thanks Michael. I'm not in a position to support option 1: that sort of performance cliff is one of the most operationally worrisome ways that a Go program can fail, so I don't want to make it worse. It's also bad news for vertically-scaled apps, which may not be able to find machines that are big and fast enough to recover the performance loss. I've made some progress on option 2, on a path with a pretty simple concept: for futex-based locks, move the sampling decision into Option 3 is safe, and would work for my purposes. My goal for this change is to support the project, by increasing the fraction of bug reports we get that include enough detail to find a solution. So there's the question of balancing that benefit against the risk of the change, including work in support of option 2. Maybe this is addressing a sharp pain point for the project and so can pay its way. If not, then a revert is probably right for the overall health. |
Change https://go.dev/cl/588857 mentions this issue: |
Change https://go.dev/cl/588856 mentions this issue: |
I'm also in support of reverting and revisiting for 1.24. It's always unfortunate for things to just miss a release, but given that this touches really foundational code, I lean strongly toward the safest solution at this point in the release cycle. I look forward to seeing this re-land in 1.24! @rhysh, do you have time to prep rollbacks? If not, we can do it. |
Change https://go.dev/cl/589095 mentions this issue: |
Change https://go.dev/cl/589096 mentions this issue: |
Change https://go.dev/cl/589116 mentions this issue: |
Change https://go.dev/cl/589098 mentions this issue: |
Change https://go.dev/cl/589097 mentions this issue: |
Change https://go.dev/cl/589115 mentions this issue: |
Change https://go.dev/cl/589117 mentions this issue: |
Thanks @aclements , the reverts are ready. Their test failure (cmd/cgo/internal/swig on linux-386) is also present on the build dashboard. Beyond touching foundational code, it touches foundational code across several platforms. The current performance builders are only for linux/amd64, which wouldn't catch problems with semaphore-based locks (darwin, windows) or where cputicks is implemented by calling nanotime (often on arm64 and arm). |
This reverts commit 8ab131f (CL 586796) Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I54711691e86e072081482102019d168292b5150a Reviewed-on: https://go-review.googlesource.com/c/go/+/589095 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com>
This reverts commit f9ba2cf (CL 586237) Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. This test verifies that the semantics of the mutex profile for runtime.mutex values matches that of sync.Mutex values. Without the rest of the patch series, this test would correctly identify that Go 1.22's semantics are incorrect (issue #66999). Updates #66999 Updates #67585 Change-Id: Id06ae01d7bc91c94054c80d273e6530cb2d59d10 Reviewed-on: https://go-review.googlesource.com/c/go/+/589096 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This reverts commit 87e930f (CL 585639) Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I1e286d2a16d16e4af202cd5dc04b2d9c4ee71b32 Reviewed-on: https://go-review.googlesource.com/c/go/+/589097 Reviewed-by: Than McIntosh <thanm@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
This reverts commit ba1c5b2 (CL 585638). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: Ibeec5d8deb17e87966cf352fefc7efe2267839d6 Reviewed-on: https://go-review.googlesource.com/c/go/+/589115 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Than McIntosh <thanm@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This reverts commit d881ed6 (CL 585637). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I70d8d0b74f73be95c43d664f584e8d98519aba26 Reviewed-on: https://go-review.googlesource.com/c/go/+/589116 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This reverts commit dfb7073 (CL 585636). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I3483bf0b85ba0b77204032a68b7cbe93f142703e Reviewed-on: https://go-review.googlesource.com/c/go/+/589098 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
This reverts commit be0b569 (CL 585635). Reason for revert: This is part of a patch series that changed the handling of contended lock2/unlock2 calls, reducing the maximum throughput of contended runtime.mutex values, and causing a performance regression on applications where that is (or became) the bottleneck. Updates #66999 Updates #67585 Change-Id: I7843ccaecbd273b7ceacfa0f420dd993b4b15a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/589117 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
It looks like mostly afbbc28 and a little bit following CLs got everything back to where it was. Notably, Tile38QueryLoad-88 is back to where it was on sec/op and latency. A bunch of microbenchmarks benefited in mysterious ways from dfb7073, and thus slowed back down to where they were when that was reverted, but the original effect was clearly not intended. Several benchmarks are still behind where they were as of last release, but I'll open a separate issue to investigate that. |
According to the performance dashboard, there was a significant regression across many benchmarks somewhere between d68d485 (good) and 647870b (bad). This was almost certainly from one of the commits in this CL stack, but I'm going to start up a bisect to pin it down. The regression is noticeable across more benchmarks on the 88 core machine than the 16 core machine, suggesting it may be a scalability issue.
cc @mknyszek @rhysh
The text was updated successfully, but these errors were encountered: