Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: regression in BiogoKrishna Sweet benchmark in the Go 1.21 cycle #64342

Closed
mknyszek opened this issue Nov 22, 2023 · 0 comments
Closed
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Nov 22, 2023

As we looked back over the year, we noticed that the BiogoKrishna benchmark had regressed significantly (~20%) between the Go 1.20 and Go 1.21 release on our linux-amd64-perf builder.

Upon further investigation, the culprit appears to be 9f9bb26. This change partially rolled back a switch during the Go 1.21 to setting MADV_HUGEPAGE on all heap memory. For a number of reasons, this switch turned out not to be a good idea. Mainly, as it turns out, marking memory as MADV_HUGEPAGE can trigger an unbounded stall on a memory access to that memory in many common kernel configurations for huge pages. See #61718. (9f9bb26 replaced MADV_HUGEPAGE with something else which had the same latency problem, just at syscall time. This too was rolled back.) As of Go 1.21.4, the Go runtime no longer tries to mark any memory for huge page tracking purposes. What is left is a policy change in the Go runtime's scavenger (the part that returns memory back to the OS) to make it significantly friendlier to OS huge page heuristics, which at this point I believe is about as good as we can get. The Linux APIs unfortunately do not give the Go runtime the precise control it wants over the situation, and seem mostly focused around the use-case of operators and application owners tweaking the huge page settings, not memory allocators or language runtimes, despite some of the messaging around these features.

(For more details about huge pages see https://go.dev/doc/gc-guide#Linux_transparent_huge_pages, which was added during the Go 1.21 release.)

One thing to note is that this culprit doesn't explain why there is a regression between Go 1.20 and Go 1.21. The reason for that is that prior to the policy change that landed in Go 1.21, Go 1.20 would occasionally mark memory as MADV_HUGEPAGE. The behavior was unpredictable (since it relied on the order of memory allocations), but deterministic. In short, if a memory allocation was made that contained at least one complete aligned huge page, that huge page would get marked as MADV_HUGEPAGE.

My best understanding of the BiogoKrishna benchmark is that this was mostly a microbenchmark. It operates on a significant data source, but the analysis it performs on that data source involves a series of tight loops that run for a very long time. It is quite sensitive to the caching effects of the memory accesses in those loops. Hence, the lack of backing that memory with a huge page led to a significant performance regression.

Crucially though,MADV_HUGEPAGE only made a meaningful difference if the memory wasn't already backed by a huge page. As of this writing, the default kernel configuration of the linux-amd64-perf builder sets /sys/kernel/mm/transparent_hugepage/enabled set to madvise, which means huge pages are only enabled for memory regions marked MADV_HUGEPAGE. This regression did not always reproduce, and it did not reproduce on machines that set /sys/kernel/mm/transparent_hugepage/enabled to always, because that memory region was presumably already backed by a huge page.

Therefore, I conclude that this performance regression is an unfortunate consequence of the following factors:

  • BiogoKrishna behaves more like a microbenchmark and isn't as realistic as I originally thought. I held up for a long time, but this regression can be compared to a situation wherein a microbenchmark regressed because of arbitrary code alignment changes. (And we have no end of those on https://perf.golang.org/dashboard.)
  • We made a conscious decision in the Go 1.21.4 release that the Go runtime would no longer try to impose a huge page policy itself. It would still strive to be friendly to huge pages and picks allocation and release policies that aided the kernel in applying and maintaining huge pages, but it would no longer force the kernel to back any heap memory with huge pages.
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 22, 2023
@mknyszek mknyszek closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2023
@mknyszek mknyszek added this to the Backlog milestone Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
None yet
Development

No branches or pull requests

2 participants