Because the Go runtime's background scavenger releases memory in 64 KiB chunks, the heuristic in mem_linux.go's implementation of sysUnusedOS never actually calls MADV_NOHUGEPAGE to break up huge pages that would otherwise keep a lot of memory around. The reason for this is that there's a heuristic preventing the call on every sysUnusedOS to avoid creating too many VMAs (Linux has a low limit).
The result of all this is that Linux may be keeping around huge pages transparently using up a lot more memory than intended, and reducing the effectiveness of the scavenger.
I think the fix for this is to mark all new heap memory as MADV_NOHUGEPAGE initially, and then only call it again in circumstances where we free memory that is likely to have had MADV_HUGEPAGE called on it (e.g. a large object that contains at least one aligned 2 MiB region will have this happen via sysUsed). However, this might not be enough, and needs some more investigation.
Thinking about this more, what my proposed solution loses is the benefit of backing dense parts of the heap with huge pages.
However, back when we worked on the scavenger, there was a key insight about having a GC'd heap: as we reach the end of a GC cycle, the whole heap is always densely packed. I wonder if we should say something like "back up the heap with MADV_HUGEPAGE up to the heap goal." Due to fragmentation this leaves a non-huge-page tail, but that's OK.
At first I thought the goroutine doing mark termination could spend a little extra time post-STW to go over the (mostly-contiguous) heap and call MADV_HUGEPAGE and MADV_NOHUGEPAGE a bunch of times, but maybe what really should be happening is just the MADV_HUGEPAGE part, and the scavenger then takes care of MADV_NOHUGEPAGE as it walks down the heap from the highest addresses. It means there's a delay, but there's a chance for the scavenger to take advantage of calling MADV_DONTNEED on hugepages which is considerably faster. This may not be worth it, however. It's not hard to try both approaches.
This is in effect a very simple "huge page aware" allocation policy without needing to do any special bin-packing. We're taking advantage of the fact that we can count on heap density, thanks to the GC.