Currently on Darwin the scavenger is too eager and causing performance regressions. The issue mainly stems from the fact that in Go 1.14 the scavenger is paced empirically according to costs of scavenging, most of which comes from sysUnused which makes an madvise syscall on most platforms, including Darwin.
However, the problem on Darwin is that we don't just do MADV_FREE anymore, we do MADV_FREE_REUSABLE in the sysUnused path and MADV_FREE_REUSE in the sysUsed path (instead of sysUsed being a no-op). It turns out the source of the regression is mostly the fact that sysUsed is actually quite expensive relative to other systems where it's just a no-op and we instead incur an implicit page fault once the memory is actually touched. The benefits of using this API outweigh the costs, since it updates process RSS counters in the kernel appropriately and MADV_FREE_REUSABLE is properly lazy.
So since we don't account for sysUsed we end up scavenging a lot more frequently than we should to maintain the scavenger's goal of only using 1% of the CPU time of one CPU.
The actual size of the regression can be quite large, up to 5%, as seen in #36218, so we should fix this before 1.14 goes out.
The fix here is relatively simple: we just need to account for this extra cost somehow. We could measure it directly in the runtime but this then slows down allocation unnecessarily, and even then it's unclear how we should attribute that cost to the scavenger (maybe as a debt it needs to pay down?). Trying to account for this cost on non-Darwin platforms is also tricky because the costs aren't actually coming from sysUsed but from the page fault.
Instead, I think it's a good idea to do something along the lines of what we did last release: get some empirical measurements and use that to get an order-of-magnitude approximation. In this particular case, I think we should compute an empirical ratio "r" of using a scavenged page and sysUnused and turn this into a multiplicative constant, "1+r" for the time spent scavenging. So for example if sysUsed is roughly as expensive as sysUnused the factor would be 2.
The text was updated successfully, but these errors were encountered:
Digging deeper, I think I know better whysysUsed is hurting performance so much, even though it only takes around the same amount of time as sysUnused: starting with this release, we might allocate across a scavenged/unscavenged memory boundary. If one allocates several pages at once and if even only one page in that range is scavenged, we call sysUsed on that whole region. This has no effect on systems where sysUsed is a no-op, since those systems just fault in those pages on demand.
I'm not sure what to do here. The patch I uploaded helps the problem a little bit, but to completely solve the problem would require figuring out exactly which pages are scavenged and only calling sysUsed on those, but that's somewhat expensive to keep track of in the new allocator.
Because the allocator doesn't propagate up which pages are scavenged (though it does know this information precisely!) currently we just do the heavy-handed thing. But, we could instead have the allocator, which actually clears the bits, do the sysUsed operation. It would then have precise information. Because syscall overhead is also a factor here, there would need to be a heuristic, though. For example, if >50% of the memory region is scavenged, just sysUsed the whole thing, instead of piece-by-piece.
Unfortunately, lowering the sysUsed operation down into the allocator like this is annoying for tests. In most cases it's safe to sysUsed something already considered in-use by the OS, but I don't think we make that part of sysUsed's API, though I suppose could. We also do already have a flag which does this so we can test scavenging, so maybe we should just use that?
OK going to walk back my second thought. Turns out, yes, #36507 (comment) is a real problem. It's slightly worse now, but only by a little bit. This is basically going back to Go 1.11 behavior where if any memory was scavenged you would just scavenge the whole thing.