Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scavenger is too eager on Darwin #36507

Closed
mknyszek opened this issue Jan 10, 2020 · 4 comments
Closed

runtime: scavenger is too eager on Darwin #36507

mknyszek opened this issue Jan 10, 2020 · 4 comments
Assignees
Milestone

Comments

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 10, 2020

Currently on Darwin the scavenger is too eager and causing performance regressions. The issue mainly stems from the fact that in Go 1.14 the scavenger is paced empirically according to costs of scavenging, most of which comes from sysUnused which makes an madvise syscall on most platforms, including Darwin.

However, the problem on Darwin is that we don't just do MADV_FREE anymore, we do MADV_FREE_REUSABLE in the sysUnused path and MADV_FREE_REUSE in the sysUsed path (instead of sysUsed being a no-op). It turns out the source of the regression is mostly the fact that sysUsed is actually quite expensive relative to other systems where it's just a no-op and we instead incur an implicit page fault once the memory is actually touched. The benefits of using this API outweigh the costs, since it updates process RSS counters in the kernel appropriately and MADV_FREE_REUSABLE is properly lazy.

So since we don't account for sysUsed we end up scavenging a lot more frequently than we should to maintain the scavenger's goal of only using 1% of the CPU time of one CPU.

The actual size of the regression can be quite large, up to 5%, as seen in #36218, so we should fix this before 1.14 goes out.

The fix here is relatively simple: we just need to account for this extra cost somehow. We could measure it directly in the runtime but this then slows down allocation unnecessarily, and even then it's unclear how we should attribute that cost to the scavenger (maybe as a debt it needs to pay down?). Trying to account for this cost on non-Darwin platforms is also tricky because the costs aren't actually coming from sysUsed but from the page fault.

Instead, I think it's a good idea to do something along the lines of what we did last release: get some empirical measurements and use that to get an order-of-magnitude approximation. In this particular case, I think we should compute an empirical ratio "r" of using a scavenged page and sysUnused and turn this into a multiplicative constant, "1+r" for the time spent scavenging. So for example if sysUsed is roughly as expensive as sysUnused the factor would be 2.

@mknyszek mknyszek added this to the Go1.14 milestone Jan 10, 2020
@mknyszek mknyszek self-assigned this Jan 10, 2020
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jan 13, 2020

Change https://golang.org/cl/214517 mentions this issue: runtime: multiply critical time by system-dependent constant

@mknyszek

This comment has been minimized.

Copy link
Contributor Author

@mknyszek mknyszek commented Jan 13, 2020

Digging deeper, I think I know better why sysUsed is hurting performance so much, even though it only takes around the same amount of time as sysUnused: starting with this release, we might allocate across a scavenged/unscavenged memory boundary. If one allocates several pages at once and if even only one page in that range is scavenged, we call sysUsed on that whole region. This has no effect on systems where sysUsed is a no-op, since those systems just fault in those pages on demand.

I'm not sure what to do here. The patch I uploaded helps the problem a little bit, but to completely solve the problem would require figuring out exactly which pages are scavenged and only calling sysUsed on those, but that's somewhat expensive to keep track of in the new allocator.

Because the allocator doesn't propagate up which pages are scavenged (though it does know this information precisely!) currently we just do the heavy-handed thing. But, we could instead have the allocator, which actually clears the bits, do the sysUsed operation. It would then have precise information. Because syscall overhead is also a factor here, there would need to be a heuristic, though. For example, if >50% of the memory region is scavenged, just sysUsed the whole thing, instead of piece-by-piece.

Unfortunately, lowering the sysUsed operation down into the allocator like this is annoying for tests. In most cases it's safe to sysUsed something already considered in-use by the OS, but I don't think we make that part of sysUsed's API, though I suppose could. We also do already have a flag which does this so we can test scavenging, so maybe we should just use that?

@mknyszek

This comment has been minimized.

Copy link
Contributor Author

@mknyszek mknyszek commented Jan 13, 2020

On second thought, I can't really find evidence of the number of scavenged pages in an allocation being less than the size of the allocation in the above benchmarks. Perhaps this isn't a problem...

@mknyszek

This comment has been minimized.

Copy link
Contributor Author

@mknyszek mknyszek commented Jan 13, 2020

OK going to walk back my second thought. Turns out, yes, #36507 (comment) is a real problem. It's slightly worse now, but only by a little bit. This is basically going back to Go 1.11 behavior where if any memory was scavenged you would just scavenge the whole thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.