Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
proposal: runtime: smarter scavenging #30333
Proposal: Smarter Scavenging
Motivation & Purpose
Out-of-memory errors (OOMs) have been a pain-point for Go applications. A class of these errors come from the same underlying cause: a temporary spike in memory causes the Go runtime to grow the heap, but it takes a very long time (on the order of minutes) to return that unneeded memory back to the system. The system can end up killing the application in many situations, such as if the system has no swap space or if system monitors count this space against your application. In addition, if this additional space is counted against your application, you end up paying more for memory when you don’t really need it.
The Go runtime does have internal mechanisms to help deal with this, but they don’t react to changes in the application promptly enough. The way users solve this problem today is through a runtime library function called
I believe the Go runtime should do better here by default, so that for most cases the scavenger is prompt enough and
Dynamic memory allocators typically obtain memory from the operating system by requesting for it to be mapped into their virtual address space. Sometimes this space ends up unused, and modern operating systems provide a way to tell the OS that certain virtual memory address regions won’t be used without unmapping them. This means the physical memory backing those regions may be taken back by the OS and used elsewhere. We in the Go runtime refer to this technique as “scavenging”.
Scavenging is especially useful in dealing with page-level external fragmentation, since we can give these fragments back to the OS, reducing the process’ resident set size (RSS). That is, the amount of memory that is backed by physical memory in the application’s address space.
As of Go 1.11, the only scavenging process in the Go runtime was a periodic scavenger which runs every 2.5 minutes. This scavenger combs over all the free spans in the heap and scavenge them if they have been unused for at least 5 minutes. When the runtime coalesced spans, it would track how much of the new span was scavenged.
While this simple technique is surprisingly effective for long-running applications, the peak RSS of an application can end up wildly exaggerated in many circumstances, even though the application’s peak in-use memory is significantly smaller. The periodic scavenger just does not react quickly enough to changes in the application’s memory usage.
As of Go 1.12, in addition to the periodic scavenger, the Go runtime also performs heap-growth scavenging. On each heap growth up to N bytes of the largest spans are scavenged, where N is the amount of bytes the heap grew by. The idea here is to “pay back” the cost of a heap growth. This technique helped to reduce the peak RSS of some applications (#14045).
The goal in scavenging smarter is two-fold:
The two goals go hand-in-hand. On the one hand, you want to keep the RSS of the application as close to its in-use memory usage as possible. On the other hand, doing so is expensive in terms of CPU time, having to make syscalls and handle page faults. If we’re too aggressive and scavenge every free space we have, then on every span allocation we effectively incur a hard page fault (or invoke a syscall), and we’re calling a syscall on every span free.
The ideal scenario, in my view, is that the RSS of the application “tracks” the peak in-use memory over time.
The goal of this proposal is to improve the Go runtime’s scavenging mechanisms such that it exhibits the behavior described. Compared with today’s implementation, this behavior should reduce the average overall RSS of most Go applications with minimal impact on performance.
Three questions represent the key policy decisions that describe a memory scavenging system.
I propose that for the Go runtime, we:
Additionally, I propose we change the span allocation policy to prefer unscavenged spans over scavenged spans, and to be first-fit rather than best-fit.
A brief rationale is that:
Detailed rationale and design available very soon.
What happens in an application that has a huge heap spike (say, an initial loading phase) and then the heap drops significantly? In particular, let's say this is drastic enough that the runtime doesn't even notice the drop until a 2 minute GC kicks in. At that scale, it could take a while for N GCs to pass, and we won't reclaim the heap spike until they do.
I wonder if it makes sense to pace this to periodic GC. E.g., pace scavenging to the max of the heap growth and the two minute delay before the periodic GC..
If we're switching to first-fit and scavenging from higher addresses, do we also need to prefer unscavenged spans during allocation?
This is something that came to my mind recently too. An alternative is to set a schedule to decrease the scavenge goal linearly, or according to a smoothstep function, which goes to zero over N GCs. If this schedule ever gets below C * the heap goal, we use that instead. We'll get smoother cliffs in general and still make progress in the case you describe. Smoothstep is preferred here since we won't over-fit to transient drops in heap size, but this also means we might be slower to react in the case you described. I prefer not to over-fit here because that carries a performance cost.
That makes a lot of sense to me, and I think this is the right call. Thanks!
There's a trade-off here: by preferring unscavenged spans we're cementing that we want to avoid page faults at all cost. By not preferring them, we may end up getting overall better cache locality ("first-fit at all costs") but we could incur more page faults. Given that the policy already works to our advantage it's hard to say what works better in practice. I'm trying to avoid additional syscalls and page faults to perhaps an excessive degree at this point, but in practice I hypothesize the performance would be similar either way.
Maybe only one thing has me leaning toward preferring unscavenged spans: the implementation for that is definitely simpler. While it's not much code, I definitely got global best-fit wrong a couple times. :) But if there's another reason to do a global first-fit allocation policy that I'm not seeing (perhaps stability of the policy?), I don't feel too strongly about this.