runtime/debug: soft memory limit #48409
Proposal: Soft memory limit
Author: Michael Knyszek
I propose a new option for tuning the behavior of the Go garbage collector by setting a soft memory limit on the total amount of memory that Go uses.
This option comes in two flavors: a new
This new option gives applications better control over their resource economy. It empowers users to:
Full design document found here.
Note that, for the time being, this proposal intends to supersede #44309. Frankly, I haven't been able to find a significant use-case for it, as opposed to a soft memory limit overall. If you believe you have a real-world use-case for a memory target where a memory limit with
The text was updated successfully, but these errors were encountered:
For golang/go#48409. Change-Id: I4e5d6d117982f51108dca83a8e59b118c2b6f4bf Reviewed-on: https://go-review.googlesource.com/c/proposal/+/350116 Reviewed-by: Michael Pratt <email@example.com>
@mpx I think that's an interesting idea. If
I think that's pretty close. Ideally we would have just one metric that could show, at-a-glance, "are you in the red, and if so, how far?"
In case you find this useful as a reference (and possibly to include in "prior art"), the go-watchdog library schedules GC according to a user-defined policy. It can infer limits from the environment/host, container, and it can target a maximum heap size defined by the user. I built this library to deal with #42805, and ever since we integrated it into https://github.com/filecoin-project/lotus, we haven't had a single OOM reported.
@rsc I believe the design is complete. I've received feedback on the design, iterated on it, and I've arrived at a point where there aren't any major remaining comments that need to be addressed. I think the big question at the center of this proposal is whether the API benefit is worth the cost. The implementation can change and improve over time; most of the details are internal.
Personally, I think the answer is yes. I've found that mechanisms that respects users' memory limits and that give the GC the flexibility to use more of the available memory are quite popular. Where Go users implement this themselves, they're left working with tools (like
In terms of implementation, I have some of the foundational bits up for review now that I wish to land in 1.18 (I think they're uncontroversial improvements, mostly related to the scavenger). My next step is create a complete implementation and trial it on real workloads. I suspect that a complete implementation won't land in 1.18 at this point, which is fine. It'll give me time to work out any unexpected issues with the design in practice.
@mknyszek I'm somewhat confused by this. At a high level, what are you hoping to include in 1.18, and what do you expect to come later?
@kent-h The proposal has not been accepted, so the API will definitely not land in 1.18. All that I'm planning to land is work on the scavenger, to make it scale a bit better. This is useful in its own right, and it happens that the implementation of
This change adds a GC CPU utilization limiter to the GC. It disables assists to ensure GC CPU utilization remains under 50%. It uses a leaky bucket mechanism that will only fill if GC CPU utilization exceeds 50%. Once the bucket begins to overflow, GC assists are limited until the bucket empties, at the risk of GC overshoot. The limiter is primarily updated by assists. The scheduler may also update it, but only if the GC is on and a few milliseconds have passed since the last update. This second case exists to ensure that if the limiter is on, and no assists are happening, we're still updating the limiter regularly. The purpose of this limiter is to mitigate GC death spirals, opting to use more memory instead. This change turns the limiter on always. In practice, 50% overall GC CPU utilization is very difficult to hit unless you're trying; even the most allocation-heavy applications with complex heaps still need to do something with that memory. Note that small GOGC values (i.e. single-digit, or low teens) are more likely to trigger the limiter, which means the GOGC tradeoff may no longer be respected. Even so, it should still be relatively rare. This change also introduces the feature flag for code to support the memory limit feature. For #48409. Change-Id: Ia30f914e683e491a00900fd27868446c65e5d3c2 Reviewed-on: https://go-review.googlesource.com/c/go/+/353989 Reviewed-by: Michael Pratt <firstname.lastname@example.org>
This change adds a parser for the GOMEMLIMIT environment variable's input. This environment variable accepts a number followed by an optional prefix expressing the unit. Acceptable units include B, KiB, MiB, GiB, TiB, where *iB is a power-of-two byte unit. For #48409. Change-Id: I6a3b4c02b175bfcf9c4debee6118cf5dda93bb6f Reviewed-on: https://go-review.googlesource.com/c/go/+/393400 TryBot-Result: Gopher Robot <email@example.com> Reviewed-by: Michael Pratt <firstname.lastname@example.org> Run-TryBot: Michael Knyszek <email@example.com>
Nothing much to see here, just some plumbing to make latter CLs smaller and clearer. For #48409. Change-Id: Ide23812d5553e0b6eea5616c277d1a760afb4ed0 Reviewed-on: https://go-review.googlesource.com/c/go/+/393401 Run-TryBot: Michael Knyszek <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com> Reviewed-by: Michael Pratt <firstname.lastname@example.org>
This will be used by the memory limit computation to determine overheads. For #48409. Change-Id: Iaa4e26e1e6e46f88d10ba8ebb6b001be876dc5cd Reviewed-on: https://go-review.googlesource.com/c/go/+/394220 Reviewed-by: Michael Pratt <email@example.com> Run-TryBot: Michael Knyszek <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com>
This change adds a field to memstats called mappedReady that tracks how much memory is in the Ready state at any given time. In essence, it's the total memory usage by the Go runtime (with one exception which is documented). Essentially, all memory mapped read/write that has either been paged in or will soon. To make tracking this not involve the many different stats that track mapped memory, we track this statistic at a very low level. The downside of tracking this statistic at such a low level is that it managed to catch lots of situations where the runtime wasn't fully accounting for memory. This change rectifies these situations by always accounting for memory that's mapped in some way (i.e. always passing a sysMemStat to a mem.go function), with *two* exceptions. Rectifying these situations means also having the memory mapped during testing being accounted for, so that tests (i.e. ReadMemStats) that ultimately check mappedReady continue to work correctly without special exceptions. We choose to simply account for this memory in other_sys. Let's talk about the exceptions. The first is the arenas array for finding heap arena metadata from an address is mapped as read/write in one large chunk. It's tens of MiB in size. On systems with demand paging, we assume that the whole thing isn't paged in at once (after all, it maps to the whole address space, and it's exceedingly difficult with today's technology to even broach having as much physical memory as the total address space). On systems where we have to commit memory manually, we use a two-level structure. Now, the reason why this is an exception is because we have no mechanism to track what memory is paged in, and we can't just account for the entire thing, because that would *look* like an enormous overhead. Furthermore, this structure is on a few really, really critical paths in the runtime, so doing more explicit tracking isn't really an option. So, we explicitly don't and call sysAllocOS to map this memory. The second exception is that we call sysFree with no accounting to clean up address space reservations, or otherwise to throw out mappings we don't care about. In this case, also drop down to a lower level and call sysFreeOS to explicitly avoid accounting. The third exception is debuglog allocations. That is purely a debugging facility and ideally we want it to have as small an impact on the runtime as possible. If we include it in mappedReady calculations, it could cause GC pacing shifts in future CLs, especailly if one increases the debuglog buffer sizes as a one-off. As of this CL, these are the only three places in the runtime that would pass nil for a stat to any of the functions in mem.go. As a result, this CL makes sysMemStats mandatory to facilitate better accounting in the future. It's now much easier to grep and find out where accounting is explicitly elided, because one doesn't have to follow the trail of sysMemStat nil pointer values, and can just look at the function name. For #48409. Change-Id: I274eb467fc2603881717482214fddc47c9eaf218 Reviewed-on: https://go-review.googlesource.com/c/go/+/393402 Reviewed-by: Michael Pratt <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com> Run-TryBot: Michael Knyszek <firstname.lastname@example.org>
The inconsistent heaps stats in memstats are a bit messy. Primarily, heap_sys is non-orthogonal with heap_released and heap_inuse. In later CLs, we're going to want heap_sys-heap_released-heap_inuse, so clean this up by replacing heap_sys with an orthogonal metric: heapFree. heapFree represents page heap memory that is free but not released. I think this change also simplifies a lot of reasoning about these stats; it's much clearer what they mean, and to obtain HeapSys for memstats, we no longer need to do the strange subtraction from heap_sys when allocating specifically non-heap memory from the page heap. Because we're removing heap_sys, we need to replace it with a sysMemStat for mem.go functions. In this case, heap_released is the most appropriate because we increase it anyway (again, non-orthogonality). In which case, it makes sense for heap_inuse, heap_released, and heapFree to become more uniform, and to just represent them all as sysMemStats. While we're here and messing with the types of heap_inuse and heap_released, let's also fix their names (and last_heap_inuse's name) up to the more modern Go convention of camelCase. For #48409. Change-Id: I87fcbf143b3e36b065c7faf9aa888d86bd11710b Reviewed-on: https://go-review.googlesource.com/c/go/+/397677 Run-TryBot: Michael Knyszek <email@example.com> Reviewed-by: Michael Pratt <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com>
Fundamentally, all of these memstats exist to serve the runtime in managing memory. For the sake of simpler testing, couple these stats more tightly with the GC. This CL was mostly done automatically. The fields had to be moved manually, but the references to the fields were updated via gofmt -w -r 'memstats.<field> -> gcController.<field>' *.go For #48409. Change-Id: Ic036e875c98138d9a11e1c35f8c61b784c376134 Reviewed-on: https://go-review.googlesource.com/c/go/+/397678 Reviewed-by: Michael Pratt <firstname.lastname@example.org> Run-TryBot: Michael Knyszek <email@example.com> TryBot-Result: Gopher Robot <firstname.lastname@example.org>
Currently the maximum trigger calculation is totally incorrect with respect to the comment above it and its intent. This change rectifies this mistake. For #48409. Change-Id: Ifef647040a8bdd304dd327695f5f315796a61a74 Reviewed-on: https://go-review.googlesource.com/c/go/+/398834 Run-TryBot: Michael Knyszek <email@example.com> Reviewed-by: Michael Pratt <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com>
As it stands, the heap goal and the trigger are set once by gcController.commit, and then read out of gcController. However with the coming memory limit we need the GC to be able to respond to changes in non-heap memory. The simplest way of achieving this is to compute the heap goal and its associated trigger dynamically. In order to make this easier to implement, the GC trigger is now based on the heap goal, as opposed to the status quo of computing both simultaneously. In many cases we just want the heap goal anyway, not both, but we definitely need the goal to compute the trigger, because the trigger's bounds are entirely based on the goal (the initial runway is not). A consequence of this is that we can't rely on the trigger to enforce a minimum heap size anymore, and we need to lift that up directly to the goal. Specifically, we need to lift up any part of the calculation that *could* put the trigger ahead of the goal. Luckily this is just the heap minimum and minimum sweep distance. In the first case, the pacer may behave slightly differently, as the heap minimum is no longer the minimum trigger, but the actual minimum heap goal. In the second case it should be the same, as we ensure the additional runway for sweeping is added to both the goal *and* the trigger, as before, by computing that in gcControllerState.commit. There's also another place we update the heap goal: if a GC starts and we triggered beyond the goal, we always ensure there's some runway. That calculation uses the current trigger, which violates the rule of keeping the goal based on the trigger. Notice, however, that using the precomputed trigger for this isn't even quite correct: due to a bug, or something else, we might trigger a GC beyond the precomputed trigger. So this change also adds a "triggered" field to gcControllerState that tracks the point at which a GC actually triggered. This is independent of the precomputed trigger, so it's fine for the heap goal calculation to rely on it. It also turns out, there's more than just that one place where we really should be using the actual trigger point, so this change fixes those up too. Also, because the heap minimum is set by the goal and not the trigger, the maximum trigger calculation now happens *after* the goal is set, so the maximum trigger actually does what I originally intended (and what the comment says): at small heaps, the pacer picks 95% of the runway as the maximum trigger. Currently, the pacer picks a small trigger based on a not-yet-rounded-up heap goal, so the trigger gets rounded up to the goal, and as per the "ensure there's some runway" check, the runway ends up at always being 64 KiB. That check is supposed to be for exceptional circumstances, not the status quo. There's a test introduced in the last CL that needs to be updated to accomodate this slight change in behavior. So, this all sounds like a lot that changed, but what we're talking about here are really, really tight corner cases that arise from situations outside of our control, like pathologically bad behavior on the part of an OS or CPU. Even in these corner cases, it's very unlikely that users will notice any difference at all. What's more important, I think, is that the pacer behaves more closely to what all the comments describe, and what the original intent was. Another note: at first, one might think that computing the heap goal and trigger dynamically introduces some raciness, but not in this CL: the heap goal and trigger are completely static. Allocation outside of a GC cycle may now be a bit slower than before, as the GC trigger check is now significantly more complex. However, note that this executes basically just as often as gcController.revise, and that makes up for a vanishingly small part of any CPU profile. The next CL cleans up the floating point multiplications on this path nonetheless, just to be safe. For #48409. Change-Id: I280f5ad607a86756d33fb8449ad08555cbee93f9 Reviewed-on: https://go-review.googlesource.com/c/go/+/397014 Run-TryBot: Michael Knyszek <firstname.lastname@example.org> Reviewed-by: Michael Pratt <email@example.com> TryBot-Result: Gopher Robot <firstname.lastname@example.org>
As of the last CL, the heap trigger is computed as-needed. This means that some of the niceties we assumed (that the float64 computations don't matter because we're doing this rarely anyway) are no longer true. While we're not exactly on a hot path right now, the trigger check still happens often enough that it's a little too hot for comfort. This change optimizes the computation by replacing the float64 multiplication with a shift and a constant integer multiplication. I ran an allocation microbenchmark for an allocation size that would hit this path often. CPU profiles seem to indicate this path was ~0.1% of cycles (dwarfed by other costs, e.g. zeroing memory) even if all we're doing is allocating, so the "optimization" here isn't particularly important. However, since the code here is executed significantly more frequently, and this change isn't particularly complicated, let's err on the size of efficiency if we can help it. Note that because of the way the constants are represented now, they're ever so slightly different from before, so this change technically isn't a total no-op. In practice however, it should be. These constants are fuzzy and hand-picked anyway, so having them shift a little is unlikely to make a significant change to the behavior of the GC. For #48409. Change-Id: Iabb2385920f7d891b25040226f35a3f31b7bf844 Reviewed-on: https://go-review.googlesource.com/c/go/+/397015 Run-TryBot: Michael Knyszek <email@example.com> TryBot-Result: Gopher Robot <firstname.lastname@example.org> Reviewed-by: Michael Pratt <email@example.com>
This change makes the memory limit functional by including it in the heap goal calculation. Specifically, we derive a heap goal from the memory limit, and compare that to the GOGC-based goal. If the goal based on the memory limit is lower, we prefer that. To derive the memory limit goal, the heap goal calculation now takes a few additional parameters as input. As a result, the heap goal, in the presence of a memory limit, may change dynamically. The consequences of this are that different parts of the runtime can have different views of the heap goal; this is OK. What's important is that all of the runtime is able to observe the correct heap goal for the moment it's doing something that affects it, like anything that should trigger a GC cycle. On the topic of triggering a GC cycle, this change also allows any manually managed memory allocation from the page heap to trigger a GC. So, specifically workbufs, unrolled GC scan programs, and goroutine stacks. The reason for this is that now non-heap memory can effect the trigger or the heap goal. Most sources of non-heap memory only change slowly, like GC pointer bitmaps, or change in response to explicit function calls like GOMAXPROCS. Note also that unrolled GC scan programs and workbufs are really only relevant during a GC cycle anyway, so they won't actually ever trigger a GC. Our primary target here is goroutine stacks. Goroutine stacks can increase quickly, and this is currently totally independent of the GC cycle. Thus, if for example a goroutine begins to recurse suddenly and deeply, then even though the heap goal and trigger react, we might not notice until its too late. As a result, we need to trigger a GC cycle. We do this trigger in allocManual instead of in stackalloc because it's far more general. We ultimately care about memory that's mapped read/write and not returned to the OS, which is much more the domain of the page heap than the stack allocator. Furthermore, there may be new sources of memory manual allocation in the future (e.g. arenas) that need to trigger a GC if necessary. As such, I'm inclined to leave the trigger in allocManual as an extra defensive measure. It's worth noting that because goroutine stacks do not behave quite as predictably as other non-heap memory, there is the potential for the heap goal to swing wildly. Fortunately, goroutine stacks that haven't been set up to shrink by the last GC cycle will not shrink until after the next one. This reduces the amount of possible churn in the heap goal because it means that shrinkage only happens once per goroutine, per GC cycle. After all the goroutines that should shrink did, then goroutine stacks will only grow. The shrink mechanism is analagous to sweeping, which is incremental and thus tends toward a steady amount of heap memory used. As a result, in practice, I expect this to be a non-issue. Note that if the memory limit is not set, this change should be a no-op. For #48409. Change-Id: Ie06d10175e5e36f9fb6450e26ed8acd3d30c681c Reviewed-on: https://go-review.googlesource.com/c/go/+/394221 Run-TryBot: Michael Knyszek <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com> Reviewed-by: Michael Pratt <firstname.lastname@example.org>
This change does everything necessary to make the memory allocator and the scavenger respect the memory limit. In particular, it: - Adds a second goal for the background scavenge that's based on the memory limit, setting a target 5% below the limit to make sure it's working hard when the application is close to it. - Makes span allocation assist the scavenger if the next allocation is about to put total memory use above the memory limit. - Measures any scavenge assist time and adds it to GC assist time for the sake of GC CPU limiting, to avoid a death spiral as a result of scavenging too much. All of these changes have a relatively small impact, but each is intimately related and thus benefit from being done together. For #48409. Change-Id: I35517a752f74dd12a151dd620f102c77e095d3e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/397017 Reviewed-by: Michael Pratt <email@example.com> Run-TryBot: Michael Knyszek <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com>
Currently the runtime's scavenging algorithm involves running from the top of the heap address space to the bottom (or as far as it gets) once per GC cycle. Once it treads some ground, it doesn't tread it again until the next GC cycle. This works just fine for the background scavenger, for heap-growth scavenging, and for debug.FreeOSMemory. However, it breaks down in the face of a memory limit for small heaps in the tens of MiB. Basically, because the scavenger never retreads old ground, it's completely oblivious to new memory it could scavenge, and that it really *should* in the face of a memory limit. Also, every time some thread goes to scavenge in the runtime, it reserves what could be a considerable amount of address space, hiding it from other scavengers. This change modifies and simplifies the implementation overall. It's less code with complexities that are much better encapsulated. The current implementation iterates optimistically over the address space looking for memory to scavenge, keeping track of what it last saw. The new implementation does the same, but instead of directly iterating over pages, it iterates over chunks. It maintains an index of chunks (as a bitmap over the address space) that indicate which chunks may contain scavenge work. The page allocator populates this index, while scavengers consume it and iterate over it optimistically. This has a two key benefits: 1. Scavenging is much simpler: find a candidate chunk, and check it, essentially just using the scavengeOne fast path. There's no need for the complexity of iterating beyond one chunk, because the index is lock-free and already maintains that information. 2. If pages are freed to the page allocator (always guaranteed to be unscavenged), the page allocator immediately notifies all scavengers of the new source of work, avoiding the hiding issues of the old implementation. One downside of the new implementation, however, is that it's potentially more expensive to find pages to scavenge. In the past, if a single page would become free high up in the address space, the runtime's scavengers would ignore it. Now that scavengers won't, one or more scavengers may need to iterate potentially across the whole heap to find the next source of work. For the background scavenger, this just means a potentially less reactive scavenger -- overall it should still use the same amount of CPU. It means worse overheads for memory limit scavenging, but that's not exactly something with a baseline yet. In practice, this shouldn't be too bad, hopefully since the chunk index is extremely compact. For a 48-bit address space, the index is only 8 MiB in size at worst, but even just one physical page in the index is able to support up to 128 GiB heaps, provided they aren't terribly sparse. On 32-bit platforms, the index is only 128 bytes in size. For #48409. Change-Id: I72b7e74365046b18c64a6417224c5d85511194fb Reviewed-on: https://go-review.googlesource.com/c/go/+/399474 Reviewed-by: Michael Pratt <firstname.lastname@example.org> Run-TryBot: Michael Knyszek <email@example.com> TryBot-Result: Gopher Robot <firstname.lastname@example.org>
This metric exports the the last GC cycle index that the GC limiter was enabled. This metric is useful for debugging and identifying the root cause of OOMs, especially when SetMemoryLimit is in use. For #48409. Change-Id: Ic6383b19e88058366a74f6ede1683b8ffb30a69c Reviewed-on: https://go-review.googlesource.com/c/go/+/403614 TryBot-Result: Gopher Robot <email@example.com> Reviewed-by: David Chase <firstname.lastname@example.org> Run-TryBot: Michael Knyszek <email@example.com>
For #48409. Change-Id: I056afcdbc417ce633e48184e69336213750aae28 Reviewed-on: https://go-review.googlesource.com/c/go/+/406575 Reviewed-by: Michael Knyszek <firstname.lastname@example.org> Run-TryBot: Michael Knyszek <email@example.com> TryBot-Result: Gopher Robot <firstname.lastname@example.org> Reviewed-by: Ian Lance Taylor <email@example.com>
WIP implementation of a memory limit. This will likely be superseded by Go's incoming soft memory limit feature (coming August?), but it's interesting to explore nonetheless. Each time we receive a PUT request, check the used memory. To calculate used memory, we use runtime.ReadMemStats. I was concerned that it would have a large performance cost, because it stops the world on every invocation, but it turns out that it has previously been optimised. Return a 500 if this value has exceeded the current max memory. We use TotalAlloc do determine used memory, because this seemed to be closest to the container memory usage reported by Docker. This is broken regardless, because the value does not decrease as we delete keys (possibly because the store map does not shrink). If we can work out a constant overhead for the map data structure, we might be able to compute memory usage based on the size of keys and values. I think it will be difficult to do this reliably, though. Given that a new language feature will likely remove the need for this work, a simple interim solution might be to implement a max number of objects limit, which provides some value in situations where the user can predict the size of keys and values. TODO: * Make the memory limit configurable by way of an environment variable * Push the limit checking code down to the put handler golang/go#48409 golang/go@4a7cf96 patrickmn/go-cache#5 https://github.com/vitessio/vitess/blob/main/go/cache/lru_cache.go golang/go#20135 https://redis.io/docs/getting-started/faq/#what-happens-if-redis-runs-out-of-memory https://redis.io/docs/manual/eviction/
For #48409. Change-Id: Ia6616a377bc4c871b7ffba6f5a59792a09b64809 Reviewed-on: https://go-review.googlesource.com/c/go/+/410734 Run-TryBot: Michael Pratt <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com> Reviewed-by: Chris Hines <firstname.lastname@example.org> Reviewed-by: Russ Cox <email@example.com>
This addresses comments from CL 410356. For #48409. For #51400. Change-Id: I03560e820a06c0745700ac997b02d13bc03adfc6 Reviewed-on: https://go-review.googlesource.com/c/go/+/410735 Run-TryBot: Michael Pratt <firstname.lastname@example.org> TryBot-Result: Gopher Robot <email@example.com> Reviewed-by: Chris Hines <firstname.lastname@example.org> Reviewed-by: Russ Cox <email@example.com>