Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: idle GC interferes with auto-scaling #39983

Open
aclements opened this issue Jul 1, 2020 · 17 comments
Open

runtime: idle GC interferes with auto-scaling #39983

aclements opened this issue Jul 1, 2020 · 17 comments
Labels
Milestone

Comments

@aclements
Copy link
Member

@aclements aclements commented Jul 1, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14 linux/amd64

But this has been true for years.

What did you do?

Run a mostly-idle application in a container with CPU limits under a mechanism that monitors CPU use and increases the job's CPU reservation if it appears to be too little. Specifically, this was observed with a latency-sensitive policy that looked at high percentile usage sampled over a short time period.

What did you expect to see?

Since the application is mostly idle, a small CPU reservation should be adequate and the auto-scaler should not need to grow that reservation.

What did you see instead?

Because the garbage collector attempts to use any idle cores up to GOMAXPROCS, even an otherwise mostly idle application will see periodic spikes of CPU activity. These will happen at least every 2 minutes if not more frequently. In this case, the auto-scaler's policy was sensitive enough that these spikes caused it to grow the job's CPU reservation. However, then the garbage collector uses all of the new CPU reservation. This leads to a feedback cycle where the auto-scaler continually grows the reservation.

See also #17969, #14812 (comment).

/cc @mknyszek

Thoughts

We've been thinking for a while now that idle GC may be more trouble than it's worth. The idea was to speed up the mark phase, so the write barrier is on for less time and we allocate black for less time. However, if an application is mostly idle, then it's not very sensitive to the (very small) performance impact of the write barrier and probably doesn't produce much floating garbage from allocate-black; and if an application isn't mostly idle, then it's not benefiting from idle GC anyway.

@aclements aclements added this to the Unplanned milestone Jul 1, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 2, 2020

Out of curiosity I looked up when we started doing a GC every two minutes. It was introduced in https://golang.org/cl/5451057, a CL whose purpose was to release unused memory to the operating system. The specific suggestion to ensure that a GC is run at least once every two minutes was https://codereview.appspot.com/5451057/#msg14.

Of course the garbage collector is very different now than it was in 2012. But it does seem valid to consider whether an idle program will eventually release unused memory to the operating system.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Jul 2, 2020

Or we could monitor how much was released to the OS in a GC at N minutes that recovered R memory, and if it was not "enough", then delay next GC till either 2N minutes pass or 2R bytes allocated (or the GOGC threshold is hit). Reset this policy after periods of rapid allocation, and perhaps, start at a smaller number of minutes than 2.

This might have a poor interaction with finalizers in programs that relied on them being even vaguely timely.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Jul 2, 2020

Regarding the autoscaler, I am not entirely sure that this will seal the deal for them. Once upon a time I wrote an all-singing-all-dancing implementation of CRC32 for Java, that would use fork-join parallelism from the system pool to make CRC run faster (CRCs are somewhat embarrassingly parallel, the combine step requires a little work but not tons). I went so far as to demonstrate superlinear speedup. This did not get checked in, because the Java library team was apparently too nervous about people actually using fork-join parallelism in libraries, even though it was there for all the Collections stuff. My attitude was (and remains) "there is no such thing as unexpected parallelism in this millennium, and if a user doesn't want it to happen, they can size the system fork join pool to what they want, that's what it's there for". And of course, pay no attention to that multithreaded same-address-space optimizing compiler behind the curtain. But no dice, then.

My opinion is that if you don't want a process using more than N cores, that is what GOMAXPROCS is for, set it to the number that you want. An autotuning algorithm that gives you more cores after you use all the cores it told you you could use, is broken, especially now. We should not be surprised if people start using all the cores they can to handle embarrassingly parallel problems.

@taralx
Copy link

@taralx taralx commented Jul 2, 2020

There's a difference between bulk computational tasks (where you can expect them to use what you give them) and demand-driven tasks (which will not use it, and if they're using it they might need more). Autoscalers are designed to respond to the latter type. But idle GC that uses easily 10x the CPU of the task's recent 1 minute usage completely throws off that assumption -- there's no way to tell the difference between "task is hammered with demand" and "task is doing a random GC".

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Jul 3, 2020

But it does seem valid to consider whether an idle program will eventually release unused memory to the operating system.

💯

I agree it would be a mistake to not let idle programs eventually return unused memory. For one, it would be pretty confusing for users and potentially even lead to more "maybe-a-memory-leak" issues being filed ("after restarting my application, go heap decreased significantly").

there's no way to tell the difference between "task is hammered with demand" and "task is doing a random GC"

Random suggestion: expose an expvar that tracks inflight real work (e.g. number of "real work" requests being currently served) and inhibit autoscaling if that metric is 0.

Or tweak the autoscaler to more quickly lower the reservation once the process goes idle to avoid the "unbounded" growth.

@taralx
Copy link

@taralx taralx commented Jul 3, 2020

Alternatively, since idle GC isn't triggered by any real memory pressure, run it single-threaded?

@aclements
Copy link
Member Author

@aclements aclements commented Jul 3, 2020

Oops.

I want to clarify that I didn't mean the two minute GC. That also has some downsides, but I think the benefits outweigh the costs.

I was talking about how the garbage collector schedules GC work during idle time on every GC cycle. This is what interferes with autoscaling.

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Jul 3, 2020

I see, IIUC what you mean then #14812 (comment) (GC causing ~100ms latency spikes when running in containers with CPU limits) can also be considered as an additional argument in favor of fixing this.

@RLH
Copy link
Contributor

@RLH RLH commented Jul 4, 2020

@taralx
Copy link

@taralx taralx commented Jul 5, 2020

Java's GC is heavily tunable. One could f.ex. restrict the number of threads it runs on.

@aclements
Copy link
Member Author

@aclements aclements commented Jul 5, 2020

I see, IIUC what you mean then #14812 (comment) (GC causing ~100ms latency spikes when running in containers with CPU limits) can also be considered as an additional argument in favor of fixing this.

@CAFxX , yep! (Thanks for that cross-reference, I've added it to the top post.)

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Jul 6, 2020

In the short run, how would autoscalers respond if GC limited it's "looks like nobody's using those cores" resource consumption to min(idle/2, idle-1)? So for example, a single-threaded app, on a 2-core box, would leave the 2nd core alone and do 25% time-slicing against the mutator in one core.

3 and 4 cores -> GC takes 100% of one idle core (leaving 1 or 2 still idle)
5, 6 -> 2 (leaving 2 or 3 still idle)

(alternate formula: 1 + idle/3, which would use the second core for the 1-busy-1-idle case).

This doesn't mean the cores are actually idle, if there are other processes or if the OS decides it has important work to do, perhaps induced by GC activity.

Another possibility is capping the idle parallelism; have we measured how much marginal returns diminish as the number of cores devoted to GC increases? For the CRC example above, it turned out that 4 cores was very much a sweet spot; speedup fell into the almost-linear to superlinear range.

(I still think autoscalers will need to cope more gracefully with the existence of embarrassingly parallel subproblems. Clock rates are stuck, people will look for these. It would be nice if the OS/autoscaler had some way to communicate the "cost" of an extra core. Otherwise, sure the marginal returns per core are falling, but it's "my" core and it's "idle", so why not use it? )

@aclements
Copy link
Member Author

@aclements aclements commented Jul 6, 2020

Limiting idle GC is certainly a possibility. In the past we've talked about letting one P enter scheduler idle even if GC is running so that there's a P to quickly wake up and also to block in netpoll (though if you do wake it up, do you have to shut down idle GC running on another P to keep a spare P?).

It's actually been a very long time since we've measured the scalability of the garbage collector. It certainly has limits. I suspect they're high enough that they're significantly higher than a mostly-idle latency-sensitive application is likely to use on its own, but at least it would cap out the feedback cycle.

@RLH
Copy link
Contributor

@RLH RLH commented Jul 6, 2020

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Jul 6, 2020

This may be a naive question but here goes anyway: wouldn't it be ideal to give the idle GC its own pacer so that it can be throttled up/down to finish roughly when the next GC cycle is expected to start, as to avoid as much as possible any spiky behavior that we know can cause issues with autoscalers, cpu quotas, and latency in general?

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Jul 6, 2020

A naive question that I have also asked, so I know some of the answer. Running the garbage collector has additional costs above and beyond the work of garbage collection;

  • memory is "allocated black" during garbage collection, meaning already marked which adds to the size of the live set at the end of GC.
  • any writes to pointers must also process a write barrier, which slows down the application in general.
  • the GC is thought to be generally disruptive of otherwise useful processing; it tends to trash caches and use a lot of memory bandwidth.

From the point of view of energy efficiency, it is believed (with some historical evidence) that the "best" way to collect garbage is as quickly as possible, subject to constraints of letting the mutator get its job done. When I've raised the question of "does the GC's intermittent 25% cpu tax require a 33% container overprovision to ensure that latency/throughput goals are met?", the answer has been that in any large (multiple containers handling requests) service, this is not that different from variations in service load over time, variable request sizes, etc. Stuff happens, loads get balanced. And, this is probably fair, anyone running a single-node service probably has to overprovision anyway because reasons, and how much could a single node cost, anyway?

@RLH
Copy link
Contributor

@RLH RLH commented Jul 6, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.