Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: 1.13 performance regression on kubernetes scaling #32828

Closed
mm4tt opened this issue Jun 28, 2019 · 57 comments
Closed

runtime: 1.13 performance regression on kubernetes scaling #32828

mm4tt opened this issue Jun 28, 2019 · 57 comments

Comments

@mm4tt
Copy link

@mm4tt mm4tt commented Jun 28, 2019

Together with golang team, we, kubernetes sig-scalability, are testing not-yet-released golang1.13 against kubernetes scale tests.

What version of Go are you using (go version)?

$ go version
devel +c290cb6338 Sun Jun 23 22:20:39 2019 +0000

Does this issue reproduce with the latest release?

Yes, it reproduces with golang head.

What did you do?

We've run k8s scale tests with kubernetes compiled with go1.13.
There is a visible increase in 99th percentile of api call latency when compared to the same version of kubernetes compiled against go1.12.5.

What did you expect to see?

We expected similar api-call latency as in the baseline run compiled with go1.12.5
YVNzXYG5swf

What did you see instead?

The api-call latency was visibly worse in the run compiled with golang1.13
kNOEaY1LND8

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jun 28, 2019

As asked by @mknyszek, we re-run the tests from golang head with patched https://go-review.googlesource.com/c/go/+/183857
Unfortunately, it didn't help much, the api-call latency was still visibly higher than in the baseline.

DjPh981FmfF

@ALTree ALTree added this to the Go1.13 milestone Jun 28, 2019
@ALTree ALTree changed the title Golang 1.13 performance regression runtime: golang 1.13 performance regression Jun 28, 2019
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 28, 2019

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 1, 2019

Interesting! The fact that the regression is still there indicates that it has little to do with that code snippet. This is a little surprising to me but it's definitely progress.

Also, by removing that code snippet, it seems we lost virtually nothing in terms of physical memory utilization for the load test, based on graphs which were shared out-of-band. This is particularly interesting because I ran a bunch of benchmarks against that patch, of which 2 worsened significantly in terms of average and peak RSS when applying this patch (others retained their RSS usage improvements). This indicates to me that leaving it in is probably makes the most sense.

My next best guess as to the cause of the regression (assuming it's the result of my changes, related to #30333) is the background scavenger existing in the first place, since it is a source of additional lock contention. If this is actually the cause, I have a new patch which might help. https://go-review.googlesource.com/c/go/+/184497

This patch makes it so that the background scavenger should rarely block all allocations from making progress. This patch is basically just a performance improvement, but it could be enough to help the 99th percentile API latency. If this patch doesn't help at all, then I'm out of guesses. I think we should consider bisecting at that point to try to find the source of the regression, though given that these scalability tests take a very long time to run, I imagine it'll take quite a few days to get to the bottom of that.

There's a more aggressive version of this patch (that's slightly trickier) where we only block the one allocation which conflicts with the background scavenger. It's a very minor change to this patch but what makes it tricky is all the consequences. For now, let's try this patch.

Out of curiosity, do you run the load test with GODEBUG=madvdontneed=1 set? In my experimentation, using MADV_FREE over MADV_DONTNEED does help here, but I understand that may not work if you need to track memory usage via any of the Linux-reported statistics.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 1, 2019

This patch makes it so that the background scavenger should rarely block all allocations from making progress.

Sorry, re-reading this I realized it's slightly inaccurate. It doesn't just block allocations, it blocks frees, and anything else that needs to acquire the heap lock (which is a lot of things).

So, it's a good bit worse today than what I originally said, but even the common case is fairly uncommon when we're talking about these code-paths.

@ALTree ALTree changed the title runtime: golang 1.13 performance regression runtime: 1.13 performance regression on kubernetes scaling Jul 1, 2019
@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 2, 2019

Thanks @mknyszek!

Just to confirm, do you want us to run the tests from golang head with just the new patch applied and forget about the previous patch, right?

Our scalability test resources are under high demand recently, but I will see what I can do to run the tests today or tomorrow.
Will post an update here when I have the results.

Regarding the GODEBUG=madvdontneed=1, AFAIR when debugging the go1.12 performance regression we agreed that it shouldn't matter as we're using 4.14+ kernel version. Let me know if the situation has changed here in any way. Given that it would make impossible to track memory usage via Linux-reported statistics, I think we can try it as an experiment if it can provide you with any meaningful insights, but it wouldn't work for us as a long term solution.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 2, 2019

Thanks @mknyszek!

No problem!

Just to confirm, do you want us to run the tests from golang head with just the new patch applied and forget about the previous patch, right?

That's correct.

Our scalability test resources are under high demand recently, but I will see what I can do to run the tests today or tomorrow.
Will post an update here when I have the results.

I realize this all takes a while to set up and run, so I appreciate you taking the time.

Regarding the GODEBUG=madvdontneed=1, AFAIR when debugging the go1.12 performance regression we agreed that it shouldn't matter as we're using 4.14+ kernel version. Let me know if the situation has changed here in any way. Given that it would make impossible to track memory usage via Linux-reported statistics, I think we can try it as an experiment if it can provide you with any meaningful insights, but it wouldn't work for us as a long term solution.

Ahhh right, I remember now. I forgot you're already using MADV_FREE. In that case you can ignore this, like last time.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 2, 2019

The bottleneck here is not my time but the scalability test resources, i.e. the gcp projects where we run the tests. The test uses 5K+ CPUs and takes over 10+h, there is only a few projects where I can run it and all of them are in high demand, other teams use them to run their scalability tests. We'll need to figure out something for long term plan, especially if we'd like to test new golang releases automatically / continuously.

Nevertheless, I managed to find a free project and run the tests. I'm running the tests at the golang commit fbde753a58e286c405a04388816ed044486151bb (current head) with reverted https://go-review.googlesource.com/c/go/+/184098 (which apparently broke k8s build again, I asked for help in kubernetes/kubernetes#78845 (comment)) and with https://go-review.googlesource.com/c/go/+/184497 patched.
Will post the results tomorrow morning.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 3, 2019

The test passed, but the api-call latency is still visibly worse than in the baseline:
yyxNY8qUJxr

If we don't have any other ideas, I can try re-running the baseline to verify that we don't have any external infrastructure issues impacting the results.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 3, 2019

@mm4tt when you say it "passed" do you mean there's some threshold that average 99th percentile latency needs to stay under? If it did better this time around, then that's good news, and I think I have a next step.

I have one more patch which is basically taking the idea from the previous patch further. It's a slightly riskier change to land for Go 1.13, but it might be enough to bring it back to expected levels.

Can you please do another run with the following two patches applied to HEAD in the following order?

  1. https://go-review.googlesource.com/c/go/+/184497
  2. https://go-review.googlesource.com/c/go/+/184442
@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 5, 2019

@mknyszek - yes, that's exactly the case. We have a threshold that the 99th percentile of various groups of requests needs to stay under. It was the first run compiled with golang1.13 that passed, so there is an improvement, but the SLI is still much worse than in the baseline.

I'm rerunning the tests with your patches. I still needed to revert https://go-review.googlesource.com/c/go/+/184098, as I'm not able to compile k8s with it in, but I don't think it should have any performance implications.
Will update the thread when I have the results.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 8, 2019

I've run the tests twice over the weekend and both times it made our api-server explode during the load test.

From a cursory look at the graphs I found that there is a big increase in the memory usage comparing to the previous run (head + https://go-review.googlesource.com/c/go/+/184497)

Jul 2nd run:
aqP1Lz0q3wP

Jul 5th: run
hpObYdxS01R

@mknyszek, any ideas how to interpret that?

In the meantime, I'm re-running the tests compiled with golang1.12 to get a better baseline.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 8, 2019

That's unfortunate, but thanks for confirming. It's possible the second change is broken in some way (not correctness, but in breaking some invariant), so I'll do some more testing.

When you say "exploded" do you mean crashed in some way? OoM? What exactly went wrong?

Can you please clarify what you mean by "memory usage"? Is it RSS, or virtual memory footprint? Or is it computed from MemStats and if so, which MemStats?

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 8, 2019

By exploded I meant that it became overloaded (due to other components crashlooping) and eventually also started crashlooping, most likely because the load on it was so high that it wasn't able to reply on liveness probe in time and was restarted by kubelet. Hard to say what was the exact cause here, but there is a know vicious circle problem here. If the api-server becomes too overloaded, it slows down, other components may start restarting as they're not able to renew the "master lease" on time. Restarting components in turn introduce higher load on the apiserver as they issue more requests during initialization than during normal operation. Then api-server becomes even more overloaded and so the vicious circle continuous...

"memory usage" is based on container_memory_usage_bytes metrics which I think is a sum of all kinds of memory, e.g. container_memory_rss., container_memory_cache, container_memory_swap

I modified the graphs to see which one went up and it's the container_memory_rss:

Jul 2nd:
73QDML8pbV4

Jul 5th:
OxB8PX6KXpF

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 8, 2019

Thanks for looking into the source of the memory increase.

I'm having a hard time figuring out what could be going wrong that's related to my patch, especially since memstats seem to cut off a good amount (between 10:50 and 10:55) before memory actually spikes:

Screen Shot 2019-07-08 at 2 41 40 PM

Based on this, it seems like the load spike comes in before RSS actually increases, and so the memory blow-up seems like it's just a symptom. The first stat, btw (heap_sys - heap_released) is what the runtime thinks the heap's contribution to RSS is, and I think it's fairly accurate. Naturally though it gets cut off when metrics start to disappear.

My patch definitely seems to be involved though; I tried to run a bunch of benchmarks with the latest patch and managed to OOM a machine (at least, as far as I can tell, I can't SSH back into it even though it's on, which is usually a symptom of accidentally OOMing it).

On the plus side, it seems like the latency is much better compared to before, at least while the server is alive? I think this is going in the right direction, I just need to figure out what went wrong here.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 9, 2019

Thanks Michael,

Let me know if your benchmarks found anything.
We almost have a ready PR for fixing the disappearing metric problem under high load, I can try re-running the test with it patched if you think that it might be helpful in debugging.
I've re-run the baseline, will add it to Grafana shortly.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 9, 2019

Well, I'm not sure now if I actually OOM'd the machine I mentioned, but I'm pretty sure I found a deadlock bug in that last patch that (I think if left sitting for long enough) has the potential to OOM. Is it possible that the api-server just deadlocked in the last run and got overloaded when it woke up, so much so that it started crash looping?

In any case, I rewrote the patch to be more "obviously correct," meaning the synchronization should be much less funky, and this specific deadlock should be gone.

To be safe and not waste valuable resources, I'm going to run my benchmarks overnight with those patches and make sure they don't OOM/deadlock/etc. I'll let you know tomorrow how it turns out, and if all goes well, could you please try applying the latest patch and trying it out again?

Just to be completely clear, I mean the following two patches, at their latest versions:

  1. https://go-review.googlesource.com/c/go/+/184497/2
  2. https://go-review.googlesource.com/c/go/+/184442/3
@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 10, 2019

Sounds good, will run the tests once we get a green light from you. Thanks Michael!

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 10, 2019

The benchmarks didn't explode, which is a good sign. I think the patches are good to go.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 10, 2019

Well, I took a longer look at everything and it seems like the fact that the scavenger tries to re-acquire the heap lock can actually make things worse in some cases.

I have one more patch which makes things a little better, but it still seems like it's not much better than the Go 1.13 baseline (without the patches) for the benchmarks I've been running. It seems like it's helping y'all at least, given that the test passed? Do you think it's possible it was a fluke? Do you know how much it failed by?

Finally, if you haven't started running the load test yet, would it be possible apply the following patches instead?

  1. https://go-review.googlesource.com/c/go/+/184497/2
  2. https://go-review.googlesource.com/c/go/+/184442/3
  3. https://go-review.googlesource.com/c/go/+/185617/1

This includes that newest patch I mentioned.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 11, 2019

Unfortunately, I didn't get your second message on time and the tests were run without the third patch.

This time the API server didn't explode. Unfortunately the results are still visible worse than in the baseline

Baseline:
OFTckmm27G7

Yesterday's run:
3icudVa7Tfq

To compare the same graph for Go 1.13 (no patches), imho it looks worse than the yesterday's run
3ttBm4gPc08

I'm not sure if I'll find resources to run the tests with the third patch this week, so most likely expect the results in the beginning of the next week.

Nevertheless, I think we're still far from getting to the state where api-call latency is reasonable (i.e. similar to golang1.12 baseline).
Given how long and expensive (and hard to find resources for) the tests we're currently running are, it's not the most efficient way to iterate on this issue. I'll see if we can find some faster and less expensive tests instead (that would also catch the difference between go1.12 and go1.13). Ideally, I could give you access to some smaller dev project where you could execute the tests yourself. This should allow us to iterate faster and once we had some reasonable version we could re-run the standard tests to verify.  

I'll see what can be done and get back to you beginning of the next week.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 11, 2019

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Jul 11, 2019

I think I've narrowed down the range of CLs which contain the source of the regression, and they're all mine.

Turns out it has nothing to do with the background scavenger, but rather with changes I made to the treap implementation in the runtime, which feels familiar. Reducing the number of treap operations required for allocation/free/scavenging probably isn't possible without a huge refactoring, so I think the main thing to do here is to try to make the operations themselves faster. I didn't expect what I added to have such a large effect but given that there's a lot of heap lock contention I suppose I shouldn't be surprised.

In the past I attempted to add a cache to speed up allocations, but it didn't help much. But, perhaps there's something yet-unexplored here. I'll think about it some more.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Jul 16, 2019

A short update from my side, busy week, but haven't forgot about this.
I'm working on reducing time and cost of the tests I used to verify differences between golang1.12 and 1.13, so it will be possible to run them on my dev project.
Hopefully will have an update within the next few days.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 1, 2019

@mm4tt / @oxddr, I'm going to work on getting spun up with the dev instance based on the instructions you sent via email.

Generally, though, I would love to have a more concrete metric we can track versus just seeing that it's "visibly worse" than the baseline. My particular concern is that all of these latency plots have both extreme overplotting and different palettes, which makes them very hard to compare visually. It's obvious that the worst p99 latencies are worse, since they stand out above the rest, but it seems like only a few request types may actually be affected. How many requests types are actually affected? Are they always the same request types? Is there a metric we can use to more easily compare these, like time-averaged p99 latency?

@oxddr

This comment has been minimized.

Copy link

@oxddr oxddr commented Aug 2, 2019

@aclements I am aware that eyeballing a graph is not an ideal approach for validation, however at the moment I don't think we have a better metric to track. From our perspective p99 latency of requests is real, user-focused metric. I'll give some thought and try to come back to you after the weekend.
Something to try out is to plot only N top streams or actually aggregate them together, instead of grouping them by resource, verb. Most of types are handled the same way, with few exceptions.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 11, 2019

I've finished running the experiments proposed by @mknyszek. The TL;DR is: at current HEAD, the answer appears to be to apply CL 183857.

Here's a full summary of the experiments:

  • 1ad2298 (baseline): multiple treaps, no background scavenger, inline scavenging. Bad:
    image
  • 1ad2298 + CL 183857: multiple treaps, no background scavenger, inline scavenging. GOOD:
    image

Jumping forward to treap changes:

  • 9baa430 + CL 183857: unified treaps, no background scavenger, no inline scavenging. Pretty good:
    image
  • 31c4e09 + CL 183857: unified treaps + optimizations, no background scavenger, no inline scavenging. Pretty good: image

Jumping forward to background scavenger:

  • fe67ea3~1 + CL 183857: same, but just before background scavenger. GOOD:
    image
  • fe67ea3 + CL 183857: unified treaps + optimizations, background scavenger, no inline scavenging. Slightly bad (may be worth a second run):
    image

Jumping forward to HEAD, which includes CL 186926, which fixes a bug introduced in 31c4e09 that lots of unnecessary madvise calls:

  • 0ca4f6b: unified treaps + optimizations + mdavise fix, background scavenger, inline scavenging. Bad:
    image
  • 0ca4f6b + CL 183857: unified treaps + optimizations + mdavise fix, background scavenger, no inline scavenging: GOOD:
    image

I think we have our answer.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 11, 2019

Thanks. Should we put 183857 into 1.13? It looks safe enough.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 11, 2019

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 12, 2019

Change https://golang.org/cl/189957 mentions this issue: runtime: grow the heap incrementally

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 12, 2019

Change https://golang.org/cl/183857 mentions this issue: runtime: scavenge on growth instead of inline with allocation

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 13, 2019

@mknyszek pointed out that the final result above (0ca4f6b + CL 183857: unified treaps + optimizations + mdavise fix, background scavenger, no inline scavenging) didn't do heap growth scavenging, so in some sense it was cheating. However, I just got a test run with CLs 189957 PS 3 and 183957 PS 8, with the fix to add back heap-growth scavenging. It doesn't look quite as good, but it still looks pretty good, and it's not cheating:

image

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 22, 2019

With @oxddr's help, I was able to run the full 14 hour load test with CL 183857 PS 8. The results look pretty good to me:

Screenshot from 2019-08-22 09-29-49

@oxddr says this is the same plot as the "non-WATCH request latency (99th percentile)" from the original posts. Compared to the 1.12 baseline, there are a lot of requests that actually appear much lower latency overall. There are a few brief spikes, but the baseline had that, too, and it's nothing compared to the fog of spikes from 1.13 beta.

So, based on my assessment, CL 183857 (and its parent) appears to fix this issue.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 22, 2019

Actually, the plot I pasted isn't comparable because it omits LIST calls, so the results are currently inconclusive.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Aug 27, 2019

@oxddr built a tool to do a more principled comparison of the latency distributions and found that the proposed fixes "don't look good". Unfortunately, that means I don't really know how to make progress here. @mknyszek and I may have to go back to the drawing board.

We'll keep working on this issue, and once we find a fix, our plan is to issue a point release if at all possible, but in the mean time, we're going to go ahead with releasing 1.13.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Sep 13, 2019

Hey Folks,

I think I have great news.

I've re-run the full tests on golang1.12 (baseline) and golang1.13 (at 2754118) with the two patches that we tried already (which helped, but didn't fix the performance issue totally), i.e., https://go-review.googlesource.com/c/go/+/189957/3, https://go-review.googlesource.com/c/go/+/183857/8 + the new @mknyszek's patch, https://go-review.googlesource.com/c/go/+/193040/2, and there is no visible difference in api-call-latency. In some aspects the golang1.13 run is even slightly better than the baseline.

I'll post a more detailed comparison here later, but the summary is that if we could land all those 3 patches, then we, K8s SIG-Scalability, would be completely satisfied with the golang1.13 performance, meaning that we can basically close this issue.

@aclements, @mknyszek, any thoughts on that? How hard would it be to release these patches? Any estimate on the potential timeline?

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Sep 13, 2019

@mm4tt Austin's patches already have +2 and my patch is very short, so it should be easy to land to master.

I'd personally like to release it in the first 1.13 point release, 1.13.1. Unfortunately I don't know what the time frame is for it since we're a little bit out of the usual release cycle. I'll ask around and get back to you with that.

@aclements also mentioned wanting having some soak time for them to catch any unintended side-effects prior to release. I imagine since the full release just came out we'll have a decent amount of time to let it soak on master before cherry-picking it into the next point release.

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Sep 16, 2019

The comparison I promised in the previous post.

Api-call latency (non-LIST)

baseline
api-call-baseline
golang1.13
api-call-113

The results are very similar, golang1.13 has 3 spikes exceeding 1.5s that we don't see in the baseline, but this is most likely noise. I'm quite sure that if we re-run the tests these spikes could disappear or even appear in the baseline. Single spikes are acceptable, and that's why our performance SLIs are not based on instant api-call-latency but rather on aggregation over whole test, which we'll compare below.

Aggregated api-call latency

baseline
api-call-sli-baseline
golang1.13
api-call-sli-113

During the test, the graph for golang1.13 looks worse due to the 3 spikes above, but looking at the end of the test the golang1.13 run actually looks better. There are only 6 metrics exceeding 500ms (in baseline there are 8) and none exceeding 1s (in baseline there is 1).

Aggregated LIST latency

baseline
list-latency-baseline
golang1.13
list-latency-113

This time it's clearly in favor of the golang1.13. The aggregated list latency is clearly better for the golang1.13 run.

To sum up, some aspects are worse for golang1.13 and some are better, but most differences are most likely just a noise.
Overall the golang1.13 run looks quite good comparing to the baseline. There is nothing that stands out and golang1.13 performance looks reasonable from the k8s scalability point of view. Putting it in other words, with the three patches we don't observe any obvious performance degradation between golang1.12 and golang1.13.

I believe we can close this issue once all three patches merge.

Let me know if there is anything more we can assist you with and thank you so much for your tireless work on this issue!

@mm4tt

This comment has been minimized.

Copy link
Author

@mm4tt mm4tt commented Sep 20, 2019

@mknyszek, @aclements, circling back on this, any updates on merging these patches into HEAD?

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Sep 20, 2019

Change https://golang.org/cl/193040 mentions this issue: runtime: redefine scavenge goal in terms of heap_inuse

gopherbot pushed a commit that referenced this issue Sep 25, 2019
This change makes it so that the scavenge goal is defined primarily in
terms of heap_inuse at the end of the last GC rather than next_gc. The
reason behind this change is that next_gc doesn't take into account
fragmentation, and we can fall into situation where the scavenger thinks
it should have work to do but there's no free and unscavenged memory
available.

In order to ensure the scavenge goal still tracks next_gc, we multiply
heap_inuse by the ratio between the current heap goal and the last heap
goal, which describes whether the heap is growing or shrinking, and by
how much.

Finally, this change updates the documentation for scavenging and
elaborates on why the scavenge goal is defined the way it is.

Fixes #34048.
Updates #32828.

Change-Id: I8deaf87620b5dc12a40ab8a90bf27932868610da
Reviewed-on: https://go-review.googlesource.com/c/go/+/193040
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot gopherbot closed this in f18109d Sep 25, 2019
gopherbot pushed a commit that referenced this issue Sep 25, 2019
Inline scavenging causes significant performance regressions in tail
latency for k8s and has relatively little benefit for RSS footprint.

We disabled inline scavenging in Go 1.12.5 (CL 174102) as well, but
we thought other changes in Go 1.13 had mitigated the issues with
inline scavenging. Apparently we were wrong.

This CL switches back to only doing foreground scavenging on heap
growth, rather than doing it when allocation tries to allocate from
scavenged space.

Fixes #32828.

Change-Id: I1f5df44046091f0b4f89fec73c2cde98bf9448cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/183857
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Sep 26, 2019

This issue is closed now because we fixed this on master, but for Go 1.13 we have two backport issues which track the two problems and their fixes for Go 1.13.

They are #34556 and #34149.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 2, 2019

Change https://golang.org/cl/198485 mentions this issue: [release-branch.go1.13] runtime: grow the heap incrementally

gopherbot pushed a commit that referenced this issue Oct 4, 2019
Currently, we map and grow the heap a whole arena (64MB) at a time.
Unfortunately, in order to fix #32828, we need to switch from
scavenging inline with allocation back to scavenging on heap growth,
but heap-growth scavenging happens in large jumps because we grow the
heap in large jumps.

In order to prepare for better heap-growth scavenging, this CL
separates mapping more space for the heap from actually "growing" it
(tracking the new space with spans). Instead, growing the heap keeps
track of the "current arena" it's growing into. It track that with new
spans as needed, and only maps more arena space when the current arena
is inadequate. The effect to the user is the same, but this will let
us scavenge on much smaller increments of heap growth.

There are two slightly subtleties to this change:

1. If an allocation requires mapping a new arena and that new arena
   isn't contiguous with the current arena, we don't want to lose the
   unused space in the current arena, so we have to immediately track
   that with a span.

2. The mapped space must be accounted as released and idle, even
   though it isn't actually tracked in a span.

For #32828, since this makes heap-growth scavenging far more
effective, especially at small heap sizes. For example, this change is
necessary for TestPhysicalMemoryUtilization to pass once we remove
inline scavenging.

Updates #34556

Change-Id: I300e74a0534062467e4ce91cdc3508e5ef9aa73a
Reviewed-on: https://go-review.googlesource.com/c/go/+/189957
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit f18109d)
Reviewed-on: https://go-review.googlesource.com/c/go/+/198485
Run-TryBot: Andrew Bonventre <andybons@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.