Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOGC=40 default causes performance issues #2665

Closed
fabxc opened this Issue Apr 28, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@fabxc
Copy link
Member

fabxc commented Apr 28, 2017

So the GOGC=40 default made it into v2.0.0-alpha.0 by accident. I removed it again but forgot about it. What followed was a 6h journey of figuring out why my seemingly irrelevant changes in dev-2.0 made it perform a lot better than v2.0.0-alpha.0 in prombench.
Eventually I remembered the GOGC setting.

I thought it would be valuable to verify whether this is a 2.0 specific thing or not. I just my typical prombench setup that also tests the read path.
Two v1.6.1 servers with 10GB target heap size we run, one with GOGC=40 one with GOGC=100.

screen shot 2017-04-28 at 21 31 38

screen shot 2017-04-28 at 21 31 47

screen shot 2017-04-28 at 21 32 07

Even after turning queries off, the memory savings stayed about the same but so did the increased CPU load, even if the ratio shrank a little bit.

The graphs lead me to believe that the drawbacks of increased CPU load and query latency outshine the slight memory savings by a large margin. We might want to consider setting it back to the default and instead document that users can adjust it themselves if they are willing to accept the tradeoffs.

The results reported in #2528 suggest something very different. I'm not quite sure what causes the large difference. The test deployment is by no means crazy at about 80k samples/sec and the ratios did not change after pod scaling happened.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Apr 28, 2017

More granular query histograms with less aggressive quantile:

screen shot 2017-04-28 at 21 43 47

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 29, 2017

I did those tests with 1.6 before releasing. Obviously, the CPU load increased as well. I guess, if you are CPU bound, this will negatively effect queries, which will certainly be true for some users. However, in 1.x, everything that is managed by the page cache in 2.x is on the heap, so the effect on better memory utilization is huge. (I'm planning to post something about it.)

We should definitely document clearly the option to tweak GOGC.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Apr 29, 2017

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 1, 2017

In case it was ambiguous before, both of the servers graphed here at 1.6.1.

Gotcha! That puts things into perspective. I'll leave my thoughts here in a few moments. (Sorry for delays, caught in conference and jet lag...)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 1, 2017

Here my thoughts:

The graphs posted above are for test servers that haven't reached steady state yet, i.e. the heap size is still far away from the configured target heap size, and also fairly small at around 1.5GiB reached at the end of the test run. Also,the heap size graph is using the HeapInuse bytes, while the graphs in #2528 use HeapAlloc and RSS. This has a number of implications:

  • The rationale behind GOGC=40 is that about 60% of heap usage on a mid-size production server is essentially static (seen from the time horizon of a GC cycle, which is seconds to minutes). With the relatively short test run, that state hasn't been reached as there are relatively few memory chunks, but all the rest of the Prometheus machinery is running under full steam. With most of the heap allocations being short-term, GC will run quite frequently (it would be interesting to see how often, and compare it with the numbers reported in #2528, where a GC cycle happened every 90s before and every 35s after). Possibly, with a "normal" scrape interval of 15s+, GC will run more often than scrapes, which might skew the results (by not catching the peaks). Looking at RSS might be interesting because it is less noisy. Looking at HeapAllec might be interesting because there might be a baseline of heap fragmentation that renders the relative memory savings smaller for smaller heap sizes. In any case, with the relatively small heap and GC already running quite often with GOGC=100, the increase you get by GOGC=40 on an otherwise not terribly busy Prometheus server explains the relatively sharp increase in CPU usage compared to #2528.
  • The tragedy behind the previous item is the following: While the heap is small, we run GC more frequently because the usual allocation churn will reach the GC target heap size more quickly. At the same time, with a small heap, we need GC least. There is certainly an optimization potential to GC infrequently for as long as we are far away from the configured target heap size and only GC more aggressively once we get close to the boundary. At the current time, I didn't want to over-engineer this because I'm pretty sure those levers are exactly the ones the improved GC management will give us with Go 1.9. My plan is to revisit once 1.9 materializes and we can play with the new features.
  • With the new target heap size setting, there will be no memory savings in the steady-state case, but instead more memory chunks. The number of memory chunks is the defining metric for various typically hit bottlenecks: Number of series, queries that hit the disk, ingestion rate, how many chunks can get batched up for writes to reduce write amplification. As reported in #2528, the increase in memory chunks in about 50%. That boils down to 50% more time series, which is the most frequently hit bottleneck in practice. The test setup here doesn't hit those bottlenecks. It has all chunks in memory, so no query ever hits the disk, and the increased number of max memory chunks hasn't kicked in yet. So you get all the penalties here (and that even amplified, see above) but none of the rewards. In a real-life case, the slowest queries will be those that hit the disk. Since those are less likely to occur with 50% more memory chunks, you will probably see a huge improvement in long-tail query latency with the GOGC=40 setting.

In summary: Should you ever be CPU-bound instead of memory-bound, you should increase GOGC. But that's even true for GOGC=100. As said, the rationale really boils down to re-creating the usual state with predominantly short-lived heap allocations. But that rationale only holds if 60% of the heap allocations are long-lived, i.e. only in steady state.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 1, 2017

Another point: The first graph above plots avg_over_time, but the relevant aggregation would be max_over_time, as the peak of the heap size is most important for the RSS. With avg_over_time, you see only half of the memory gain.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 5, 2017

CPU usage data averaged over all ~70 Prometheus servers at SC (a diverse mix of bare metal and EC2 and of big and small):

  • v1.5 GOGC=100: 1.8–2.7 cores
  • v1.5 GOGC=40: 2.3–3.0 cores
  • v1.6 GOGC=40: 2.7–3.2 cores

In general, CPU usage seems to be correlated with the number of chunks in memory, and with 1.6, we have now way more of those (thanks to the clamped head size, we don't need to leave a lot of headroom anymore, so that we have no more than twice as many memory chunks fleet-wide). I would attribute the higher CPU usage of 1.6 to the increased number of memory chunks.

Tail latency for queries is dominated by the occasional ad-hoc queries (that hit cold chunks). Since those queries are very different each time, the tail latency is very noisy. It is thus quite difficult to draw conclusions. Having said that, I see far fewer tail latency peaks since we have a lot more memory chunks, and also the baseline 90th percentile latency seems much reduced. But as said, has to be taken with a grain of salt. The screenshot below is median and 90th percentile latency of range queries on the one Prometheus server that drives our most important dashboard (visible on a lot of wall screens). Here, the latency is probably dominated by the regular dashboard queries. We switched to GOGC=40 (while staying on v1.5) on 2017-03-21 (approx. in the middle of the graph) and upgraded to 1.6 on 2017-04-18. There is no dramatic effect on the latency. If at all, it gets a bit smoother and lower after the GOGC=40 switch.
screenshot_20170505_144023

With the documentation we have added now, I suggest to close this issue.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 6, 2017

With the 2.0.0 release imminent, I don't think it's worth to invest more research into this topic.
(2.0.0 does not manipulate GOGC anymore.)

@beorn7 beorn7 closed this Nov 6, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.