New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC causes latency spikes #14812

Open
Dieterbe opened this Issue Mar 14, 2016 · 103 comments

Comments

@Dieterbe
Contributor

Dieterbe commented Mar 14, 2016

Hello,
while running the program at https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing mark-and-scan times of 15s cpu time, 20002500 ms clock time. (8 core system) for a heap of about 6.5GB
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 to obtain the data below.

$ metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
$ go version
go version go1.6 linux/amd64
$ uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've gotten allocations down in various parts of the program there are currently probably about a million or more live pointers referencing pieces of data. I was going to work on optimizing this when Dave Cheney suggested there's a problem with the runtime and I should file a bug (https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
Here's the log with gctrace and schedtrace enabled: https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find a zip containing this log, the binary, a cpu profile taken during gc run 1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just after the spike induced by GC run 1482)
https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that incoming load (requests) of the app is constant so this conveys to me that any extra load is caused by GC, not by changing workload
https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC workload.

let me know if there's anything else I can provide,
thanks,
Dieter.

@RLH

This comment has been minimized.

Contributor

RLH commented Mar 14, 2016

The GC will use as much CPU as is available. If your program is basically
idle, which it appears to be, the GC will use the idle CPU and CPU load
will naturally go up. If your application is active and load is already
high then the GC will limit its load to 25% of GOMAXPROCS. The mark and
scan phase is concurrent, it is unclear how it is adversely affecting your
idle application.

On Mon, Mar 14, 2016 at 1:54 AM, Dieter Plaetinck notifications@github.com
wrote:

Hello,
while running the program at
https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing excessive time spent in mark-and-scan on go 1.6
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 in the below
description.

./metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
go version go1.6 linux/amd64
$uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've
gotten allocations down in various parts of the program there are currently
probably about a million or more live pointers referencing pieces of data.
I was going to work on optimizing this when Dave Cheney suggested there's a
problem with the runtime and I should file a bug (
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
I'm seeing MAS times of 15s and more cpu time, 2000~2500 ms clock time. (8
core system) for a heap of about 6.5GB
Here's the log with gctrace and schedtrace enabled:
https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find
a zip containing this log, the binary, a cpu profile taken during gc run
1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just
after the spike induced by GC run 1482)

https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that
incoming load (requests) of the app is constant so this conveys to me that
any extra load is caused by GC, not by changing workload

https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC
workload.

let me know if there's anything else I can provide,
thanks,
Dieter.


Reply to this email directly or view it on GitHub
#14812.

@ianlancetaylor ianlancetaylor changed the title from mark and scan needs excessive amount of time (15s for 6.5GB heap) to runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap) Mar 14, 2016

@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Mar 14, 2016

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Mar 16, 2016

The mark and scan phase is concurrent, it is unclear how it is adversely affecting your idle application.

just a guess, but perhaps the cause is the extra workload induced by the write barrier? (i watched your gophercon talk again today :) Interestingly, when I use top, i haven't been able to ever catch a core running at 100%.

But you're right that there's essentially two things going on, which may or may not be related:

  • mark and scan takes too long
  • app slows down during GC runs, while cpu cores don't saturate.

Let me know how I can help.

@aclements

This comment has been minimized.

Member

aclements commented Apr 3, 2016

Hi @Dieterbe, could you clarify what the issue is? 15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

If it's that response time goes up during GC, could you try the CL in #15022? (And, if you're feeling adventurous, there's also https://go-review.googlesource.com/21036 and https://go-review.googlesource.com/21282)

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Apr 4, 2016

Hey @aclements!

15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

ok , fair enough for me. i just reported this here because @davecheney mentioned in
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg
that 1.5s for 5GB was unexpected and that i should open a ticket for it. so hence this ticket.

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

of course, this is by itself not a problem.

If it's that response time goes up during GC, could you try the CL in #15022?

initially the ticket wasn't about this, but it was brought up and is definitely a problem for us. so from now on we may as well consider this the issue at hand.
I recompiled my app with a recompiled go using your patch, and did a test run before and after.
unfortunately i see no change and the latency spikes are still there (details at grafana/metrictank#172)
note that i can verify this problem quite early on. e.g. in this case i've seen spikes as early as GC run 270. the issue is there probably much earlier but my app needs to load in a lot of data before i can test. the bug mentioned in #15022 looks like it only activates after a sufficient amount of GC runs.

@aclements aclements changed the title from runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap) to runtime: GC causes latency spikes Apr 22, 2016

@aclements

This comment has been minimized.

Member

aclements commented May 16, 2016

@Dieterbe, would it be possible for you to collect a runtime trace (https://godoc.org/runtime/trace) around one of the periods of increased latency? If you do this with current master, the trace file will be entirely self-contained (otherwise, I'll also need the binary to read the trace file).

I have a hunch about what could be causing this. GC shrinks the stacks, so if many of your goroutines are constantly growing and shrinking the amount of stack they're using by at least a factor of 4, you would see a spike as many goroutines re-grew their stacks after the shrink. This should be more smeared out on master than with Go 1.6 since f11e4eb made shrinking happen concurrently at the beginning of the GC cycle, but if this is the problem I don't think that would have completely mitigated it. (Unfortunately, the trace doesn't say when stack growth happens, so it wouldn't be a smoking gun, but if many distinct goroutines have high latency right after GC that will be some evidence for this theory.)

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented May 16, 2016

Hey @aclements
I did curl 'http://localhost:6063/debug/pprof/trace?seconds=20' > trace.bin
about 5~7 seconds in I think (it's a bit hard to tell) is where the GC kicks in and a latency spike was observed
files: http://dieter.plaetinck.be/files/go-gc-team-is-awesome/trace.bin and http://dieter.plaetinck.be/files/go-gc-team-is-awesome/metric_tank for the binary. compiled with official 1.6.2 . hopefully this helps to diagnose. if not let me know, maybe i can get a better trace.

@aclements aclements modified the milestones: Go1.7Maybe, Go1.7 May 24, 2016

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented May 30, 2016

I read through and #9477 and #10345 and wonder if this issue is another similar case? note that this app is centered around a map (https://github.com/raintank/raintank-metric/blob/master/metric_tank/mdata/aggmetrics.go#L13) that has just over 1M values (and each value in turn has a bunch of pointers to things that have more pointers, and lots of strings involved too). optimizing this is on my todo, but in the meantime i wonder if maybe a GC thread blocks the map leaving other application threads (mutators) unable to interact with the map. and since everything in the app needs this map, it could explain the slow downs?

@aclements

This comment has been minimized.

Member

aclements commented May 31, 2016

@Dieterbe, it's possible. Could you try the fix I posted for #10345? (https://golang.org/cl/23540)

Note that it's not that the GC thread blocks the map. Mutators are free to read and write the map while GC is scanning it; there's no synchronization on the map itself. The issue is that whatever thread gets picked to scan the buckets array of the map is stuck not being able to do anything else until it's scanned the whole bucket array. If there's other mutator work queued up on that thread, it's blocked during this time.

(Sorry I haven't had a chance to dig into the trace you sent.)

@aclements

This comment has been minimized.

Member

aclements commented Jun 7, 2016

I dug into the trace you posted (thanks!) and took another look at your gctrace. This is pretty clearly an over-assist problem. When the GC is running, the expected 25% of the CPU goes to dedicated GC workers, but nearly all of the rest of the CPU time goes to GC assists triggered by allocation. Hence, when GC is running, even though it's technically concurrent, nearly all of the CPU is being soaked up by GC. You can see this really clearly in the trace:

trace-14812

The GC kicks in in the middle. When the GC isn't running, all of the user goroutine executions are really short and the system keeps up. As soon as GC kicks in, two Ps go to dedicated gcBgMarkWorkers (expected) and everything else turns to a checkerboard of 10–20 ms long user goroutine time slices (not expected). Because of this, the run queue lengths climb rapidly (the second row in the group of three at the top), which is almost certainly what's causing the latency spikes. Right after GC finishes there's a period where the user goroutines are running quickly again but they're packed much more densely as it drains the long run queues and returns to normal.

I don't yet understand why this is happening. The heap is in steady state, so I would expect it to compute an assist ratio that causes it to finish roughly half way between the trigger and the goal, but it seems to be computing a much higher assist ratio and finishing much earlier.

@Dieterbe, could you re-run your application with GODEBUG=gctrace=1,gcpacertrace=1 and grab stderr? It would be awesome if you can run with master, since there's more gcpacertrace output than there was in 1.6, but even the 1.6 output would be helpful.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jun 8, 2016

hey @aclements

i recompiled latest master go, then compiled my app with it, and ran with that GODEBUG variable.
this time i just ran it in local dev environment with fake traffic, but the symptoms are visible again (some of the fake requests got slow responses at GC times).
here's the err output: https://gist.github.com/Dieterbe/6a57caf1e6fae129f375146ef95e6276
note at line 459 I pressed return twice.
that's where my fake traffic started and where the symptoms become clear.
the 1st and 2nd GC run after that didn't show it clearly, but then at the 3rd, 4th and 5th and basically every GC run ever that showed response time spikes again.

if it helps, here's a dashboard of the entire run:
https://snapshot.raintank.io/dashboard/snapshot/gNKMeTpufl77H9gK1JCZn5TYvo88uLju
and zoomed in on the region where fake traffic kicked in:
https://snapshot.raintank.io/dashboard/snapshot/QgUlChTZGR4Bg4bw95mHxmN3vozfb334

the "golang GC" graph 2nd row middle and "request handle duration (all steps)" 3rd row on right will probably be most helpful.

@aclements

This comment has been minimized.

Member

aclements commented Jun 14, 2016

@Dieterbe, is there a way I can run the load test with fake traffic myself? I built metric_tank (I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting, but otherwise go get just worked), but I don't know how to run it.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jun 14, 2016

I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting

i'm not sure what's metrics_reporting, but the log problem is caused due to grafana recently changing up its logging mechanism, you can just checkout the last commit before that (grafana/grafana@22cda19) and metric-tank will build fine against it.

as for running it, MT (metric tank) relies on cassandra, elasticsearch and nsqd (from the nsq.io project) at runtime. the easiest (hopefully) way to get it all up and running is by using https://github.com/raintank/raintank-docker, so you can follow that readme, with a few changes:

  • you need to update screens/metricTank with
-./metric_tank \
+GODEBUG=gctrace=1,gcpacertrace=1 ./metric_tank \
  • you would also need to create a file screens/benchmark in raintank-docker with contents like the below:
#backfill an hour at 100k, wait for it, and then do realtime at 10k in background, and start http load.
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100 -offset 1h -speedup 10 -stop-at-now
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100  &
go get github.com/raintank/raintank-metric/inspect-es
go get github.com/tsenart/vegeta
inspect-es --es-addr elasticsearch:9200 -format vegeta-mt | sed 's#localhost:6063#metricTank:6063#' | vegeta attack -rate 500 -duration=120m > attack.out

these tools set up the "contextual circumstances": fake_metrics generates an input stream, inspect-es piped to vegeta triggers a consistent http request load.

you might hit some snags in the process of getting this all up and running though, and i would hate to have you lose much time with this. but i could hop on a call or whatever to help out.

thanks.

@aclements

This comment has been minimized.

Member

aclements commented Jun 15, 2016

I think I got it up and running (with many tweaks), but now I'm a little stuck.

  1. What's the easiest way to get a custom-built Go into the metric_tank container? I'd like to add some debugging to the runtime.
  2. How do I monitor the end-to-end latency so I can test any changes?
@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jun 15, 2016

hey @aclements sorry you had to go through that. what i personally do is compile metric-tank manually on the host, so that the binary is $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/metric_tank, and then make raintank-docker/raintank_code/raintank-metric a symlink to $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/ . you can see in fig.yaml, the metricTank entry, that ./raintank_code/raintank-metric from the host gets mapped to /go/src/github.com/raintank/raintank-metric within the container using a docker directory mapping. in other words, the directory in the container is synchronized with the host, so whatever you compile on the host will be available in the container.

  1. there's 2 methods:

A) using vegeta. in the vegeta attack command you already have, probably change the duration from 120m to about 5m or so, to make the benchmark shorter. then it should be as simple as cat attack.out | vegeta report. vegeta has a few other interesting reporting modes as well. and i think you can even run reports "streaming", e.g. get reports while the file is still being written to.

B) if grafana is running properly in the container (the easiest to verify is to attach to the screen session and go to the grafana tab. in fact you may want to check out all tabs just to verify nothing's obviously broken. usually the last tab i switch to is the benchmark one. there you'll have the console where vegeta is executing and where you can run more vegeta commands). but anyway if grafana's running then you can go to http://localhost:3000, log in as admin:admin and load the "metric tank" dashboard. that's the dashboard i've been using to provide the snapshots above. it should show real-time stats coming in (which it loads from graphite, which also runs in a container)

@aclements

This comment has been minimized.

Member

aclements commented Jul 1, 2016

I spent much of this week working on this. I still have ideas on my list, but wanted to post an update.

First, I was able to bring up the whole system with a custom runtime and observe the behavior through Grafana and vegeta report. I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources, so it was swapping and over-saturating the CPU and causing all sorts of horrible latency. My workstation had the resources (though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!), but I'm not sure it exhibited the problem. According to vegeta, max latency was ~350ms, but P99 was just 2ms in several runs. I did see small jumps in max "request handle duration" correlated with GC in Grafana, though they were only around 15ms.

However, I was still able to dig into the execution traces and glean some insight. I fixed some problems with high goroutine churn potentially losing track of assist credit, and with the way we build up spare assist credit causing long assists when the assist ratio is high (which it is in Metric Tank). I also did several more experimental changes, which did get rid of the bad checkerboard pattern in the execution trace and break it up into a much more finely interleaved workload. This is certainly what we want to see in the trace, but it didn't seem to move the overall stats. I have a few more ideas on why that may be, but I'll be out next week, so it may be a little bit before I can try them.

I'm going to mail a bunch of my experimental commits, which should get linked here shortly. @Dieterbe, if you have time, it would be good try running with these, though I'm not sure you'll see much difference.

I did notice one thing you may be able to change on the application side to mitigate some of the problem. I think Metric Tank is allocating very small objects at a very high rate. This is causing the GC scan rate to be much lower than typical, since the memory access pattern is even more random than typical. This is stretching out the GC assists and interacting poorly with some of the tuning that assumes a more typical scan rate. It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jul 4, 2016

I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources

yeah it's not trivial to consistently trigger clear symptoms (large spikes). But it seems more load always seems to help.

though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

I run https://github.com/meltwater/docker-cleanup/blob/master/docker-cleanup-volumes.sh periodically in between runs of the stack, this frees up space.

It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

Oh boy. I already went through weeks of optimizing memory allocations and what's left now is a bit harder. IIRC the remaining inefficiencies are about allocating a whole bunch of strings (several for each incoming metric). optimizing that (splitting up in the incoming stream in separate data and metadata streams, string interning, using enums, etc) is on the agenda but will require a broader architectural overhaul

I'll be out next week

Enjoy your time off!

@uluyol

This comment has been minimized.

Contributor

uluyol commented Jul 4, 2016

(though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

Mounting a directory on another partition into the container might help with this.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jul 7, 2016

Postponing to 1.8.

@gopherbot

This comment has been minimized.

gopherbot commented Jul 23, 2016

CL https://golang.org/cl/25155 mentions this issue.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Sep 5, 2017

This appears to be some sort of library mismatch nonsense.

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 5, 2017

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 5, 2017

@Dieterbe Dieterbe referenced this issue Oct 1, 2017

Merged

Tag index #729

gopherbot pushed a commit that referenced this issue Oct 31, 2017

runtime: separate soft and hard heap limits
Currently, GC pacing is based on a single hard heap limit computed
based on GOGC. In order to achieve this hard limit, assist pacing
makes the conservative assumption that the entire heap is live.
However, in the steady state (with GOGC=100), only half of the heap is
live. As a result, the garbage collector works twice as hard as
necessary and finishes half way between the trigger and the goal.
Since this is a stable state for the trigger controller, this repeats
from cycle to cycle. Matters are even worse if GOGC is higher. For
example, if GOGC=200, only a third of the heap is live in steady
state, so the GC will work three times harder than necessary and
finish only a third of the way between the trigger and the goal.

Since this causes the garbage collector to consume ~50% of the
available CPU during marking instead of the intended 25%, about 25% of
the CPU goes to mutator assists. This high mutator assist cost causes
high mutator latency variability.

This commit improves the situation by separating the heap goal into
two goals: a soft goal and a hard goal. The soft goal is set based on
GOGC, just like the current goal is, and the hard goal is set at a 10%
larger heap than the soft goal. Prior to the soft goal, assist pacing
assumes the heap is in steady state (e.g., only half of it is live).
Between the soft goal and the hard goal, assist pacing switches to the
current conservative assumption that the entire heap is live.

In benchmarks, this nearly eliminates mutator assists. However, since
background marking is fixed at 25% CPU, this causes the trigger
controller to saturate, which leads to somewhat higher variability in
heap size. The next commit will address this.

The lower CPU usage of course leads to longer mark cycles, though
really it means the mark cycles are as long as they should have been
in the first place. This does, however, lead to two potential
down-sides compared to the current pacing policy: 1. the total
overhead of the write barrier is higher because it's enabled more of
the time and 2. the heap size may be larger because there's more
floating garbage. We addressed 1 by significantly improving the
performance of the write barrier in the preceding commits. 2 can be
demonstrated in intense GC benchmarks, but doesn't seem to be a
problem in any real applications.

Updates #14951.
Updates #14812 (fixes?).
Fixes #18534.

This has no significant effect on the throughput of the
github.com/dr2chase/bent benchmarks-50.

This has little overall throughput effect on the go1 benchmarks:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.41s ± 0%     2.40s ± 0%  -0.22%  (p=0.007 n=20+18)
Fannkuch11-12                2.95s ± 0%     2.95s ± 0%  +0.07%  (p=0.003 n=17+18)
FmtFprintfEmpty-12          41.7ns ± 3%    42.2ns ± 0%  +1.17%  (p=0.002 n=20+15)
FmtFprintfString-12         66.5ns ± 0%    67.9ns ± 2%  +2.16%  (p=0.000 n=16+20)
FmtFprintfInt-12            77.6ns ± 2%    75.6ns ± 3%  -2.55%  (p=0.000 n=19+19)
FmtFprintfIntInt-12          124ns ± 1%     123ns ± 1%  -0.98%  (p=0.000 n=18+17)
FmtFprintfPrefixedInt-12     151ns ± 1%     148ns ± 1%  -1.75%  (p=0.000 n=19+20)
FmtFprintfFloat-12           210ns ± 1%     212ns ± 0%  +0.75%  (p=0.000 n=19+16)
FmtManyArgs-12               501ns ± 1%     499ns ± 1%  -0.30%  (p=0.041 n=17+19)
GobDecode-12                6.50ms ± 1%    6.49ms ± 1%    ~     (p=0.234 n=19+19)
GobEncode-12                5.43ms ± 0%    5.47ms ± 0%  +0.75%  (p=0.000 n=20+19)
Gzip-12                      216ms ± 1%     220ms ± 1%  +1.71%  (p=0.000 n=19+20)
Gunzip-12                   38.6ms ± 0%    38.8ms ± 0%  +0.66%  (p=0.000 n=18+19)
HTTPClientServer-12         78.1µs ± 1%    78.5µs ± 1%  +0.49%  (p=0.035 n=20+20)
JSONEncode-12               12.1ms ± 0%    12.2ms ± 0%  +1.05%  (p=0.000 n=18+17)
JSONDecode-12               53.0ms ± 0%    52.3ms ± 0%  -1.27%  (p=0.000 n=19+19)
Mandelbrot200-12            3.74ms ± 0%    3.69ms ± 0%  -1.17%  (p=0.000 n=18+19)
GoParse-12                  3.17ms ± 1%    3.17ms ± 1%    ~     (p=0.569 n=19+20)
RegexpMatchEasy0_32-12      73.2ns ± 1%    73.7ns ± 0%  +0.76%  (p=0.000 n=18+17)
RegexpMatchEasy0_1K-12       239ns ± 0%     238ns ± 0%  -0.27%  (p=0.000 n=13+17)
RegexpMatchEasy1_32-12      69.0ns ± 2%    69.1ns ± 1%    ~     (p=0.404 n=19+19)
RegexpMatchEasy1_1K-12       367ns ± 1%     365ns ± 1%  -0.60%  (p=0.000 n=19+19)
RegexpMatchMedium_32-12      105ns ± 1%     104ns ± 1%  -1.24%  (p=0.000 n=19+16)
RegexpMatchMedium_1K-12     34.1µs ± 2%    33.6µs ± 3%  -1.60%  (p=0.000 n=20+20)
RegexpMatchHard_32-12       1.62µs ± 1%    1.67µs ± 1%  +2.75%  (p=0.000 n=18+18)
RegexpMatchHard_1K-12       48.8µs ± 1%    50.3µs ± 2%  +3.07%  (p=0.000 n=20+19)
Revcomp-12                   386ms ± 0%     384ms ± 0%  -0.57%  (p=0.000 n=20+19)
Template-12                 59.9ms ± 1%    61.1ms ± 1%  +2.01%  (p=0.000 n=20+19)
TimeParse-12                 301ns ± 2%     307ns ± 0%  +2.11%  (p=0.000 n=20+19)
TimeFormat-12                323ns ± 0%     323ns ± 0%    ~     (all samples are equal)
[Geo mean]                  47.0µs         47.1µs       +0.23%

https://perf.golang.org/search?q=upload:20171030.1

Likewise, the throughput effect on the x/benchmarks is minimal (and
reasonably positive on the garbage benchmark with a large heap):

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.40ms ± 4%  2.29ms ± 3%  -4.57%  (p=0.000 n=19+18)
Garbage/benchmem-MB=64-12    2.23ms ± 1%  2.24ms ± 2%  +0.59%  (p=0.016 n=19+18)
HTTP-12                      12.5µs ± 1%  12.6µs ± 1%    ~     (p=0.326 n=20+19)
JSON-12                      11.1ms ± 1%  11.3ms ± 2%  +2.15%  (p=0.000 n=16+17)

It does increase the heap size of the garbage benchmarks, but seems to
have relatively little impact on more realistic programs. Also, we'll
gain some of this back with the next commit.

name                         old peak-RSS-bytes  new peak-RSS-bytes  delta
Garbage/benchmem-MB=1024-12          1.21G ± 1%          1.88G ± 2%  +55.59%  (p=0.000 n=19+20)
Garbage/benchmem-MB=64-12             168M ± 3%           248M ± 8%  +48.08%  (p=0.000 n=18+20)
HTTP-12                              45.6M ± 9%          47.0M ±27%     ~     (p=0.925 n=20+20)
JSON-12                               193M ±11%           206M ±11%   +7.06%  (p=0.001 n=20+20)

https://perf.golang.org/search?q=upload:20171030.2

Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736
Reviewed-on: https://go-review.googlesource.com/59970
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>

gopherbot pushed a commit that referenced this issue Oct 31, 2017

runtime: allow 5% mutator assist over 25% background mark
Currently, both the background mark worker and the goal GC CPU are
both fixed at 25%. The trigger controller's goal is to achieve the
goal CPU usage, and with the previous commit it can actually achieve
this. But this means there are *no* assists, which sounds ideal but
actually causes problems for the trigger controller. Since the
controller can't lower CPU usage below the background mark worker CPU,
it saturates at the CPU goal and no longer gets feedback, which
translates into higher variability in heap growth.

This commit fixes this by allowing assists 5% CPU beyond the 25% fixed
background mark. This avoids saturating the trigger controller, since
it can now get feedback from both sides of the CPU goal. This leads to
low variability in both CPU usage and heap growth, at the cost of
reintroducing a low rate of mark assists.

We also experimented with 20% background plus 5% assist, but 25%+5%
clearly performed better in benchmarks.

Updates #14951.
Updates #14812.
Updates #18534.

Combined with the previous CL, this significantly improves tail
mutator utilization in the x/bechmarks garbage benchmark. On a sample
trace, it increased the 99.9%ile mutator utilization at 10ms from 26%
to 59%, and at 5ms from 17% to 52%. It reduced the 99.9%ile zero
utilization window from 2ms to 700µs. It also helps the mean mutator
utilization: it increased the 10s mutator utilization from 83% to 94%.
The minimum mutator utilization is also somewhat improved, though
there is still some unknown artifact that causes a miniscule fraction
of mutator assists to take 5--10ms (in fact, there was exactly one
10ms mutator assist in my sample trace).

This has no significant effect on the throughput of the
github.com/dr2chase/bent benchmarks-50.

This has little effect on the go1 benchmarks (and the slight overall
improvement makes up for the slight overall slowdown from the previous
commit):

name                      old time/op    new time/op    delta
BinaryTree17-12              2.40s ± 0%     2.41s ± 1%  +0.26%  (p=0.010 n=18+19)
Fannkuch11-12                2.95s ± 0%     2.93s ± 0%  -0.62%  (p=0.000 n=18+15)
FmtFprintfEmpty-12          42.2ns ± 0%    42.3ns ± 1%  +0.37%  (p=0.001 n=15+14)
FmtFprintfString-12         67.9ns ± 2%    67.2ns ± 3%  -1.03%  (p=0.002 n=20+18)
FmtFprintfInt-12            75.6ns ± 3%    76.8ns ± 2%  +1.59%  (p=0.000 n=19+17)
FmtFprintfIntInt-12          123ns ± 1%     124ns ± 1%  +0.77%  (p=0.000 n=17+14)
FmtFprintfPrefixedInt-12     148ns ± 1%     150ns ± 1%  +1.28%  (p=0.000 n=20+20)
FmtFprintfFloat-12           212ns ± 0%     211ns ± 1%  -0.67%  (p=0.000 n=16+17)
FmtManyArgs-12               499ns ± 1%     500ns ± 0%  +0.23%  (p=0.004 n=19+16)
GobDecode-12                6.49ms ± 1%    6.51ms ± 1%  +0.32%  (p=0.008 n=19+19)
GobEncode-12                5.47ms ± 0%    5.43ms ± 1%  -0.68%  (p=0.000 n=19+20)
Gzip-12                      220ms ± 1%     216ms ± 1%  -1.66%  (p=0.000 n=20+19)
Gunzip-12                   38.8ms ± 0%    38.5ms ± 0%  -0.80%  (p=0.000 n=19+20)
HTTPClientServer-12         78.5µs ± 1%    78.1µs ± 1%  -0.53%  (p=0.008 n=20+19)
JSONEncode-12               12.2ms ± 0%    11.9ms ± 0%  -2.38%  (p=0.000 n=17+19)
JSONDecode-12               52.3ms ± 0%    53.3ms ± 0%  +1.84%  (p=0.000 n=19+20)
Mandelbrot200-12            3.69ms ± 0%    3.69ms ± 0%  -0.19%  (p=0.000 n=19+19)
GoParse-12                  3.17ms ± 1%    3.19ms ± 1%  +0.61%  (p=0.000 n=20+20)
RegexpMatchEasy0_32-12      73.7ns ± 0%    73.2ns ± 1%  -0.66%  (p=0.000 n=17+20)
RegexpMatchEasy0_1K-12       238ns ± 0%     239ns ± 0%  +0.32%  (p=0.000 n=17+16)
RegexpMatchEasy1_32-12      69.1ns ± 1%    69.2ns ± 1%    ~     (p=0.669 n=19+13)
RegexpMatchEasy1_1K-12       365ns ± 1%     367ns ± 1%  +0.49%  (p=0.000 n=19+19)
RegexpMatchMedium_32-12      104ns ± 1%     105ns ± 1%  +1.33%  (p=0.000 n=16+20)
RegexpMatchMedium_1K-12     33.6µs ± 3%    34.1µs ± 4%  +1.67%  (p=0.001 n=20+20)
RegexpMatchHard_32-12       1.67µs ± 1%    1.62µs ± 1%  -2.78%  (p=0.000 n=18+17)
RegexpMatchHard_1K-12       50.3µs ± 2%    48.7µs ± 1%  -3.09%  (p=0.000 n=19+18)
Revcomp-12                   384ms ± 0%     386ms ± 0%  +0.59%  (p=0.000 n=19+19)
Template-12                 61.1ms ± 1%    60.5ms ± 1%  -1.02%  (p=0.000 n=19+20)
TimeParse-12                 307ns ± 0%     303ns ± 1%  -1.23%  (p=0.000 n=19+15)
TimeFormat-12                323ns ± 0%     323ns ± 0%  -0.12%  (p=0.011 n=15+20)
[Geo mean]                  47.1µs         47.0µs       -0.20%

https://perf.golang.org/search?q=upload:20171030.4

It slightly improve the performance the x/benchmarks:

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.29ms ± 3%  2.22ms ± 2%  -2.97%  (p=0.000 n=18+18)
Garbage/benchmem-MB=64-12    2.24ms ± 2%  2.21ms ± 2%  -1.64%  (p=0.000 n=18+18)
HTTP-12                      12.6µs ± 1%  12.6µs ± 1%    ~     (p=0.690 n=19+17)
JSON-12                      11.3ms ± 2%  11.3ms ± 1%    ~     (p=0.163 n=17+18)

and fixes some of the heap size bloat caused by the previous commit:

name                         old peak-RSS-bytes  new peak-RSS-bytes  delta
Garbage/benchmem-MB=1024-12          1.88G ± 2%          1.77G ± 2%  -5.52%  (p=0.000 n=20+18)
Garbage/benchmem-MB=64-12             248M ± 8%           226M ± 5%  -8.93%  (p=0.000 n=20+20)
HTTP-12                              47.0M ±27%          47.2M ±12%    ~     (p=0.512 n=20+20)
JSON-12                               206M ±11%           206M ±10%    ~     (p=0.841 n=20+20)

https://perf.golang.org/search?q=upload:20171030.5

Combined with the change to add a soft goal in the previous commit,
the achieves a decent performance improvement on the garbage
benchmark:

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.40ms ± 4%  2.22ms ± 2%  -7.40%  (p=0.000 n=19+18)
Garbage/benchmem-MB=64-12    2.23ms ± 1%  2.21ms ± 2%  -1.06%  (p=0.000 n=19+18)
HTTP-12                      12.5µs ± 1%  12.6µs ± 1%    ~     (p=0.330 n=20+17)
JSON-12                      11.1ms ± 1%  11.3ms ± 1%  +1.87%  (p=0.000 n=16+18)

https://perf.golang.org/search?q=upload:20171030.6

Change-Id: If04ddb57e1e58ef2fb9eec54c290eb4ae4bea121
Reviewed-on: https://go-review.googlesource.com/59971
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@gopherbot

This comment has been minimized.

gopherbot commented Nov 6, 2017

Change https://golang.org/cl/24704 mentions this issue: runtime: over-assist in terms of scan work instead of allocation bytes

@gopherbot

This comment has been minimized.

gopherbot commented Nov 6, 2017

Change https://golang.org/cl/24706 mentions this issue: runtime: shorten mark worker preemption to 1ms

@gopherbot

This comment has been minimized.

gopherbot commented Nov 6, 2017

Change https://golang.org/cl/24705 mentions this issue: runtime: adjust trigger based only on CPU utilization

@gopherbot

This comment has been minimized.

gopherbot commented Nov 6, 2017

Change https://golang.org/cl/24708 mentions this issue: runtime: essentially disable dedicated and fractional mark workers

@gopherbot

This comment has been minimized.

gopherbot commented Nov 6, 2017

Change https://golang.org/cl/24707 mentions this issue: runtime: add a control to disable the idle worker

@gopherbot

This comment has been minimized.

gopherbot commented Nov 6, 2017

Change https://golang.org/cl/24703 mentions this issue: runtime: flush assist credit on goroutine exit

@aclements

This comment has been minimized.

Member

aclements commented Nov 6, 2017

Gopherbot probably shouldn't post in response to old CLs that are being abandoned. :) Only the last of these (CL 24703) is real, and it's just a minor fix.

It would be great if everyone could try current master and report their findings. The fixes for the pacer that everyone tried earlier are now in, along with several other latency improvements.

@dkisman-laserlike, I took another look at the traces you posted (thanks!), and noticed something deeper this time that I think explains why the earlier patches didn't help you (and you probably won't see much improvement on master). I think what's going on is that your application's idle time varies significantly between GC cycles. Since GC scheduling is based on a feedback system, it "gets used to" being able to soak up the application's idle time for a few cycles. Then it hits a cycle that doesn't have idle time, but realizes too late that it started too late and makes up for the loss by having a lot of assists. It would adjust to this over the next few cycles, but then it starts having idle time available again. If it's not too annoying to run your test, you can disable idle GC workers by disabling the two if statements in findrunnable in runtime/proc.go that mention gcBlackenEnabled != 0. I'd be interested in the results.

@petermattis

This comment has been minimized.

petermattis commented Nov 7, 2017

I'll give go-tip a whirl on our workload later this week.

@petermattis

This comment has been minimized.

petermattis commented Nov 7, 2017

I don't have precisely the same machine setup as previously, so the numbers below differ from ones I've reported earlier. Before running the tests below, I let my load generator run as fast as possible to see the capacity of the system. Then I limited the rate of the load generator to ~50% of the max. Here are the numbers for a 3-node cockroach cluster compiled with go1.9 vs go-tip (devel +33c246f Tue Nov 7 14:18:00 2017 +0000):

go1.9:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14760.2      2.4      0.8     13.6     37.7

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14998.5      1.1      0.8      2.9      8.9

Big thumbs up on the reduction in 95%-tile and 99%-tile latencies. For reference, if I don't rate limit the load generator I see:

go1.9:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     27537.1      7.0      2.4     26.2     52.4

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     27368.3      7.0      2.8     27.3     41.9

From the rate limited runs, here are a few lines of gctrace output from go1.9 compiled binary:

gc 299 @270.168s 3%: 0.28+94+0.13 ms clock, 1.9+226/188/0+0.95 ms cpu, 399->410->209 MB, 420 MB goal, 8 P
gc 300 @271.634s 3%: 0.26+108+0.10 ms clock, 2.1+197/216/0+0.83 ms cpu, 398->409->211 MB, 419 MB goal, 8 P
gc 301 @273.202s 3%: 0.21+104+0.11 ms clock, 1.7+196/207/0+0.93 ms cpu, 401->411->210 MB, 422 MB goal, 8 P

And from go-tip:

gc 285 @270.125s 2%: 0.041+120+0.14 ms clock, 0.33+151/231/48+1.1 ms cpu, 417->436->220 MB, 435 MB goal, 8 P
gc 286 @271.713s 2%: 0.091+117+0.16 ms clock, 0.73+19/234/139+1.3 ms cpu, 419->434->216 MB, 441 MB goal, 8 P
gc 287 @273.322s 2%: 0.061+120+0.12 ms clock, 0.48+115/240/74+0.96 ms cpu, 415->430->216 MB, 433 MB goal, 8 P
@aclements

This comment has been minimized.

Member

aclements commented Nov 7, 2017

@petermattis, thanks for re-running! The first set of numbers looks great. The second set might just be showing the natural queuing delays of a fully loaded system.

Your gctrace is interesting. Just based on those three lines, I suspect you're encountering to some degree the same problem with idle fluctuation that I hypothesized in #14812 (comment). Can you paste a few more lines of gctrace? A dozen is probably enough.

gopherbot pushed a commit that referenced this issue Nov 7, 2017

runtime: flush assist credit on goroutine exit
Currently dead goroutines retain their assist credit. This credit can
be used if the goroutine gets recycled, but in general this can make
assist pacing over-aggressive by hiding an amount of credit
proportional to the number of exited (and not reused) goroutines.

Fix this "hidden credit" by flushing assist credit to the global
credit pool when a goroutine exits.

Updates #14812.

Change-Id: I65f7f75907ab6395c04aacea2c97aea963b60344
Reviewed-on: https://go-review.googlesource.com/24703
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@petermattis

This comment has been minimized.

petermattis commented Nov 7, 2017

Yeah, I'm not concerned about the second set of numbers. I included those to very that top-end performance wasn't suffering.

Would it be worthwhile to try disabling idle GC workers?

go1.9 gctrace:

gc 290 @256.195s 3%: 0.25+100+0.10 ms clock, 2.0+208/199/0+0.84 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 291 @257.727s 3%: 0.24+104+0.21 ms clock, 1.9+212/207/0+1.7 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 292 @259.293s 3%: 0.25+104+0.11 ms clock, 2.0+191/208/0.24+0.93 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 293 @260.812s 3%: 0.29+106+0.25 ms clock, 2.0+209/202/0+1.7 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 294 @262.345s 3%: 0.078+108+0.18 ms clock, 0.63+214/217/13+1.4 ms cpu, 401->411->211 MB, 420 MB goal, 8 P
gc 295 @263.929s 3%: 0.14+103+3.2 ms clock, 1.0+214/206/0+22 ms cpu, 402->412->210 MB, 422 MB goal, 8 P
gc 296 @265.492s 3%: 0.26+103+0.11 ms clock, 2.1+205/207/0+0.94 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 297 @267.053s 3%: 0.16+103+0.10 ms clock, 1.3+211/204/0+0.86 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 298 @268.609s 3%: 0.18+102+0.19 ms clock, 1.5+216/204/0+1.5 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 299 @270.168s 3%: 0.28+94+0.13 ms clock, 1.9+226/188/0+0.95 ms cpu, 399->410->209 MB, 420 MB goal, 8 P
gc 300 @271.634s 3%: 0.26+108+0.10 ms clock, 2.1+197/216/0+0.83 ms cpu, 398->409->211 MB, 419 MB goal, 8 P
gc 301 @273.202s 3%: 0.21+104+0.11 ms clock, 1.7+196/207/0+0.93 ms cpu, 401->411->210 MB, 422 MB goal, 8 P

go-tip gctrace:

gc 276 @255.549s 2%: 0.041+118+0.16 ms clock, 0.32+116/236/73+1.3 ms cpu, 416->431->216 MB, 434 MB goal, 8 P
gc 277 @257.165s 2%: 0.038+118+0.14 ms clock, 0.30+88/234/86+1.1 ms cpu, 414->429->216 MB, 432 MB goal, 8 P
gc 278 @258.793s 2%: 0.075+125+0.096 ms clock, 0.60+111/248/57+0.77 ms cpu, 415->431->217 MB, 433 MB goal, 8 P
gc 279 @260.381s 2%: 0.043+119+0.17 ms clock, 0.34+92/237/81+1.4 ms cpu, 416->431->216 MB, 435 MB goal, 8 P
gc 280 @261.956s 2%: 0.14+127+0.14 ms clock, 1.1+111/253/57+1.1 ms cpu, 414->431->217 MB, 433 MB goal, 8 P
gc 281 @263.588s 2%: 0.049+132+0.12 ms clock, 0.39+109/264/59+1.0 ms cpu, 416->433->218 MB, 435 MB goal, 8 P
gc 282 @265.224s 2%: 0.051+126+0.14 ms clock, 0.40+76/251/98+1.1 ms cpu, 417->433->217 MB, 437 MB goal, 8 P
gc 283 @266.856s 2%: 0.074+125+0.092 ms clock, 0.59+107/247/59+0.74 ms cpu, 417->432->217 MB, 435 MB goal, 8 P
gc 284 @268.487s 2%: 0.075+128+0.12 ms clock, 0.60+92/255/55+1.0 ms cpu, 417->432->217 MB, 435 MB goal, 8 P
gc 285 @270.125s 2%: 0.041+120+0.14 ms clock, 0.33+151/231/48+1.1 ms cpu, 417->436->220 MB, 435 MB goal, 8 P
gc 286 @271.713s 2%: 0.091+117+0.16 ms clock, 0.73+19/234/139+1.3 ms cpu, 419->434->216 MB, 441 MB goal, 8 P
gc 287 @273.322s 2%: 0.061+120+0.12 ms clock, 0.48+115/240/74+0.96 ms cpu, 415->430->216 MB, 433 MB goal, 8 P
@aclements

This comment has been minimized.

Member

aclements commented Nov 8, 2017

Would it be worthwhile to try disabling idle GC workers?

There isn't as much idle variance in the longer gctrace, but if it's not too much trouble, I would love to know what effect disabling idle GC has and what the trace looks like. Most of the cycles in your longer trace still have more assist time than they're supposed to, which I suspect is related to idle GC.

BTW, this patch should disable idle GC:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 112543db10..febb676a07 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2290,7 +2290,7 @@ stop:
        // We have nothing to do. If we're in the GC mark phase, can
        // safely scan and blacken objects, and have work to do, run
        // idle-time marking rather than give up the P.
-       if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
+       if false && gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
                _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode
                gp := _p_.gcBgMarkWorker.ptr()
                casgstatus(gp, _Gwaiting, _Grunnable)
@@ -2357,7 +2357,7 @@ stop:
        }
 
        // Check for idle-priority GC work again.
-       if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
+       if false && gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
                lock(&sched.lock)
                _p_ = pidleget()
                if _p_ != nil && _p_.gcBgMarkWorker == 0 {
@petermattis

This comment has been minimized.

petermattis commented Nov 8, 2017

Disabling GC idle didn't seem to affect latencies. The run below shows slightly better 99%-tile latency and slightly worse 95%-tile latency vs go-tip, but I think this is noise.

go-tip+no-gc-idle:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14999.0      1.1      0.8      3.1      8.4

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14998.5      1.1      0.8      2.9      8.9

gctrace output shows no GC idle time as expected and somewhat lower assist times, though I thought the recent GC changes were intended to eliminate GC assist in most cases.

gc 266 @255.887s 3%: 0.084+155+0.16 ms clock, 0.67+82/309/0+1.3 ms cpu, 421->441->221 MB, 442 MB goal, 8 P
gc 267 @257.543s 3%: 0.086+147+0.11 ms clock, 0.69+86/295/0+0.91 ms cpu, 422->441->220 MB, 443 MB goal, 8 P
gc 268 @259.179s 3%: 0.081+150+0.095 ms clock, 0.65+95/300/0+0.76 ms cpu, 420->440->220 MB, 441 MB goal, 8 P
gc 269 @260.741s 3%: 0.16+154+0.10 ms clock, 1.3+91/308/0+0.83 ms cpu, 421->441->221 MB, 441 MB goal, 8 P
gc 270 @262.389s 3%: 0.070+149+0.093 ms clock, 0.56+81/299/0+0.74 ms cpu, 422->441->221 MB, 443 MB goal, 8 P
gc 271 @264.034s 3%: 0.031+148+0.10 ms clock, 0.25+88/296/0+0.80 ms cpu, 421->440->220 MB, 442 MB goal, 8 P
gc 272 @265.676s 3%: 0.081+145+0.094 ms clock, 0.65+88/289/0+0.75 ms cpu, 420->439->220 MB, 440 MB goal, 8 P
gc 273 @267.307s 3%: 0.20+153+0.10 ms clock, 1.6+95/307/0+0.83 ms cpu, 420->440->221 MB, 440 MB goal, 8 P
gc 274 @268.949s 3%: 0.11+154+0.14 ms clock, 0.91+92/308/0+1.1 ms cpu, 421->440->221 MB, 442 MB goal, 8 P
gc 275 @270.515s 3%: 0.055+154+0.097 ms clock, 0.44+82/306/0+0.78 ms cpu, 421->441->221 MB, 442 MB goal, 8 P
gc 276 @272.171s 3%: 0.12+152+0.16 ms clock, 1.0+98/303/0+1.3 ms cpu, 422->441->221 MB, 443 MB goal, 8 P
gc 277 @273.820s 3%: 0.052+157+0.11 ms clock, 0.42+89/314/0+0.94 ms cpu, 421->441->221 MB, 442 MB goal, 8 P

If I also disable GC assist (small patch to src/runtime/malloc.go) I see:

go-tip+no-gc-idle+no-gc-assist:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     15000.1      0.9      0.8      2.2      3.9

And the corresponding gctrace:

gc 250 @258.759s 3%: 0.031+187+0.088 ms clock, 0.24+0/375/0+0.70 ms cpu, 431->456->227 MB, 451 MB goal, 8 P
gc 251 @260.447s 3%: 0.058+184+0.20 ms clock, 0.46+0/367/0+1.6 ms cpu, 433->458->226 MB, 454 MB goal, 8 P
gc 252 @262.137s 3%: 0.064+181+0.24 ms clock, 0.51+0/353/0+1.9 ms cpu, 432->455->225 MB, 452 MB goal, 8 P
gc 253 @263.857s 3%: 0.093+188+0.070 ms clock, 0.75+0/376/0+0.56 ms cpu, 430->455->227 MB, 450 MB goal, 8 P
gc 254 @265.592s 3%: 0.11+175+0.12 ms clock, 0.90+0/350/0+0.97 ms cpu, 433->456->224 MB, 454 MB goal, 8 P
gc 255 @267.308s 3%: 0.047+175+0.12 ms clock, 0.37+0/350/0+0.96 ms cpu, 429->453->225 MB, 449 MB goal, 8 P
gc 256 @269.018s 2%: 0.036+180+0.17 ms clock, 0.28+0/360/0+1.3 ms cpu, 430->454->225 MB, 450 MB goal, 8 P
gc 257 @270.690s 2%: 0.068+185+0.11 ms clock, 0.54+0/369/0+0.91 ms cpu, 430->455->225 MB, 450 MB goal, 8 P
gc 258 @272.374s 2%: 0.074+193+0.12 ms clock, 0.52+0/386/0+0.86 ms cpu, 431->457->227 MB, 451 MB goal, 8 P
gc 259 @274.111s 2%: 0.11+184+0.25 ms clock, 0.94+0/367/0+2.0 ms cpu, 433->457->225 MB, 454 MB goal, 8 P

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@aclements aclements modified the milestones: Go1.11, Unplanned Jul 3, 2018

@brian-armstrong-discord

This comment has been minimized.

brian-armstrong-discord commented Aug 8, 2018

Hi,

I posted on golang-nuts but was pointed to this issue. I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.

Here's one trace captured from the service. I think most look pretty similar.
GC forced
pacer: assist ratio=+8.611156e-002 (scan 960 MB in 15690->21268 MB) workers=1++0.000000e+000
pacer: H_m_prev=11151043456 h_t=+9.500000e-001 H_T=21744534739 h_a=+4.765627e-001 H_a=16465215016 h_g=+1.000000e+000 H_g=22302086912 u_a=+9.186874e-001 u_g=+3.000000e-001 W_a=347113784 goalΔ=+5.000000e-002 actualΔ=-4.734373e-001 u_a/u_g=+3.062291e+000
gc 36 @1197.776s 1%: 0.59+2210+0.29 ms clock, 2.3+5915/2192/0+1.1 ms cpu, 15690->15702->10671 MB, 21268 MB goal, 4 P
scvg7: inuse: 15834, idle: 107, sys: 15942, released: 0, consumed: 15942 (MB)
pacer: sweep done at heap size 10851MB; allocated 179MB during sweep; swept 2026864 pages at +1.906868e-004 pages/byte

This is on Go 1.10.3 with 4 cores. The allocation is slow enough compared to the heap size that the GC runs compulsorily every 120 seconds, but changing GOGC to make it run more often doesn't help the CPU issue, just makes the spikes more frequent.

Are there any fixes for the latency spikes on Go's roadmap?

Edit: Also, I don't understand why the GC runs this concurrent phase in such a short duration. If the GCs are spaced 120 seconds apart, why use just ~2.5s of that doing GC? Services typically don't like spikey load, so it seems silly to burst all the GC at once.

@RLH

This comment has been minimized.

Contributor

RLH commented Aug 13, 2018

@brian-armstrong-discord

This comment has been minimized.

brian-armstrong-discord commented Aug 13, 2018

Unfortunately I don't have any traces that work. I captured a few but they just render blank pages in the browser, and I didn't have time to dig into what was wrong with them. There's no ambiguity that the spikes are correlated with the GC behavior though since our metrics (and htop) show very clear CPU spikes that occur when Go reports that it has started its GC.

I haven't tried any Go builds but it wouldn't be trivial for us to change out to a nonstandard Go binary on our build system. Are there GC fixes in Go > 1.10.3?

We did come up with a very hacky workaround which is to add more machines to the fleet. By bringing the average CPU utilization under 20%, the impact of the CPU spikes is not as pronounced on request latency. The spikes are still present and still completely saturate the CPU, but the smaller load doesn't seem to be as affected by it. However it seems rather unfortunate if Go can only use 1/5 of the computing resources of a machine it's on because of its GC design.

I'll come up with a repro repo that we can share. I actually think it will be pretty easy given the variables we tested, but I'll need to actually test and confirm to be sure.

@RLH

This comment has been minimized.

Contributor

RLH commented Aug 14, 2018

gopherbot pushed a commit that referenced this issue Nov 5, 2018

cmd/trace: add minimum mutator utilization (MMU) plot
This adds an endpoint to the trace tool that plots the minimum mutator
utilization curve using information on mark assists and GC pauses from
the trace.

This commit implements a fairly straightforward O(nm) algorithm for
computing the MMU (and tests against an even more direct but slower
algorithm). Future commits will extend and optimize this algorithm.

This should be useful for debugging and understanding mutator
utilization issues like #14951, #14812, #18155. #18534, #21107,
particularly once follow-up CLs add trace cross-referencing.

Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a
Reviewed-on: https://go-review.googlesource.com/c/60790
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment