Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC causes latency spikes #14812

Open
Dieterbe opened this issue Mar 14, 2016 · 116 comments
Open

runtime: GC causes latency spikes #14812

Dieterbe opened this issue Mar 14, 2016 · 116 comments

Comments

@Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Mar 14, 2016

Hello,
while running the program at https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing mark-and-scan times of 15s cpu time, 20002500 ms clock time. (8 core system) for a heap of about 6.5GB
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 to obtain the data below.

$ metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
$ go version
go version go1.6 linux/amd64
$ uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've gotten allocations down in various parts of the program there are currently probably about a million or more live pointers referencing pieces of data. I was going to work on optimizing this when Dave Cheney suggested there's a problem with the runtime and I should file a bug (https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
Here's the log with gctrace and schedtrace enabled: https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find a zip containing this log, the binary, a cpu profile taken during gc run 1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just after the spike induced by GC run 1482)
https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that incoming load (requests) of the app is constant so this conveys to me that any extra load is caused by GC, not by changing workload
https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC workload.

let me know if there's anything else I can provide,
thanks,
Dieter.

@RLH

This comment has been minimized.

Copy link
Contributor

@RLH RLH commented Mar 14, 2016

The GC will use as much CPU as is available. If your program is basically
idle, which it appears to be, the GC will use the idle CPU and CPU load
will naturally go up. If your application is active and load is already
high then the GC will limit its load to 25% of GOMAXPROCS. The mark and
scan phase is concurrent, it is unclear how it is adversely affecting your
idle application.

On Mon, Mar 14, 2016 at 1:54 AM, Dieter Plaetinck notifications@github.com
wrote:

Hello,
while running the program at
https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing excessive time spent in mark-and-scan on go 1.6
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 in the below
description.

./metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
go version go1.6 linux/amd64
$uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've
gotten allocations down in various parts of the program there are currently
probably about a million or more live pointers referencing pieces of data.
I was going to work on optimizing this when Dave Cheney suggested there's a
problem with the runtime and I should file a bug (
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
I'm seeing MAS times of 15s and more cpu time, 2000~2500 ms clock time. (8
core system) for a heap of about 6.5GB
Here's the log with gctrace and schedtrace enabled:
https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find
a zip containing this log, the binary, a cpu profile taken during gc run
1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just
after the spike induced by GC run 1482)

https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that
incoming load (requests) of the app is constant so this conveys to me that
any extra load is caused by GC, not by changing workload

https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC
workload.

let me know if there's anything else I can provide,
thanks,
Dieter.


Reply to this email directly or view it on GitHub
#14812.

@ianlancetaylor ianlancetaylor changed the title mark and scan needs excessive amount of time (15s for 6.5GB heap) runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap) Mar 14, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Mar 14, 2016
@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented Mar 16, 2016

The mark and scan phase is concurrent, it is unclear how it is adversely affecting your idle application.

just a guess, but perhaps the cause is the extra workload induced by the write barrier? (i watched your gophercon talk again today :) Interestingly, when I use top, i haven't been able to ever catch a core running at 100%.

But you're right that there's essentially two things going on, which may or may not be related:

  • mark and scan takes too long
  • app slows down during GC runs, while cpu cores don't saturate.

Let me know how I can help.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Apr 3, 2016

Hi @Dieterbe, could you clarify what the issue is? 15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

If it's that response time goes up during GC, could you try the CL in #15022? (And, if you're feeling adventurous, there's also https://go-review.googlesource.com/21036 and https://go-review.googlesource.com/21282)

@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented Apr 4, 2016

Hey @aclements!

15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

ok , fair enough for me. i just reported this here because @davecheney mentioned in
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg
that 1.5s for 5GB was unexpected and that i should open a ticket for it. so hence this ticket.

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

of course, this is by itself not a problem.

If it's that response time goes up during GC, could you try the CL in #15022?

initially the ticket wasn't about this, but it was brought up and is definitely a problem for us. so from now on we may as well consider this the issue at hand.
I recompiled my app with a recompiled go using your patch, and did a test run before and after.
unfortunately i see no change and the latency spikes are still there (details at grafana/metrictank#172)
note that i can verify this problem quite early on. e.g. in this case i've seen spikes as early as GC run 270. the issue is there probably much earlier but my app needs to load in a lot of data before i can test. the bug mentioned in #15022 looks like it only activates after a sufficient amount of GC runs.

@aclements aclements changed the title runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap) runtime: GC causes latency spikes Apr 22, 2016
@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented May 16, 2016

@Dieterbe, would it be possible for you to collect a runtime trace (https://godoc.org/runtime/trace) around one of the periods of increased latency? If you do this with current master, the trace file will be entirely self-contained (otherwise, I'll also need the binary to read the trace file).

I have a hunch about what could be causing this. GC shrinks the stacks, so if many of your goroutines are constantly growing and shrinking the amount of stack they're using by at least a factor of 4, you would see a spike as many goroutines re-grew their stacks after the shrink. This should be more smeared out on master than with Go 1.6 since f11e4eb made shrinking happen concurrently at the beginning of the GC cycle, but if this is the problem I don't think that would have completely mitigated it. (Unfortunately, the trace doesn't say when stack growth happens, so it wouldn't be a smoking gun, but if many distinct goroutines have high latency right after GC that will be some evidence for this theory.)

@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented May 16, 2016

Hey @aclements
I did curl 'http://localhost:6063/debug/pprof/trace?seconds=20' > trace.bin
about 5~7 seconds in I think (it's a bit hard to tell) is where the GC kicks in and a latency spike was observed
files: http://dieter.plaetinck.be/files/go-gc-team-is-awesome/trace.bin and http://dieter.plaetinck.be/files/go-gc-team-is-awesome/metric_tank for the binary. compiled with official 1.6.2 . hopefully this helps to diagnose. if not let me know, maybe i can get a better trace.

@aclements aclements modified the milestones: Go1.7Maybe, Go1.7 May 24, 2016
@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented May 30, 2016

I read through and #9477 and #10345 and wonder if this issue is another similar case? note that this app is centered around a map (https://github.com/raintank/raintank-metric/blob/master/metric_tank/mdata/aggmetrics.go#L13) that has just over 1M values (and each value in turn has a bunch of pointers to things that have more pointers, and lots of strings involved too). optimizing this is on my todo, but in the meantime i wonder if maybe a GC thread blocks the map leaving other application threads (mutators) unable to interact with the map. and since everything in the app needs this map, it could explain the slow downs?

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented May 31, 2016

@Dieterbe, it's possible. Could you try the fix I posted for #10345? (https://golang.org/cl/23540)

Note that it's not that the GC thread blocks the map. Mutators are free to read and write the map while GC is scanning it; there's no synchronization on the map itself. The issue is that whatever thread gets picked to scan the buckets array of the map is stuck not being able to do anything else until it's scanned the whole bucket array. If there's other mutator work queued up on that thread, it's blocked during this time.

(Sorry I haven't had a chance to dig into the trace you sent.)

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Jun 7, 2016

I dug into the trace you posted (thanks!) and took another look at your gctrace. This is pretty clearly an over-assist problem. When the GC is running, the expected 25% of the CPU goes to dedicated GC workers, but nearly all of the rest of the CPU time goes to GC assists triggered by allocation. Hence, when GC is running, even though it's technically concurrent, nearly all of the CPU is being soaked up by GC. You can see this really clearly in the trace:

trace-14812

The GC kicks in in the middle. When the GC isn't running, all of the user goroutine executions are really short and the system keeps up. As soon as GC kicks in, two Ps go to dedicated gcBgMarkWorkers (expected) and everything else turns to a checkerboard of 10–20 ms long user goroutine time slices (not expected). Because of this, the run queue lengths climb rapidly (the second row in the group of three at the top), which is almost certainly what's causing the latency spikes. Right after GC finishes there's a period where the user goroutines are running quickly again but they're packed much more densely as it drains the long run queues and returns to normal.

I don't yet understand why this is happening. The heap is in steady state, so I would expect it to compute an assist ratio that causes it to finish roughly half way between the trigger and the goal, but it seems to be computing a much higher assist ratio and finishing much earlier.

@Dieterbe, could you re-run your application with GODEBUG=gctrace=1,gcpacertrace=1 and grab stderr? It would be awesome if you can run with master, since there's more gcpacertrace output than there was in 1.6, but even the 1.6 output would be helpful.

@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented Jun 8, 2016

hey @aclements

i recompiled latest master go, then compiled my app with it, and ran with that GODEBUG variable.
this time i just ran it in local dev environment with fake traffic, but the symptoms are visible again (some of the fake requests got slow responses at GC times).
here's the err output: https://gist.github.com/Dieterbe/6a57caf1e6fae129f375146ef95e6276
note at line 459 I pressed return twice.
that's where my fake traffic started and where the symptoms become clear.
the 1st and 2nd GC run after that didn't show it clearly, but then at the 3rd, 4th and 5th and basically every GC run ever that showed response time spikes again.

if it helps, here's a dashboard of the entire run:
https://snapshot.raintank.io/dashboard/snapshot/gNKMeTpufl77H9gK1JCZn5TYvo88uLju
and zoomed in on the region where fake traffic kicked in:
https://snapshot.raintank.io/dashboard/snapshot/QgUlChTZGR4Bg4bw95mHxmN3vozfb334

the "golang GC" graph 2nd row middle and "request handle duration (all steps)" 3rd row on right will probably be most helpful.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Jun 14, 2016

@Dieterbe, is there a way I can run the load test with fake traffic myself? I built metric_tank (I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting, but otherwise go get just worked), but I don't know how to run it.

@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented Jun 14, 2016

I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting

i'm not sure what's metrics_reporting, but the log problem is caused due to grafana recently changing up its logging mechanism, you can just checkout the last commit before that (grafana/grafana@22cda19) and metric-tank will build fine against it.

as for running it, MT (metric tank) relies on cassandra, elasticsearch and nsqd (from the nsq.io project) at runtime. the easiest (hopefully) way to get it all up and running is by using https://github.com/raintank/raintank-docker, so you can follow that readme, with a few changes:

  • you need to update screens/metricTank with
-./metric_tank \
+GODEBUG=gctrace=1,gcpacertrace=1 ./metric_tank \
  • you would also need to create a file screens/benchmark in raintank-docker with contents like the below:
#backfill an hour at 100k, wait for it, and then do realtime at 10k in background, and start http load.
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100 -offset 1h -speedup 10 -stop-at-now
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100  &
go get github.com/raintank/raintank-metric/inspect-es
go get github.com/tsenart/vegeta
inspect-es --es-addr elasticsearch:9200 -format vegeta-mt | sed 's#localhost:6063#metricTank:6063#' | vegeta attack -rate 500 -duration=120m > attack.out

these tools set up the "contextual circumstances": fake_metrics generates an input stream, inspect-es piped to vegeta triggers a consistent http request load.

you might hit some snags in the process of getting this all up and running though, and i would hate to have you lose much time with this. but i could hop on a call or whatever to help out.

thanks.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Jun 15, 2016

I think I got it up and running (with many tweaks), but now I'm a little stuck.

  1. What's the easiest way to get a custom-built Go into the metric_tank container? I'd like to add some debugging to the runtime.
  2. How do I monitor the end-to-end latency so I can test any changes?
@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented Jun 15, 2016

hey @aclements sorry you had to go through that. what i personally do is compile metric-tank manually on the host, so that the binary is $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/metric_tank, and then make raintank-docker/raintank_code/raintank-metric a symlink to $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/ . you can see in fig.yaml, the metricTank entry, that ./raintank_code/raintank-metric from the host gets mapped to /go/src/github.com/raintank/raintank-metric within the container using a docker directory mapping. in other words, the directory in the container is synchronized with the host, so whatever you compile on the host will be available in the container.

  1. there's 2 methods:

A) using vegeta. in the vegeta attack command you already have, probably change the duration from 120m to about 5m or so, to make the benchmark shorter. then it should be as simple as cat attack.out | vegeta report. vegeta has a few other interesting reporting modes as well. and i think you can even run reports "streaming", e.g. get reports while the file is still being written to.

B) if grafana is running properly in the container (the easiest to verify is to attach to the screen session and go to the grafana tab. in fact you may want to check out all tabs just to verify nothing's obviously broken. usually the last tab i switch to is the benchmark one. there you'll have the console where vegeta is executing and where you can run more vegeta commands). but anyway if grafana's running then you can go to http://localhost:3000, log in as admin:admin and load the "metric tank" dashboard. that's the dashboard i've been using to provide the snapshots above. it should show real-time stats coming in (which it loads from graphite, which also runs in a container)

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Jul 1, 2016

I spent much of this week working on this. I still have ideas on my list, but wanted to post an update.

First, I was able to bring up the whole system with a custom runtime and observe the behavior through Grafana and vegeta report. I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources, so it was swapping and over-saturating the CPU and causing all sorts of horrible latency. My workstation had the resources (though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!), but I'm not sure it exhibited the problem. According to vegeta, max latency was ~350ms, but P99 was just 2ms in several runs. I did see small jumps in max "request handle duration" correlated with GC in Grafana, though they were only around 15ms.

However, I was still able to dig into the execution traces and glean some insight. I fixed some problems with high goroutine churn potentially losing track of assist credit, and with the way we build up spare assist credit causing long assists when the assist ratio is high (which it is in Metric Tank). I also did several more experimental changes, which did get rid of the bad checkerboard pattern in the execution trace and break it up into a much more finely interleaved workload. This is certainly what we want to see in the trace, but it didn't seem to move the overall stats. I have a few more ideas on why that may be, but I'll be out next week, so it may be a little bit before I can try them.

I'm going to mail a bunch of my experimental commits, which should get linked here shortly. @Dieterbe, if you have time, it would be good try running with these, though I'm not sure you'll see much difference.

I did notice one thing you may be able to change on the application side to mitigate some of the problem. I think Metric Tank is allocating very small objects at a very high rate. This is causing the GC scan rate to be much lower than typical, since the memory access pattern is even more random than typical. This is stretching out the GC assists and interacting poorly with some of the tuning that assumes a more typical scan rate. It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

@Dieterbe

This comment has been minimized.

Copy link
Contributor Author

@Dieterbe Dieterbe commented Jul 4, 2016

I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources

yeah it's not trivial to consistently trigger clear symptoms (large spikes). But it seems more load always seems to help.

though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

I run https://github.com/meltwater/docker-cleanup/blob/master/docker-cleanup-volumes.sh periodically in between runs of the stack, this frees up space.

It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

Oh boy. I already went through weeks of optimizing memory allocations and what's left now is a bit harder. IIRC the remaining inefficiencies are about allocating a whole bunch of strings (several for each incoming metric). optimizing that (splitting up in the incoming stream in separate data and metadata streams, string interning, using enums, etc) is on the agenda but will require a broader architectural overhaul

I'll be out next week

Enjoy your time off!

@uluyol

This comment has been minimized.

Copy link
Contributor

@uluyol uluyol commented Jul 4, 2016

(though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

Mounting a directory on another partition into the container might help with this.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 7, 2016

Postponing to 1.8.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jul 23, 2016

CL https://golang.org/cl/25155 mentions this issue.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Nov 7, 2017

@petermattis, thanks for re-running! The first set of numbers looks great. The second set might just be showing the natural queuing delays of a fully loaded system.

Your gctrace is interesting. Just based on those three lines, I suspect you're encountering to some degree the same problem with idle fluctuation that I hypothesized in #14812 (comment). Can you paste a few more lines of gctrace? A dozen is probably enough.

gopherbot pushed a commit that referenced this issue Nov 7, 2017
Currently dead goroutines retain their assist credit. This credit can
be used if the goroutine gets recycled, but in general this can make
assist pacing over-aggressive by hiding an amount of credit
proportional to the number of exited (and not reused) goroutines.

Fix this "hidden credit" by flushing assist credit to the global
credit pool when a goroutine exits.

Updates #14812.

Change-Id: I65f7f75907ab6395c04aacea2c97aea963b60344
Reviewed-on: https://go-review.googlesource.com/24703
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
@petermattis

This comment has been minimized.

Copy link

@petermattis petermattis commented Nov 7, 2017

Yeah, I'm not concerned about the second set of numbers. I included those to very that top-end performance wasn't suffering.

Would it be worthwhile to try disabling idle GC workers?

go1.9 gctrace:

gc 290 @256.195s 3%: 0.25+100+0.10 ms clock, 2.0+208/199/0+0.84 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 291 @257.727s 3%: 0.24+104+0.21 ms clock, 1.9+212/207/0+1.7 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 292 @259.293s 3%: 0.25+104+0.11 ms clock, 2.0+191/208/0.24+0.93 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 293 @260.812s 3%: 0.29+106+0.25 ms clock, 2.0+209/202/0+1.7 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 294 @262.345s 3%: 0.078+108+0.18 ms clock, 0.63+214/217/13+1.4 ms cpu, 401->411->211 MB, 420 MB goal, 8 P
gc 295 @263.929s 3%: 0.14+103+3.2 ms clock, 1.0+214/206/0+22 ms cpu, 402->412->210 MB, 422 MB goal, 8 P
gc 296 @265.492s 3%: 0.26+103+0.11 ms clock, 2.1+205/207/0+0.94 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 297 @267.053s 3%: 0.16+103+0.10 ms clock, 1.3+211/204/0+0.86 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 298 @268.609s 3%: 0.18+102+0.19 ms clock, 1.5+216/204/0+1.5 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 299 @270.168s 3%: 0.28+94+0.13 ms clock, 1.9+226/188/0+0.95 ms cpu, 399->410->209 MB, 420 MB goal, 8 P
gc 300 @271.634s 3%: 0.26+108+0.10 ms clock, 2.1+197/216/0+0.83 ms cpu, 398->409->211 MB, 419 MB goal, 8 P
gc 301 @273.202s 3%: 0.21+104+0.11 ms clock, 1.7+196/207/0+0.93 ms cpu, 401->411->210 MB, 422 MB goal, 8 P

go-tip gctrace:

gc 276 @255.549s 2%: 0.041+118+0.16 ms clock, 0.32+116/236/73+1.3 ms cpu, 416->431->216 MB, 434 MB goal, 8 P
gc 277 @257.165s 2%: 0.038+118+0.14 ms clock, 0.30+88/234/86+1.1 ms cpu, 414->429->216 MB, 432 MB goal, 8 P
gc 278 @258.793s 2%: 0.075+125+0.096 ms clock, 0.60+111/248/57+0.77 ms cpu, 415->431->217 MB, 433 MB goal, 8 P
gc 279 @260.381s 2%: 0.043+119+0.17 ms clock, 0.34+92/237/81+1.4 ms cpu, 416->431->216 MB, 435 MB goal, 8 P
gc 280 @261.956s 2%: 0.14+127+0.14 ms clock, 1.1+111/253/57+1.1 ms cpu, 414->431->217 MB, 433 MB goal, 8 P
gc 281 @263.588s 2%: 0.049+132+0.12 ms clock, 0.39+109/264/59+1.0 ms cpu, 416->433->218 MB, 435 MB goal, 8 P
gc 282 @265.224s 2%: 0.051+126+0.14 ms clock, 0.40+76/251/98+1.1 ms cpu, 417->433->217 MB, 437 MB goal, 8 P
gc 283 @266.856s 2%: 0.074+125+0.092 ms clock, 0.59+107/247/59+0.74 ms cpu, 417->432->217 MB, 435 MB goal, 8 P
gc 284 @268.487s 2%: 0.075+128+0.12 ms clock, 0.60+92/255/55+1.0 ms cpu, 417->432->217 MB, 435 MB goal, 8 P
gc 285 @270.125s 2%: 0.041+120+0.14 ms clock, 0.33+151/231/48+1.1 ms cpu, 417->436->220 MB, 435 MB goal, 8 P
gc 286 @271.713s 2%: 0.091+117+0.16 ms clock, 0.73+19/234/139+1.3 ms cpu, 419->434->216 MB, 441 MB goal, 8 P
gc 287 @273.322s 2%: 0.061+120+0.12 ms clock, 0.48+115/240/74+0.96 ms cpu, 415->430->216 MB, 433 MB goal, 8 P
@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Nov 8, 2017

Would it be worthwhile to try disabling idle GC workers?

There isn't as much idle variance in the longer gctrace, but if it's not too much trouble, I would love to know what effect disabling idle GC has and what the trace looks like. Most of the cycles in your longer trace still have more assist time than they're supposed to, which I suspect is related to idle GC.

BTW, this patch should disable idle GC:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 112543db10..febb676a07 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2290,7 +2290,7 @@ stop:
        // We have nothing to do. If we're in the GC mark phase, can
        // safely scan and blacken objects, and have work to do, run
        // idle-time marking rather than give up the P.
-       if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
+       if false && gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
                _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode
                gp := _p_.gcBgMarkWorker.ptr()
                casgstatus(gp, _Gwaiting, _Grunnable)
@@ -2357,7 +2357,7 @@ stop:
        }
 
        // Check for idle-priority GC work again.
-       if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
+       if false && gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
                lock(&sched.lock)
                _p_ = pidleget()
                if _p_ != nil && _p_.gcBgMarkWorker == 0 {
@petermattis

This comment has been minimized.

Copy link

@petermattis petermattis commented Nov 8, 2017

Disabling GC idle didn't seem to affect latencies. The run below shows slightly better 99%-tile latency and slightly worse 95%-tile latency vs go-tip, but I think this is noise.

go-tip+no-gc-idle:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14999.0      1.1      0.8      3.1      8.4

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14998.5      1.1      0.8      2.9      8.9

gctrace output shows no GC idle time as expected and somewhat lower assist times, though I thought the recent GC changes were intended to eliminate GC assist in most cases.

gc 266 @255.887s 3%: 0.084+155+0.16 ms clock, 0.67+82/309/0+1.3 ms cpu, 421->441->221 MB, 442 MB goal, 8 P
gc 267 @257.543s 3%: 0.086+147+0.11 ms clock, 0.69+86/295/0+0.91 ms cpu, 422->441->220 MB, 443 MB goal, 8 P
gc 268 @259.179s 3%: 0.081+150+0.095 ms clock, 0.65+95/300/0+0.76 ms cpu, 420->440->220 MB, 441 MB goal, 8 P
gc 269 @260.741s 3%: 0.16+154+0.10 ms clock, 1.3+91/308/0+0.83 ms cpu, 421->441->221 MB, 441 MB goal, 8 P
gc 270 @262.389s 3%: 0.070+149+0.093 ms clock, 0.56+81/299/0+0.74 ms cpu, 422->441->221 MB, 443 MB goal, 8 P
gc 271 @264.034s 3%: 0.031+148+0.10 ms clock, 0.25+88/296/0+0.80 ms cpu, 421->440->220 MB, 442 MB goal, 8 P
gc 272 @265.676s 3%: 0.081+145+0.094 ms clock, 0.65+88/289/0+0.75 ms cpu, 420->439->220 MB, 440 MB goal, 8 P
gc 273 @267.307s 3%: 0.20+153+0.10 ms clock, 1.6+95/307/0+0.83 ms cpu, 420->440->221 MB, 440 MB goal, 8 P
gc 274 @268.949s 3%: 0.11+154+0.14 ms clock, 0.91+92/308/0+1.1 ms cpu, 421->440->221 MB, 442 MB goal, 8 P
gc 275 @270.515s 3%: 0.055+154+0.097 ms clock, 0.44+82/306/0+0.78 ms cpu, 421->441->221 MB, 442 MB goal, 8 P
gc 276 @272.171s 3%: 0.12+152+0.16 ms clock, 1.0+98/303/0+1.3 ms cpu, 422->441->221 MB, 443 MB goal, 8 P
gc 277 @273.820s 3%: 0.052+157+0.11 ms clock, 0.42+89/314/0+0.94 ms cpu, 421->441->221 MB, 442 MB goal, 8 P

If I also disable GC assist (small patch to src/runtime/malloc.go) I see:

go-tip+no-gc-idle+no-gc-assist:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     15000.1      0.9      0.8      2.2      3.9

And the corresponding gctrace:

gc 250 @258.759s 3%: 0.031+187+0.088 ms clock, 0.24+0/375/0+0.70 ms cpu, 431->456->227 MB, 451 MB goal, 8 P
gc 251 @260.447s 3%: 0.058+184+0.20 ms clock, 0.46+0/367/0+1.6 ms cpu, 433->458->226 MB, 454 MB goal, 8 P
gc 252 @262.137s 3%: 0.064+181+0.24 ms clock, 0.51+0/353/0+1.9 ms cpu, 432->455->225 MB, 452 MB goal, 8 P
gc 253 @263.857s 3%: 0.093+188+0.070 ms clock, 0.75+0/376/0+0.56 ms cpu, 430->455->227 MB, 450 MB goal, 8 P
gc 254 @265.592s 3%: 0.11+175+0.12 ms clock, 0.90+0/350/0+0.97 ms cpu, 433->456->224 MB, 454 MB goal, 8 P
gc 255 @267.308s 3%: 0.047+175+0.12 ms clock, 0.37+0/350/0+0.96 ms cpu, 429->453->225 MB, 449 MB goal, 8 P
gc 256 @269.018s 2%: 0.036+180+0.17 ms clock, 0.28+0/360/0+1.3 ms cpu, 430->454->225 MB, 450 MB goal, 8 P
gc 257 @270.690s 2%: 0.068+185+0.11 ms clock, 0.54+0/369/0+0.91 ms cpu, 430->455->225 MB, 450 MB goal, 8 P
gc 258 @272.374s 2%: 0.074+193+0.12 ms clock, 0.52+0/386/0+0.86 ms cpu, 431->457->227 MB, 451 MB goal, 8 P
gc 259 @274.111s 2%: 0.11+184+0.25 ms clock, 0.94+0/367/0+2.0 ms cpu, 433->457->225 MB, 454 MB goal, 8 P
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@aclements aclements modified the milestones: Go1.11, Unplanned Jul 3, 2018
@brian-armstrong-discord

This comment has been minimized.

Copy link

@brian-armstrong-discord brian-armstrong-discord commented Aug 8, 2018

Hi,

I posted on golang-nuts but was pointed to this issue. I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.

Here's one trace captured from the service. I think most look pretty similar.
GC forced
pacer: assist ratio=+8.611156e-002 (scan 960 MB in 15690->21268 MB) workers=1++0.000000e+000
pacer: H_m_prev=11151043456 h_t=+9.500000e-001 H_T=21744534739 h_a=+4.765627e-001 H_a=16465215016 h_g=+1.000000e+000 H_g=22302086912 u_a=+9.186874e-001 u_g=+3.000000e-001 W_a=347113784 goalΔ=+5.000000e-002 actualΔ=-4.734373e-001 u_a/u_g=+3.062291e+000
gc 36 @1197.776s 1%: 0.59+2210+0.29 ms clock, 2.3+5915/2192/0+1.1 ms cpu, 15690->15702->10671 MB, 21268 MB goal, 4 P
scvg7: inuse: 15834, idle: 107, sys: 15942, released: 0, consumed: 15942 (MB)
pacer: sweep done at heap size 10851MB; allocated 179MB during sweep; swept 2026864 pages at +1.906868e-004 pages/byte

This is on Go 1.10.3 with 4 cores. The allocation is slow enough compared to the heap size that the GC runs compulsorily every 120 seconds, but changing GOGC to make it run more often doesn't help the CPU issue, just makes the spikes more frequent.

Are there any fixes for the latency spikes on Go's roadmap?

Edit: Also, I don't understand why the GC runs this concurrent phase in such a short duration. If the GCs are spaced 120 seconds apart, why use just ~2.5s of that doing GC? Services typically don't like spikey load, so it seems silly to burst all the GC at once.

@RLH

This comment has been minimized.

Copy link
Contributor

@RLH RLH commented Aug 13, 2018

@brian-armstrong-discord

This comment has been minimized.

Copy link

@brian-armstrong-discord brian-armstrong-discord commented Aug 13, 2018

Unfortunately I don't have any traces that work. I captured a few but they just render blank pages in the browser, and I didn't have time to dig into what was wrong with them. There's no ambiguity that the spikes are correlated with the GC behavior though since our metrics (and htop) show very clear CPU spikes that occur when Go reports that it has started its GC.

I haven't tried any Go builds but it wouldn't be trivial for us to change out to a nonstandard Go binary on our build system. Are there GC fixes in Go > 1.10.3?

We did come up with a very hacky workaround which is to add more machines to the fleet. By bringing the average CPU utilization under 20%, the impact of the CPU spikes is not as pronounced on request latency. The spikes are still present and still completely saturate the CPU, but the smaller load doesn't seem to be as affected by it. However it seems rather unfortunate if Go can only use 1/5 of the computing resources of a machine it's on because of its GC design.

I'll come up with a repro repo that we can share. I actually think it will be pretty easy given the variables we tested, but I'll need to actually test and confirm to be sure.

@RLH

This comment has been minimized.

Copy link
Contributor

@RLH RLH commented Aug 14, 2018

gopherbot pushed a commit that referenced this issue Nov 5, 2018
This adds an endpoint to the trace tool that plots the minimum mutator
utilization curve using information on mark assists and GC pauses from
the trace.

This commit implements a fairly straightforward O(nm) algorithm for
computing the MMU (and tests against an even more direct but slower
algorithm). Future commits will extend and optimize this algorithm.

This should be useful for debugging and understanding mutator
utilization issues like #14951, #14812, #18155. #18534, #21107,
particularly once follow-up CLs add trace cross-referencing.

Change-Id: Ic2866869e7da1e6c56ba3e809abbcb2eb9c4923a
Reviewed-on: https://go-review.googlesource.com/c/60790
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
@CAFxX

This comment has been minimized.

Copy link
Contributor

@CAFxX CAFxX commented Aug 7, 2019

The CPU spike at GC is working as intended. If there are idle Ps then the GC will used them so you should expect full CPU utilization.

I know this goes way beyond the scope of the Go runtime, but as it's arguably a common configuration I think it warrants mentioning: such a behavior (CPU spike during GC) is not desirable when using CFS bandwidth control cfs_quota_us/cfs_period_us (i.e. when running in basically any linux container with a CPU limitation set), as the default CFS period is 100ms and a GC-induced spike in CPU usage that burns through the quota will make the kernel stop scheduling the process until the end of the period, thereby turning what would be a harmless concurrent GC phase into a pretty long (in the order of the cfs_period_us, ~100ms) stop-the-world pause.

I am not saying that we need to necessarily do something about this in Go, just that this may not be the most desirable behavior in a pretty common configuration (running a go program in a linux container with cpu limits), and therefore stretching a bit the definition of "working as intended" if the intention is having consistent sub-ms stop-the-world pauses.

@larytet

This comment has been minimized.

Copy link

@larytet larytet commented Aug 7, 2019

In my experience keeping a large map (10M entries) seriously impacts the system latency. Under heavy loads the typical response time is under 100 microseconds. When GC kicks in, the system stops responding altogether for 3-4 seconds. It happens every ~5 minutes. The system has 4 cores. In the gctrace I see that GC frees ~8GB of memory and it takes most of 4s. In this scenario GC consumes all cores.

The flow is: default HTTP server gets a short HTTP GET (under 200 bytes), handler looks in the map (syncmap), generates a short response (under 200 bytes). The map gets a new entry or two every few hours.

In one approach GC could avoid scanning objects which appear to live long time. Low level malloc/free API in the unsafe package could help me as well.

I switched to BigCache which solved the problem.

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Aug 7, 2019

If you could turn that large map problem into some sort of a test or benchmark, that would be really useful. There's a short list of few-millisecond hiccup causes, but I don't know of any related to maps.

@larytet

This comment has been minimized.

Copy link

@larytet larytet commented Aug 7, 2019

@dr2chase This is not quite a "benchmark" you probably have in mind, but it demonstrates the latency problem https://github.com/larytet-go/map-benchmark
This is a simple HTTP server. When allocated memory reaches ~3GB the server stops responding for ~5s. I keep the system around 0% idle time. Arguably the demo - in-memory data base - is not a typical Go micro service. The service is CPU and DDR bound.

The issue is in no way a new one. I assume that the behavior is by design. If GC sees that the system remains busy it will stay put (GC "marker" lags allocations) until the amount of allocated memory crosses a threshold (GCPercent?). The threshold forces the GC's hand. GC will consume all available cores, release the allocated memory, return to hibernation (background). I think that I understand some of the trade offs. GC's main goal is to let the system serve occasional short spikes with minimal latency impact (?). Devops is expected to add nodes until the latency distribution hits the target.

I do have a strong opinion about this. The GC is excellent when the machine has CPUs to spare and/or the service waits for I/O a lot. I personally prefer predictable latency with well defined and reasonable worst case. I wonder why there is no a separate list for long living objects, const pointers, access to runtime/unsafe malloc/free, a buddy allocator in the standard library, more GC tuning options, GC feedback and so on. Is it possible to avoid scanning 10M pointers in a map each and every time GC wakes up? Should I use rate limiter in all services? I agree that my opinion is not a very educated one.

In the trace on a 4 core system I see

SCHED 119544ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
GC forced
gc 15 @120.185s 0%: 0.011+1.8+0.006 ms clock, 0.045+0/1.8/1.5+0.027 ms cpu, 2->2->1 MB, 4 MB goal, 4 P
SCHED 120557ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]
SCHED 120050ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
gc 15 @113.396s 6%: 0.010+7082+0.005 ms clock, 0.043+8.9/7071/18652+0.020 ms cpu, 2977->3054->1663 MB, 3245 MB goal, 4 P
SCHED 121066ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]

Links
#9477
Probably relevant #27410
GC performance http://big-elephants.com/2018-09/unexpected-gc-pauses/

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Jan 13, 2020

Sorry for the huge delay in replying, the short answer to your problem is "don't call memstat that often and don't put the call someplace you would mind waiting for an answer". There was an old problem with memstat blocking till the GC finished; we though we had fixed it, but apparently you have found a new problem with memstat blocking. When I remove the calls to memstat from your test program, all is well.

Even when we fix this new bug, frequent calls to memstat are not a great idea, because it actually stops the world. Not for long, but everything. It has overheads.

@genez

This comment has been minimized.

Copy link

@genez genez commented Jan 14, 2020

We have a software in production that has a similar behavior.
Even without calling memstat, we experience large delays (+100ms for heap size in the range of 7 or 8 GB)
Unfortunately the software is not open-source.

Memory models is composed of a relatively low number of btrees (10) (IIRC https://github.com/emirpasic/gods#redblacktree) that represent partitions of a large dataset (loaded at startup) of key/value pairs, where the value component is a Go structure and the key is a string

Once loaded, the dataset is 95% read and 5% updated

From traces it is clearly evident that large delays are caused by GC sweeping and starving mutators (more frequent) or GC STW phase (rare - please see this example)
image

image

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Jan 14, 2020

Which version of Go? There's several bugs that cause few-millisecond pauses for one or more threads, some are planned for 1.15, one -- failure to preempt call-free long-running loops -- got fixed in 1.14, so if you test with tip and the latency gets better, that was your problem. We also fixed one of the others in 1.14, but it's limited to just a few milliseconds. See #27732 for a discussion of those bugs, and traces showing what they look like -- it would be nice to know if what's causing your problems is new or known.

A quick summary of the known bugs (from 27732):

  1. Stack allocations are not divisible work; can take a long time to process. (Workaround for 1.14 added https://go-review.googlesource.com/c/go/+/180817/ -- use go build -gcflags 'all=-smallframes' hello.go)
  2. Mark assist applied to global and stack allocations (generally, roots) doesn't generate any credit.
  3. Sweeping problem (#18155 - experimental fix here: https://go-review.googlesource.com/c/go/+/187817/ )
  4. Reschedule (i.e., quantum expires) that discovers a need for GC does not wake other Ps (or, does not check global run queue) (i.e., the work to do expanded by 1, if any Ps are idle, they one should be wakened to take over doing "real work"). (Fixed in 1.14: https://go-review.googlesource.com/c/go/+/146817/ )
  5. (Related to 4?) Dedicated worker doesn't kick goroutines out of its local run queue.

The workaround for bug 1 will also address bug 2 -- if there's no oversized frames to mark, the size of the accounting error is small, also.

Bug 3 is rare, usually only occurs during program startup or if your code is very very good at not generating garbage; if the heap is completely live, scanning for a free slot can run very long. The workaround is somewhat nutty -- add a counter to the allocator for the problem, and every 64 allocations (counter&63 == 0) drop the allocation on the floor and do it again. This leaves holes in your heap that prevent long stalls for scanning.

If you try these workarounds and one of them works, that's really useful information about which bugs is the bigger problem and needs attention first.

It's also helpful to exclude startup from delay stats for long-running servers; those are a separate category of bug, and (our opinion) not nearly as important as steady-state glitches. They do matter; minor changes in startup heap growth heuristics have a noticeable effect on build speed, so we care, just not as much.

@genez

This comment has been minimized.

Copy link

@genez genez commented Jan 14, 2020

Thanks for the detailed follow-up
I am using Go 1.13.5 on windows x64

I have a reduced/modified version of the sample program that I can publish if needed.
Unfortunately the simplified version exhibits a less evident issue but still can be usefull for investigations.

Behavior is the same with 1.13 and tip (go version devel +563287ae06 Mon Jan 13 23:35:18 2020 +0000 windows/amd64)

Based on your explanation it look like we are in scenario (3)
We have a relatively large allocation at startup, then we execute a large number of read/update on the data set. Only the second phase is considered in the trace.

I expect that the real live heap is about 95% of the total heap

This is a sample fragment of the output of the trace tool:
image

I will provide source code and other details as soon as possible

@genez

This comment has been minimized.

Copy link

@genez genez commented Feb 18, 2020

please check repro code here: https://github.com/genez/collectiontest
please note this is extremely simplified, production code does more computations (json and string handling, logging, etc...) that lead to >50ms latency

in production code we are not using stdlib maps because they do not shrink and cause OOM panics, we use a custom data structure (an rb+ tree where we serialize the items as a byte slice)

on my test bench (Intel i9 16 core, 32 GB of RAM) I can repro >10ms latency
go version go1.13 windows/amd64

it looks like the platform (win, linux) does not matter

image

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Feb 18, 2020

Short answer it is definitely sweeping, and the patch for sweep delays works nicely.
Unfortunately the next mole to whack is mark assist, and though I think it's technically working as intended, it's sometimes not good.

You can see these in the MMU graph, here's a particularly terrible one (some look great, this one is awful even with an experimental "fix"). By turning the mark-assist toggle on and off you can see where most of the problem comes from:

Screen Shot 2020-02-18 at 5 45 54 PM

Screen Shot 2020-02-18 at 5 46 20 PM

@genez

This comment has been minimized.

Copy link

@genez genez commented Feb 19, 2020

Thank you for this insight!
I am not sure I can fully grasp what does "turning mark assist off" mean and how to correctly read those two charts

I'm sorry but I am not that confident in runtime internals

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Feb 19, 2020

"Turning mark assist off" refers to the theft-of-time toggles for the graph; if you don't count mark-assist "theft" of mutator time, the worst-case delay is below 10ms, and the longest window in which only 50% of the cpu is available is about 25-30ms.

Turning mark assist "on" (in the accounting), the 50% point is pushed out into seconds.

The chart shows "minimum mutator utilization" (MMU) and is available from "go tool trace". The time on the bottom (x-axis) is the size of a window that is slid across the trace, and the fraction (y-axis) is the minimum for all windows (of that size) fraction of time in which the mutator wanted to run but could not. The toggle buttons ("Include") allow you to specify things that are not mutation -- if it's one of those, then the mutator is "not running". "View" allows you to specify either the entire system or worst-case per-goroutine ("system" runs the window across the threads in a stack, takes the fraction for the entire stack, "Per-goroutine" is instead thread-by-thread).

By default the graph shows worst case, and toggling "Percentile" is supposed to get you more nuanced numbers, but I think there is some sort of a bug because I'm not seeing that no matter how hard I fiddle.

One thing to be careful of is that this is a distillation of your app's memory allocation, and it is quite possible that it sees mark assist problems that your app will not. Spending a lot of time in mark assist happens when the mutator (your app) is allocating quickly during GC -- when that happens a tax is applied to allocations to prevent the GC from falling too far behind. If your app does more work that is not allocation -- or even if it does a lot more allocations but they are often smaller than in this benchmark -- then you won't see the mark assist problem.

Spending a lot of time in sweep (your original problem) is a property of a heap that is very well-filled; it's large, and it's 100% live objects. Right after GC completes, a "background sweeper" thread starts to figure out where free space is, but if the mutator allocates before that thread has processed very much memory, then the mutator has to go looking (sweeping) instead. When a lot of the heap is 100% filled, it can look for a while. The patch for that simply says "after you've looked at 100 spans with no luck, give up and allocate a new span".

So, if you're up for patching a Go release, you might try
https://go-review.googlesource.com/c/go/+/187817/
and see if your latency problem goes away (meaning, your real app does other stuff), or if it only gets a little better (you app allocates a lot anyway and we need to improve mark assist.)

Thanks very much for this benchmark. This gives us a nice reproducible case (taken from a real-world app) of sweep being a problem, which will probably get that fix into 1.15, and it provides a nice motivating case for improving mark assist.

@genez

This comment has been minimized.

Copy link

@genez genez commented Feb 24, 2020

I just repeated the test on a real production-grade linux machine with the following Go versions:

  • Go 1.13.8 (the one we are using at the moment in prod)
    image

  • Go tip origin/master 218f457
    image

It looks like there's a nice improvement
I can upload and share trace output files if needed

I did try to patch the tip sources, but I get compilation errors:

Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.

/opt/go_MASTER/src/runtime/mcentral.go:262:17: too many arguments in call to mheap_.freeSpan
        have (*mspan, bool)
        want (*mspan)

/opt/go_MASTER/src/runtime/mcentral.go:272:19: too many arguments in call to mheap_.alloc
        have (uintptr, spanClass, bool, bool)
        want (uintptr, spanClass, bool)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.