New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: system experienced long pauses during GC marking #27410

Open
fmstephe opened this Issue Aug 31, 2018 · 6 comments

Comments

Projects
None yet
5 participants
@fmstephe

fmstephe commented Aug 31, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.10.3 linux/amd64

Does this issue reproduce with the latest release?

We don't know. Unfortunately we can't run this exact experiment again.

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

We were running a tool migrating data between two database systems. It maintains two in memory caches (implemented as sharded-maps) which became very large. As the heap grew we experienced very uneven performance.

We ran a series of experiments capping the size of the caches to compare the system operation at different cache sizes.

We took traces of the running system to compare the behaviour and see if we could get a better understanding of what was causing the performance degradation.

Unfortunately we had to force the garbage collector to run during the trace to ensure that we would actually see it. So the traces contain an artificial GC cycle, which won't work exactly the same as a naturally occurring one.

What did you expect to see?

What did you see instead?

Repeated and dramatically long gaps during the mark phase, on the order of 10-20 milliseconds. During these periods no network events occurred, effectively pausing the system.

We have written up an account of what we saw here

https://docs.google.com/document/d/1-DO8jp9q0ONuo9nL1EXVm9unzMHLA0KUkrisAWxjaRo/edit?usp=sharing

My company is reluctant to attach traces directly here, as they contain private code stack traces. But we are happy to send them directly to individuals if they would like to see them.

@ALTree ALTree changed the title from System experienced long pauses during GC marking to runtime: system experienced long pauses during GC marking Aug 31, 2018

@ALTree ALTree added this to the Go1.12 milestone Aug 31, 2018

@josharian

This comment has been minimized.

Contributor

josharian commented Aug 31, 2018

@ysmolsky

This comment has been minimized.

Member

ysmolsky commented Sep 4, 2018

Mentioned google document cannot be accessed.

What was the approx. size of allocated heap?

@fmstephe

This comment has been minimized.

fmstephe commented Sep 4, 2018

@ysmolsky We experienced similar sized pauses during the mark phase. We tested heaps of size

8-13 GB
20-30 GB
60-70 GB

We saw a lot of 5-20ms pauses (where pauses are measured as a period where no network events are recorded in the trace). The larger heaps sizes saw a greater performance impact, but it looks like that was simply because the mark phase ran for much longer.

In particular we saw huge number of Idle-GC slices in the trace. If were to guess I would say the scheduler was struggling to schedule all of the Idle-GC goroutines as well and schedule meaningful work.

If you request access to the google doc, I will grant it as soon as I can.
I need to work out how to publish that without using google docs, as it makes reading the doc really hard.

@agnivade

This comment has been minimized.

Member

agnivade commented Sep 19, 2018

Likely a dupe of #27732. @dr2chase ?

@fmstephe

This comment has been minimized.

fmstephe commented Sep 19, 2018

We saw no Mark assists in the trace. I specifically looked for them, because that was my first thought.

With the way we forced the GC to run during the trace makes assists even less likely. Because an early GC run is less likely to feel like the mutators are outpacing it and ask for assists (is my understanding).

The unusual feature of this test that the system was running on a machine with 48 cores with GOMAXPROCS unset. It looks like it was trying to use all the cores during GC, but most GC slices are idle.

@fmstephe

This comment has been minimized.

fmstephe commented Oct 5, 2018

The blog post has been published

http://big-elephants.com/2018-09/unexpected-gc-pauses/

If anything written there is incorrect I am happy to make edits.

Should be easier to access than the google doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment