Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work #37116

Open
thepudds opened this issue Feb 7, 2020 · 8 comments
Milestone

Comments

@thepudds
Copy link

@thepudds thepudds commented Feb 7, 2020

What version of Go are you using (go version)?

$ go1.14rc1 version

go version go1.14rc1 linux/amd64

Does this issue reproduce with the latest release?

Yes, go1.14rc1.

What operating system and processor architecture are you using (go env)?

Ubuntu 18.04.2 LTS (stock GCE image). 4 vCPUs, 15 GB.

go env Output
$ go1.14rc1 env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/sdk/go1.14rc1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/sdk/go1.14rc1/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build656951356=/tmp/go-build -gno-record-gcc-switches"

Background

A recent blog post reported an issue they saw in the >2 year old Go 1.10 when there are "tens of millions" of entries in an LRU, along with CPU spikes and response time spikes every two minutes, with a p95 response time reportedly hitting 50ms-300ms.

The two minute periodicity was explained within the blog by the following, which seems plausible:

we had written the Go code very efficiently and had very few allocations. We were not creating a lot of garbage.
After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection.

A separate recent post from one of the engineers theorized the problem was due to a very long linked list, but I am less sure if this comment is a precise description of the actual root cause:

we used a large free-list like structure, a very long linked list. The head of the list is maintained as a variable, which means that Go's mark phase must start scanning from the head and then pointer chase its way through the list. For whatever reason, Go does (did?) this section in a single-threaded manner with a global lock held. As a result, everything must wait until this extremely long pointer chase occurs.

An engineer from the same team might have also posted this roughly 2.5 years ago in #14812 (comment)

I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.

What did you do?

Write a simple GC benchmark intended to emulate a LRU with a separate free list.

It's not 100% clear what their exact data structures were, including whether or not it was just an LRU that linked elements within the LRU to help with expiring the older items, vs. maybe there was an additional free list that was separate from the LRU, vs. something else entirely.

Last night, without too much data, I put together a simple gc benchmark of a LRU cache along with a separate free list.

The benchmark is here:

https://play.golang.org/p/-Vo1hqIAoNz

Here is a sample run with 50M items in a LRU cache (a map + the items within the map also are on a linked list), as well as 1M items on separate free list, and a background allocation rate of about 10k/sec:

$ go1.14rc1 build 
$ GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms

What did you expect to see?

Ideally, sub-ms worst-case latency impact of GC.

What did you see instead?

This was fairly quick and dirty, and might not be at all related to the problem reported in the blog, but seemed to observe occasionally higher latency values.

It is important to note that the issue here does not seem to be anywhere near the scale reported in the blog:

  • The amplitude of the latency impact here (10ms-26ms) seems substantially smaller than what was reported on the blog (50ms-300ms).
  • The issue here is reporting worst-case latency, rather than p95 latency on the blog.

Here is an example where a single 26ms worst-case latency was observed when looking up an element in the LRU during a forced GC.

$ go1.14rc1 build  && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms 

gc 1 @0.011s 5%: 0.066+4.1+0.023 ms clock, 0.26+0.30/3.0/6.9+0.095 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.022s 10%: 0.008+9.0+0.021 ms clock, 0.033+0.33/8.7/10+0.087 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.044s 10%: 0.012+13+0.020 ms clock, 0.049+0/12/23+0.083 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.087s 10%: 0.009+26+0.030 ms clock, 0.038+0/24/46+0.12 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.177s 10%: 0.010+49+0.023 ms clock, 0.042+0/47/89+0.095 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.369s 10%: 0.015+98+0.022 ms clock, 0.060+0/98/179+0.089 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.767s 10%: 0.012+195+0.021 ms clock, 0.049+0/194/380+0.086 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.586s 11%: 7.4+377+0.028 ms clock, 29+71/376/742+0.11 ms cpu, 426->430->398 MB, 427 MB goal, 4 P
gc 9 @3.343s 11%: 0.024+787+0.023 ms clock, 0.099+236/785/1501+0.094 ms cpu, 853->859->795 MB, 854 MB goal, 4 P
gc 10 @7.071s 11%: 0.023+1806+0.023 ms clock, 0.092+497/1803/3513+0.095 ms cpu, 1706->1719->1591 MB, 1707 MB goal, 4 P
gc 11 @14.905s 12%: 49+3796+0.023 ms clock, 196+1084/3796/7271+0.093 ms cpu, 3411->3439->3182 MB, 3412 MB goal, 4 P
gc 12 @31.292s 14%: 226+8162+0.043 ms clock, 904+5028/8161/14256+0.17 ms cpu, 6823->6866->6353 MB, 6824 MB goal, 4 P
LRU count:       50.00M
Free list count: 1.00M
BG allocation:   10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0          bgAllocs=540000   at=64.026s  lookup=0.009ms
i=4096       bgAllocs=540100   at=64.028s  lookup=0.016ms
i=8944       bgAllocs=540100   at=64.030s  lookup=0.118ms
i=9410       bgAllocs=540100   at=64.031s  lookup=0.123ms
i=52120      bgAllocs=540300   at=64.053s  lookup=0.141ms
i=54295      bgAllocs=540300   at=64.054s  lookup=0.148ms
i=81167      bgAllocs=540400   at=64.068s  lookup=0.185ms
i=141525     bgAllocs=540700   at=64.099s  lookup=0.193ms
i=473297     bgAllocs=542400   at=64.271s  lookup=0.205ms
i=820934     bgAllocs=544200   at=64.451s  lookup=0.233ms
i=1045129    bgAllocs=545300   at=64.572s  lookup=0.323ms
i=49675599   bgAllocs=810200   at=91.600s  lookup=0.651ms
i=59934276   bgAllocs=863000   at=97.000s  lookup=0.661ms
i=63432689   bgAllocs=881700   at=98.900s  lookup=0.696ms
GC forced
i=173370432  bgAllocs=1476800   at=159.688s  lookup=3.407ms
i=173381145  bgAllocs=1476800   at=159.727s  lookup=26.235ms
gc 13 @159.687s 5%: 0.096+17356+0.020 ms clock, 0.38+0.65/17348/34668+0.081 ms cpu, 9642->9659->8463 MB, 12706 MB goal, 4 P
GC forced
gc 14 @297.047s 4%: 0.12+17250+0.021 ms clock, 0.49+30/17250/34376+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 15 @434.306s 4%: 0.10+16952+0.024 ms clock, 0.41+0.20/16949/33820+0.096 ms cpu, 8696->8712->8463 MB, 16927 MB goal, 4 P

I would guess that 26ms lookup latency is likely related to this 26ms 'GC (idle)' period where the main goroutine did not seem to be on a P:

image

Another sample run:


$ go1.14rc1 build  && GODEBUG=gctrace=1 ./lru-gc-benchmark -free-list-items=1000000 -lru-items=50000000 -duration=20m -bg-alloc-sleep=10ms 

gc 1 @0.011s 6%: 0.014+4.3+0.025 ms clock, 0.056+0.023/3.7/6.6+0.10 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.021s 10%: 0.009+8.4+0.026 ms clock, 0.038+0.34/7.8/11+0.10 ms cpu, 7->7->6 MB, 8 MB goal, 4 P
gc 3 @0.043s 10%: 0.009+12+0.023 ms clock, 0.036+0/11/23+0.092 ms cpu, 13->13->12 MB, 14 MB goal, 4 P
gc 4 @0.084s 11%: 0.010+24+0.023 ms clock, 0.041+0/24/42+0.093 ms cpu, 27->27->25 MB, 28 MB goal, 4 P
gc 5 @0.171s 10%: 0.014+48+0.024 ms clock, 0.058+0/47/93+0.097 ms cpu, 53->54->50 MB, 54 MB goal, 4 P
gc 6 @0.364s 10%: 0.017+97+0.022 ms clock, 0.070+0/96/177+0.088 ms cpu, 107->108->100 MB, 108 MB goal, 4 P
gc 7 @0.758s 10%: 5.6+203+0.022 ms clock, 22+0.49/201/386+0.091 ms cpu, 213->216->200 MB, 214 MB goal, 4 P
gc 8 @1.605s 10%: 0.019+411+0.023 ms clock, 0.078+0/409/796+0.095 ms cpu, 426->432->399 MB, 427 MB goal, 4 P
gc 9 @3.312s 11%: 28+790+0.031 ms clock, 114+194/787/1520+0.12 ms cpu, 853->860->795 MB, 854 MB goal, 4 P
gc 10 @6.964s 12%: 46+1586+0.033 ms clock, 185+730/1585/2988+0.13 ms cpu, 1706->1716->1587 MB, 1707 MB goal, 4 P
gc 11 @14.638s 13%: 79+3760+0.022 ms clock, 316+1164/3758/7168+0.088 ms cpu, 3411->3438->3182 MB, 3412 MB goal, 4 P
gc 12 @31.187s 18%: 196+6723+0.007 ms clock, 785+11516/6720/7379+0.028 ms cpu, 6823->6824->6311 MB, 6824 MB goal, 4 P
LRU count:       50.00M
Free list count: 1.00M
BG allocation:   10000 / sec (approx.)
BG retain count: 10000
Printing slowest lookup times:
i=0          bgAllocs=530900   at=63.844s  lookup=0.006ms
i=100        bgAllocs=530900   at=63.844s  lookup=0.006ms
i=671        bgAllocs=530900   at=63.844s  lookup=0.028ms
i=2407       bgAllocs=530900   at=63.845s  lookup=0.030ms
i=4258       bgAllocs=530900   at=63.847s  lookup=0.120ms
i=80012      bgAllocs=531300   at=63.886s  lookup=0.634ms
i=134006913  bgAllocs=1258500   at=138.086s  lookup=0.679ms
i=145128912  bgAllocs=1321200   at=144.486s  lookup=0.703ms
GC forced
i=170136450  bgAllocs=1454900   at=158.133s  lookup=16.008ms
gc 13 @158.108s 6%: 0.11+17483+0.022 ms clock, 0.44+0.23/17480/34884+0.089 ms cpu, 9639->9656->8463 MB, 12622 MB goal, 4 P
GC forced
gc 14 @295.596s 5%: 0.13+16977+0.024 ms clock, 0.52+28/16977/33847+0.099 ms cpu, 8695->8712->8463 MB, 16927 MB goal, 4 P
GC forced
i=632749868  bgAllocs=3978800   at=432.596s  lookup=11.794ms
gc 15 @432.576s 4%: 0.073+17356+0.021 ms clock, 0.29+0.23/17355/34654+0.084 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
gc 16 @569.938s 4%: 0.17+16911+0.021 ms clock, 0.71+4.9/16908/33760+0.086 ms cpu, 8696->8713->8463 MB, 16926 MB goal, 4 P
GC forced
gc 17 @706.859s 3%: 0.19+17107+0.023 ms clock, 0.76+0.41/17106/34143+0.093 ms cpu, 8697->8714->8463 MB, 16927 MB goal, 4 P
GC forced
i=1327474340 bgAllocs=7763400   at=843.987s  lookup=8.791ms
gc 18 @843.969s 3%: 0.28+17297+0.022 ms clock, 1.1+0.82/17295/34532+0.088 ms cpu, 8696->8713->8463 MB, 16927 MB goal, 4 P
i=1344317800 bgAllocs=7849300   at=861.294s  lookup=13.015ms
...
^C

I suspect this trace section corresponds to the at=432.596s lookup=11.794ms observed latency:

image

In both cases, the traces started roughly 100 seconds after the start of the process, so the timestamps between the log message vs. images are offset roughly by 100 seconds.

Related issues

This issue might very well be a duplicate. For example, there is a recent list of 5 known issues in #14812 (comment). That said, if this issue is explained by one of the 5 items there, it is not immediately obvious to me which one, including some of the issues in that list are resolved now, but this happens in go1.14rc1. Also possibly related to #14179, #18155, or others.

@cagedmantis cagedmantis modified the milestones: Backlog, Go1.14 Feb 7, 2020
@cagedmantis

This comment has been minimized.

Copy link
Contributor

@cagedmantis cagedmantis commented Feb 7, 2020

@heschik

This comment has been minimized.

Copy link
Contributor

@heschik heschik commented Feb 9, 2020

I think this is working as expected. At the heart of github.com/golang/groupcache/lru is a single large map. AFAIK the map has to be marked all in one go, which will take a long time. In the execution trace above you can see a MARK ASSIST segment below G21. If you click on that I think you'll find that it spent all of its time marking a single object, presumably the map. Austin and others will know better, of course.

Since a real server would need to lock the map, sharding it would be a good idea, which would also reduce the latency impact of a mark.

(cc @rsc, since I think @cagedmantis made a typo.)

@RLH

This comment has been minimized.

Copy link
Contributor

@RLH RLH commented Feb 9, 2020

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Feb 9, 2020

@cagedmantis

This comment has been minimized.

Copy link
Contributor

@cagedmantis cagedmantis commented Feb 10, 2020

Thanks @heschik

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Feb 10, 2020

Thanks for trying to put something together to recreate this behavior, especially from the little scraps of information we have. :)

Based on the execution traces, I think you're probably right that the idle GC worker is the root of the problem here. The idle worker currently runs until a scheduler preemption, which happens only every 10 to 20ms. The trace shows a fifth goroutine that appears to enter the run queue a moment after the idle worker kicks in, which is presumably the user goroutine.

I agree with @randall77 that this is not likely to be related to the large map, since we do break up large objects when marking.

In general long linked lists are not very GC-friendly, since it has no choice but to walk the list sequentially. That will inherently serialize that part of marking (assuming it's the only work left), though I don't understand the quoted comment about this happening "with a global lock". I don't think there's been anything like a global lock here since we switched to a concurrent GC in Go 1.5.

@aclements aclements modified the milestones: Go1.14, Go1.15 Feb 10, 2020
@thepudds

This comment has been minimized.

Copy link
Author

@thepudds thepudds commented Feb 11, 2020

The Discord blog post described above had said they saw GC-related latency problems with Go 1.9.2 (and also later said elsewhere they had tested but seemingly not deployed Go 1.10), so part of the intent here was to see if the pauses they reported could be reproduced in Go 1.9.2.

Here are some results from running that benchmark above in a loop overnight, with ten 15-minute runs first on 1.9.2 and then on 1.14rc1.

Comparing 1.9.2 to 1.14rc1

  • 1.9.2 was seeing occasional delays in the range 32->64ms, as well as above 64ms. None of those were present in 1.14rc1.
  • The frequency of delays in the range 4->16ms reduced by a factor of roughly 40x with 1.14rc1.
  • The frequency of delays in the range 1->4ms reduced by a factor of roughly 100x with 1.14rc1.
  • This was a latency benchmark and not really focused on throughput, but 1.14rc1 did about 30% more work over the same time interval.
  • Delays in the range 16->32ms were rare in both cases (1 or 0 out of a billion lookups or so), but persist in Go 1.14rc1, which is partly why this issue was opened.

Each line here in these two tables shows the counts of delays in different ranges for one 15-minute run:

Go 1.9.2 counts of delays

 0->1ms: 1034773027  1->4ms: 1640  4->16ms: 188  16->32ms: 0  32->64ms: 1  64ms->∞: 2
 0->1ms: 1020927457  1->4ms: 1232  4->16ms: 144  16->32ms: 1  32->64ms: 2  64ms->∞: 0
 0->1ms: 1027859639  1->4ms: 1724  4->16ms: 189  16->32ms: 0  32->64ms: 1  64ms->∞: 2
 0->1ms: 1023974382  1->4ms: 981   4->16ms: 109  16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1024773755  1->4ms: 1767  4->16ms: 186  16->32ms: 1  32->64ms: 0  64ms->∞: 1
 0->1ms: 1028694483  1->4ms: 1755  4->16ms: 187  16->32ms: 1  32->64ms: 0  64ms->∞: 2
 0->1ms: 1028308359  1->4ms: 2226  4->16ms: 236  16->32ms: 1  32->64ms: 0  64ms->∞: 1
 0->1ms: 1034795065  1->4ms: 1273  4->16ms: 143  16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1033581601  1->4ms: 1643  4->16ms: 172  16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1027894298  1->4ms: 1412  4->16ms: 152  16->32ms: 1  32->64ms: 0  64ms->∞: 0

Go 1.14rc1 counts of delays

 0->1ms: 1359388697  1->4ms: 6     4->16ms: 6    16->32ms: 1  32->64ms: 0  64ms->∞: 0
 0->1ms: 1355108389  1->4ms: 9     4->16ms: 5    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1344033293  1->4ms: 13    4->16ms: 3    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1333374534  1->4ms: 21    4->16ms: 5    16->32ms: 1  32->64ms: 0  64ms->∞: 0
 0->1ms: 1340122378  1->4ms: 19    4->16ms: 4    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1353391580  1->4ms: 14    4->16ms: 6    16->32ms: 1  32->64ms: 0  64ms->∞: 0
 0->1ms: 1354059777  1->4ms: 9     4->16ms: 5    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1366321263  1->4ms: 14    4->16ms: 3    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1350138536  1->4ms: 16    4->16ms: 6    16->32ms: 0  32->64ms: 0  64ms->∞: 0
 0->1ms: 1366766376  1->4ms: 15    4->16ms: 1    16->32ms: 0  32->64ms: 0  64ms->∞: 0

I'm putting the raw results here including so that anyone running this benchmark in the future can see some of the current variability from run to run.

@thepudds

This comment has been minimized.

Copy link
Author

@thepudds thepudds commented Feb 11, 2020

Also, the 5 worst lookups seen in 1.9.2 across those ten runs:

i=509412400  bgAllocs=4601400   at=486.606s  lookup=260.360ms
i=863625700  bgAllocs=7347600   at=775.641s  lookup=171.862ms
i=4738900    bgAllocs=760700    at=89.159s   lookup=137.284ms
i=163862600  bgAllocs=2008200   at=216.857s  lookup=117.707ms
i=429438200  bgAllocs=4006925   at=423.171s  lookup=117.672ms

As I noted above, this is a simplified benchmark, including the lookups are all done sequentially, so that a GC-induced delay of >100ms might be recorded only in one measurement here, whereas in a real system with parallel inbound requests a GC-induced delay of >100ms might impact many requests at once and hence impact multiple response time measurements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.