Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC pacing exhibits strange behavior with a low GOGC #37927

Closed
mknyszek opened this issue Mar 18, 2020 · 16 comments
Closed

runtime: GC pacing exhibits strange behavior with a low GOGC #37927

mknyszek opened this issue Mar 18, 2020 · 16 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mknyszek
Copy link
Contributor

As of Go 1.14, GC pacing changed a bit to alleviate issues with an increased allocation rate, related to #35112.

Unfortunately, the pacing change that was made was made in error. Currently, the code for capping the trigger ratio in gcSetTriggerRatio looks like:

    const minTriggerRatio = 0.6                                                                                                                                                                             
                                                                                                                                                                                                            
    // Set the trigger ratio, capped to reasonable bounds.                                                                                                                                                  
    if triggerRatio < minTriggerRatio {                                                                                                                                                                     
        // This can happen if the mutator is allocating very                                                                                                                                                
        // quickly or the GC is scanning very slowly.                                                                                                                                                       
        triggerRatio = minTriggerRatio                                                                                                                                                                      
    } else if gcpercent >= 0 {                                                                                                                                                                              
        // Ensure there's always a little margin so that the                                                                                                                                                
        // mutator assist ratio isn't infinity.                                                                                                                                                             
        maxTriggerRatio := 0.95 * float64(gcpercent) / 100                                                                                                                                                  
        if triggerRatio > maxTriggerRatio {                                                                                                                                                                 
            triggerRatio = maxTriggerRatio                                                                                                                                                                  
        }                                                                                                                                                                                                   
    }                                                                                                                                                                                                       
    memstats.triggerRatio = triggerRatio

Consider the case where gcpercent == 3 as in the Go 1.14 log found here. If the input trigger ratio is < 0.6, then we'll set it at 0.6... but if it's greater, then we'll always set it to 0.95*3/100, or 0.0285. Given that gcpercent is 3, the latter is the correct behavior. In fact, when we see a 0.6 in a gcpacertrace, we also see a spike in heap use compared to Go 1.13. This is not good, since we're not respecting the trade-off.

I think the fix is to scale the minTriggerRatio just like the maxTriggerRatio, though it probably stops making sense to have a minimum at some point (it gets so close to 0 anyway).

We should fix this and probably backport it to Go 1.14. This breaks the promise of GOGC in some cases, which I think counts as a correctness bug?

CC @aclements @randall77

@mknyszek mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 18, 2020
@mknyszek mknyszek added this to the Go1.15 milestone Mar 18, 2020
@mknyszek mknyszek self-assigned this Mar 18, 2020
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/223937 mentions this issue: runtime: ensure minTriggerRatio never exceeds maxTriggerRatio

@mknyszek
Copy link
Contributor Author

@pijng Would you be willing to try out https://golang.org/cl/223937 and to see if it helps?

@ianlancetaylor
Copy link
Contributor

@gopherbot Please open a backport to 1.14. The comment above suggests that we should do this to fix GOGC in 1.14.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #37928 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@ianlancetaylor
Copy link
Contributor

@mknyszek Should this be for 1.14.1 or can it wait until 1.14.2?

@pijng
Copy link

pijng commented Mar 18, 2020

@mknyszek I did the same tests as before.
It definitely helped and now I don't see any significant difference between 1.13.3 and 1.14 with https://golang.org/cl/223937.

Attaching log just in case.

If additional actions/logs are needed – I'm always ready.

@y3llowcake
Copy link

I believe we hit this issue in a recent upgrade of a high QPS production service to go 1.14. On startup, this service makes a few very large heap allocations and then dials down GOGC. During normal request processing operation it makes a high rate of smaller allocations. We found it is now OOMing where previously it did not.

I can write a constrained example program to demonstrate this behavior if it would help?

@mknyszek
Copy link
Contributor Author

@y3llowcake that would be great! Let me know if you need any help; I think I've nailed down the conditions for reproducing this, but it's a little tricky to construct the right benchmark.

@ianlancetaylor After talking to some folks (and you), it looks like the 1.14.1 release is very close and we shouldn't delay it more. We'll get this in 1.14.2.

@y3llowcake
Copy link

y3llowcake commented Mar 20, 2020

I didn't get a chance to write up that program today, but figured I would at least share the snippet below. It's taken directly from the service in question and is the function we are using to adjust GOGC.

// This function attempts to set GOGC to a value that will maintain a maximum
// **additional** heap size of maxBytes. Call this after making a very large
// heap allocation.
func AdjustGCPercent(maxBytes uint64) (int, error) {
	if maxBytes < 1 {
		return 0, fmt.Errorf("adjust gcpercent: max bytes must be > 0")
	}
	// Force a GC to run so that memstats are up to date.
	runtime.GC()
	s := runtime.MemStats{}
	var diff uint64
	p := 100
	for ; p > 0; p-- {
		debug.SetGCPercent(p)
		runtime.ReadMemStats(&s)
		// https://golang.org/pkg/runtime/#MemStats
		// NextGC is the target heap size of the next GC cycle.
		//
		// The garbage collector's goal is to keep HeapAlloc ≤ NextGC.
		// At the end of each GC cycle, the target for the next cycle
		// is computed based on the amount of reachable data and the
		// value of GOGC.
		diff = s.NextGC - s.HeapAlloc
		if diff < maxBytes {
			break
		}
	}
	if diff < maxBytes {
		return p, nil
	}
	return p, fmt.Errorf("GC percent adjustment failed; last diff %d max %d", diff, maxBytes)
}

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/225637 mentions this issue: [release-branch.go1.14] runtime: ensure minTriggerRatio never exceeds maxTriggerRatio

gopherbot pushed a commit that referenced this issue Mar 27, 2020
… maxTriggerRatio

Currently, the capping logic for the GC trigger ratio is such that if
gcpercent is low, we may end up setting the trigger ratio far too high,
breaking the promise of SetGCPercent and GOGC has a trade-off knob (we
won't start a GC early enough, and we will use more memory).

This change modifies the capping logic for the trigger ratio by scaling
the minTriggerRatio with gcpercent the same way we scale
maxTriggerRatio.

For #37927.
Fixes #37928.

Change-Id: I2a048c1808fb67186333d3d5a6bee328be2f35da
Reviewed-on: https://go-review.googlesource.com/c/go/+/223937
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
(cherry picked from commit d1ecfcc)
Reviewed-on: https://go-review.googlesource.com/c/go/+/225637
Reviewed-by: David Chase <drchase@google.com>
@edaniels
Copy link
Contributor

I have a service on 1.14.2, also with a very high QPS, and started experiencing this after upgrading.

@edaniels
Copy link
Contributor

@y3llowcake did 1.14.2 solve your issues?

@mknyszek
Copy link
Contributor Author

@edaniels when you say "this" do you mean you set GOGC < 100 and you see a spike in memory use? If you're still seeing problems, please file a new issue (if you haven't already), since the patch here seemed to solve a specific problem (which appears to have been the same problem as #37525).

More detail would also be helpful.

@edaniels
Copy link
Contributor

We're seeing hosts go out of memory that have about 61.3GB total. Here's a memory dump when there's about 5GB left:

=== memory stats ===
Memory Stats:
Alloc: 30698342168
TotalAlloc: 3571647324080
Sys: 59102877128
Lookups: 0
Mallocs: 36810087943
Frees: 36543839111
HeapAlloc: 30698342168
HeapSys: 55796236288
HeapIdle: 23416643584
HeapInuse: 32379592704
HeapReleased: 1767669760
HeapObjects: 266248832
StackInuse: 38338560
StackSys: 38338560
MSpanInuse: 408217328
MSpanSys: 783368192
MCacheInuse: 27776
MCacheSys: 32768
BuckHashSys: 61592850
GCSys: 2324183808
OtherSys: 99124662
NextGC: 48695359040
LastGC: 1588280936483426698
PauseTotalNs: 110904381
PauseNs:
	358357, 395484, 268496, 426147, 430009, 512197, 351232, 376371, 468639, 387465, 345571, 334499, 428578, 411265, 381217, 382551, 372943, 252301, 328028, 419602, 459612, 194891, 319705, 607428, 475961, 767992, 577126, 301774, 442705, 563811, 401015, 152673, 145005, 156672, 219362, 182173, 223865, 222223, 141329, 219462, 190199, 168172, 188998, 150842, 201376, 238452, 137399, 37946, 176429, 162584, 232576, 168027, 168412, 285547, 186578, 92097, 364949, 169108, 85049, 308691, 193857, 174312, 208231, 161889, 208067, 203456, 199075, 91059, 64501, 151284, 241966, 327597, 315046, 251170, 205349, 188194, 245368, 189961, 222272, 283295, 183003, 350968, 250886, 214969, 160934, 239626, 255945, 243904, 250315, 318811, 216497, 201451, 160635, 300990, 120597, 265684, 179997, 305779, 510226, 274997, 129059, 202099, 413275, 165699, 255028, 183387, 317511, 328824, 284125, 192652, 247897, 262318, 326472, 225600, 304063, 224090, 251700, 233200, 234734, 246989, 289660, 326316, 244799, 374684, 273956, 373061, 284836, 334782, 207459, 344947, 225856, 404933, 744424, 227856, 188582, 387621, 418437, 283129, 272922, 307916, 230312, 266560, 209388, 352040, 359575, 317218, 302814, 353008, 306782, 241623, 176553, 368828, 245890, 300505, 242951, 383005, 273138, 348770, 412267, 549904, 370591, 374093, 562233, 250576, 477666, 395721, 366534, 262036, 460345, 377615, 301965, 288310, 364779, 498470, 325060, 254182, 312343, 245593, 368077, 351691, 355038, 360711, 330080, 349042, 312731, 409800, 231470, 231494, 290557, 372604, 313610, 812495, 379494, 407504, 267812, 352239, 425772, 462748, 467109, 341275, 334382, 371320, 359384, 387665, 416107, 350079, 252152, 302836, 341458, 340916, 310648, 428497, 248683, 342683, 313744, 399811, 336211, 389972, 384586, 339408, 288506, 639844, 307817, 439204, 448958, 379096, 216607, 292359, 330522, 357550, 377618, 429495, 295428, 364711, 731998, 434532, 454233, 350951, 356772, 385330, 270037, 254745, 329036, 389417, 415166, 400588, 336518, 429203, 215493, 390259, 555167, 322374, 572121, 334960, 339037, 404408, 
PauseEnd:
	1588277651292971065, 1588277774089900913, 1588277896871928518, 1588278019673839520, 1588278142489222415, 1588278265292239166, 1588278388102707395, 1588278510912938699, 1588278633774958575, 1588278756586662048, 1588278879442941001, 1588279002230657896, 1588279125048069656, 1588279247888651545, 1588279370695087788, 1588279493494782308, 1588279616324070931, 1588279739120380877, 1588279861997684287, 1588279984915707576, 1588280107887712549, 1588280199712321506, 1588280282202002549, 1588280405546919269, 1588280518114985714, 1588280553767664385, 1588280589722729279, 1588280629006267480, 1588280690798454263, 1588280813678789480, 1588280936483426698, 1588254680005717832, 1588254691089677540, 1588254701774592372, 1588254712965345544, 1588254723275075840, 1588254734488814390, 1588254745083472508, 1588254755861769033, 1588254767007934392, 1588254778179613767, 1588254788786036725, 1588254800358957346, 1588254811507663304, 1588254823116946695, 1588254833963824458, 1588254845470700019, 1588254856958971155, 1588254867757133295, 1588254879046341833, 1588254890678782581, 1588254901966977235, 1588254913365416505, 1588254925053990219, 1588254936275642160, 1588254948593557268, 1588254960310726485, 1588254972012026948, 1588254983607956612, 1588254994859852303, 1588255007021877442, 1588255019038749132, 1588255030579913990, 1588255042694428564, 1588255054497868851, 1588255067146474638, 1588255080729503785, 1588255094589914312, 1588255109275435655, 1588255129489262307, 1588255230306057164, 1588255271062603346, 1588255367851103932, 1588255487590563104, 1588255595905169865, 1588255716886317411, 1588255837906384090, 1588255958931596387, 1588256079988138442, 1588256201085120923, 1588256322194243152, 1588256443321348905, 1588256564454508509, 1588256685592776507, 1588256806772960114, 1588256927943812064, 1588257049127141696, 1588257170331329389, 1588257291572518047, 1588257412854703041, 1588257534143040078, 1588257655446970359, 1588257757322136830, 1588257878664789734, 1588258000009614689, 1588258121355746970, 1588258242717649214, 1588258364103811759, 1588258446301475022, 1588258567715332910, 1588258689150161968, 1588258785280583298, 1588258882874732479, 1588259004350256364, 1588259125851310260, 1588259247359756070, 1588259357212481174, 1588259478747904789, 1588259600295024089, 1588259721903636513, 1588259843519727838, 1588259965163385977, 1588260086824003854, 1588260208465855066, 1588260330128931943, 1588260451818116908, 1588260573620176407, 1588260695342165079, 1588260817078614829, 1588260938820087525, 1588261060584606538, 1588261174554234205, 1588261296348222646, 1588261418186109731, 1588261540003651488, 1588261661846564685, 1588261783708563975, 1588261905566686511, 1588262027444693778, 1588262149328878143, 1588262271248439030, 1588262393216729223, 1588262463126333385, 1588262584740747049, 1588262706818903329, 1588262828804352556, 1588262950787624452, 1588263072776056578, 1588263194806991426, 1588263316836312031, 1588263438873243830, 1588263560949917861, 1588263683016295923, 1588263805091327030, 1588263927237410621, 1588264049376253435, 1588264171504381515, 1588264293692345679, 1588264415811281162, 1588264537965573755, 1588264660134699662, 1588264782292188482, 1588264904469705921, 1588265026649750050, 1588265148849941109, 1588265271225028318, 1588265393536959415, 1588265516234016898, 1588265638454959710, 1588265760759225266, 1588265883021888289, 1588266005418463732, 1588266127669826590, 1588266249898340765, 1588266372165943817, 1588266494401991430, 1588266616664445418, 1588266738904010252, 1588266861141084908, 1588266983412394473, 1588267105684823043, 1588267228074557847, 1588267350366388701, 1588267472681211500, 1588267595022872536, 1588267717312735911, 1588267839617022982, 1588267961913090274, 1588268084249094186, 1588268206577676060, 1588268328919727927, 1588268451265490048, 1588268573627099852, 1588268695976939143, 1588268818360447815, 1588268940753071522, 1588269063159034072, 1588269185714587915, 1588269308200980696, 1588269430590212404, 1588269553001060401, 1588269675478026159, 1588269797909794260, 1588269920350910562, 1588270042851429757, 1588270165322287525, 1588270287789203696, 1588270410241591805, 1588270532772007920, 1588270655304762510, 1588270777835257221, 1588270900331948322, 1588271022847469121, 1588271145393843505, 1588271267904536199, 1588271390420630355, 1588271513022753783, 1588271635574649007, 1588271758137300915, 1588271880775230182, 1588272003411987577, 1588272126016195805, 1588272248627523587, 1588272371282056886, 1588272493910695683, 1588272616552412630, 1588272739246706488, 1588272861894464703, 1588272984620410965, 1588273107334898763, 1588273230025286974, 1588273352881377832, 1588273475656290409, 1588273598410941698, 1588273721153281251, 1588273843942273183, 1588273966756170356, 1588274089521872267, 1588274212318701167, 1588274335088778195, 1588274457874173831, 1588274580690545619, 1588274703585278155, 1588274826372574862, 1588274949168593174, 1588275071941687674, 1588275194735678811, 1588275317499337785, 1588275440462280599, 1588275563266771938, 1588275686073136124, 1588275808912725325, 1588275931774628667, 1588276054652138468, 1588276177446076578, 1588276300296382802, 1588276423074424764, 1588276545866045427, 1588276668665411613, 1588276791486666410, 1588276914354828705, 1588277037145945630, 1588277159948564843, 1588277282862056861, 1588277405707420939, 1588277528505876727, 
NextGC: 543
NextGC: 0
GCCPUFraction: 0.005853
EnableGC: true
DebugGC: false
BySize:
	0: 
		Size: 0
		Mallocs: 0
		Frees: 0
	1: 
		Size: 8
		Mallocs: 249495444
		Frees: 248772717
	2: 
		Size: 16
		Mallocs: 8505508246
		Frees: 8457368646
	3: 
		Size: 32
		Mallocs: 10623688479
		Frees: 10532291781
	4: 
		Size: 48
		Mallocs: 5525764630
		Frees: 5477937705
	5: 
		Size: 64
		Mallocs: 2209711844
		Frees: 2188264580
	6: 
		Size: 80
		Mallocs: 1050495428
		Frees: 1048309572
	7: 
		Size: 96
		Mallocs: 1178713086
		Frees: 1176119584
	8: 
		Size: 112
		Mallocs: 91607123
		Frees: 91238150
	9: 
		Size: 128
		Mallocs: 1828895711
		Frees: 1804519657
	10: 
		Size: 144
		Mallocs: 29677314
		Frees: 29226185
	11: 
		Size: 160
		Mallocs: 248909072
		Frees: 248472714
	12: 
		Size: 176
		Mallocs: 86977376
		Frees: 86833906
	13: 
		Size: 192
		Mallocs: 54657307
		Frees: 54533570
	14: 
		Size: 208
		Mallocs: 19666613
		Frees: 19592086
	15: 
		Size: 224
		Mallocs: 16746575
		Frees: 16690598
	16: 
		Size: 240
		Mallocs: 36042133
		Frees: 36008927
	17: 
		Size: 256
		Mallocs: 54632170
		Frees: 54402087
	18: 
		Size: 288
		Mallocs: 270858085
		Frees: 270171185
	19: 
		Size: 320
		Mallocs: 48541994
		Frees: 48306245
	20: 
		Size: 352
		Mallocs: 1167405940
		Frees: 1146352314
	21: 
		Size: 384
		Mallocs: 26730653
		Frees: 26622280
	22: 
		Size: 416
		Mallocs: 131673712
		Frees: 129922763
	23: 
		Size: 448
		Mallocs: 9548804
		Frees: 9525234
	24: 
		Size: 480
		Mallocs: 18031428
		Frees: 17990450
	25: 
		Size: 512
		Mallocs: 324917256
		Frees: 324574060
	26: 
		Size: 576
		Mallocs: 32720815
		Frees: 32618780
	27: 
		Size: 640
		Mallocs: 18280523
		Frees: 18181727
	28: 
		Size: 704
		Mallocs: 11431393
		Frees: 11320366
	29: 
		Size: 768
		Mallocs: 8848012
		Frees: 8793961
	30: 
		Size: 896
		Mallocs: 12483341
		Frees: 12378830
	31: 
		Size: 1024
		Mallocs: 14883680
		Frees: 14757823
	32: 
		Size: 1152
		Mallocs: 6766228
		Frees: 6725101
	33: 
		Size: 1280
		Mallocs: 9741973
		Frees: 9717094
	34: 
		Size: 1408
		Mallocs: 5098334
		Frees: 5008796
	35: 
		Size: 1536
		Mallocs: 177033281
		Frees: 176835696
	36: 
		Size: 1792
		Mallocs: 5966179
		Frees: 5925807
	37: 
		Size: 2048
		Mallocs: 4288405
		Frees: 4277545
	38: 
		Size: 2304
		Mallocs: 2223893
		Frees: 2202602
	39: 
		Size: 2688
		Mallocs: 8843092
		Frees: 8788269
	40: 
		Size: 3072
		Mallocs: 1442307
		Frees: 1436368
	41: 
		Size: 3200
		Mallocs: 244758
		Frees: 243365
	42: 
		Size: 3456
		Mallocs: 639841
		Frees: 637008
	43: 
		Size: 4096
		Mallocs: 3266811
		Frees: 3241695
	44: 
		Size: 4864
		Mallocs: 1145067
		Frees: 1140108
	45: 
		Size: 5376
		Mallocs: 393858
		Frees: 392986
	46: 
		Size: 6144
		Mallocs: 1606474
		Frees: 1587198
	47: 
		Size: 6528
		Mallocs: 95866
		Frees: 95398
	48: 
		Size: 6784
		Mallocs: 127991
		Frees: 127489
	49: 
		Size: 6912
		Mallocs: 66094
		Frees: 65828
	50: 
		Size: 8192
		Mallocs: 2417313
		Frees: 2410711
	51: 
		Size: 9472
		Mallocs: 833378
		Frees: 830654
	52: 
		Size: 9728
		Mallocs: 39758
		Frees: 39604
	53: 
		Size: 10240
		Mallocs: 178127
		Frees: 177390
	54: 
		Size: 10880
		Mallocs: 59864
		Frees: 59464
	55: 
		Size: 12288
		Mallocs: 506783
		Frees: 504650
	56: 
		Size: 13568
		Mallocs: 428978
		Frees: 427417
	57: 
		Size: 14336
		Mallocs: 37825
		Frees: 37589
	58: 
		Size: 16384
		Mallocs: 2090474
		Frees: 2084767
	59: 
		Size: 18432
		Mallocs: 112612
		Frees: 111821
	60: 
		Size: 19072
		Mallocs: 931371
		Frees: 899217

=== end of memory stats===

In go 1.13.x, we had virtually identical QPS and alloc rate. It appears right now when the alloc rate is sustained at a high rate long enough, we end up running out of memory even though there are objects that are available to be GCd.

@mknyszek
Copy link
Contributor Author

@edaniels Thanks for the MemStats dump. The only thing that jumps out at me as that there's a lot of free and unused memory at the point you took this snapshot. Unfortunately that doesn't tell us how we got here.

Could you please file a new issue? Now that you're sharing details, it'll be a lot easier to reference all the related details if they're concentrated on a new one, rather than continuing the conversation on a closed (and possibly unrelated) issue.

A few questions:

Do you set GOGC < 100? I'm trying to determine if you might be experiencing the same problem as the one described in this issue.

How did you determine that there were objects available to be GC'd? What platform are you running on?

Getting a GC trace and scavenge trace would probably be the most useful thing now (GODEBUG=gctrace=1,scavtrace=1), if you're able.

@y3llowcake
Copy link

1.14.2 fixed the issue for me, thank you

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants