-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: smoother growth rate transition yields great results #64877
Comments
I know I wrote all of the explanation and everything, but actually, just keeping the growth rate at x2 has even better results, so why not? and why lower it in the first place? Keeping it at x2 was even better in all parameters than using the golden ratio for anything of capacity above 1024, so something here just wants a high growth rate. golden ratio benchmark bash-3.2$ GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8 1 8139819000 ns/op 48032196000 B/op 18695 allocs/op
PASS
ok go-playground/pkg 8.511s Keeping it at 2 benchmark bash-3.2$ GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8 1 7253913416 ns/op 37604270040 B/op 13737 allocs/op
PASS
ok go-playground/pkg 7.465s It seems like the allocated bytes decrease slower for a bigger growth rate. This could be because as the growth rate is higher, more values are used for the previously grown slice; so the complexity I wrote has to be amended somehow. |
Documentation is in CL 347917 and the linked discussions (the groups discussion linked to in the CL description, and the graphs linked to in the last comment). So if I understand correctly, your change is just keeping the growth factor a bit larger for a bit longer?
Sounds like fewer total reallocations, which of course leads to less total memory and work. We'd get an even better effect measuring this way by just doing 2x reallocations always. The trick here is that "total memory" means total allocated memory. It does not reflect total live but unused memory, which will rise as we have fewer larger reallocations and thus more tail waste. How do we know what that effect will be? |
I don't think this needs to be in the proposal process, as it doesn't involve any API. We can just discuss it on this issue and decide. |
Added some benchmarks from the package pkg
import (
"runtime/debug"
"runtime/metrics"
"testing"
)
func GetMetricsForLog() any {
// Get descriptions for all supported metrics.
m := []metrics.Sample{
{Name: "/gc/heap/frees:bytes"},
{Name: "/gc/heap/allocs:bytes"},
{Name: "/gc/heap/live:bytes"},
{Name: "/cpu/classes/gc/total:cpu-seconds"},
}
metrics.Read(m)
return map[string]any{
"gc_heap_frees": m[0].Value.Uint64(),
"gc_heap_allocs": m[1].Value.Uint64(),
"gc_heap_live": m[2].Value.Uint64(),
"cpu_gc_total": m[3].Value.Float64(),
}
}
// From runtime/sizeclasses.go:
// We only test allocations above it, since anything below it has a small allocation price.
const _MaxSmallSize = 32768
func BenchmarkGrowSlice(b *testing.B) {
baseSlice := make([]byte, _MaxSmallSize)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
allSlices := make([][]byte, 0, 1000)
// Test the average across a range of target sizes.
for k := _MaxSmallSize * 10; k < 1<<24; k += 10001 {
debug.FreeOSMemory()
testSlice := baseSlice
for j := 0; j < k; j += 10 {
// Unroll the loop to reduce dependency on the loop time.
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
testSlice = append(testSlice, 1)
}
// keep a pointer to prevent garbage collection
allSlices = append(allSlices, testSlice)
}
}
b.StopTimer()
debug.FreeOSMemory()
b.Log(GetMetricsForLog())
} bash-3.2$ GOROOT=/usr/local/go /usr/local/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8 1 13316692167 ns/op 78178604488 B/op 36868 allocs/op
--- BENCH: BenchmarkGrowSlice-8
growSlice_test.go:59: map[cpu_gc_total:1.841684158 gc_heap_allocs:78178862360 gc_heap_frees:78178704800 gc_heap_live:157872]
PASS
ok go-playground/pkg 13.951s
bash-3.2$ GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8 1 12797637958 ns/op 72003326144 B/op 24958 allocs/op
--- BENCH: BenchmarkGrowSlice-8
growSlice_test.go:59: map[cpu_gc_total:1.736739948 gc_heap_allocs:72003585408 gc_heap_frees:72003421600 gc_heap_live:164096]
PASS
ok go-playground/pkg 13.300s So it does seem to be using a bit more space that is not freed, but also not much of it. Oh missed that groups discussion. oops. Also, it has the solution as to why all the phi reasoning would not work as of now - since other slices may point to the previously allocated slice, so it can't be reused. In general, it seems more aggressive slice growth leads to better time and memory consumption, with a worse live heap performance, so we'd have to assume both the range of and distribution of a typical slice, and the memory limit to desired performance ratio. |
From your benchmark results, it looks like a ~4% cpu improvement at the cost of a ~4% live memory increase. The benchmark doesn't have much live data itself, so it doesn't also measure the effect that a larger live set has. A larger live set implies a higher GC scanning cost (for slices with element types that have pointers, anyway) and possibly more frequent GCs. |
Proposal Details
Proposal
In
growslice
function, smoothly decrease the growth rate ofnewCap
from x2 to x1.125This is the benchmarked code change -
Edit: bug fixing the formula
Benchmark
So about x1.5 less allocations, x1.08 less memory and x1.05 less time.
The results are consistent with small variability, and also work with different capacity ranges.
This is the benchmark code -
Background
Up until go1.17, if the capacity was below 1024, the growth rate was x2 and if the capacity was above 1024, the growth rate was x1.25
In go1.18 this was smoothed somewhat https://go-review.googlesource.com/c/go/+/347917
I was not able to find documentation of the rational and reasoning behind that.
Rational
The rational behind growing the capacity exponentially is to reduce the amount of required reallocation times.
It makes the required number of reallocation to grow in amortized complexity of
O(log(n))
where n is the length.But this has a trade-off; the greater the exponent (growth factor) is, the more redundant capacity was allocated and not used.
This grows relative to the last realloaction size hence
O(exp(log(n))) ~ O(n)
To mitigate this, go reduced the growth rate of large slices where memory starts to matter.
This change is too abrupt, both before the go1.18 fix and after, causing 2 issues -
The proposal is to gradually decrease the growth factor from x2 to x1.125 across a typical size of
1<<10
up until1<<24
at which the growth rate will be kept at 1.125To keep using integers, I quantized it to steps of 1/16.
Keeping a high
Since it's a difference in the exponent, it leads to substantial performance differences.
Further investigation
The golden ratio
Some claim having a growth rate below the golden ratio (
~1.61
) is optimal for memory reuse.https://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array
So it might be beneficial to quickly decrease to something below it, and only then start the gradual decrease.
Decrease rate
The rate at which the growth factor is decreasing is linear with the number of allocations and exponential with the increasing capacity.
To do this I spread the growth in powers of 2, even though the growth rate slows.
There might be a better strategy, but it might not be worth the complexity.
Keep it high
Keeping the growth rate high at the beginning had tremendously better results, but I don't know the exact form of it.
The text was updated successfully, but these errors were encountered: