-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
There is currently a large performance gap between allocating N elements of type T, and allocating a slice of N elements of type T (even in the case in which T does not contain pointers and the size of T is equal to the size class used for T).
Just to give an idea, the following benchmark1 shows the performance gap between the two, for varying number of elements N (old is N allocations, new is a single allocation of a slice of N elements):
name old time/op new time/op delta
Allocs/1 65.4ns ± 0% 65.2ns ± 0% ~ (p=0.079 n=4+5)
Allocs/2 100ns ± 3% 74ns ±11% -25.60% (p=0.008 n=5+5)
Allocs/4 184ns ± 1% 84ns ± 4% -54.26% (p=0.016 n=4+5)
Allocs/8 315ns ± 1% 103ns ±11% -67.41% (p=0.008 n=5+5)
Allocs/16 580ns ± 2% 132ns ±12% -77.17% (p=0.008 n=5+5)
Allocs/32 1.18µs ±16% 0.19µs ±24% -83.93% (p=0.008 n=5+5)
Allocs/64 2.21µs ± 0% 0.29µs ±25% -86.93% (p=0.008 n=5+5)
Allocs/128 4.75µs ±22% 0.51µs ± 9% -89.32% (p=0.008 n=5+5)
Allocs/256 12.6µs ±17% 0.9µs ± 4% -92.50% (p=0.008 n=5+5)
Allocs/512 22.4µs ±12% 1.8µs ± 5% -91.94% (p=0.008 n=5+5)
Allocs/1024 42.5µs ± 6% 3.5µs ± 5% -91.81% (p=0.008 n=5+5)
Allocs/2048 85.2µs ±12% 7.1µs ±13% -91.72% (p=0.008 n=5+5)
Allocs/4096 169µs ±10% 14µs ± 8% -91.82% (p=0.008 n=5+5)
Allocs/8192 298µs ±15% 25µs ± 1% -91.63% (p=0.008 n=5+5)
Allocs/16384 542µs ± 1% 49µs ± 3% -91.02% (p=0.008 n=5+5)
Allocs/32768 1.07ms ± 1% 0.09ms ± 2% -91.11% (p=0.008 n=5+5)
Allocs/65536 2.13ms ± 2% 0.19ms ± 3% -91.20% (p=0.008 n=5+5)
To help close the gap, it may be worthwhile to investigate/consider performing batch allocations for heap allocations that occur inside loops. This is something that happens frequently in unmarshaling code and, more in general, in data processing code.
To clarify, I am not suggesting turning N allocations of type T into an allocation of a vector with N elements of type T, as that would prevent garbage collection of individual elements. Instead by "batch allocation" I mean performing the allocation of N individual elements of type T in a single call to the runtime. In this way each one of the N elements can still be individually garbage collected.
This could mean, for a loop with a trip count known at compile time, turning this:
for ... /* trip count = C (known at compile time) */ {
e := new(T)
// use e
}into this:
a := make([]*T, 0, C) // stack allocated
for ... {
e := new_batched(&a)
// use e
}
batch_deallocate(a) // optionaland for loops with trip count known at runtime, turning this:
for ... /* trip count = c (not known at compile time) */ {
e := new(T)
// use e
}into:
a := make([]*T, 0, 64) // stack allocated
for ... {
if len(a) == 0 && remaining_trip_count < cap(a) { // optional
a = a[:0:remaining_trip_count]
}
e := new_batched(&a)
// use e
}
batch_deallocate(a) // optionaland for loops with unknown trip count, turning this:
for ... {
e := new(T)
// use e
}into:
a := make([]*T, 0, 64) // stack allocated
for ... {
e := new_batched(&a)
// use e
}
batch_deallocate(a) // optionalwhere the runtime functions calls inserted above by the compiler would be equivalent to:
func new_batched[T any](a *[]*T) *T {
if a == nil || cap(*a) <= 1 {
return new(T)
}
if len(*a) == 0 {
*a = *a[:cap(*a)]
batch_allocate(*a)
}
e := (*a)[len(*a)-1]
*a = (*a)[:len(*a)-1]
return e
}
func batch_allocate[T any](a []*T) {
// this would be a runtime call that allocates len(a) elements of type T
for i := range a {
a[i] = new(T)
}
}
func batch_deallocate[T any](a []*T) {
// this would be a runtime call that deallocates the len(a) elements of type T
for _, e := range a {
// mark e as unused in the corresponding bitmap
}
}In a sense, this would entail having loop-scoped bump allocators to avoid entering the main heap allocator for each individual element.
Surely some heuristics would be needed to decide when to use batch allocation and when not to: for example we definitely would not want to use batch allocation for loops with very low trip counts. Nevertheless I think this would be a worthwhile optimization to consider.
An alternative to the above would be to specialize the heap allocator for {size class, has pointers} or otherwise add generic, fast per-P bump allocators (that would have the benefit of being effective for all code, not just in loops). These seem like more invasive changes to the runtime, though.
Other alternatives include the (over)use of sync.Pool but as was mentioned in the past almost any use of sync.Pool is an indication that the memory management subsystem is lacking in one way or the other.
Footnotes
-
go version go1.20.5 darwin/amd64, Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (turbo boost disabled) ↩