Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slices: Insert/Replace sizing can produce immense GC churn #60134

Closed
seebs opened this issue May 11, 2023 · 10 comments
Closed

slices: Insert/Replace sizing can produce immense GC churn #60134

seebs opened this issue May 11, 2023 · 10 comments
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@seebs
Copy link
Contributor

seebs commented May 11, 2023

What version of Go are you using (go version)?

(N/A, I don't think this released yet?)

Does this issue reproduce with the latest release?

N/A

What operating system and processor architecture are you using (go env)?

N/A

What did you do?

Read the source code for slices.go.

What did you expect to see?

That's a really good question. I think I expected slices.Insert() to have allocation behavior more similar to append().

What did you see instead?

Insert computes the needed size, then uses append(nil, make(..., tot)) to get that size, rounded up to next size class.
Replace computes the needed size, then uses make(..., tot) to get exactly that size.

Neither of them uses the usual append slice growth strategy. Which is surprising, because Insert is using append. But it's using append on a nil slice, not on the original slice, so it's not picking up the usual behavior.

I think that, as a user, I'd have been surprised by this; I'd have expected insert to have the same behavior as append. And it's true that this doesn't actually change algorithmic complexity -- we're always going to have copying costs linear with the size of the original slice, because the offset i is assumed to be randomly distributed somewhere in that slice. But it does mean that, if you do a lot of single-item inserts, you will pay a lot more allocation cost than you would if you were using append.

One possibility might be something like:

s2 := append(s[:i], make(S, tot-i)...)
copy(s2[i:], v)
copy(s2[i+len(v):], s[i:])

This would pick up the expected append growth patterns from the existing slice's size, rather than just rounding up by size class. I think this would be a nice improvement; if you actually want to coerce a slice to not grow, there's ways, but Insert seems like it should be as efficient as possible for large-scale insertions, and having it generate a lot of garbage seems like it'll hurt that.

Similarly, for Replace:

s2 := append(s[:i], make(S, tot-j)...)
copy(s2[i:], v)
copy(s2[i+len(v):], s[j:])

(same thing, except using j rather than i as the right-side offset).

This would produce the desireable trait that Replace(s, i, i, v...) behaves precisely like Insert(s, i, v...), rather than producing a result slice which is not rounded up to the next size class (in cases where reslicing is in fact needed).

Alternatively, if it's too late to make changes to semantics: Maybe have a second pair of functions, differently named, that have the expected behavior. And if this can't be changed, at least the documentation should be explicit that, for Replace, there is no room for new items if the slice is resliced, and that for Insert, the growth algorithm results in a lot more reslices and allocations than you'd get with similar/corresponding append operations.

@ianlancetaylor
Copy link
Contributor

The append function computes the amount of memory to allocate based on the total size. And that is exactly what the code in slices.Insert is doing:

	// Use append rather than make so that we bump the size of
	// the slice up to the next storage class.
	// This is what Grow does but we don't call Grow because
	// that might copy the values twice.
	s2 := append(S(nil), make(S, tot)...)

Here tot is the total size of the new slice.

So when you say

I'd have expected insert to have the same behavior as append

can you clarify what you mean? Because it seems to me that it does. Thanks.

@seebs
Copy link
Contributor Author

seebs commented May 11, 2023

func show(label string, s []int) {
	fmt.Printf("%s: len %d cap %d\n", label, len(s), cap(s))
}

func main() {
	s1 := append(make([]int, 511), make([]int, 5)...)
	s2 := append([]int(nil), make([]int, 516)...)
	show("s1", s1)
	show("s2", s2)
}

s1: len 516 cap 848
s2: len 516 cap 608

Changing 511 to 1023:
s1: len 1028 cap 1536
s2: len 1028 cap 1184

(Arguably, perhaps the algorithm should really use the larger of the cap of the original slice, and the len of the new slice, to figure out which size to grow, but in practice I think that almost never matters.)

When you append to nil, your size is determined entirely by rounding up to size class from the size of the appended thing. When you append to an existing slice that isn't big enough, your size is determined by the append-growth algorithm based on the current size. That's usually significantly more growth than rounding up by size class. Yes, the code (for Insert, not Replace) is using append, but since it's appending to nil, not to an existing slice, it isn't using the existing slice's capacity to guess at rates of growth. (Replace, meanwhile, isn't even rounding up to size classes.)

func main() {
	caps := make(map[int]struct{})
	var s []int
	for i := 0; i < 1024; i++ {
		s = append(s, make([]int, 1)...)
		caps[cap(s)] = struct{}{}
	}
	fmt.Printf("caps seen: %d\n", len(caps))

	for k := range caps {
		delete(caps, k)
	}

	s = nil
	for i := 0; i < 1024; i++ {
		s = append([]int(nil), make([]int, len(s)+1)...)
		caps[cap(s)] = struct{}{}
	}
	fmt.Printf("caps seen: %d\n", len(caps))
}

This gives me 12 and 51 respectively.

@seebs
Copy link
Contributor Author

seebs commented May 12, 2023

I do note the test case there is sort of dumb because it's doing allocations unconditionally.

package main

import "testing"

const inserts = 1024

func BenchmarkSmartAppend(b *testing.B) {
        b.ReportAllocs()
        caps := make(map[int]struct{})
        for i := 0; i < b.N; i++ {
                var s []int
                for i := 0; i < inserts; i++ {
                        if len(s) + 1 >= cap(s) {
                                s = append(s, make([]int, 1)...)
                                caps[cap(s)] = struct{}{}
                        } else {
                                s = append(s, make([]int, 1)...)
                        }
                }
        }
        b.Logf("caps seen: %d\n", len(caps))
}

func BenchmarkNaiveAppend(b *testing.B) {
        b.ReportAllocs()
        caps := make(map[int]struct{})
        for i := 0; i < b.N; i++ {
                var s []int
                for i := 0; i < inserts; i++ {
                        if len(s) + 1 >= cap(s) {
                                s = append([]int(nil), make([]int, len(s)+1)...)
                                caps[cap(s)] = struct{}{}
                        } else {
                                s = append(s, make([]int, 1)...)
                        }
                }
        }
        b.Logf("caps seen: %d\n", len(caps))
}

This one shows 12 allocs per run through appending 1024 items using the existing slice logic, and 98 per run using the naive logic. Honestly not sure why 98 rather than, say, 51. It also reports that the second one takes >4x as long to run, which I'm slightly surprised by.

@ianlancetaylor
Copy link
Contributor

Thanks.

I guess the next question is: what behavior do people expect? People should not normally be building large slices using individual calls to Insert, which is not a particular fast operation. Do you have a suggestion for a better algorithm?

@Merovius
Copy link
Contributor

Merovius commented May 12, 2023

FWIW I would have expected Insert to have similar capacity growth characteristics as append. That is, I'd expect the behavior of

func Insert[S ~[]E, E any](s S, i int, v ...E) S {
    if i < 0 || i > len(s) { panic("gnoes") }
    s = append(s, make(S, len(v))...)
    copy(s[i+len(v):], s[i:])
    copy(s[i:], v)
    return s
}

I was kind of surprised that's not happening when reading the Slack discussion prompting this issue. And I think it would be reasonable to assume that Insert(s, len(s), v...) does the same as append(s, v...).

I agree that Insert is inefficient, so it's hard to argue that changing the amortized allocation cost matters much, given that the cost of copying can't be amortized. But I also don't really see any harm in the extra capacity. It might make some programs hold on to slightly too much memory, but I'd assume that anyone who cares about that also cares enough about the inefficiency of Insert not to use it.

@merykitty
Copy link

I kind of disagree that "Insert is inefficient", the efficiency of Insert depends on where the insertion occurs, if it consistently happens near the end then one should expect Insert to be of similar performance characteristics as append. This can arise for example when implementing a stack-based machine where shuffling data on top of the stack is common.

Thanks.

@seebs
Copy link
Contributor Author

seebs commented May 12, 2023

So, I don't have a suggestion for a better algorithm overall; I just think that having the same growth characteristics as append on the same slice gets us closer to a good solution. It's not an algorithmic-complexity speedup, but it is a fairly significant improvement. (I also think, more strongly as I keep re-evaluating it, that it's almost certainly a very bad idea to have Replace(s, i, i, v...) behave differently from Insert(s, i, v...).)

Hmm. I think... You can't really make an Insert which isn't, in general, roughly O(len(s)+len(v)). Or, more precisely, O(len(s)-i+len(v)).

Actually, that could be a useful way to think about it: The expected/amortized cost of append(s, item) is O(1). The expected/amortized cost of slices.Insert(s, len(s), item) is closer to O(len(s)), because it's copying the entirety of s at a frequency much higher than 1/len(s). So in cases where i tends to be close to len(s), the existing algorithm actually has worse time complexity. If i tends to be randomly distributed, the time complexities are similar, but there is a noticeable constant factor.

So I think this would get better behavior, in general:

s2 := append(s[:i], make(S, tot-i)...)
copy(s2[i:], v)
copy(s2[i+len(v):], s[i:])

The rationale is, this gets us the append logic including awareness of the current size, which is a good way to guess how much we need to grow by, but it only copies each value once. Same thing for Replace, really.

This doesn't make it sane to build million-item slices by inserts at random offsets, but it does reduce the overhead quite a bit.

If we really want to solve for a more general problem... Honestly, I think at that point you want a type Builder [...] which has a fancy data structure to accept random-access inserts and reify the slice when it's done, and I would be very hesitent to pitch implementing that on short notice. Surely, someone's done a renumbering-friendly data structure otherwise similar to a btree? I think something... vaguely [handwaves] similar in logic to something like an AVL tree, except that instead of propagating depth information, you're propagating item count information, so you know how many items are in each branch, so you can quickly home in on the right leaf to add an item to?

Should be a simple matter of programming to design and implement that.

But I think, in the mean time, "make Insert and Replace behave a bit more like existing ways of expanding slices, to reduce the pathological allocations if people are doing admittedly sorta stupid things with them, and document more explicitly that hey, this is not actually going to give you great performance for large N" is probably a good call.

@bcmills bcmills added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 12, 2023
@pconstantinou
Copy link

Adding my $0.02 to this discussion:
I ran into an unexpected slices.Insert(..) memory performance bottleneck because I was trying to maintain and ordered slice using a combination of slices.BinarySearch() and slices.Insert(). Theese functions are designed to work together nicely. My expectation was that slices would have better GC performance than btrees and large contiguous blocks of memory fit my requirements well.
The documentation on Insert didn't specify the memory allocation model so I presumed it matched append().

My problem requires millions of sorted slices, ingesting a 150GB file to provide an in-memory database.
I have a tight loop of: read > find correct slice > insert into the slice (if the value doesn't already exist).

pos, found := slices.BinarySearchFunc(list, rec, func(a, b r) int { return b.ID - a.ID })
if !found {
   list = slices.Insert(list, pos, rec)
}

I expected the slice to grow similarly to append(), but the internal intelligent use of copy() would reduce the amount of memory copying required.

Instead, I found that this loop was a significant source of memory allocation and garbage collection. Particularly once the slices got over 10MB.

My hack solution was to reimplement Insert as follows. I tried to re-implement the append() growth logic.

// Insert inserts the values v... into s at index i,
// returning the modified slice.
// In the returned slice r, r[i] == v[0].
// Insert panics if i is out of range.
// This function is O(len(s) + len(v)).
// This implementation differs from `slices.Insert` it expands the slice like `append“
func Insert[S ~[]E, E any](s S, i int, v ...E) S {
	tot := len(s) + len(v)
	c := cap(s)
	if tot <= c {
		s2 := s[:tot]
		copy(s2[i+len(v):], s[i:])
		copy(s2[i:], v)
		return s2
	}
	if c > 1024 {
		c += c / 4
	} else {
		c *= 2
	}
	if tot > c {
		c = tot
	}
	s2 := make(S, tot, c)
	copy(s2, s[:i])
	copy(s2[i:], v)
	copy(s2[i+len(v):], s[i:])
	return s2
}

Before making this change, slices.Insert was responsible for 97% of all the allocations in the process. After making the change above, it changed to less than 3%.

@randall77
Copy link
Contributor

I'm agreeing with @seebs here that the right, or at least right now, fix is to append to s[:i] instead of S(nil).

When inserting near the end of large slices, this gets you approximately the same behavior as append, where we grow in 1.25x increments (instead of currently, where we grow in 8192-byte increments).

When inserting near the start of large slices, this doesn't help so much. But if you're inserting near the start of large slices, you probably have bigger problems. GC, although expensive, is not O(n^2) expensive.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/495296 mentions this issue: slices: for Insert and Replace, grow slices like append does

@randall77 randall77 self-assigned this May 16, 2023
eric pushed a commit to fancybits/go that referenced this issue Sep 7, 2023
At least when we're inserting/replacing near the end of a slice, when
we have to grow it use the same multiplicative growth factor that the
runtime uses for append.

Before this CL, we would grow the slice one page (8192 bytes) at a time
for large slices. This would cause O(n^2) work when appending near the
end should only take O(n) work.

This doesn't fix the problem if you insert/replace near the start of the
array, but maybe that doesn't need fixing because it is O(n^2) anyway.

Fixes golang#60134

Change-Id: If05376bc512ab839769180e5ce4cb929f47363b1
Reviewed-on: https://go-review.googlesource.com/c/go/+/495296
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
eric pushed a commit to fancybits/go that referenced this issue Sep 7, 2023
At least when we're inserting/replacing near the end of a slice, when
we have to grow it use the same multiplicative growth factor that the
runtime uses for append.

Before this CL, we would grow the slice one page (8192 bytes) at a time
for large slices. This would cause O(n^2) work when appending near the
end should only take O(n) work.

This doesn't fix the problem if you insert/replace near the start of the
array, but maybe that doesn't need fixing because it is O(n^2) anyway.

Fixes golang#60134

Change-Id: If05376bc512ab839769180e5ce4cb929f47363b1
Reviewed-on: https://go-review.googlesource.com/c/go/+/495296
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
@golang golang locked and limited conversation to collaborators May 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

8 participants