-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: avoid memclr for provable overwritten slices #57153
Comments
It is hard to implement this for all cases. But it might be a good idea to do some special optimizations for the later |
Generally I would not want to open the Pandora's box of giving std lib functions like bytes.Join special runtime access. Lets make the pattern simple enough to detect and then bytes.Join can have the exact pattern so programmers dont need to remember it. I suggest var b []byte
for _, src := range source {
b = append(b, src...)
} We can teach the compiler to detect that pattern and avoid unnecessary memclr (but only for non pointer types) as well as do only one allocation for the full b slice with enough room for all appends. This avoids the explicit To make it work for bytes.Join we can then expand this to allow an append of an unchanged variable or const in addition in the loop: for _, src := range source {
b = append(b, src...)
b = append(b, var|const...)
} Then join would be: for _, src := range source {
b = append(b, src...)
b = append(b, var|const...)
}
b = b[:len(b)-len(var|const)] Yes that over allocates by |
Re:
Requiring that to be the canonical pattern has the weird property that someone pre-allocating with the "right" size will get worse performance. When I encountered this my first wish was for append to solve the problem via a double definition:
(I'm not claiming that's a great solution, just a data point about what one go-user would have found intuitive) |
One can after the first iteration that only accepts new slices change to just match: for _, src := range source {
b = append(b, src...)
} and have some checks around existing vs needed capacity that determines how the allocation is made. However it also complicates some things like when b has free capacity but not sufficient much. We then still need to update b with as many copies of slices as would have happened in the append loop, before allocating a new backing array, which requires more care and machinery. I would still suggest we start with something more restricted and then iterate to be less restrictive than reaching for something more versatile but more complex right away. func append[T any](base []T, args... []T) I agree that a dedicated inbuild instead of a pattern makes detection easier and the intention clearer. A language change on the other side has a much higher bar to get accepted. There had been discussions in the past to allow a variadic list of slices to append but afaik they get negative feedback on the grounds of append shouldnt be an even more special language construct. |
It seems like an alternative approach might be "lazy clearing" -- track an index i such that everything prior to that index doesn't need clearing and everything after it is uninitialized. After an allocation just avoid clearing the remaining part of the array until one of these conditions is met:
|
Change https://go.dev/cl/456336 mentions this issue: |
Heh... that's exactly the approach I took in https://go.dev/cl/456336. I personally think it's fairly palatable since As part of that CL, I also optimized In general, this provides 2x performance gains for sufficiently large buffers. This makes sense since we're removing an entire write pass through the memory. |
func append[T any](base []T, args... []T) has ambiguity if the type argument for |
I don't think this belongs in the |
cc @golang/compiler |
Using unsafe and assembly is not the same as giving special access into the runtime by linknaming. Yes there are other packages that do that like reflect but I dont think bytes and strings are in the same category of being tied to internals. Additionally the bytes.MakeNoZero solves this for one specific type which is also not ideal. This will invite unsafe casts to other types. Using What I mean with pandorax box is that this invites adding linknames from std packages everywhere into the runtime to special operations for optimizations. "More efficient map copy" -> create a map package and link it to special function in runtime. Thats not to say we should not optimize the functions but I think we should do so by patterns detected in the compiler and replaced by runtime calls, or through builtins and not through linknaming which is not portable/maintainable within the go language as others can not just add their linknames into the runtime. |
@martisch: In principle, I agree with you. However, I'm not sure compiler optimization is going to fix all cases. For example, it seems like a hard (though solvable) problem statically detecting that the |
@dsnet My view of "perfect is the enemy of good" is from another angle. We are using the complex case of strings.Builder that is too complex to catch by pattern matching to make the case patterns are not good enough to be used and thereby apply the heavy cannon approach to all use cases. I think we should first cover the cases we can with pattern matching by not using link names. All those cases then just rely on compiler correctness and dont need to use unsafe. The case of strings.Builder could be handled as a separate topic. We could discuss exposing in unsafe package a method that allocates memory without zeroing. Im not saying we should or will: but if we give strings.Builder that power without additional safe guards (compiler proofing that this can never leak data), then why not to everyone but have it explicitly unsafe? Then strings.Builder that already uses unsafe can use a public but unsafe function. I would like to point out another issue. If we now go for milking every drop of performance out of bytes and strings packages than it will be hard to take it back to something fast and good enough later that doesnt rely on unsafe and linknaming when we have pattern matching (which seems preferable to use). So we may lock ourselves in to a different tradeoff of flexibility and maintainability with the inbetween solution then the actual target we may agree on. |
I'm using
If the maintenance cost is light, we should milk every drop of performance out of "bytes" and "strings". https://go.dev/cl/456336 adds a dozen lines of code to the runtime in the same repo. This is not a situation where removing this in the future requires a complicated dance across multiple repos. "strings" and "bytes" consistently come up as bottlenecks in logic that I write and I've sent various optimizations over the years to make it faster. By having these be faster, I don't have to use "unsafe" in my code. That's good as I don't like using "unsafe".
This seems like a good litmus test then for the compiler to have the right optimization. Once the compiler optimization lands, it seems that the "bytes" changes in https://go.dev/cl/456336 could at least be reverted. It's doubtful the compiler could ever satisfy the use case in (BTW, most of the changed lines in https://go.dev/cl/456336 are unrelated to the optimization itself, but rather to fix an existing integer overflow bug in the packages) |
TL;DR: Why not introduce a new function
I meant "all" as in other use cases in bytes and strings that can be more locally optimized, dont need strings builder and are easier matches by the patterns as discussed in the beginning.
One difference with We do not disagree that a complex case such as
Then why not add an unsafe function to allocate without memclr to give all developers to possibility to perforrmance optimize to the max and apply that to their own "bytes" and "string" like functions? This can then also be applied in new generic libraries (e.g. in /exp/) that dont have the luxury to be shipped with std and thereby be linked into the runtime.
If we are using unsafe in "bytes" and "strings" by exposing a allocate without memclr function then "your" code wont use "unsafe" either, similar to
I think we if go down the route of linknaming into runtime we should agree before that it is ok to remove it if if the compiler catches up and that then small performance differences should not be a blocker to revert the link naming. |
I may misunderstand this. @martisch seems to imply that users will use "linknaming into runtime" once the bytealg function is available. After >5 years of go I'm aware of unsafe (though I've never needed to use it) but I have only a faint idea of linknaming. I seem to recall a Go issue where the go team made clear that using it is "your problem". I may of cause be wrong, but isn't bytealg with potential misuse a much smaller foot gun (given it's not promoted as Go api) than an unsafe function that actually invites you to do so? |
The users of byte and string packages dont need todo anything new. However someone that would want to copy the code into another package outside the std lib would not be able todo so (and have it working) as the the function used to alloc without zeroing would be referenced from a go file in the runtime package. So making this useable elsewhere outside the std lib is not possible without adding another linkname and recompiling runtime. See the addition "src/runtime/slice.go" made to here: https://go-review.googlesource.com/c/go/+/456336/1/src/runtime/slice.go |
Add bytealg.MakeNoZero that specially allocates a []byte without zeroing it. It assumes the caller will populate every byte. From within the bytes and strings packages, we can use bytealg.MakeNoZero in a way where our logic ensures that the entire slice is overwritten such that uninitialized bytes are never leaked to the end user. We use bytealg.MakeNoZero from within the following functions: * bytes.Join * bytes.Repeat * bytes.ToUpper * bytes.ToLower * strings.Builder.Grow The optimization in strings.Builder transitively benefits the following: * strings.Join * strings.Map * strings.Repeat * strings.ToUpper * strings.ToLower * strings.ToValidUTF8 * strings.Replace * any user logic that depends on strings.Builder This optimization is especially notable on large buffers that do not fit in the CPU cache, such that the cost of runtime.memclr and runtime.memmove are non-trivial since they are both limited by the relatively slow speed of physical RAM. Performance: RepeatLarge/256/1 66.0ns ± 3% 64.5ns ± 1% ~ (p=0.095 n=5+5) RepeatLarge/256/16 55.4ns ± 5% 53.1ns ± 3% -4.17% (p=0.016 n=5+5) RepeatLarge/512/1 95.5ns ± 7% 87.1ns ± 2% -8.78% (p=0.008 n=5+5) RepeatLarge/512/16 84.4ns ± 9% 76.2ns ± 5% -9.73% (p=0.016 n=5+5) RepeatLarge/1024/1 161ns ± 4% 144ns ± 7% -10.45% (p=0.016 n=5+5) RepeatLarge/1024/16 148ns ± 3% 141ns ± 5% ~ (p=0.095 n=5+5) RepeatLarge/2048/1 296ns ± 7% 288ns ± 5% ~ (p=0.841 n=5+5) RepeatLarge/2048/16 298ns ± 8% 281ns ± 5% ~ (p=0.151 n=5+5) RepeatLarge/4096/1 593ns ± 8% 539ns ± 8% -8.99% (p=0.032 n=5+5) RepeatLarge/4096/16 568ns ±12% 526ns ± 7% ~ (p=0.056 n=5+5) RepeatLarge/8192/1 1.15µs ± 8% 1.08µs ±12% ~ (p=0.095 n=5+5) RepeatLarge/8192/16 1.12µs ± 4% 1.07µs ± 7% ~ (p=0.310 n=5+5) RepeatLarge/8192/4097 1.77ns ± 1% 1.76ns ± 2% ~ (p=0.310 n=5+5) RepeatLarge/16384/1 2.06µs ± 7% 1.94µs ± 5% ~ (p=0.222 n=5+5) RepeatLarge/16384/16 2.02µs ± 4% 1.92µs ± 6% ~ (p=0.095 n=5+5) RepeatLarge/16384/4097 1.50µs ±15% 1.44µs ±11% ~ (p=0.802 n=5+5) RepeatLarge/32768/1 3.90µs ± 8% 3.65µs ±11% ~ (p=0.151 n=5+5) RepeatLarge/32768/16 3.92µs ±14% 3.68µs ±12% ~ (p=0.222 n=5+5) RepeatLarge/32768/4097 3.71µs ± 5% 3.43µs ± 4% -7.54% (p=0.032 n=5+5) RepeatLarge/65536/1 7.47µs ± 8% 6.88µs ± 9% ~ (p=0.056 n=5+5) RepeatLarge/65536/16 7.29µs ± 4% 6.74µs ± 6% -7.60% (p=0.016 n=5+5) RepeatLarge/65536/4097 7.90µs ±11% 6.34µs ± 5% -19.81% (p=0.008 n=5+5) RepeatLarge/131072/1 17.0µs ±18% 14.1µs ± 6% -17.32% (p=0.008 n=5+5) RepeatLarge/131072/16 15.2µs ± 2% 16.2µs ±17% ~ (p=0.151 n=5+5) RepeatLarge/131072/4097 15.7µs ± 6% 14.8µs ±11% ~ (p=0.095 n=5+5) RepeatLarge/262144/1 30.4µs ± 5% 31.4µs ±13% ~ (p=0.548 n=5+5) RepeatLarge/262144/16 30.1µs ± 4% 30.7µs ±11% ~ (p=1.000 n=5+5) RepeatLarge/262144/4097 31.2µs ± 7% 32.7µs ±13% ~ (p=0.310 n=5+5) RepeatLarge/524288/1 67.5µs ± 9% 63.7µs ± 3% ~ (p=0.095 n=5+5) RepeatLarge/524288/16 67.2µs ± 5% 62.9µs ± 6% ~ (p=0.151 n=5+5) RepeatLarge/524288/4097 65.5µs ± 4% 65.2µs ±18% ~ (p=0.548 n=5+5) RepeatLarge/1048576/1 141µs ± 6% 137µs ±14% ~ (p=0.421 n=5+5) RepeatLarge/1048576/16 140µs ± 2% 134µs ±11% ~ (p=0.222 n=5+5) RepeatLarge/1048576/4097 141µs ± 3% 134µs ±10% ~ (p=0.151 n=5+5) RepeatLarge/2097152/1 258µs ± 2% 271µs ±10% ~ (p=0.222 n=5+5) RepeatLarge/2097152/16 263µs ± 6% 273µs ± 9% ~ (p=0.151 n=5+5) RepeatLarge/2097152/4097 270µs ± 2% 277µs ± 6% ~ (p=0.690 n=5+5) RepeatLarge/4194304/1 684µs ± 3% 467µs ± 6% -31.69% (p=0.008 n=5+5) RepeatLarge/4194304/16 682µs ± 1% 471µs ± 7% -30.91% (p=0.008 n=5+5) RepeatLarge/4194304/4097 685µs ± 2% 465µs ±20% -32.12% (p=0.008 n=5+5) RepeatLarge/8388608/1 1.50ms ± 1% 1.16ms ± 8% -22.63% (p=0.008 n=5+5) RepeatLarge/8388608/16 1.50ms ± 2% 1.22ms ±17% -18.49% (p=0.008 n=5+5) RepeatLarge/8388608/4097 1.51ms ± 7% 1.33ms ±11% -11.56% (p=0.008 n=5+5) RepeatLarge/16777216/1 3.48ms ± 4% 2.66ms ±13% -23.76% (p=0.008 n=5+5) RepeatLarge/16777216/16 3.37ms ± 3% 2.57ms ±13% -23.72% (p=0.008 n=5+5) RepeatLarge/16777216/4097 3.38ms ± 9% 2.50ms ±11% -26.16% (p=0.008 n=5+5) RepeatLarge/33554432/1 7.74ms ± 1% 4.70ms ±19% -39.31% (p=0.016 n=4+5) RepeatLarge/33554432/16 7.90ms ± 4% 4.78ms ± 9% -39.50% (p=0.008 n=5+5) RepeatLarge/33554432/4097 7.80ms ± 2% 4.86ms ±11% -37.60% (p=0.008 n=5+5) RepeatLarge/67108864/1 16.4ms ± 3% 9.7ms ±15% -41.29% (p=0.008 n=5+5) RepeatLarge/67108864/16 16.5ms ± 1% 9.9ms ±15% -39.83% (p=0.008 n=5+5) RepeatLarge/67108864/4097 16.5ms ± 1% 11.0ms ±18% -32.95% (p=0.008 n=5+5) RepeatLarge/134217728/1 35.2ms ±12% 19.2ms ±10% -45.58% (p=0.008 n=5+5) RepeatLarge/134217728/16 34.6ms ± 6% 19.3ms ± 7% -44.07% (p=0.008 n=5+5) RepeatLarge/134217728/4097 33.2ms ± 2% 19.3ms ±14% -41.79% (p=0.008 n=5+5) RepeatLarge/268435456/1 70.9ms ± 2% 36.2ms ± 5% -48.87% (p=0.008 n=5+5) RepeatLarge/268435456/16 77.4ms ± 7% 36.1ms ± 8% -53.33% (p=0.008 n=5+5) RepeatLarge/268435456/4097 75.8ms ± 4% 37.0ms ± 4% -51.15% (p=0.008 n=5+5) RepeatLarge/536870912/1 163ms ±14% 77ms ± 9% -52.94% (p=0.008 n=5+5) RepeatLarge/536870912/16 156ms ± 4% 76ms ± 6% -51.42% (p=0.008 n=5+5) RepeatLarge/536870912/4097 151ms ± 2% 76ms ± 6% -49.64% (p=0.008 n=5+5) RepeatLarge/1073741824/1 293ms ± 5% 149ms ± 8% -49.18% (p=0.008 n=5+5) RepeatLarge/1073741824/16 308ms ± 9% 150ms ± 8% -51.19% (p=0.008 n=5+5) RepeatLarge/1073741824/4097 299ms ± 5% 151ms ± 6% -49.51% (p=0.008 n=5+5) Updates #57153 Change-Id: I024553b7e676d6da6408278109ac1fa8def0a802 Reviewed-on: https://go-review.googlesource.com/c/go/+/456336 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Joseph Tsai <joetsai@digital-static.net> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Consider the following pattern:
which often occurs when merging segmented buffers.
Running this benchmark shows that the CPU time is dominated by
runtime.memclrNoHeapPointers
, which occupies 44% of the time, whileruntime.memmove
occupies 36% of the time (this is possibly faster thanmemclr
due to caching effects from the former function).The compiler should be able to prove that every byte of this slice is overwritten, and thus avoid the call to
memclr
entirely.If detecting this pattern is too difficult, we could instead special-case
bytes.Join
such that the internal call tomake
avoids thememclr
(perhaps by calling a runtime-internal variant ofmake
that avoids zeroing). Thus, the pattern above could be replaced withbytes.Join(source, nil)
.\cc @randall77 @martisch @lavalamp
The text was updated successfully, but these errors were encountered: