Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: For some sizes, zeroing memory through OpZero could be faster (AMD64) #56997

Open
jake-ciolek opened this issue Nov 30, 2022 · 7 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@jake-ciolek
Copy link
Contributor

jake-ciolek commented Nov 30, 2022

I'm working on a patch for inlining known-size memclrNoHeapPointers calls and replacing them with an OpZero.

The inlining significantly speeds up clearing, but for some clear size, AMD64's OpZero is actually slower than calling memcrlNoHeapPointers,

see:

Intel Alder Lake 12600k:

name old time/op new time/op delta
MemclrKnownSize1-16 0.59ns ± 3% 0.38ns ± 6% -36.00% (p=0.000 n=19+18)
MemclrKnownSize2-16 0.57ns ± 1% 0.19ns ± 5% -66.27% (p=0.000 n=19+19)
MemclrKnownSize4-16 0.66ns ± 2% 0.36ns ±21% -45.12% (p=0.000 n=19+20)
MemclrKnownSize8-16 0.74ns ± 1% 0.30ns ±26% -59.81% (p=0.000 n=18+20)
MemclrKnownSize16-16 1.00ns ± 7% 0.21ns ± 8% -79.51% (p=0.000 n=20+19)
MemclrKnownSize32-16 0.95ns ± 1% 0.40ns ± 1% -57.61% (p=0.000 n=20+18)
MemclrKnownSize64-16 1.20ns ± 2% 0.41ns ± 0% -65.82% (p=0.000 n=20+18)
MemclrKnownSize112-16 1.27ns ± 2% 1.03ns ± 0% -19.35% (p=0.000 n=20+18)
MemclrKnownSize128-16 1.34ns ± 2% 1.03ns ± 0% -23.02% (p=0.000 n=20+20)
MemclrKnownSize192-16 1.92ns ± 2% 1.44ns ± 0% -24.89% (p=0.000 n=20+16)
MemclrKnownSize248-16 2.77ns ± 1% 3.29ns ± 0% +18.81% (p=0.000 n=20+16)
MemclrKnownSize256-16 1.92ns ± 1% 1.86ns ± 0% -3.49% (p=0.000 n=19+15)
MemclrKnownSize512-16 2.81ns ± 2% 3.49ns ± 0% +24.15% (p=0.000 n=20+17)
MemclrKnownSize1024-16 4.02ns ± 1% 6.78ns ± 0% +68.44% (p=0.000 n=20+18)
MemclrKnownSize4096-16 17.2ns ± 2% 14.4ns ± 0% -16.73% (p=0.000 n=20+17)
MemclrKnownSize512KiB-16 6.71µs ± 1% 6.52µs ± 0% -2.85% (p=0.000 n=20+18)
[Geo mean] 2.60ns 1.71ns -34.06%

name old speed new speed delta
MemclrKnownSize1-16 1.71GB/s ± 3% 2.67GB/s ± 6% +56.39% (p=0.000 n=19+18)
MemclrKnownSize2-16 3.52GB/s ± 2% 10.43GB/s ± 6% +196.04% (p=0.000 n=20+20)
MemclrKnownSize4-16 6.06GB/s ± 1% 10.83GB/s ±11% +78.63% (p=0.000 n=19+18)
MemclrKnownSize8-16 10.7GB/s ± 1% 27.0GB/s ±21% +151.49% (p=0.000 n=18+20)
MemclrKnownSize16-16 16.0GB/s ± 8% 78.1GB/s ± 7% +387.24% (p=0.000 n=20+19)
MemclrKnownSize32-16 33.6GB/s ± 1% 79.4GB/s ± 1% +135.89% (p=0.000 n=20+18)
MemclrKnownSize64-16 53.3GB/s ± 2% 155.9GB/s ± 0% +192.58% (p=0.000 n=20+18)
MemclrKnownSize112-16 88.0GB/s ± 2% 109.1GB/s ± 0% +23.97% (p=0.000 n=20+18)
MemclrKnownSize128-16 95.3GB/s ± 2% 123.8GB/s ± 0% +29.88% (p=0.000 n=20+20)
MemclrKnownSize192-16 100GB/s ± 2% 133GB/s ± 0% +33.12% (p=0.000 n=20+17)
MemclrKnownSize248-16 89.7GB/s ± 1% 75.5GB/s ± 0% -15.84% (p=0.000 n=20+19)
MemclrKnownSize256-16 133GB/s ± 1% 138GB/s ± 0% +3.61% (p=0.000 n=19+14)
MemclrKnownSize512-16 182GB/s ± 2% 147GB/s ± 0% -19.46% (p=0.000 n=20+17)
MemclrKnownSize1024-16 254GB/s ± 1% 151GB/s ± 0% -40.64% (p=0.000 n=20+18)
MemclrKnownSize4096-16 237GB/s ± 2% 285GB/s ± 0% +20.09% (p=0.000 n=20+17)
MemclrKnownSize512KiB-16 78.2GB/s ± 1% 80.4GB/s ± 0% +2.93% (p=0.000 n=20+18)
[Geo mean] 42.1GB/s 63.8GB/s +51.53%

This problem doesn't appear on arm64 (M1 Pro):

name old time/op new time/op delta
MemclrKnownSize1-10 2.03ns ± 0% 0.31ns ± 0% -84.69% (p=0.000 n=18+19)
MemclrKnownSize2-10 1.97ns ± 0% 0.31ns ± 0% -84.19% (p=0.000 n=12+19)
MemclrKnownSize4-10 2.02ns ± 0% 0.31ns ± 0% -84.56% (p=0.000 n=12+20)
MemclrKnownSize8-10 2.02ns ± 0% 0.31ns ± 0% -84.59% (p=0.000 n=14+19)
MemclrKnownSize16-10 2.15ns ± 0% 0.31ns ± 0% -85.50% (p=0.000 n=18+19)
MemclrKnownSize32-10 2.48ns ± 0% 0.31ns ± 0% -87.48% (p=0.000 n=20+19)
MemclrKnownSize64-10 1.93ns ± 0% 0.62ns ± 0% -67.88% (p=0.000 n=20+19)
MemclrKnownSize112-10 2.48ns ± 0% 1.80ns ± 0% -27.74% (p=0.000 n=19+20)
MemclrKnownSize128-10 10.0ns ±112% 2.0ns ± 0% -79.76% (p=0.000 n=18+17)
MemclrKnownSize192-10 27.4ns ±103% 2.6ns ± 0% -90.38% (p=0.000 n=16+19)
MemclrKnownSize248-10 9.67ns ±43% 3.26ns ± 0% -66.29% (p=0.000 n=19+19)
MemclrKnownSize256-10 85.4ns ±148% 3.3ns ± 0% -96.18% (p=0.000 n=20+20)
MemclrKnownSize512-10 223ns ±54% 6ns ± 0% -97.42% (p=0.000 n=18+20)
MemclrKnownSize1024-10 216ns ±26% 11ns ± 0% -95.00% (p=0.000 n=18+15)
MemclrKnownSize4096-10 265ns ± 2% 88ns ± 0% -66.84% (p=0.000 n=19+17)
MemclrKnownSize512KiB-10 9.91µs ± 1% 10.23µs ± 2% +3.14% (p=0.000 n=19+19)
[Geo mean] 15.6ns 2.5ns -83.62%

name old speed new speed delta
MemclrKnownSize1-10 493MB/s ± 0% 3216MB/s ± 0% +553.04% (p=0.000 n=18+19)
MemclrKnownSize2-10 1.02GB/s ± 0% 6.43GB/s ± 0% +532.33% (p=0.000 n=16+19)
MemclrKnownSize4-10 1.99GB/s ± 0% 12.86GB/s ± 0% +547.67% (p=0.000 n=18+20)
MemclrKnownSize8-10 3.96GB/s ± 0% 25.72GB/s ± 0% +548.81% (p=0.000 n=19+19)
MemclrKnownSize16-10 7.46GB/s ± 0% 51.43GB/s ± 0% +589.72% (p=0.000 n=20+19)
MemclrKnownSize32-10 12.9GB/s ± 0% 102.9GB/s ± 0% +698.60% (p=0.000 n=20+18)
MemclrKnownSize64-10 33.1GB/s ± 0% 103.0GB/s ± 0% +211.34% (p=0.000 n=19+19)
MemclrKnownSize112-10 45.1GB/s ± 0% 62.4GB/s ± 0% +38.38% (p=0.000 n=19+20)
MemclrKnownSize128-10 13.3GB/s ±107% 63.5GB/s ± 0% +378.03% (p=0.000 n=19+18)
MemclrKnownSize192-10 6.97GB/s ±139% 72.72GB/s ± 0% +943.44% (p=0.000 n=19+19)
MemclrKnownSize248-10 25.9GB/s ±46% 76.1GB/s ± 0% +194.16% (p=0.000 n=20+17)
MemclrKnownSize256-10 8.64GB/s ±196% 78.51GB/s ± 0% +808.19% (p=0.000 n=20+20)
MemclrKnownSize512-10 2.33GB/s ±86% 89.13GB/s ± 0% +3719.50% (p=0.000 n=17+20)
MemclrKnownSize1024-10 4.85GB/s ±32% 94.93GB/s ± 0% +1856.74% (p=0.000 n=18+19)
MemclrKnownSize4096-10 15.4GB/s ± 2% 46.6GB/s ± 0% +201.55% (p=0.000 n=19+18)
MemclrKnownSize512KiB-10 52.9GB/s ± 1% 51.3GB/s ± 2% -3.04% (p=0.000 n=19+19)
[Geo mean] 7.54GB/s 42.86GB/s +468.76%

It seems that either: on AMD64 OpZero could be improved or for some clears we could use memcrlNoHeapPointers instead.

I'll send the patch soon. Creating this issue so that it can be linked from the change.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 30, 2022
@gopherbot
Copy link

Change https://go.dev/cl/454255 mentions this issue: cmd/compile: inline known-size memclrNoHeapPointers calls

@randall77
Copy link
Contributor

The problem I see here is that sometimes OpZero is used in situations where we can't issue a call. Typically, this is because it is being used to set up call arguments, so setting arguments for memclrNoHeapPointers overwrites the arguments we've already set for some other call. Also initializing stack frames is a tricky time, as we can't do an interruptible call because the stack frame has junk in it.

Maybe though this problem only applies to OpCopy? Not sure that OpZero is used for setting up arguments, and since memclrNoHeapPointers is not interruptible maybe it is ok.

@jake-volvo
Copy link

I think it’s best solved by improving the DUFFZERO implementation. If you look at the benchmark it seems the problem areas are all under the DUFFZERO range.

We know the size of the clear, thus we can skip the initial branching in memclrNoHeapPointers. The trick is to teach DUFFZERO checking for AVX support and figuring out whether the increased code size (would it increase?) pays off. We could also elide the check on higher arch levels.

@toothrot toothrot added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Dec 1, 2022
@mknyszek mknyszek added this to the Go1.21 milestone Dec 7, 2022
@laboger
Copy link
Contributor

laboger commented Jan 23, 2023

On ppc64le, we have an instruction to clear blocks of 128 which is used in the memclr function and is more efficient than doing it this way. It sounds like these cases were recognized because it is mentioned in this issue as well as the commit message of the CL, but not addressed?

@jake-volvo
Copy link

@laboger is it possible to use that instruction in ppc64’s Zero implementation? Perhaps we can skip inlining for ppc64 if it’s a problem.

Maybe you could checkout this change and try running memory clearing benchmarks to see how it impacts performance. Thanks.

@laboger
Copy link
Contributor

laboger commented Jan 24, 2023

@jake-volvo Not really because there are different conditions to verify before you can use that instruction, such as alignment of the target, which makes the code more complicated. It is better to do that within the memclr assembler function.

Why don't you do this the same way as the runtime.memmove generic rules, which use the function isInlinableMemmove to determine when it should be done and returns the condition depending on the target? You could define a similar function isInlinableMemclr.

Here is the comparison with your change:

MemclrKnownSize1-64         2.07ns ± 0%     0.57ns ± 1%   -72.44%  (p=0.029 n=4+4)
MemclrKnownSize2-64         2.55ns ± 1%     0.57ns ± 0%   -77.66%  (p=0.029 n=4+4)
MemclrKnownSize4-64         5.14ns ± 0%     0.50ns ± 0%   -90.36%  (p=0.029 n=4+4)
MemclrKnownSize8-64         2.24ns ± 0%     0.57ns ± 1%   -74.58%  (p=0.029 n=4+4)
MemclrKnownSize16-64        2.23ns ± 0%     0.50ns ± 1%   -77.48%  (p=0.029 n=4+4)
MemclrKnownSize32-64        2.28ns ± 1%     0.57ns ± 1%   -74.95%  (p=0.029 n=4+4)
MemclrKnownSize64-64        2.27ns ± 0%     0.72ns ± 0%   -68.13%  (p=0.029 n=4+4)
MemclrKnownSize112-64       2.95ns ± 0%     1.15ns ± 0%   -60.89%  (p=0.029 n=4+4)
MemclrKnownSize128-64       4.74ns ± 1%     2.68ns ± 0%   -43.52%  (p=0.029 n=4+4)
MemclrKnownSize192-64       5.08ns ± 1%     3.15ns ± 0%   -38.09%  (p=0.029 n=4+4)
MemclrKnownSize248-64       4.68ns ± 4%     3.00ns ± 0%   -36.00%  (p=0.029 n=4+4)
MemclrKnownSize256-64       6.81ns ± 1%     3.56ns ± 1%   -47.70%  (p=0.029 n=4+4)
MemclrKnownSize512-64       3.65ns ± 0%     8.97ns ±31%  +145.78%  (p=0.029 n=4+4)
MemclrKnownSize1024-64      4.74ns ± 0%    12.02ns ± 0%  +153.75%  (p=0.029 n=4+4)
MemclrKnownSize4096-64      17.1ns ± 0%     51.9ns ± 0%  +203.34%  (p=0.029 n=4+4)
MemclrKnownSize512KiB-64    2.12µs ± 0%     5.30µs ± 0%  +149.61%  (p=0.029 n=4+4)

@jake-ciolek
Copy link
Contributor Author

Thanks, I'll gate the inlining so it only happens for AMD64 and ARM64 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
Status: In Progress
Development

No branches or pull requests

7 participants