Skip to content

cmd/compile: For some sizes, zeroing memory through OpZero could be faster (AMD64) #56997

@jake-ciolek

Description

@jake-ciolek

I'm working on a patch for inlining known-size memclrNoHeapPointers calls and replacing them with an OpZero.

The inlining significantly speeds up clearing, but for some clear size, AMD64's OpZero is actually slower than calling memcrlNoHeapPointers,

see:

Intel Alder Lake 12600k:

name old time/op new time/op delta
MemclrKnownSize1-16 0.59ns ± 3% 0.38ns ± 6% -36.00% (p=0.000 n=19+18)
MemclrKnownSize2-16 0.57ns ± 1% 0.19ns ± 5% -66.27% (p=0.000 n=19+19)
MemclrKnownSize4-16 0.66ns ± 2% 0.36ns ±21% -45.12% (p=0.000 n=19+20)
MemclrKnownSize8-16 0.74ns ± 1% 0.30ns ±26% -59.81% (p=0.000 n=18+20)
MemclrKnownSize16-16 1.00ns ± 7% 0.21ns ± 8% -79.51% (p=0.000 n=20+19)
MemclrKnownSize32-16 0.95ns ± 1% 0.40ns ± 1% -57.61% (p=0.000 n=20+18)
MemclrKnownSize64-16 1.20ns ± 2% 0.41ns ± 0% -65.82% (p=0.000 n=20+18)
MemclrKnownSize112-16 1.27ns ± 2% 1.03ns ± 0% -19.35% (p=0.000 n=20+18)
MemclrKnownSize128-16 1.34ns ± 2% 1.03ns ± 0% -23.02% (p=0.000 n=20+20)
MemclrKnownSize192-16 1.92ns ± 2% 1.44ns ± 0% -24.89% (p=0.000 n=20+16)
MemclrKnownSize248-16 2.77ns ± 1% 3.29ns ± 0% +18.81% (p=0.000 n=20+16)
MemclrKnownSize256-16 1.92ns ± 1% 1.86ns ± 0% -3.49% (p=0.000 n=19+15)
MemclrKnownSize512-16 2.81ns ± 2% 3.49ns ± 0% +24.15% (p=0.000 n=20+17)
MemclrKnownSize1024-16 4.02ns ± 1% 6.78ns ± 0% +68.44% (p=0.000 n=20+18)
MemclrKnownSize4096-16 17.2ns ± 2% 14.4ns ± 0% -16.73% (p=0.000 n=20+17)
MemclrKnownSize512KiB-16 6.71µs ± 1% 6.52µs ± 0% -2.85% (p=0.000 n=20+18)
[Geo mean] 2.60ns 1.71ns -34.06%

name old speed new speed delta
MemclrKnownSize1-16 1.71GB/s ± 3% 2.67GB/s ± 6% +56.39% (p=0.000 n=19+18)
MemclrKnownSize2-16 3.52GB/s ± 2% 10.43GB/s ± 6% +196.04% (p=0.000 n=20+20)
MemclrKnownSize4-16 6.06GB/s ± 1% 10.83GB/s ±11% +78.63% (p=0.000 n=19+18)
MemclrKnownSize8-16 10.7GB/s ± 1% 27.0GB/s ±21% +151.49% (p=0.000 n=18+20)
MemclrKnownSize16-16 16.0GB/s ± 8% 78.1GB/s ± 7% +387.24% (p=0.000 n=20+19)
MemclrKnownSize32-16 33.6GB/s ± 1% 79.4GB/s ± 1% +135.89% (p=0.000 n=20+18)
MemclrKnownSize64-16 53.3GB/s ± 2% 155.9GB/s ± 0% +192.58% (p=0.000 n=20+18)
MemclrKnownSize112-16 88.0GB/s ± 2% 109.1GB/s ± 0% +23.97% (p=0.000 n=20+18)
MemclrKnownSize128-16 95.3GB/s ± 2% 123.8GB/s ± 0% +29.88% (p=0.000 n=20+20)
MemclrKnownSize192-16 100GB/s ± 2% 133GB/s ± 0% +33.12% (p=0.000 n=20+17)
MemclrKnownSize248-16 89.7GB/s ± 1% 75.5GB/s ± 0% -15.84% (p=0.000 n=20+19)
MemclrKnownSize256-16 133GB/s ± 1% 138GB/s ± 0% +3.61% (p=0.000 n=19+14)
MemclrKnownSize512-16 182GB/s ± 2% 147GB/s ± 0% -19.46% (p=0.000 n=20+17)
MemclrKnownSize1024-16 254GB/s ± 1% 151GB/s ± 0% -40.64% (p=0.000 n=20+18)
MemclrKnownSize4096-16 237GB/s ± 2% 285GB/s ± 0% +20.09% (p=0.000 n=20+17)
MemclrKnownSize512KiB-16 78.2GB/s ± 1% 80.4GB/s ± 0% +2.93% (p=0.000 n=20+18)
[Geo mean] 42.1GB/s 63.8GB/s +51.53%

This problem doesn't appear on arm64 (M1 Pro):

name old time/op new time/op delta
MemclrKnownSize1-10 2.03ns ± 0% 0.31ns ± 0% -84.69% (p=0.000 n=18+19)
MemclrKnownSize2-10 1.97ns ± 0% 0.31ns ± 0% -84.19% (p=0.000 n=12+19)
MemclrKnownSize4-10 2.02ns ± 0% 0.31ns ± 0% -84.56% (p=0.000 n=12+20)
MemclrKnownSize8-10 2.02ns ± 0% 0.31ns ± 0% -84.59% (p=0.000 n=14+19)
MemclrKnownSize16-10 2.15ns ± 0% 0.31ns ± 0% -85.50% (p=0.000 n=18+19)
MemclrKnownSize32-10 2.48ns ± 0% 0.31ns ± 0% -87.48% (p=0.000 n=20+19)
MemclrKnownSize64-10 1.93ns ± 0% 0.62ns ± 0% -67.88% (p=0.000 n=20+19)
MemclrKnownSize112-10 2.48ns ± 0% 1.80ns ± 0% -27.74% (p=0.000 n=19+20)
MemclrKnownSize128-10 10.0ns ±112% 2.0ns ± 0% -79.76% (p=0.000 n=18+17)
MemclrKnownSize192-10 27.4ns ±103% 2.6ns ± 0% -90.38% (p=0.000 n=16+19)
MemclrKnownSize248-10 9.67ns ±43% 3.26ns ± 0% -66.29% (p=0.000 n=19+19)
MemclrKnownSize256-10 85.4ns ±148% 3.3ns ± 0% -96.18% (p=0.000 n=20+20)
MemclrKnownSize512-10 223ns ±54% 6ns ± 0% -97.42% (p=0.000 n=18+20)
MemclrKnownSize1024-10 216ns ±26% 11ns ± 0% -95.00% (p=0.000 n=18+15)
MemclrKnownSize4096-10 265ns ± 2% 88ns ± 0% -66.84% (p=0.000 n=19+17)
MemclrKnownSize512KiB-10 9.91µs ± 1% 10.23µs ± 2% +3.14% (p=0.000 n=19+19)
[Geo mean] 15.6ns 2.5ns -83.62%

name old speed new speed delta
MemclrKnownSize1-10 493MB/s ± 0% 3216MB/s ± 0% +553.04% (p=0.000 n=18+19)
MemclrKnownSize2-10 1.02GB/s ± 0% 6.43GB/s ± 0% +532.33% (p=0.000 n=16+19)
MemclrKnownSize4-10 1.99GB/s ± 0% 12.86GB/s ± 0% +547.67% (p=0.000 n=18+20)
MemclrKnownSize8-10 3.96GB/s ± 0% 25.72GB/s ± 0% +548.81% (p=0.000 n=19+19)
MemclrKnownSize16-10 7.46GB/s ± 0% 51.43GB/s ± 0% +589.72% (p=0.000 n=20+19)
MemclrKnownSize32-10 12.9GB/s ± 0% 102.9GB/s ± 0% +698.60% (p=0.000 n=20+18)
MemclrKnownSize64-10 33.1GB/s ± 0% 103.0GB/s ± 0% +211.34% (p=0.000 n=19+19)
MemclrKnownSize112-10 45.1GB/s ± 0% 62.4GB/s ± 0% +38.38% (p=0.000 n=19+20)
MemclrKnownSize128-10 13.3GB/s ±107% 63.5GB/s ± 0% +378.03% (p=0.000 n=19+18)
MemclrKnownSize192-10 6.97GB/s ±139% 72.72GB/s ± 0% +943.44% (p=0.000 n=19+19)
MemclrKnownSize248-10 25.9GB/s ±46% 76.1GB/s ± 0% +194.16% (p=0.000 n=20+17)
MemclrKnownSize256-10 8.64GB/s ±196% 78.51GB/s ± 0% +808.19% (p=0.000 n=20+20)
MemclrKnownSize512-10 2.33GB/s ±86% 89.13GB/s ± 0% +3719.50% (p=0.000 n=17+20)
MemclrKnownSize1024-10 4.85GB/s ±32% 94.93GB/s ± 0% +1856.74% (p=0.000 n=18+19)
MemclrKnownSize4096-10 15.4GB/s ± 2% 46.6GB/s ± 0% +201.55% (p=0.000 n=19+18)
MemclrKnownSize512KiB-10 52.9GB/s ± 1% 51.3GB/s ± 2% -3.04% (p=0.000 n=19+19)
[Geo mean] 7.54GB/s 42.86GB/s +468.76%

It seems that either: on AMD64 OpZero could be improved or for some clears we could use memcrlNoHeapPointers instead.

I'll send the patch soon. Creating this issue so that it can be linked from the change.

Metadata

Metadata

Assignees

Labels

NeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.compiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions