Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
cmd/compile: multiplication strength-reduction rules hurting performance #21434
I noticed that certain strength-reduction rules for multiplication seem to harm performance on my machine.
Take the following rule in AMD64.rules (that reduces
Two silly benchmarks:
The generated code for
The rule makes no difference on the microbenchmark, and strongly hurts performances on the more realistic second benchmark.
This is on an Haswell machine, where the rule-of-thumb currently used for this kind of reduction (
I don't think your test is actually measuring the multiply latency. It is only measuring multiply throughput.
So when measuring throughput (F1 vs F2) the rewrite hurts, probably because we're substituting 3 instructions for 1. (Side note - why does this hurt? I would expect a reasonable fetch/retire engine to keep up with this loop.)
When measuring latency (F3 vs F4), however, the rewrite helps. We're now doing the multiply in two latency 1 instructions instead of 1 latency 3 instruction.
I think latency is more important than throughput, so we should keep the rewrite. I'm happy to hear arguments otherwise, though.
My processor is Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz, YMMV.
Ah, the reduction improves latency. Thanks for the explanation.
Well, the other compilers I looked at (GCC, Clang, intel) seem, too, to optimize for latency, so I guess that's a point in favour of keeping those rewrites.