cmd/compile: reorganize associative computation to allow superscalar execution #49331

josharian · 2021-11-04T00:06:25Z

The compiler currently compiles a+b+c+d as a+(b+(c+d)). It should use (a+b)+(c+d) instead, because the latter can be executed out of order.

More broadly, we should balance trees of associative computation.

Doing this in the compiler with a rule tailored for a single computation type in round 4 of md5block.go yielded a 15% throughput improvement. (See below for details.)

It's not obvious to me whether we can do this with carefully crafted rewrite rules or whether a dedicated pass would be better. But it looks like there may be significant performance wins available on tight mathematical code.

I don't plan to work on this further, but I really hope someone else picks it up.

cc @randall77 @martisch @FiloSottile @mmcloughlin @mdempsky

To reproduce the md5 results, disable the optimized assembly routines, and add this rewrite rule:

(Add32 <t> c:(Const32) (Add32 d:(Load _ _) (Add32 x:(Xor32 _ _) a:(Add32 _ _)))) => (Add32 (Add32 <t> c d) (Add32 <t> x a))

and disable this one to avoid an infinite loop:

(Add32 (Add32 i:(Const32 <t>) z) x) && (z.Op != OpConst32 && x.Op != OpConst32) => (Add32 i (Add32 <t> z x))

The text was updated successfully, but these errors were encountered:

seebs · 2021-11-04T02:24:01Z

	a := float64(1<<53)
	b := float64(1)
	c := float64(1)
	d := float64(0)
	e := (a+(b+(c+d)))
	f := (a+b)+(c+d)

It's probably good for integers, because I think that, with the guarantees Go gives, addition is associative in Go, but for float values, it isn't always.

josharian · 2021-11-05T16:23:23Z

@seebs the SSA backend has different ops for different data types exactly to avoid this kind of issue. (We have @dr2chase to thank for that.) Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.

mdempsky · 2021-11-05T19:29:53Z

Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.

Note that it is safe (AFAICT) to rewrite (ptr + x1) + x2 into ptr + (x1 + x2), just not the other way around. I'm skeptical this is useful in practice though.

…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331

gopherbot · 2023-05-05T22:44:20Z

Change https://go.dev/cl/493115 mentions this issue: cmd/compile: add reassociate pass to rebalance commutative operations

y1yang0 · 2023-05-06T10:21:50Z

Interesting. As far as I know, reassociate in other compilers is mainly to assist other optimizations (such as sccp, gcse, loopopt). It seems that reassociate is used to "accelerate CPU superscalar execution" is a new expressive statement, is there any related extension information/paper? Maybe this is also helpful for other compilers. Thanks.

Java applies reassociate only when invariant participates computation in loop. GCC trunk does not reassociate a+b+c+d to (a+b)+(c+d). Clang trunk does reassociates it.

CAFxX · 2023-05-06T14:14:09Z

is there any related extension information/paper?

@y1yang0 see section 9.5 (and in general the whole of chapter 9) of https://www.agner.org/optimize/optimizing_assembly.pdf

ryan-berger · 2023-05-08T03:54:51Z

@y1yang0 Yes, Clang outputs the optimization because of the Reassociate pass in LLVM.

This is an extremely simple analysis pass compared to LLVM's version, and some of the things the LLVM pass are already taken care of by the rewrite rules which I didn't know existed until after I submitted the PR so I've simplified this even further to get rid of constant sorting.

If this pass runs again further after opt it might have constant folding opportunities but it is more important to get done before gcse since it can help group expressions together nicely. For now the goal is to just get out of order execution accelerated and I'll work on some more of the opportunities that this optimization opens up soon.

It looks like the reason Java does a reassociation optimization only inside of loops is auto-vectorization. This sort of pass helps sort out all the dependencies and lets you pretty easily recognize 4 consecutive additions that could be turned into some SIMD if a model decides it is cost effective.

This pass seems to help make analysis much easier in a lot of places though so I'll be combing through some other compilers and LLVM to see if we can leverage it more in Go

…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331

gopherbot · 2023-05-18T04:03:29Z

Change https://go.dev/cl/496095 mentions this issue: cmd/compile: add ilp pass to help balance commutative expressions aiding in instruction level parallelism

…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331

josharian added the Performance label Nov 4, 2021

josharian mentioned this issue Nov 4, 2021

crypto/md5: remove assembly cores #49248

Open

thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 4, 2021

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

ryan-berger mentioned this issue May 5, 2023

cmd/compile: add reassociate pass to find better optimization opportunities within commutative expressions #60015

Open

ryan-berger linked a pull request May 18, 2023 that will close this issue

cmd/compile: add ilp pass to help balance commutative expressions aiding in instruction level parallelism #60278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: reorganize associative computation to allow superscalar execution #49331

cmd/compile: reorganize associative computation to allow superscalar execution #49331

josharian commented Nov 4, 2021

seebs commented Nov 4, 2021

josharian commented Nov 5, 2021

mdempsky commented Nov 5, 2021

gopherbot commented May 5, 2023

y1yang0 commented May 6, 2023 •

edited

Loading

CAFxX commented May 6, 2023

ryan-berger commented May 8, 2023

gopherbot commented May 18, 2023

cmd/compile: reorganize associative computation to allow superscalar execution #49331

cmd/compile: reorganize associative computation to allow superscalar execution #49331

Comments

josharian commented Nov 4, 2021

seebs commented Nov 4, 2021

josharian commented Nov 5, 2021

mdempsky commented Nov 5, 2021

gopherbot commented May 5, 2023

y1yang0 commented May 6, 2023 • edited Loading

CAFxX commented May 6, 2023

ryan-berger commented May 8, 2023

gopherbot commented May 18, 2023

y1yang0 commented May 6, 2023 •

edited

Loading