-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: reorganize associative computation to allow superscalar execution #49331
Comments
It's probably good for integers, because I think that, with the guarantees Go gives, addition is associative in Go, but for float values, it isn't always. |
Note that it is safe (AFAICT) to rewrite |
…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331
…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331
Change https://go.dev/cl/493115 mentions this issue: |
Interesting. As far as I know, reassociate in other compilers is mainly to assist other optimizations (such as sccp, gcse, loopopt). It seems that reassociate is used to "accelerate CPU superscalar execution" is a new expressive statement, is there any related extension information/paper? Maybe this is also helpful for other compilers. Thanks. Java applies reassociate only when invariant participates computation in loop. GCC trunk does not reassociate |
@y1yang0 see section 9.5 (and in general the whole of chapter 9) of https://www.agner.org/optimize/optimizing_assembly.pdf |
@y1yang0 Yes, Clang outputs the optimization because of the Reassociate pass in LLVM. This is an extremely simple analysis pass compared to LLVM's version, and some of the things the LLVM pass are already taken care of by the rewrite rules which I didn't know existed until after I submitted the PR so I've simplified this even further to get rid of constant sorting. If this pass runs again further after opt it might have constant folding opportunities but it is more important to get done before gcse since it can help group expressions together nicely. For now the goal is to just get out of order execution accelerated and I'll work on some more of the opportunities that this optimization opens up soon. It looks like the reason Java does a reassociation optimization only inside of loops is auto-vectorization. This sort of pass helps sort out all the dependencies and lets you pretty easily recognize 4 consecutive additions that could be turned into some SIMD if a model decides it is cost effective. This pass seems to help make analysis much easier in a lot of places though so I'll be combing through some other compilers and LLVM to see if we can leverage it more in Go |
…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331
Change https://go.dev/cl/496095 mentions this issue: |
…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331
…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331
…ions Currently the compiler groups expressions with commutative operations such as a + b + c + d as so: (a + (b + (c + d))) which is suboptimal for CPU instruction pipelining. This pass balances commutative expressions as shown above to (a + b) + (c + d) to optimally pipeline them. It also attempts to reassociate constants to as far right of the commutative expression as possible for better constant folding opportunities. Below is a benchmark from crypto/md5 on an MacBook Pro M2: trunk reassociate Hash1K-8 433.7Mi ± 0% 499.4Mi ± 4% +15.17% (p=0.000 n=10) Hash8K-8 454.3Mi ± 1% 524.9Mi ± 1% +15.53% (p=0.000 n=10) .... geomean 284.4Mi 327.5Mi +15.15% Other CPU architectures tried showed very little change (+/-1%) on this particular benchmark but tight mathematical code stands to gain greatly from this optimization Fixes golang#49331
The compiler currently compiles
a+b+c+d
asa+(b+(c+d))
. It should use(a+b)+(c+d)
instead, because the latter can be executed out of order.More broadly, we should balance trees of associative computation.
Doing this in the compiler with a rule tailored for a single computation type in round 4 of md5block.go yielded a 15% throughput improvement. (See below for details.)
It's not obvious to me whether we can do this with carefully crafted rewrite rules or whether a dedicated pass would be better. But it looks like there may be significant performance wins available on tight mathematical code.
I don't plan to work on this further, but I really hope someone else picks it up.
cc @randall77 @martisch @FiloSottile @mmcloughlin @mdempsky
To reproduce the md5 results, disable the optimized assembly routines, and add this rewrite rule:
and disable this one to avoid an infinite loop:
The text was updated successfully, but these errors were encountered: