New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: loop invariant code motion #63670
Comments
Do you have a favorite reference for LCSSA? I understand what it's about from reading your patch (and I can see why we would want it) but there are also corner cases (multiple exits to different blocks, exits varying in loop level of target) and it would be nice to see that discussed. I tried searching, but mostly got references to LLVM and GCC internals. Also, how far are you into a performance evaluation for this? Experience in the past (I also tried this) was uncompelling on x86, better but not inspiring for arm64 and ppc64. But things are changing; I'm working on (internal use only) pure and const calls that would allow commoning, hoisting, and dead code elimination, and arm64 is used much more now in data centers than five years ago. |
LLVM's doc is a good reference, maybe I can write some doc later either in go/doc or other place.
Yes! This is the most challenging part.
Experimental data(building go toolchain itself) shows that 98.27% loops can apply LCSSA.
x86 benchmark result:
geomean does not show significant changes. However, if we look at specific cases, we will find that they have significant data fluctuations. There are many performance tuning work. |
I agree that being able to hoist loads of constants and invariants should
help loops in ppc64 and probably arm64. Previous hoisting CLs didn't
attempt loads which is why there was not much benefit, and I couldn't make
it work because of what tighten does with loads is at odds with what
hoisting is trying to do.
I also recently tried to do a change to add PCALIGN to the top of loop body
but because of the way looprotate moved the blocks, the alignment couldn't
be inserted in the correct place. With this improved way of doing
looprotate it should be more straightforward to insert the PCALIGN in the
appropriate spot.
I have done some performance comparisons between gccgo and Golang on ppc64
since gccgo does better at optimizing loops in most cases on ppc64. That
can show what improvement is possible.
…On Mon, Oct 23, 2023 at 9:30 PM Yi Yang ***@***.***> wrote:
Do you have a favorite reference for LCSSA?
LLVM's doc
<https://llvm.org/docs/LoopTerminology.html#loop-closed-ssa-lcssa> is a
good reference, maybe I can write some doc later either in go/doc or other
place.
but there are also corner cases (multiple exits to different blocks, exits
varying in loop level of target) and it would be nice to see that discussed.
Yes! This is the most challenging part.
[image: image]
<https://user-images.githubusercontent.com/5010047/277528772-13f97a7b-4fd3-46f0-b668-cd41abea860a.png>
- For case 1, only one exit block, E1 doms use block, so we insert
proxy phi at E1
- For case2, all exits E1 and E2 dominates all predecessors of use
block, insert proxy phi p1,p2 at E1 and E2 respectively, and insert yet
another proxy phi p3 at use block to merge p1, p2.
- For case3, use block is reachable from E1 and E2, but not E3, this
is hard, maybe we need to we start from all loop exits(including inner loop
exits) though dominance frontier and see if we can reach to the use block.
THis is hard, we give up now.
Experimental data(building go toolchain itself shows that 98.27% loops can
apply LCSSA.
—
Reply to this email directly, view it on GitHub
<#63670 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACH7BDAZ7M4ENQRHATOCOHLYA4R4HAVCNFSM6AAAAAA6K2QRESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGM4TKMZXGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I tried your source tree. It looks like there is a bug when building the go1 benchmark test Fannkuch. This test used to be under test/bench/go1 but it was removed in go 1.21. If you go back to go 1.20 you can build it. One of the loops in the Fannkuch test doesn't exit. I also noticed that hoisting the NilCheck is not always a good thing. I found a similar problem recently with tighten when a load gets moved to a different block. There is a later pass to eliminate unnecessary NilChecks, but it only gets optimized if the NilCheck is in the same block as the load or store that uses the same pointer/address. Look at the loop in math.BenchmarkSqrtIndirectLatency.
|
I don't think this should be a problem. True that hoisting and tightening are working in opposite directions, but tightening should never move anything into a loop, so as long as you're correctly hoisting out of a loop then tightening should not undo that work. (Or tightening has a bug.) |
(Or maybe you mean the rematerialization of constants that regalloc does? We could disable rematerialization for constants that would require a load from a constant pool.) |
I mean loads of constant values which shouldn't have any aliases. I think the rematerialization flag mostly controls where this gets loaded, and tighten has some effect on that. It's been a while since I tried to play with it. Base addresses of arrays, slices, etc. should be constant/invariant and hoisted instead of reloaded at each iteration. I can't seem to get that to happen. I should add that I don't know if there is anything in place right now to allow this type of hoisting even without the rematerialization done in regalloc. This statement is based on experimenting with various CLs that have tried to enable hoisting including the new one mentioned in this issue. |
Here's an example of what I meant above, from test/bench/go1.revcomp (back before it was removed)
|
I think this is just a problem with our register allocator. At this point go/src/cmd/compile/internal/ssa/regalloc.go Line 1402 in 9cdcb01
|
Do you means something like loop alignment. That's interesting.
Thanks a lot! I'll fix this then. The current development is almost complete, and the remaining work mainly involves ensuring robustness and optimizing performance. I will try testing go1 benchmark and golang/benchmarks and then resolve any issues that may arise. (I am thinking about adding support for simple stress testing to the golang compiler, i.e. allow a pass runs between [start pass, end pass], and executing the pass after each pass within that range. This approach would greatly ensure the stability of optimization/transformation.) Update: I confirmed this bug, TBAA thinks they are NoAlias but actually they do alias, because unsafe pointer may aliases with everything
|
Yes, there are several ports that support the PCALIGN directive and currently it is only used in assembly implementations. It would be worthwhile to have the compiler generate it for appropriate loops.
What do you mean by definition point for a constant? Do you mean that the loop hoisting code would generate the definition?
I still don't think the way NilChecks are hoisted is providing the best performance, as I mentioned earlier. |
Currently, if you have this before regalloc:
Then after regalloc it moves v1 inside the loop
We shouldn't do that move if the destination is in a loop. We should force |
Hi, several months ago I submit a patch to implement LICM(#59194), but in that patch the performance improvement was not compelling, after some research we think we need to hoist the high-yield instructions (Load/Store/etc) to achieve a positive performance improvement. Since these high-yield instructions are usually not speculative execution, we may need to implement a fulll LICM with the following dependencies:
Overall, LCSSA opens the door to future loop optimizations and LICM is expected to have attractive performance improvements (I'll update this later), the current work is largely complete, if you're interested you can check out this POC.
Before submitting the complete PR, I'm looking forward to hearing some comments and suggestions from the community about it. Thank you in advance!
The text was updated successfully, but these errors were encountered: