-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: Get rid of redundant TEST instructions by using the ZF flag of DEC/INC #49113
Comments
The non optimization of using flags from arithmetic instructions directly is a general gap (not just INC/DEC). I think I got a bit around it by defining new instruction for MUL with flag output then optimizing the SETcc CMP away in https://go-review.googlesource.com/c/go/+/141820. If we going to do this more generally we should probably look for a generic way how to easily use the flag output of any instruction in rules files for branching instructions that can consume those flags. |
I think it might go even beyond arithmetic. I've looked at doing this for AND(Q|L)const/XOR(Q|L)const , but ran into issues related to handling blocks... I've changed the Op definition so it emitted two values (Uint and Flags) but was unable to make it work for some reason. It was somehow failing when handling OpAMD64SETEQ / OpAMD64SETNE and others. I think we could get a lot of speedup on x86 by doing this. |
I'd love to see a general purpose approach to this. Maybe a pass that occurs after flagalloc, to avoid the issues around spilling flags. |
Feel free to post a CL showing your changes and the errors if you'd like some help getting unstuck. (Although as mentioned I'd prefer something general purpose.) |
Are there micro-benchmarks proving this is a speed increase? Chips are doing macro-op and micro-op fusions nowadays. The test and the following jump are very clearly in that territory. You might end-up with smaller binaries, which will help caching and decoding, but no real speed increase on the µOp level. |
I believe reducing code size will improve performance, for example this change where we replace longer forms of AND with a shorter one clearly boosted performance: https://go-review.googlesource.com/c/go/+/354970 Additionally, even though TEST/JUMP fuses, DEC/JUMP fuses too. Putting a test in between makes A DEC/TEST/JUMP execute one extra instruction. The same goes for AND, ADD and SUB. |
There's another thing, for which I think I'll start a new issue. Sometimes we generate code that prevents macro-op fusion from happening because there's an instruction between TEST and a JUMP or a CMP and a JUMP. That instruction could be put before TEST/CMP and the execution of the program wouldn't change, but the speed would increase. |
@jake-ciolek RE: removing instructions in between. I think often this can increase performance but I think we should still prove that with microbenchmarks or prove its not worse but better for binary size. There AFAIK can be cases were it doesnt improve performance and not all amd64 do op fusion or deal with partial register flags updates (INC/DEC) equally well without delays. (e.g. if I remember correctly Sandy Bridge has a flag merging uOP delay after some INC/DEC anyway.) |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Dumped the Go assembly itself, but this happens for anything compiled with gc.
What did you expect to see?
The compiler making use of the ZF flag value when emitting DEC(Q|L) / INC(Q|L) instructions.
What did you see instead?
Redundant TEST(Q|L) instructions for checking if the result is zero, for example:
We could get rid of testq and use the value of ZF set by INCQ/DECQ instead.
I have looked into this briefly and it seems that it might be better to move the INC/DEC implementation to AMD64.rules instead? Currently it's handled in a special way inside of cmd/cmpile/internal/amd64/ssa.go
The text was updated successfully, but these errors were encountered: