We could get rid of testq and use the value of ZF set by INCQ/DECQ instead.
I have looked into this briefly and it seems that it might be better to move the INC/DEC implementation to AMD64.rules instead? Currently it's handled in a special way inside of cmd/cmpile/internal/amd64/ssa.go
The text was updated successfully, but these errors were encountered:
The non optimization of using flags from arithmetic instructions directly is a general gap (not just INC/DEC).
I think I got a bit around it by defining new instruction for MUL with flag output then optimizing the SETcc CMP away in https://go-review.googlesource.com/c/go/+/141820. If we going to do this more generally we should probably look for a generic way how to easily use the flag output of any instruction in rules files for branching instructions that can consume those flags.
Are there micro-benchmarks proving this is a speed increase? Chips are doing macro-op and micro-op fusions nowadays. The test and the following jump are very clearly in that territory. You might end-up with smaller binaries, which will help caching and decoding, but no real speed increase on the µOp level.
There's another thing, for which I think I'll start a new issue.
Sometimes we generate code that prevents macro-op fusion from happening because there's an instruction between TEST and a JUMP or a CMP and a JUMP. That instruction could be put before TEST/CMP and the execution of the program wouldn't change, but the speed would increase.
@jake-ciolek RE: removing instructions in between. I think often this can increase performance but I think we should still prove that with microbenchmarks or prove its not worse but better for binary size. There AFAIK can be cases were it doesnt improve performance and not all amd64 do op fusion or deal with partial register flags updates (INC/DEC) equally well without delays. (e.g. if I remember correctly Sandy Bridge has a flag merging uOP delay after some INC/DEC anyway.)