-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961
Comments
cc @golang/compiler |
If we knew that the input was only 0 or 1 then the zeroing would be unnecessary. Maybe we can infer that because it is passed as an arg to Maybe we could use only the low byte within the loop and extend it just on return, but that's tricky. Calling We could use |
Another option is to represent the carry as 0/-1 instead of 0/1. Then we don't need the |
That's a great point about the high bits — it would violate In any case, |
I believe
That would be nice. I think my 0/-1 idea does that also. (But really unrolling is probably the better solution here. See math/big/arith_amd64.s:addVV. It's a 1 instruction dependency chain within the loop.) |
🤦 Of course, sorry.
It does on the Zen microarchitectures, but none of the Intel chips that I know of. In any case,
Unrolling helps, though I'm not sure it's an either-or situation. Outside of pure bignum arithmetic, the amount you can usefully unroll a loop involving carries on amd64 is limited. The example I gave is cut down from a more complex loop I was converting from C. That loop was unrolled with a stride of 2. Other bitwise logic in original loop caused either register or carry flag spilling if the loop was unrolled further. For that loop, the 0/-1 representation or mov+setcs should still speed it up by 20% after unrolling. |
Go version
go version go1.23.0 linux/amd64
Output of
go env
in your module/workspace:What did you do?
This code is a simplified form of a longer unrolled loop, with the non-carry-related logic removed:
https://go.dev/play/p/gGVkiLN6qbV
https://go.godbolt.org/z/W313f1EYG
What did you see happen?
On amd64, the compiled loop has a throughput of one iteration every four cycles:
The bottleneck is the
NEGL
->ADCQ
->SBBQ
->NEGQ
dependency chain.What did you expect to see?
The SBBQ / NEGQ pair should use
SETCS
instead, e.g.This shortens the dependency chain to three instructions.
The text was updated successfully, but these errors were encountered: