-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: shorter slicing instruction sequence on x86/amd64 #47969
Comments
Nice. You planning to do this? cc also @mundaym who also experimented with alternative SSA slicing ops, if memory serves |
I plan to make a CL to at least the first step using OpSlicedelta. But happy if anyone gets to make that CL before me (thats why I wrote it down here first) as I have other things I promised todo sooner than later in the CL making pipeline. The other steps are likely more involved as It likely needs some more complex rules, new lower arch specific Ops and flag handling in SSA optimization rules. It might only be worth it if we first put in work to simplify the actual rules by having more first class support for flag dependencies on arithmetic ops. Last time I worked with them it required defining new ops with explicit flag output. That then may inhibit other optimizations from triggering. |
My memory is convinced that a while back someone did some experiments that showed CMOV* operations were slow on some processors (needed extra ports on the register file?). Not sure if that is relevant any more. Or my memory is bad, I don't see anything obvious in an issue search. |
AFAIK CMOV performance depends on the dependency chain as both values need to be calculated and can be worse if branches are nicely predictable or can skip large amounts of work. With or without CMOV we compute the scaled index here anyway so thats the same. However for the mask we can replace the NEG and SAR with a TEST (or reuse the SUB). As a data point for support: gccgo -O2 generates a CMOV for setting the slice pointer too. |
Keith: maybe #26306? That also links to some other issues. |
@josharian Thanks, definitely on point. But I don't think that's the one I'm remembering - that has to do with depending on arguments, for which the current masking scheme has a similar problem. I'm remembering something about cmov being generally slow (several uops, maybe?). In any case, cmov might be better now. Just wanted to make sure we benchmark it (on older processors?) before pulling the trigger. |
Definitely agreed on the benchmarking. I think we can aim for old being sandy bridge or newer on amd64 as the performance targets. But I can power up my Core 2 and Athlon64 if need be. |
And if it’s slow only on older machines, it can be a GOAMD64 candidate. And of course there are the other arches. |
Yeah, I experimented with changing the SSA representation of slices. Rather than representing https://go-review.googlesource.com/c/go/+/170125 Wouldn't make any difference to the example in the issue unless perhaps it was called in a loop. |
Is this a duplicate of #17638? |
Compiling on tip:
and cutting out the function prologue on amd64 we get:
We should be able to slim that down to (not tested):
by pulling the AND and OpSlicemask operation in the ssa generation phase into a single new OpSlicedelta operation:
before:
after:
By either making the compiler SSA optimizations smarter or pulling even more operations into a special SSA Op we could save the TESTQ and be able to get to:
However it is unclear if this will be any faster (or worth the complexity) without benchmarking when the scaling of the index for the delta happens after the CMOV.
A further reduction in instructions is possible by moving the panic jumps to be dependent on the SUB instructions:
That then will need extra handling in recovering the original slice len/cap in the panicpath.
At last for this specific case the SHL and ADD can be folded into a LEA:
/cc @randall77 @josharian @mdempsky
The text was updated successfully, but these errors were encountered: