Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: shorter slicing instruction sequence on x86/amd64 #47969

Open
martisch opened this issue Aug 26, 2021 · 9 comments
Open

cmd/compile: shorter slicing instruction sequence on x86/amd64 #47969

martisch opened this issue Aug 26, 2021 · 9 comments

Comments

@martisch
Copy link
Contributor

@martisch martisch commented Aug 26, 2021

Compiling on tip:

func Slice(x []int, y int) []int {
    return x[y:]
}

and cutting out the function prologue on amd64 we get:

CMPQ    BX, DI
JCS     panicpath
SUBQ    DI, CX
SUBQ    DI, BX
MOVQ    CX, DX
NEGQ    CX
SHLQ    $3, DI
SARQ    $63, CX
ANDQ    CX, DI
ADDQ    DI, AX
MOVQ    DX, CX

We should be able to slim that down to (not tested):

CMPQ    BX, DI
JCS     panicpath
SUBQ    DI, CX
SUBQ    DI, BX
SHLQ    $3, DI
TESTQ   CX, CX
CMOVZ   CX, DI
ADDQ    DI, AX

by pulling the AND and OpSlicemask operation in the ssa generation phase into a single new OpSlicedelta operation:

before:

mask := s.newValue1(ssa.OpSlicemask, types.Types[types.TINT], rcap)
delta = s.newValue2(andOp, types.Types[types.TINT], delta, mask)

after:

delta = s.newValue2(ssa.OpSlicedelta, types.Types[types.TINT], delta, rcap)

By either making the compiler SSA optimizations smarter or pulling even more operations into a special SSA Op we could save the TESTQ and be able to get to:

CMPQ    BX, DI
JCS     panicpath
SUBQ    DI, BX
SUBQ    DI, CX
CMOVE   CX, DI
SHLQ    $3, DI
ADDQ    DI, AX

However it is unclear if this will be any faster (or worth the complexity) without benchmarking when the scaling of the index for the delta happens after the CMOV.

A further reduction in instructions is possible by moving the panic jumps to be dependent on the SUB instructions:

SUBQ    DI, BX
JS      panicpath
SUBQ    DI, CX
CMOVE   CX, DI
SHLQ    $3, DI
ADDQ    DI, AX

That then will need extra handling in recovering the original slice len/cap in the panicpath.

At last for this specific case the SHL and ADD can be folded into a LEA:

SUBQ    DI, BX
JS      panicpath
SUBQ    DI, CX
CMOVE   CX, DI
LEAQ    [AX+DI*8], AX

/cc @randall77 @josharian @mdempsky

@josharian
Copy link
Contributor

@josharian josharian commented Aug 26, 2021

Nice. You planning to do this?

cc also @mundaym who also experimented with alternative SSA slicing ops, if memory serves

@martisch
Copy link
Contributor Author

@martisch martisch commented Aug 26, 2021

I plan to make a CL to at least the first step using OpSlicedelta. But happy if anyone gets to make that CL before me (thats why I wrote it down here first) as I have other things I promised todo sooner than later in the CL making pipeline.

The other steps are likely more involved as It likely needs some more complex rules, new lower arch specific Ops and flag handling in SSA optimization rules. It might only be worth it if we first put in work to simplify the actual rules by having more first class support for flag dependencies on arithmetic ops. Last time I worked with them it required defining new ops with explicit flag output. That then may inhibit other optimizations from triggering.

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 26, 2021

My memory is convinced that a while back someone did some experiments that showed CMOV* operations were slow on some processors (needed extra ports on the register file?). Not sure if that is relevant any more. Or my memory is bad, I don't see anything obvious in an issue search.

Similar issues: #8448 #27780

@martisch
Copy link
Contributor Author

@martisch martisch commented Aug 26, 2021

AFAIK CMOV performance depends on the dependency chain as both values need to be calculated and can be worse if branches are nicely predictable or can skip large amounts of work. With or without CMOV we compute the scaled index here anyway so thats the same. However for the mask we can replace the NEG and SAR with a TEST (or reuse the SUB).

As a data point for support: gccgo -O2 generates a CMOV for setting the slice pointer too.

@josharian
Copy link
Contributor

@josharian josharian commented Aug 26, 2021

Keith: maybe #26306? That also links to some other issues.

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 26, 2021

@josharian Thanks, definitely on point. But I don't think that's the one I'm remembering - that has to do with depending on arguments, for which the current masking scheme has a similar problem. I'm remembering something about cmov being generally slow (several uops, maybe?). In any case, cmov might be better now. Just wanted to make sure we benchmark it (on older processors?) before pulling the trigger.

@martisch
Copy link
Contributor Author

@martisch martisch commented Aug 26, 2021

Definitely agreed on the benchmarking. I think we can aim for old being sandy bridge or newer on amd64 as the performance targets. But I can power up my Core 2 and Athlon64 if need be.

@josharian
Copy link
Contributor

@josharian josharian commented Aug 26, 2021

And if it’s slow only on older machines, it can be a GOAMD64 candidate. And of course there are the other arches.

@mundaym
Copy link
Member

@mundaym mundaym commented Aug 26, 2021

cc also @mundaym who also experimented with alternative SSA slicing ops, if memory serves

Yeah, I experimented with changing the SSA representation of slices. Rather than representing ptr, len and cap values directly I changed SSA so that it represented slices as ptr, len and cap-len values. This meant that the cap() operation required an extra addition (to add back len) but cap-len is often loop invariant when re-slicing in a loop. There is a more detailed description in the commit message:

https://go-review.googlesource.com/c/go/+/170125

Wouldn't make any difference to the example in the issue unless perhaps it was called in a loop.

@toothrot toothrot added this to the Backlog milestone Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants