Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: unexpected difference in compiled code for returned array #31198

Open
marigonzes opened this issue Apr 1, 2019 · 7 comments
Open

cmd/compile: unexpected difference in compiled code for returned array #31198

marigonzes opened this issue Apr 1, 2019 · 7 comments

Comments

@marigonzes
Copy link

@marigonzes marigonzes commented Apr 1, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What did you do?

I compiled this code:

package test

func test() [16]byte {
    x := [16]byte{0}

    return x
}

What did you expect to see?

I expected the generated assembly code to be similar to this program's code:

package test

func test() [16]byte {
    x := [16]byte{1}

    return x
}

What did you see instead?

Instead, the generated code is much more bloated than I anticipated, even involving stack movement.

        subq    $24, SP
        movq    BP, 16(SP)
        leaq    16(SP), BP

        xorps   X0, X0
        movups  X0, "".~r0+32(SP)
        movups  X0, "".x(SP)
        movups  "".x(SP), X0
        movups  X0, "".~r0+32(SP)
        movq    16(SP), BP
        addq    $24, SP
        ret

As pointed out by @vcabbage, this issue is probably related to the assignment to "x", given that the compiler produces better code for the following program:

package test

func test() [16]byte {
    return [16]byte{0}
}

Another interesting case is the one below, where the "good" case above becomes as bad as the "bad" one:

package test

func test() [16]byte {
    x := [16]byte{1}
    y := x

    return y
}
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Apr 1, 2019

Probably a dup of #24416.
Generated code is not optimal for large structs and arrays.

@mundaym

This comment has been minimized.

Copy link
Member

@mundaym mundaym commented Apr 2, 2019

I think this is caused by LocalAddr + VarDef blocking store forwarding rewrite rules. This is what the SSA graph looks like just before lowering:

late opt [4974 ns]

    b1:

    v1 (?) = InitMem <mem>
    v2 (?) = SP <uintptr>
    v5 (17) = VarDef <mem> {~r0} v1
    v6 (17) = LocalAddr <*[16]byte> {~r0} v2 v5
    v7 (+17) = Zero <mem> {[16]byte} [16] v6 v5
    v8 (18) = VarDef <mem> {x} v7
    v9 (18) = LocalAddr <*[16]byte> {x} v2 v8
    v10 (+18) = Zero <mem> {[16]byte} [16] v9 v8
    v11 (20) = LocalAddr <*[16]byte> {x} v2 v10
    v12 (20) = VarDef <mem> {~r0} v10
    v13 (+20) = LocalAddr <*[16]byte> {~r0} v2 v12
    v14 (20) = Move <mem> {[16]byte} [16] v13 v11 v12
    v4 (?) = Const64 <int64> [0]

Ret v14 (+20)

Most of the store forwarding rules have special cases to allow them to work despite the presence of VarDefs. However, most of these rules also check that the number of uses of the VarDef memory state is 1, so they are blocked when LocalAddr is used, since it also takes the memory state as an argument.

In general this is all quite fragile, sorry. These rules were added before the LocalAddr op was introduced.

@marigonzes

This comment has been minimized.

Copy link
Author

@marigonzes marigonzes commented Apr 2, 2019

The last example (the "interesting case") is actually a regression from commit 8607b2e:

Before:

xorps   X0, X0
movups  X0, "".~r0+8(SP)
movups  X0, "".~r0+8(SP)
movb    $1, "".~r0+8(SP)
ret

After:

subq    $24, SP
movq    BP, 16(SP)
leaq    16(SP), BP

xorps   X0, X0
movups  X0, "".~r0+32(SP)
movups  X0, "".x(SP)
movb    $1, "".x(SP)
movups  "".x(SP), X0
movups  X0, "".~r0+32(SP)
movq    16(SP), BP
addq    $24, SP
ret

Should I open a separate issue for this?

@mundaym

This comment has been minimized.

Copy link
Member

@mundaym mundaym commented Apr 2, 2019

The last example (the "interesting case") is actually a regression from commit 8607b2e:

Thanks, that makes sense, any thoughts @josharian? I guess the rule added in that commit adds a 'dead end' for the optimization somehow.

Should I open a separate issue for this?

No, this issue is fine.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Apr 2, 2019

I guess the rule added in that commit adds a 'dead end' for the optimization somehow.

I think what is happening is that rule is interfering with dead auto elimination. Previously, we had 'x -> y -> ret', and the x and y were recognized as dead-autos. This rule rewrites 'x -> y -> ret' into 'x -> ret', ironically in the hopes that y can be dead-auto-eliminated. This interferes with eliminating x. I'm not sure exactly why yet; maybe dead auto elimination relies on your store forwarding rules that aren't triggering in order to be effective?

From spending a few minutes on this now I suspect that the fix won't be straightforward/easy. So part of the question is whether this is a rare case or not, and thus whether it is worth spending the time to restore this optimization.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Apr 2, 2019

The challenge here is that it is hard to tell at the moment that that rule is being applied whether 'x' can be eliminated or not. I don't see a good approach here, although I'd be open to suggestions. On balance, the optimization does appear to help more than it hurts.

One question:

However, most of these rules also check that the number of uses of the VarDef memory state is 1, so they are blocked when LocalAddr is used, since it also takes the memory state as an argument.

Is it worth it (and sound!) to allow these rules to count the number of non-LocalAddr uses of the VarDef memory state? The only important thing for LocalAddr's memory arg is that it be correctly ordered in the memory chain for liveness purposes; we should be able to preserve that property.

All in all, I'm inclined to wait for Keith's work on SSAing all structs and arrays. That should handle this and much more.

@mundaym

This comment has been minimized.

Copy link
Member

@mundaym mundaym commented Apr 2, 2019

Is it worth it (and sound!) to allow these rules to count the number of non-LocalAddr uses of the VarDef memory state?

Not sure, sounds unpleasant in SSA. I'd rather we found a way to make these sorts of annotations less intrusive when optimizing. Not that I have any ideas how to do that...

All in all, I'm inclined to wait for Keith's work on SSAing all structs and arrays. That should handle this and much more.

Yes, that sounds interesting. Avoiding touching memory entirely, at least early on in the compilation pipeline, seems like the nicest approach to take for these sorts of issues.

@andybons andybons added this to the Unplanned milestone Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.