Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: assembly generated is bigger than previous versions #30229

Open
marigonzes opened this issue Feb 14, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@marigonzes
Copy link

commented Feb 14, 2019

What version of Go are you using (go version)?

$ go version
go version go1.11.4 windows/amd64

Does this issue reproduce with the latest release?

Yes. I tried on tip and the same happens.

What did you do?

I compiled the following code and inspected the assembly instructions generated by the compiler:

func test(slc [][]int) (int, int) {
	var lentotal, lenslc int
	for _, x := range slc {
		lentotal += len(x)
		lenslc++
	}
	return lentotal, lenslc
}

Assembly code generated:

What did you expect to see?

I expected the compiler to generate code similar to the 1.10.1 version, because on tip it generates unnecessary jumps and an extra block of XOR's.

What did you see instead?

Instead, the compiler generated more code than what is necessary.
In the 1.10.1 version, the only thing that I think could be different is that, on line 5, the slice address is moved to CX, but it might not be necessary in the case that len(slc) is 0, which is well handled on tip.

Summing up, I believe the code should look something like:

        pcdata  $2, $0
        pcdata  $0, $0
        xorl    DX, DX
        xorl    BX, BX
        movq    "".slc+16(SP), AX
        testq   AX, AX
        jle     test_pc37
        pcdata  $2, $1
        pcdata  $0, $1
        movq    "".slc+8(SP), CX
        jmp     test_pc25
test_pc21:
        addq    $24, CX
test_pc25:
        addq    8(CX), BX
        incq    DX
        cmpq    DX, AX
        jlt     test_pc21
test_pc37:
        pcdata  $2, $0
        movq    BX, "".~r1+32(SP)
        movq    DX, "".~r2+40(SP)
        ret

Regarding the test_pc21 block, it could disappear, as is done in the 1.10.1 version.

@mvdan

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Is this about performance, binary size, or just correctness?

@marigonzes

This comment has been minimized.

Copy link
Author

commented Feb 14, 2019

In terms of correctness, I believe both versions are correct in the sense that they produce the desired behavior.
In this issue, my point is more related towards better performance and smaller binary size, by eliminating redundant JUMP instructions and unneeded assembly code blocks.

@mvdan mvdan added the Performance label Feb 14, 2019

@mvdan

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Fair enough - was just wondering to appropriately label the issue :)

If you'd like to help get this issue fixed faster, you could try bisecting which Go commit introduced the regression. For example, you could bisect between the go1.10 and go1.11 tags, running make.bash at each step, building a tiny program, and checking its size or number of assembly instructions.

@marigonzes

This comment has been minimized.

Copy link
Author

commented Feb 14, 2019

After following your advice, I found that the commit that started generating this code was 837ed98, which makes sense.
According to the commit message, I believe the test_pc21 block makes sense to exist to prevent the past-the-end pointer. Now, what I think could be made better is the duplicated XOR's, which could be put together as the first instructions in the function, as shown in the assembly I wrote above.

@mvdan

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

/cc @aclements @randall77 @dr2chase; please see the comments above.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

Looks like the phi tighten pass introduces the duplicate xors. If I turn that pass off, then the register allocator also introduces duplicate xors. I can't turn off the register allocator :(

The reason for both of those behaviors is that we're trying to avoid excess register pressure by loading constants into registers as late as possible. For an example reason why, see #16407. It sounds like we're being a bit too aggressive in that regard, but I'm not sure how to design a knob to adjust it, or how to decide where to set it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.