Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: improve interface dispatch performance #29276

Open
robaho opened this Issue Dec 14, 2018 · 9 comments

Comments

Projects
None yet
5 participants
@robaho
Copy link

commented Dec 14, 2018

Reviewing the following code:

image

and the generated assembly:

varint_bench_test.go:14       0x10f07ee               488b4c2430              MOVQ 0x30(SP), CX
varint_bench_test.go:14       0x10f07f3               488b5118                MOVQ 0x18(CX), DX
...
varint_bench_test.go:14       0x10f0807               ffd2                    CALL DX
 
TEXT command-line-arguments.WriteUvarint(SB) /Users/robertengels/gotest/gotest/varint_bench_test.go
  varint_bench_test.go:12       0x10f07b0               65488b0c2530000000      MOVQ GS:0x30, CX
  varint_bench_test.go:12       0x10f07b9               483b6110                CMPQ 0x10(CX), SP
  varint_bench_test.go:12       0x10f07bd               0f869f000000            JBE 0x10f0862
  varint_bench_test.go:12       0x10f07c3               4883ec28                SUBQ $0x28, SP
  varint_bench_test.go:12       0x10f07c7               48896c2420              MOVQ BP, 0x20(SP)
  varint_bench_test.go:12       0x10f07cc               488d6c2420              LEAQ 0x20(SP), BP
  varint_bench_test.go:13       0x10f07d1               488b442440              MOVQ 0x40(SP), AX
  varint_bench_test.go:13       0x10f07d6               eb09                    JMP 0x10f07e1
  varint_bench_test.go:18       0x10f07d8               488b442440              MOVQ 0x40(SP), AX
  varint_bench_test.go:18       0x10f07dd               48c1e807                SHRQ $0x7, AX
  varint_bench_test.go:13       0x10f07e1               483d80000000            CMPQ $0x80, AX
  varint_bench_test.go:13       0x10f07e7               7243                    JB 0x10f082c
  varint_bench_test.go:13       0x10f07e9               4889442440              MOVQ AX, 0x40(SP)
  varint_bench_test.go:14       0x10f07ee               488b4c2430              MOVQ 0x30(SP), CX
  varint_bench_test.go:14       0x10f07f3               488b5118                MOVQ 0x18(CX), DX
  varint_bench_test.go:14       0x10f07f7               83c880                  ORL $-0x80, AX
  varint_bench_test.go:14       0x10f07fa               88442408                MOVB AL, 0x8(SP)
  varint_bench_test.go:14       0x10f07fe               488b442438              MOVQ 0x38(SP), AX
  varint_bench_test.go:14       0x10f0803               48890424                MOVQ AX, 0(SP)
  varint_bench_test.go:14       0x10f0807               ffd2                    CALL DX
  varint_bench_test.go:14       0x10f0809               488b442418              MOVQ 0x18(SP), AX
  varint_bench_test.go:14       0x10f080e               488b4c2410              MOVQ 0x10(SP), CX
  varint_bench_test.go:15       0x10f0813               4885c9                  TESTQ CX, CX
  varint_bench_test.go:15       0x10f0816               74c0                    JE 0x10f07d8
  varint_bench_test.go:16       0x10f0818               48894c2448              MOVQ CX, 0x48(SP)
  varint_bench_test.go:16       0x10f081d               4889442450              MOVQ AX, 0x50(SP)
  varint_bench_test.go:16       0x10f0822               488b6c2420              MOVQ 0x20(SP), BP
  varint_bench_test.go:16       0x10f0827               4883c428                ADDQ $0x28, SP
  varint_bench_test.go:16       0x10f082b               c3                      RET
  varint_bench_test.go:20       0x10f082c               488b4c2430              MOVQ 0x30(SP), CX
  varint_bench_test.go:20       0x10f0831               488b4918                MOVQ 0x18(CX), CX
  varint_bench_test.go:20       0x10f0835               88442408                MOVB AL, 0x8(SP)
  varint_bench_test.go:20       0x10f0839               488b442438              MOVQ 0x38(SP), AX
  varint_bench_test.go:20       0x10f083e               48890424                MOVQ AX, 0(SP)
  varint_bench_test.go:20       0x10f0842               ffd1                    CALL CX
  varint_bench_test.go:20       0x10f0844               488b442418              MOVQ 0x18(SP), AX
  varint_bench_test.go:20       0x10f0849               488b4c2410              MOVQ 0x10(SP), CX
  varint_bench_test.go:20       0x10f084e               48894c2448              MOVQ CX, 0x48(SP)
  varint_bench_test.go:20       0x10f0853               4889442450              MOVQ AX, 0x50(SP)
  varint_bench_test.go:20       0x10f0858               488b6c2420              MOVQ 0x20(SP), BP
  varint_bench_test.go:20       0x10f085d               4883c428                ADDQ $0x28, SP
  varint_bench_test.go:20       0x10f0861               c3                      RET
  varint_bench_test.go:12       0x10f0862               e89948f6ff              CALL runtime.morestack_noctxt(SB)
  varint_bench_test.go:12       0x10f0867               e944ffffff              JMP command-line-arguments.WriteUvarint(SB)
  :-1                           0x10f086c               cc                      INT $0x3
  :-1                           0x10f086d               cc                      INT $0x3
  :-1                           0x10f086e               cc                      INT $0x3
  :-1                           0x10f086f               cc                      INT $0x3

The generated code loads the interface address using double indirection in every loop invocation, and every call (line 14 & 20).

The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method.

It is my opinion that loops like this are very common in typical Go code and deserve more optimization attention. As an example, issue #29010 makes specific reference to not using interfaces as call sites due to their inefficiency.

At a minimum the call address could be placed in a stack local avoiding one indirection.

A more advanced change might be to reserve a couple of general purpose registers for the hot interface call address and object reference (r10/r11) and so push/pop r10/r11 on entry/exit for those routines using the optimization.

Issue #18597 has some overlap with this.

@robaho

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

I can see why it’s problematic - due to the GC and stack growth, but only the dedicated struct pointer would need fix up. The interface fn pointer doesn’t need it.

@martisch

This comment has been minimized.

Copy link
Member

commented Dec 15, 2018

The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method.

Note that I think DX would currently need to be at least spilled and reloaded on every iteration since the function executed with the interface call could clobber DX.

@robaho

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

@josharian

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

Registers are precious on and64. If we are going to dedicate one to a single purpose it’ll probably be for the current g.

Changing the ABI (e.g. to support callee-save registers) is a major undertaking...and one which we are actively working on.

With callee-save registers, it is reasonably likely that a good register allocator will already choose a callee-save register for this, but it would be good to confirm (once that hypothetical becomes reality).

@randall77

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

It would be nice to teach the compiler that there's no need to reload from slots of an itab - there's nothing that writes (already initialized) itabs.

I'm not sure how much performance it actually costs, at least in this example. The branch predictor is going to predict the call correctly, so nothing is actually waiting for the results of these loads. As long as L1 has the bandwidth for these loads it won't slow anything down.

It would be good to find a benchmark where we could see an improvement. Not sure what that would look like, or how we would know without implementing the optimization. Maybe use performance counters to find a benchmark with stalls on these instructions?

@robaho

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

I am working on some hand assembly to test the performance difference.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

It would be nice to teach the compiler that there's no need to reload from slots of an itab - there's nothing that writes (already initialized) itabs.

I was also thinking that but I don’t see what the implementation would look like. Do you have a suggestion for how?

@robaho

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

Initial testing shows no performance improvement when you need to save/reload the register on each call. It seems the only viable solution is to dedicate registers for interface dispatch and make them callee save - once the GC/safepoint occurs all registers appear to be trashed trashed - which in most cases (outside of GC/runtime) they should not need to be saved/restored.

I think this would go along ways towards improving the performance of Go while encouraging interface based design.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2018

I was also thinking that but I don’t see what the implementation would look like. Do you have a suggestion for how?

Nope. If we're going to do any sort of alias analysis to enable more aggressive store-load forwarding, the itab special case would be easy to incorporate. Not sure how to do the former, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.