Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: miscellaneous optimizations #24958

Closed
benshi001 opened this issue Apr 20, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@benshi001
Copy link
Member

commented Apr 20, 2018

My optimization plan for arm64 in go-1.12:

  1. further optimization for (shifted) register indexed load/store, especially interference with combined load/store.
    1.1 combination of load/store uint16/uint32 to upper type
    1.2 BE/LE should be both supported for the same situation
    1.3 load/store in the order of from/to high memory down to low memory (it is not BE, LE load/store can also happen from high-memory/upper-byte to low-memory/lower byte)

  2. (shifted) register indexed load/store for FP. (assembler and compiler)

  3. optimization with MADD/MSUB/MADDW/MSUBW
    "MUL R1, R2, R3"
    "ADD R3, R4" can be optimized to
    "MADD R1, R2, R3, R4"
    I expect to see both performance improvement and code size reduction with MADD/MSUB emitted.

  4. optimize comparasion with ANDS/ADDS/SUBS

  5. optimize atomic operations with SWPALD/SWPALW/SWPALH/SWPALB

  6. STADD/STSUB/STEOR/STOR
    those instructions directly operate on a memory operand atomically, I expect to see both performance improvement and code size reduction with them emitted.

  7. BFC - bit field clear

  8. constant pool
    "ADD $0x00aaaaaa, Rx" is assembled to "LDR off(PC), Rtmp" + "ADD Rtmp, Rx", and a 4-byte item is stored in the constant pool, total 12 bytes are cost.
    It can be optimized to "ADD $0xaaa, Rx" + "ADD $0xaaa000, Rx", and only 8 bytes are cost.
    This optimization does not improve performance, but does reduce code size.

  9. FMULA/FMULS
    (FADD z:(FMUL x y) a) -> (FMULA x a y)
    a restriction z.Uses==1 is needed to improve performance.

  10. constant pool
    use "MOV $0xaaaa, Rx" + "MOVK $0xbbbb0000, Rx" instead of loading a 32-bit 0xbbbbaaaa from constant pool;
    use "MOV $0xaaaa, Rx" + "MOVK $0xbbbb0000, Rx" + "MOVK $0xcccc00000000, Rx" instead of loading a 64-bit 0xccccbbbbaaaa from constant pool;

  11. LDP/STP for FP (assembler and compiler)

@agnivade agnivade changed the title optimize ARM64 code cmd/compile/internal/arm64: miscellaneous optimizations Apr 20, 2018

@agnivade agnivade added this to the Go1.11 milestone Apr 20, 2018

@benshi001 benshi001 modified the milestones: Go1.11, Go1.12 Apr 20, 2018

@benshi001 benshi001 self-assigned this Apr 20, 2018

@benshi001

This comment has been minimized.

Copy link
Member Author

commented May 17, 2018

Some optimization can also be done for ARM:

  1. BICconst <-> ANDconst, SUBconst <-> ADDconst for smaller constant

  2. MULA/MULS
    2.1 (ADD a h:(MUL x y)) -> (MULA a x y) needs h.Uses==1
    2.2 (CMPconst [0] (MULA a x y)) -> (CMN a (MUL x y))

  3. optimize comparasion
    3.1 (CMPconst [0] L:(ADD x y)) && L.Uses==1 -> (CMN x y)
    3.2 (CMPconst [0] L:(ADD x y)) && L.Uses>1 -> (ADDS x y)

  4. combination of MOVB to MOV[HW] & MOVH to MOVW

  5. use BFI / BFC to simplify bit operations

  6. optimization with MOVBUloadshiftLL, MOVBUloadshiftRA, MOVBUloadshiftRL

  7. MULAF/MULAD/MULSF/MULSD
    (ADDF a (MULF x y)) && a.Uses == 1 && objabi.GOARM >= 6 -> (MULAF a x y)
    also needs z:(MULF x y), z.Uses == 1

  8. use MOVT/MOVW pair instead of constant pool on ARMv7 (for large constant and symbolic address)

@benshi001

This comment has been minimized.

Copy link
Member Author

commented May 18, 2018

The reason why the following two MOVBs are not combined to a single MOVH, is that the first load (v23) is used twice.

func ssg(s []byte, idx int) {
	s[(idx<<1)+0], s[(idx<<1)+1] = 0, 0
}

This optimization seems can not be done by SSA rules only.

  pass lower begin
  pass lower end [16576 ns]
ssg func([]byte, int)
  b1:
    v1 = InitMem <mem>
    v7 = Arg <int> {idx}
    v8 = MOVDconst <uint> [1] DEAD
    v13 = MOVDconst <int> [1] DEAD
    v15 = MOVDconst <byte> [0] DEAD
    v30 = Arg <*byte> {s} (s[*byte])
    v25 = Arg <int> {s} [8] (s+8[int])
    v28 = SLLconst <int> [1] v7
    v26 = MOVDconst <int> [0] DEAD
    v3 = FlagLT_ULT <flags> DEAD
    v6 = CMPshiftLL <flags> [1] v25 v7
    v14 = ADDconst <int> [1] v28
    v17 = GreaterThanU <bool> v6 DEAD
    v24 = InvertFlags <flags> v6 DEAD
    UGT v6 -> b2 b3 (likely)
  b2: <- b1
    v21 = ADDshiftLL <*byte> [1] v30 v7 DEAD
    v23 = MOVBstorezeroidx <mem> v30 v28 v1
    v12 = CMP <flags> v14 v25
    v27 = LessThanU <bool> v12 DEAD
    ULT v12 -> b4 b3 (likely)
  b3: <- b1 b2
    v18 = Phi <mem> v1 v23
    v19 = CALLstatic <mem> {runtime.panicindex} v18
    Exit v19
  b4: <- b2
    v29 = ADD <*byte> v30 v14 DEAD
    v31 = MOVBstorezeroidx <mem> v30 v14 v23
    Ret v31
@randall77

This comment has been minimized.

Copy link
Contributor

commented May 18, 2018

This is because the bounds check might panic after assigning the s[(idx<<1)+0] but before assigning s[(idx<<1)+1]. If you ensure the bounds check, if it fails, fails before the assignment, then it should work.

func ssg(s []byte, idx int) {
	_ = s[(idx<<1)+1]
	s[(idx<<1)+0], s[(idx<<1)+1] = 0, 0
}
@benshi001

This comment has been minimized.

Copy link
Member Author

commented May 21, 2018

Thank you. Keith.

But such kinds of code looks odd. Is it possible to clobber the MOVBstore if it was only referred by a bound check? Or making bound check omits disappeared store/load ?

@randall77

This comment has been minimized.

Copy link
Contributor

commented May 21, 2018

Combining the stores is not legal in the original example. If the first array index succeeds but the second fails, the language semantics require that the first write happens and the second doesn't. That means that the first write must be just a single byte; it can't be combined with the second write. The "in-between" memory state is observable if someone recovers the panic.

That requirement is realized in SSA by passing the results of the first store to the bounds check.

@benshi001

This comment has been minimized.

Copy link
Member Author

commented May 21, 2018

Thank you. Keith. I see the key point.

@benshi001 benshi001 changed the title cmd/compile/internal/arm64: miscellaneous optimizations cmd/compile: miscellaneous optimizations May 29, 2018

@benshi001

This comment has been minimized.

Copy link
Member Author

commented May 29, 2018

386/amd64:

  1. "MOVFconst $0.0, F0" -> "LDZ F0", "MOVFconst $1.0, F0" -> "LD1 F0"

  2. implement DIVSSload/DIVSDload/MULLload

  3. support more instructions use a destination memory operand, such as (ADD/SUB/OR/EOR/AND)Lconstmodify, (ADD/SUB/OR/EOR/AND)Lmodify

  4. support more instructions use a source memory operand, such as CMP

  5. optimize 386 with BT/BTC/BTR/BTS

  6. improve memory operands with register indexed form

  7. can more combined load/store be optimized?

  8. can optimization be done with CMOV/FMOVcc?

  9. can some optimization be done with FIADD/FISUB/FIMUL/FIDIV ?

@benshi001

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

about 40% of them have been implemented via my recent commits, others do not improvement.

@benshi001 benshi001 closed this Sep 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.