Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize CAS loops #29716

Open
CAFxX opened this issue Jan 13, 2019 · 1 comment

Comments

Projects
None yet
4 participants
@CAFxX
Copy link
Contributor

commented Jan 13, 2019

Note: I'm not 100% sure the transformation I'm proposing is correct, or profitable. If it's not feel free to close.

Consider the typical Go CAS loop:

    for {
        oldA := atomic.LoadUint64(&A)
        newA := f(oldA)
        if atomic.CompareAndSwapUint64(&A, oldA, newA) {
            return
        }
    }

On x64, this gets compiled to: (https://godbolt.org/z/7905Kp)

main_pc29:
        movq    "".A(SB), AX
		// (compute newA, store it in CX...) 
        leaq    "".A(SB), DX
        lock
        cmpxchgq        CX, (DX)
        seteq   AL
        testb   AL, AL
        jeq     main_pc29
        // (return...)

There are 3 things that jump out:

  • cmpxchgq, if A != oldA, already reloads the updated value of A in AX, so the first instruction of the loop (reloading A) can be skipped from the second iteration onwards
  • the seteq and testb after cmqxchgq are not required (they, and the jeq can be replaced with a jnz main_pc29)
  • the address of A is reloaded at every iteration: it should be hoisted outside of the loop (note that in the link above f() is marked noinline, so it makes sense to reload the address of A once f returns; but even if you remove the noinline the address is still reloaded even though the register is not reused for other purposes) (#15808?)

Rewritten, the loop could become:

        movq    "".A(SB), AX   // atomic.LoadUint64(&A)
        leaq    "".A(SB), DX   // DX = &A
main_pc29:
		// (compute newA, store it in CX...) 
        lock
        cmpxchgq        CX, (DX)
        jnz     main_pc29
        // (return...)

Since modifying atomic.CompareAndSwap* to return the reloaded value instead of a bool is likely not an option, it would be great if the compiler learned to do this kind of transformation (when safe).

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jan 13, 2019

the seteq and testb after cmqxchgq are not required (they, and the jeq can be replaced with a jnz main_pc29)

This would be nice to fix. There's even a TODO in the code:

		// TODO: have these return flags instead of bool.  The current system generates:
		//    CMPXCHGQ ...
		//    SETEQ AX
		//    CMPB  AX, $0
		//    JNE ...
		// instead of just
		//    CMPXCHGQ ...
		//    JEQ ...
		// but we can't do that because memory-using ops can't generate flags yet
		// (flagalloc wants to move flag-generating instructions around).

The restriction mentioned may still be a problem. Memory-using ops can now generate flags, but I think it would still be a problem for memory-generating ops.

cmpxchgq, if A != oldA, already reloads the updated value of A in AX, so the first instruction of the loop (reloading A) can be skipped from the second iteration onwards

the address of A is reloaded at every iteration: it should be hoisted outside of the loop (note that in the link above f() is marked noinline, so it makes sense to reload the address of A once f returns; but even if you remove the noinline the address is still reloaded even though the register is not reused for other purposes) (#15808?)

I'm not particularly worried about either of these. The loop should only execute once in typical situations (and if it doesn't, you have bigger problems).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.