New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize f32Slice[i] += 1 across math.Float32bits, Float32frombits #17220

Open
nigeltao opened this Issue Sep 24, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@nigeltao
Contributor

nigeltao commented Sep 24, 2016

I'm not sure how often this arises in practice, but it did come up in https://go-review.googlesource.com/#/c/29691/ in a vector rasterizer

Roughly speaking, in func floatingAccumulateMask, I'm accumulating the elements of a []float32 and storing the (scaled) cumulative sums in an []uint32.

I don't actually need both of the individual and cumulative values at the same time. If the two slices were both []float32 or both []uint32, then I could halve the amount of memory that I need (and possibly have better cache access patterns) by writing the output elements in-place over the input elements.

I can actually still do this, even without using unsafe, with just one slice (of type []uint32), and sprinking some math.Float32bits and math.Float32frombits throughout my float32 code.

This works, in that it produces the correct output, but there is a performance penalty. In a simpler repro case, suppose that I had these global variables:

f32Slice []float32
u32Slice []uint32

and these three lines of code:

f32Slice[0] += 1
u32Slice[0] = uint32(1 + int32(u32Slice[0]))
u32Slice[0] = math.Float32bits(1 + math.Float32frombits(u32Slice[0]))

The GOARCH=amd64 codegen for each of the three lines are:

48 8b 05 b1 71 09 00    mov    0x971b1(%rip),%rax        # 498298 <main.f32Slice+0x8>
48 8b 0d a2 71 09 00    mov    0x971a2(%rip),%rcx        # 498290 <main.f32Slice>
48 85 c0                test   %rax,%rax
76 3a                   jbe    40112d <main.main+0x12d>
f3 0f 10 01             movss  (%rcx),%xmm0
f3 0f 10 0d 15 cc 06    movss  0x6cc15(%rip),%xmm1        # 46dd14 <$f32.3f800000>
00 
f3 0f 58 c8             addss  %xmm0,%xmm1
f3 0f 11 09             movss  %xmm1,(%rcx)
48 8b 05 fe 71 09 00    mov    0x971fe(%rip),%rax        # 4982b0 <main.u32Slice>
48 8b 0d ff 71 09 00    mov    0x971ff(%rip),%rcx        # 4982b8 <main.u32Slice+0x8>
48 85 c9                test   %rcx,%rcx
76 76                   jbe    401134 <main.main+0x134>
8b 08                   mov    (%rax),%ecx
ff c1                   inc    %ecx
89 08                   mov    %ecx,(%rax)
48 8b 05 6c 72 09 00    mov    0x9726c(%rip),%rax        # 4982b0 <main.u32Slice>
48 8b 0d 6d 72 09 00    mov    0x9726d(%rip),%rcx        # 4982b8 <main.u32Slice+0x8>
48 85 c9                test   %rcx,%rcx
0f 86 e7 00 00 00       jbe    40113b <main.main+0x13b>
8b 00                   mov    (%rax),%eax
89 44 24 0c             mov    %eax,0xc(%rsp)
f3 0f 10 44 24 0c       movss  0xc(%rsp),%xmm0
f3 0f 10 0d ac cc 06    movss  0x6ccac(%rip),%xmm1        # 46dd14 <$f32.3f800000>
00 
f3 0f 58 c1             addss  %xmm1,%xmm0
f3 0f 11 44 24 08       movss  %xmm0,0x8(%rsp)
8b 44 24 08             mov    0x8(%rsp),%eax
48 8b 0d 33 72 09 00    mov    0x97233(%rip),%rcx        # 4982b0 <main.u32Slice>
48 8b 15 34 72 09 00    mov    0x97234(%rip),%rdx        # 4982b8 <main.u32Slice+0x8>
48 85 d2                test   %rdx,%rdx
0f 86 ae 00 00 00       jbe    40113b <main.main+0x13b>
89 01                   mov    %eax,(%rcx)

The codegen for the first two lines are efficient. The codegen for the third line could be better in two ways. First, there's an unnecessary bounce via the stack from memory to XMM0:

8b 00                   mov    (%rax),%eax
89 44 24 0c             mov    %eax,0xc(%rsp)
f3 0f 10 44 24 0c       movss  0xc(%rsp),%xmm0

and likewise on the way back. Second, there are two bounds checks instead of one, but the second is redundant.

I am not a compiler person, but it looks to me like the Float32{,from}bits calls are treated as black box functions and not like a (no-op) uint32 to int32 conversion until too late in the codegen for e.g. the relevant bounds check elimination.

I'll let y'all decide if this is a dupe of issue #17069.

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 24, 2016

The duplicate bounds check is because it is reloading the length, and the length may have changed because you're reading from a global. It's reloading because the intermediate store/load of the float value is hiding the fact that the loads of the length can be combined.

The extraneous write/read problem is #13095. I'd rather fix that issue than intrinsify, if we can.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Sep 24, 2016

@nigeltao Could you take the address instead?

p := &u32Slice[0];
*p = math.Float32bits(1 + math.Float32frombits(*p))

Presumably this gives you just one index check.

@nigeltao

This comment has been minimized.

Contributor

nigeltao commented Sep 28, 2016

@griesemer: yes, that gives me just one index check.

@quentinmit quentinmit added the NeedsFix label Oct 3, 2016

@quentinmit quentinmit added this to the Go1.8Maybe milestone Oct 3, 2016

@quentinmit

This comment has been minimized.

Contributor

quentinmit commented Oct 3, 2016

Feel free to punt to Unplanned if you think this isn't worth doing any time soon.

@randall77 randall77 modified the milestones: Unplanned, Go1.8Maybe Oct 3, 2016

@mundaym

This comment has been minimized.

Member

mundaym commented Sep 8, 2017

Was this fixed by CL 58732?

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 14, 2017

This appears at least partially fixed by CL 58732. There are still some issues :(

package foo

import "math"

var f []float32
var u []uint32

func a(f *float32) {
	*f = *f + 1
}
func b(u *uint32) {
	*u = *u + 1
}

func c(u *uint32) {
	*u = math.Float32bits(1 + math.Float32frombits(*u))
}

func d(f *float32) {
	*f = math.Float32frombits(1 + math.Float32bits(*f))
}
"".a STEXT nosplit size=22 args=0x8 locals=0x0
        0x0000 00000 (tmp.go:8)   MOVSS   $f32.3f800000(SB), X0
        0x0008 00008 (tmp.go:8)   MOVQ    "".f+8(SP), AX
        0x000d 00013 (tmp.go:9)   ADDSS   (AX), X0
        0x0011 00017 (tmp.go:9)   MOVSS   X0, (AX)
        0x0015 00021 (tmp.go:10)  RET
"".b STEXT nosplit size=8 args=0x8 locals=0x0
        0x0000 00000 (tmp.go:11)  MOVQ    "".u+8(SP), AX
        0x0005 00005 (tmp.go:12)  INCL    (AX)
        0x0007 00007 (tmp.go:13)  RET
"".c STEXT nosplit size=28 args=0x8 locals=0x0
        0x0000 00000 (tmp.go:15)  MOVQ    "".u+8(SP), AX
        0x0005 00005 (tmp.go:16)  MOVL    (AX), CX
        0x0007 00007 (tmp.go:16)  MOVL    CX, X0
        0x000b 00011 (tmp.go:16)  MOVSS   $f32.3f800000(SB), X1
        0x0013 00019 (tmp.go:16)  ADDSS   X0, X1
        0x0017 00023 (tmp.go:16)  MOVSS   X1, (AX)
        0x001b 00027 (tmp.go:17)  RET
"".d STEXT nosplit size=18 args=0x8 locals=0x0
        0x0000 00000 (tmp.go:19)  MOVQ    "".f+8(SP), AX
        0x0005 00005 (tmp.go:20)  MOVSS   (AX), X0
        0x0009 00009 (tmp.go:20)  MOVL    X0, CX
        0x000d 00013 (tmp.go:20)  INCL    CX
        0x000f 00015 (tmp.go:20)  MOVL    CX, (AX)
        0x0011 00017 (tmp.go:21)  RET

On the plus side, c and d do the direct store without any need for conversion.
On the minus side, c and d do a load+convert instead of loading to the target register type directly.
I'll see if I can fix that.

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 19, 2017

Fixing this is tricky. There's a phase ordering problem.
When you have a load/convert sequence, you'd like to rewrite it to just a load of the converted type. That rewrite is only safe if the load has only one use. (Otherwise you're replacing one source load with two generated loads, and that's not ok.) In the cases here, during rewrite there are 2 uses of the load. There's the convert op, but also the store to the (eventually unused) stack temp. We don't know that second use is going to go away until after we remove unread temps, and that doesn't happen until rewrites are done.

@mundaym

This comment has been minimized.

Member

mundaym commented Sep 20, 2017

That rewrite is only safe if the load has only one use. (Otherwise you're replacing one source load with two generated loads, and that's not ok.)

Out of curiosity why isn't that ok? From a performance perspective I would have thought that issuing two loads might be better than a load followed by a move (easier to schedule). Are there other limitations?

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 20, 2017

We have to be careful that we don't break code that does something like:

   x := *p
   y := x - x

We want to guarantee that y is always zero, even in the presence of races.

I think the language spec doesn't require this, but as a practical matter we have to. I allowed this by accident once, and I broke sync.Mutex.Lock. It does essentially

old := m.state
if old & mask == value {
    atomic.CompareAndSwap(&m.state, old, new)
}

If we allowed the m.state load to happen twice instead of once, then the two values of old can be different, and bad things happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment