Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: missed opportunity to coalesce reads/writes #41663

Closed
josharian opened this issue Sep 27, 2020 · 19 comments
Closed

cmd/compile: missed opportunity to coalesce reads/writes #41663

josharian opened this issue Sep 27, 2020 · 19 comments
Labels
FrozenDueToAge help wanted NeedsFix Performance
Milestone

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Sep 27, 2020

package p

import "encoding/binary"

func f(b []byte, x *[8]byte) {
	_ = b[8]
	t := binary.LittleEndian.Uint64(x[:])
	binary.LittleEndian.PutUint64(b, t)
}

This should compile down to two MOVQs on amd64, one to load from x and one to write to b.

Instead, it compiles to a series of smaller MOVxs. The coalescing rules may need more cases added.

cc @randall77 @dr2chase @martisch @mundaym

@josharian josharian added this to the Unplanned milestone Sep 27, 2020
@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Sep 28, 2020

I'd like to work on this one!

@mundaym
Copy link
Member

@mundaym mundaym commented Sep 28, 2020

@agarciamontoro Thanks for offering to take a look. I suspect the best place to start is to add a new test in test/codegen/memcombine.go. You'll want to check that no small MOV* operations are generated using the annotations, for example:

func f(b []byte, x *[8]byte) {
	_ = b[8]
        // amd64:-`MOVB`,-`MOVW`,-`MOVL`
	binary.LittleEndian.PutUint64(b, binary.LittleEndian.Uint64(x[:]))
}

After verifying that your new test fails the tricky bit will be trying to figure out why the existing rules don't work. Most likely another optimization is being applied first and that is interfering with the pattern match. GOSSAFUNC will be useful for trying to figure out what is going on. You can find the optimization rules for load/store merging on AMD64 in

// Combining byte loads into larger (unaligned) loads.
// There are many ways these combinations could occur. This is
// designed to match the way encoding/binary.LittleEndian does it.
// Little-endian loads
(OR(L|Q) x0:(MOVBload [i0] {s} p mem)
sh:(SHL(L|Q)const [8] x1:(MOVBload [i1] {s} p mem)))
&& i1 == i0+1
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVWload [i0] {s} p mem)
(OR(L|Q) x0:(MOVBload [i] {s} p0 mem)
sh:(SHL(L|Q)const [8] x1:(MOVBload [i] {s} p1 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVWload [i] {s} p0 mem)
(OR(L|Q) x0:(MOVWload [i0] {s} p mem)
sh:(SHL(L|Q)const [16] x1:(MOVWload [i1] {s} p mem)))
&& i1 == i0+2
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVLload [i0] {s} p mem)
(OR(L|Q) x0:(MOVWload [i] {s} p0 mem)
sh:(SHL(L|Q)const [16] x1:(MOVWload [i] {s} p1 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVLload [i] {s} p0 mem)
(ORQ x0:(MOVLload [i0] {s} p mem)
sh:(SHLQconst [32] x1:(MOVLload [i1] {s} p mem)))
&& i1 == i0+4
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVQload [i0] {s} p mem)
(ORQ x0:(MOVLload [i] {s} p0 mem)
sh:(SHLQconst [32] x1:(MOVLload [i] {s} p1 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVQload [i] {s} p0 mem)
(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i1] {s} p mem))
or:(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i0] {s} p mem))
y))
&& i1 == i0+1
&& j1 == j0+8
&& j0 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j0] (MOVWload [i0] {s} p mem)) y)
(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i] {s} p1 mem))
or:(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i] {s} p0 mem))
y))
&& j1 == j0+8
&& j0 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j0] (MOVWload [i] {s} p0 mem)) y)
(ORQ
s1:(SHLQconst [j1] x1:(MOVWload [i1] {s} p mem))
or:(ORQ
s0:(SHLQconst [j0] x0:(MOVWload [i0] {s} p mem))
y))
&& i1 == i0+2
&& j1 == j0+16
&& j0 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j0] (MOVLload [i0] {s} p mem)) y)
(ORQ
s1:(SHLQconst [j1] x1:(MOVWload [i] {s} p1 mem))
or:(ORQ
s0:(SHLQconst [j0] x0:(MOVWload [i] {s} p0 mem))
y))
&& j1 == j0+16
&& j0 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j0] (MOVLload [i] {s} p0 mem)) y)
// Big-endian loads
(OR(L|Q)
x1:(MOVBload [i1] {s} p mem)
sh:(SHL(L|Q)const [8] x0:(MOVBload [i0] {s} p mem)))
&& i1 == i0+1
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (ROLWconst <v.Type> [8] (MOVWload [i0] {s} p mem))
(OR(L|Q)
x1:(MOVBload [i] {s} p1 mem)
sh:(SHL(L|Q)const [8] x0:(MOVBload [i] {s} p0 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (ROLWconst <v.Type> [8] (MOVWload [i] {s} p0 mem))
(OR(L|Q)
r1:(ROLWconst [8] x1:(MOVWload [i1] {s} p mem))
sh:(SHL(L|Q)const [16] r0:(ROLWconst [8] x0:(MOVWload [i0] {s} p mem))))
&& i1 == i0+2
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPL <v.Type> (MOVLload [i0] {s} p mem))
(OR(L|Q)
r1:(ROLWconst [8] x1:(MOVWload [i] {s} p1 mem))
sh:(SHL(L|Q)const [16] r0:(ROLWconst [8] x0:(MOVWload [i] {s} p0 mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPL <v.Type> (MOVLload [i] {s} p0 mem))
(ORQ
r1:(BSWAPL x1:(MOVLload [i1] {s} p mem))
sh:(SHLQconst [32] r0:(BSWAPL x0:(MOVLload [i0] {s} p mem))))
&& i1 == i0+4
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPQ <v.Type> (MOVQload [i0] {s} p mem))
(ORQ
r1:(BSWAPL x1:(MOVLload [i] {s} p1 mem))
sh:(SHLQconst [32] r0:(BSWAPL x0:(MOVLload [i] {s} p0 mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPQ <v.Type> (MOVQload [i] {s} p0 mem))
(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i0] {s} p mem))
or:(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i1] {s} p mem))
y))
&& i1 == i0+1
&& j1 == j0-8
&& j1 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j1] (ROLWconst <typ.UInt16> [8] (MOVWload [i0] {s} p mem))) y)
(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i] {s} p0 mem))
or:(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i] {s} p1 mem))
y))
&& j1 == j0-8
&& j1 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j1] (ROLWconst <typ.UInt16> [8] (MOVWload [i] {s} p0 mem))) y)
(ORQ
s0:(SHLQconst [j0] r0:(ROLWconst [8] x0:(MOVWload [i0] {s} p mem)))
or:(ORQ
s1:(SHLQconst [j1] r1:(ROLWconst [8] x1:(MOVWload [i1] {s} p mem)))
y))
&& i1 == i0+2
&& j1 == j0-16
&& j1 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, r0, r1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j1] (BSWAPL <typ.UInt32> (MOVLload [i0] {s} p mem))) y)
(ORQ
s0:(SHLQconst [j0] r0:(ROLWconst [8] x0:(MOVWload [i] {s} p0 mem)))
or:(ORQ
s1:(SHLQconst [j1] r1:(ROLWconst [8] x1:(MOVWload [i] {s} p1 mem)))
y))
&& j1 == j0-16
&& j1 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, r0, r1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j1] (BSWAPL <typ.UInt32> (MOVLload [i] {s} p0 mem))) y)
// Combine 2 byte stores + shift into rolw 8 + word store
(MOVBstore [i] {s} p w
x0:(MOVBstore [i-1] {s} p (SHRWconst [8] w) mem))
&& x0.Uses == 1
&& clobber(x0)
=> (MOVWstore [i-1] {s} p (ROLWconst <w.Type> [8] w) mem)
(MOVBstore [i] {s} p1 w
x0:(MOVBstore [i] {s} p0 (SHRWconst [8] w) mem))
&& x0.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x0)
=> (MOVWstore [i] {s} p0 (ROLWconst <w.Type> [8] w) mem)
// Combine stores + shifts into bswap and larger (unaligned) stores
(MOVBstore [i] {s} p w
x2:(MOVBstore [i-1] {s} p (SHRLconst [8] w)
x1:(MOVBstore [i-2] {s} p (SHRLconst [16] w)
x0:(MOVBstore [i-3] {s} p (SHRLconst [24] w) mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& clobber(x0, x1, x2)
=> (MOVLstore [i-3] {s} p (BSWAPL <w.Type> w) mem)
(MOVBstore [i] {s} p3 w
x2:(MOVBstore [i] {s} p2 (SHRLconst [8] w)
x1:(MOVBstore [i] {s} p1 (SHRLconst [16] w)
x0:(MOVBstore [i] {s} p0 (SHRLconst [24] w) mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& sequentialAddresses(p1, p2, 1)
&& sequentialAddresses(p2, p3, 1)
&& clobber(x0, x1, x2)
=> (MOVLstore [i] {s} p0 (BSWAPL <w.Type> w) mem)
(MOVBstore [i] {s} p w
x6:(MOVBstore [i-1] {s} p (SHRQconst [8] w)
x5:(MOVBstore [i-2] {s} p (SHRQconst [16] w)
x4:(MOVBstore [i-3] {s} p (SHRQconst [24] w)
x3:(MOVBstore [i-4] {s} p (SHRQconst [32] w)
x2:(MOVBstore [i-5] {s} p (SHRQconst [40] w)
x1:(MOVBstore [i-6] {s} p (SHRQconst [48] w)
x0:(MOVBstore [i-7] {s} p (SHRQconst [56] w) mem))))))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& x3.Uses == 1
&& x4.Uses == 1
&& x5.Uses == 1
&& x6.Uses == 1
&& clobber(x0, x1, x2, x3, x4, x5, x6)
=> (MOVQstore [i-7] {s} p (BSWAPQ <w.Type> w) mem)
(MOVBstore [i] {s} p7 w
x6:(MOVBstore [i] {s} p6 (SHRQconst [8] w)
x5:(MOVBstore [i] {s} p5 (SHRQconst [16] w)
x4:(MOVBstore [i] {s} p4 (SHRQconst [24] w)
x3:(MOVBstore [i] {s} p3 (SHRQconst [32] w)
x2:(MOVBstore [i] {s} p2 (SHRQconst [40] w)
x1:(MOVBstore [i] {s} p1 (SHRQconst [48] w)
x0:(MOVBstore [i] {s} p0 (SHRQconst [56] w) mem))))))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& x3.Uses == 1
&& x4.Uses == 1
&& x5.Uses == 1
&& x6.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& sequentialAddresses(p1, p2, 1)
&& sequentialAddresses(p2, p3, 1)
&& sequentialAddresses(p3, p4, 1)
&& sequentialAddresses(p4, p5, 1)
&& sequentialAddresses(p5, p6, 1)
&& sequentialAddresses(p6, p7, 1)
&& clobber(x0, x1, x2, x3, x4, x5, x6)
=> (MOVQstore [i] {s} p0 (BSWAPQ <w.Type> w) mem)
// Combine constant stores into larger (unaligned) stores.
(MOVBstoreconst [c] {s} p x:(MOVBstoreconst [a] {s} p mem))
&& x.Uses == 1
&& a.Off() + 1 == c.Off()
&& clobber(x)
=> (MOVWstoreconst [makeValAndOff64(a.Val()&0xff | c.Val()<<8, a.Off())] {s} p mem)
(MOVBstoreconst [a] {s} p x:(MOVBstoreconst [c] {s} p mem))
&& x.Uses == 1
&& a.Off() + 1 == c.Off()
&& clobber(x)
=> (MOVWstoreconst [makeValAndOff64(a.Val()&0xff | c.Val()<<8, a.Off())] {s} p mem)
(MOVWstoreconst [c] {s} p x:(MOVWstoreconst [a] {s} p mem))
&& x.Uses == 1
&& a.Off() + 2 == c.Off()
&& clobber(x)
=> (MOVLstoreconst [makeValAndOff64(a.Val()&0xffff | c.Val()<<16, a.Off())] {s} p mem)
(MOVWstoreconst [a] {s} p x:(MOVWstoreconst [c] {s} p mem))
&& x.Uses == 1
&& a.Off() + 2 == c.Off()
&& clobber(x)
=> (MOVLstoreconst [makeValAndOff64(a.Val()&0xffff | c.Val()<<16, a.Off())] {s} p mem)
(MOVLstoreconst [c] {s} p x:(MOVLstoreconst [a] {s} p mem))
&& x.Uses == 1
&& a.Off() + 4 == c.Off()
&& clobber(x)
=> (MOVQstore [a.Off32()] {s} p (MOVQconst [a.Val()&0xffffffff | c.Val()<<32]) mem)
(MOVLstoreconst [a] {s} p x:(MOVLstoreconst [c] {s} p mem))
&& x.Uses == 1
&& a.Off() + 4 == c.Off()
&& clobber(x)
=> (MOVQstore [a.Off32()] {s} p (MOVQconst [a.Val()&0xffffffff | c.Val()<<32]) mem)
(MOVQstoreconst [c] {s} p x:(MOVQstoreconst [c2] {s} p mem))
&& config.useSSE
&& x.Uses == 1
&& c2.Off() + 8 == c.Off()
&& c.Val() == 0
&& c2.Val() == 0
&& clobber(x)
=> (MOVOstore [c2.Off32()] {s} p (MOVOconst [0]) mem)
// Combine stores into larger (unaligned) stores. Little endian.
(MOVBstore [i] {s} p (SHR(W|L|Q)const [8] w) x:(MOVBstore [i-1] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVWstore [i-1] {s} p w mem)
(MOVBstore [i] {s} p w x:(MOVBstore [i+1] {s} p (SHR(W|L|Q)const [8] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVWstore [i] {s} p w mem)
(MOVBstore [i] {s} p (SHR(L|Q)const [j] w) x:(MOVBstore [i-1] {s} p w0:(SHR(L|Q)const [j-8] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVWstore [i-1] {s} p w0 mem)
(MOVBstore [i] {s} p1 (SHR(W|L|Q)const [8] w) x:(MOVBstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x)
=> (MOVWstore [i] {s} p0 w mem)
(MOVBstore [i] {s} p0 w x:(MOVBstore [i] {s} p1 (SHR(W|L|Q)const [8] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x)
=> (MOVWstore [i] {s} p0 w mem)
(MOVBstore [i] {s} p1 (SHR(L|Q)const [j] w) x:(MOVBstore [i] {s} p0 w0:(SHR(L|Q)const [j-8] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x)
=> (MOVWstore [i] {s} p0 w0 mem)
(MOVWstore [i] {s} p (SHR(L|Q)const [16] w) x:(MOVWstore [i-2] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVLstore [i-2] {s} p w mem)
(MOVWstore [i] {s} p (SHR(L|Q)const [j] w) x:(MOVWstore [i-2] {s} p w0:(SHR(L|Q)const [j-16] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVLstore [i-2] {s} p w0 mem)
(MOVWstore [i] {s} p1 (SHR(L|Q)const [16] w) x:(MOVWstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& clobber(x)
=> (MOVLstore [i] {s} p0 w mem)
(MOVWstore [i] {s} p1 (SHR(L|Q)const [j] w) x:(MOVWstore [i] {s} p0 w0:(SHR(L|Q)const [j-16] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& clobber(x)
=> (MOVLstore [i] {s} p0 w0 mem)
(MOVLstore [i] {s} p (SHRQconst [32] w) x:(MOVLstore [i-4] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVQstore [i-4] {s} p w mem)
(MOVLstore [i] {s} p (SHRQconst [j] w) x:(MOVLstore [i-4] {s} p w0:(SHRQconst [j-32] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVQstore [i-4] {s} p w0 mem)
(MOVLstore [i] {s} p1 (SHRQconst [32] w) x:(MOVLstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& clobber(x)
=> (MOVQstore [i] {s} p0 w mem)
(MOVLstore [i] {s} p1 (SHRQconst [j] w) x:(MOVLstore [i] {s} p0 w0:(SHRQconst [j-32] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& clobber(x)
=> (MOVQstore [i] {s} p0 w0 mem)
(MOVBstore [i] {s} p
x1:(MOVBload [j] {s2} p2 mem)
mem2:(MOVBstore [i-1] {s} p
x2:(MOVBload [j-1] {s2} p2 mem) mem))
&& x1.Uses == 1
&& x2.Uses == 1
&& mem2.Uses == 1
&& clobber(x1, x2, mem2)
=> (MOVWstore [i-1] {s} p (MOVWload [j-1] {s2} p2 mem) mem)
(MOVWstore [i] {s} p
x1:(MOVWload [j] {s2} p2 mem)
mem2:(MOVWstore [i-2] {s} p
x2:(MOVWload [j-2] {s2} p2 mem) mem))
&& x1.Uses == 1
&& x2.Uses == 1
&& mem2.Uses == 1
&& clobber(x1, x2, mem2)
=> (MOVLstore [i-2] {s} p (MOVLload [j-2] {s2} p2 mem) mem)
(MOVLstore [i] {s} p
x1:(MOVLload [j] {s2} p2 mem)
mem2:(MOVLstore [i-4] {s} p
x2:(MOVLload [j-4] {s2} p2 mem) mem))
&& x1.Uses == 1
&& x2.Uses == 1
&& mem2.Uses == 1
&& clobber(x1, x2, mem2)
=> (MOVQstore [i-4] {s} p (MOVQload [j-4] {s2} p2 mem) mem)
.

Finally if you get a chance, try and check that other architectures optimize the pattern too. arm64, ppc64le and s390x all also do unaligned load/store merging.

@mundaym
Copy link
Member

@mundaym mundaym commented Sep 28, 2020

Aside: it would be really nice to do the unaligned load/store merging optimizations in a generic optimization pass. These rules are quite hard to maintain when there are a lot of optimizations that might interfere with the target patterns.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 28, 2020

As a quick aside, when new rules are added, is there anything to warn us about them making other rules suddenly trigger less often? With the huge amount of rules we have today, it's practically impossible to foresee that kind of interaction.

I guess we can keep adding more and more tests to cover common cases, but it still feels like some sort of tooling would be nice. I imagine there are a few rules today that basically never trigger anymore, for example.

@mundaym
Copy link
Member

@mundaym mundaym commented Sep 28, 2020

@mvdan I use compilecmp a lot (https://github.com/josharian/compilecmp). That at least tells me when the generated code for a function grows significantly (often a sign I've broken a pre-existing rewrite rule). In practice the codegen tests are also fairly good at catching a lot of stuff, there is quite a lot of coverage there now.

@andybons andybons added the NeedsFix label Sep 29, 2020
@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Sep 29, 2020

Thank you for all the detailed info, I'll start investigating as soon as possible! I may get back to you in the following days with questions, this would be my first contribution to the project :)

@josharian
Copy link
Contributor Author

@josharian josharian commented Dec 15, 2020

I just bumped into this again. @agarciamontoro anything I can do to help you out here?

@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Dec 16, 2020

Hi @josharian, sorry for the ghosting here, life got in the way!

I did take a look into this a couple of weeks ago and learned quite a lot about how everything works. However, I got stuck when trying to identify which rules apply to the generated code here. My initial idea was to simply add a new rule that covers this specific issue, but I wanted to first find the root cause of the wrong rule (or rules).

The problem is that we're generating code to store a byte, then a long, then a word, and finally the last byte.

v34 (+79) = MOVQload <uint64> v7 v1
v171 (+84) = MOVBstore <mem> v248 v34 v1
v168 (+85) = SHRQconst <uint64> [8] v34
v216 (+89) = SHRQconst <uint64> [40] v34
v180 (+91) = SHRQconst <uint64> [56] v34
v219 (88) = MOVLstore <mem> [1] v248 v168 v171
v243 (90) = MOVWstore <mem> [5] v248 v216 v219
v255 (91) = MOVBstore <mem> [7] v248 v180 v243

There are of course rules for the "normal" cases, where we coalesce sequential byte stores in a quad. But there is not something for converting the sequence B, L, W, B into a Q. We could add a specific rule for that, but I would like to know why we have this weird sequence in the first place (which is where I got stuck).

If you could give me a hint on where to look at to debug this, it would be awesome.

@josharian
Copy link
Contributor Author

@josharian josharian commented Dec 17, 2020

No worries on disappearing. It’s been a tough, unpredictable year for everyone.

It’s not easy to debug ssa rewrite rules. There’s a logging mode you can use: pass the compiler the flag -d=ssa/lower/debug=2. But the output may be voluminous. And managing rewrite rule dependencies is a headache. As long as there’s a codegen test I’d personally be comfortable just matching the 1-4-2-1 pattern.

Don’t hesitate to ask if there’s more we can help with.

@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Dec 17, 2020

Thank you for the tip, I'll try to investigate the issue with the logging mode. If there's not any success, I'll simply match the weird pattern, I already have the codegen test.

Thanks for the help!

@josharian
Copy link
Contributor Author

@josharian josharian commented Dec 25, 2020

Marking this as Go 1.17. It is a regression, I've encountered it in multiple codebases, it has come up multiple times in Gopher Slack, and it is impacting design decisions in inet.af/netaddr.

danderson added a commit to inetaf/netaddr that referenced this issue Dec 25, 2020
This allows IP manipulation operations to operate on IPs as two registers, which empirically
leads to significant speedups, in particular on OOO superscalars where the two halves can
be processed in parallel.

You might expect that we could keep the representation as [16]byte, and do a cycle of
BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient
code. Unfortunately, due to a variety of missing optimizations in the Go compiler,
that is not the case and that code turns into byte-wise operations. On the other hand,
converting to a uint64 pair at construction results in efficient construction (a pair of
MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit
pairs). See also:
 - golang/go#41663
 - golang/go#41684
 - golang/go#42958
 - The discussion in #63

name                              old time/op    new time/op    delta
StdIPv4-8                            146ns ± 2%     141ns ± 2%    -3.42%  (p=0.016 n=5+5)
IPv4-8                               120ns ± 1%     107ns ± 2%   -10.65%  (p=0.008 n=5+5)
IPv4_inline-8                        120ns ± 0%     118ns ± 1%    -1.67%  (p=0.016 n=4+5)
StdIPv6-8                            211ns ± 2%     215ns ± 1%    +2.18%  (p=0.008 n=5+5)
IPv6-8                               281ns ± 1%     252ns ± 1%   -10.19%  (p=0.008 n=5+5)
IPv4Contains-8                      11.8ns ± 4%     4.7ns ± 2%   -60.00%  (p=0.008 n=5+5)
ParseIPv4-8                         68.1ns ± 4%    78.8ns ± 1%   +15.74%  (p=0.008 n=5+5)
ParseIPv6-8                          419ns ± 1%     409ns ± 0%    -2.40%  (p=0.016 n=4+5)
StdParseIPv4-8                      73.7ns ± 1%    88.8ns ± 2%   +20.50%  (p=0.008 n=5+5)
StdParseIPv6-8                       132ns ± 2%     134ns ± 1%      ~     (p=0.079 n=5+5)
IPPrefixMasking/IPv4_/32-8          36.3ns ± 3%     4.8ns ± 4%   -86.72%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/17-8          39.0ns ± 0%     4.8ns ± 3%   -87.78%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/0-8           36.9ns ± 2%     4.8ns ± 4%   -87.07%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/128-8         32.7ns ± 1%     4.7ns ± 2%   -85.47%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/65-8          39.8ns ± 1%     4.7ns ± 1%   -88.13%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/0-8           40.7ns ± 1%     4.7ns ± 2%   -88.41%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/128-8     136ns ± 3%       5ns ± 2%   -96.53%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8      142ns ± 2%       5ns ± 1%   -96.65%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8       143ns ± 2%       5ns ± 3%   -96.67%  (p=0.008 n=5+5)
IPSetFuzz-8                         22.7µs ± 2%    16.4µs ± 2%   -27.84%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
StdIPv4-8                            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv4-8                               0.00B          0.00B           ~     (all equal)
IPv4_inline-8                        0.00B          0.00B           ~     (all equal)
StdIPv6-8                            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv6-8                               16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv4Contains-8                       0.00B          0.00B           ~     (all equal)
ParseIPv4-8                          0.00B          0.00B           ~     (all equal)
ParseIPv6-8                          48.0B ± 0%     48.0B ± 0%      ~     (all equal)
StdParseIPv4-8                       16.0B ± 0%     16.0B ± 0%      ~     (all equal)
StdParseIPv6-8                       16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPPrefixMasking/IPv4_/32-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv4_/17-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv4_/0-8            0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/128-8          0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/65-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/0-8            0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_zone_/128-8     16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8      16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8       16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPSetFuzz-8                         2.60kB ± 0%    2.60kB ± 0%      ~     (p=1.000 n=5+5)

name                              old allocs/op  new allocs/op  delta
StdIPv4-8                             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv4-8                                0.00           0.00           ~     (all equal)
IPv4_inline-8                         0.00           0.00           ~     (all equal)
StdIPv6-8                             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv6-8                                1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv4Contains-8                        0.00           0.00           ~     (all equal)
ParseIPv4-8                           0.00           0.00           ~     (all equal)
ParseIPv6-8                           3.00 ± 0%      3.00 ± 0%      ~     (all equal)
StdParseIPv4-8                        1.00 ± 0%      1.00 ± 0%      ~     (all equal)
StdParseIPv6-8                        1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPPrefixMasking/IPv4_/32-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv4_/17-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv4_/0-8             0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/128-8           0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/65-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/0-8             0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_zone_/128-8      1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8       1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8        1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPSetFuzz-8                           33.0 ± 0%      33.0 ± 0%      ~     (all equal)

Signed-off-by: David Anderson <dave@natulte.net>
@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Dec 25, 2020

Expect something from my side this weekend, @josharian, I'm on vacation now and have a bit more time. If I'm not able to solve it by then, I'll let it free :)

@josharian
Copy link
Contributor Author

@josharian josharian commented Dec 25, 2020

We have a few months yet before 1.17, so no rush. :)

danderson added a commit to inetaf/netaddr that referenced this issue Dec 26, 2020
This allows IP manipulation operations to operate on IPs as two registers, which empirically
leads to significant speedups, in particular on OOO superscalars where the two halves can
be processed in parallel.

You might expect that we could keep the representation as [16]byte, and do a cycle of
BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient
code. Unfortunately, due to a variety of missing optimizations in the Go compiler,
that is not the case and that code turns into byte-wise operations. On the other hand,
converting to a uint64 pair at construction results in efficient construction (a pair of
MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit
pairs). See also:
 - golang/go#41663
 - golang/go#41684
 - golang/go#42958
 - The discussion in #63

name                              old time/op    new time/op    delta
StdIPv4-8                            146ns ± 2%     141ns ± 2%    -3.42%  (p=0.016 n=5+5)
IPv4-8                               120ns ± 1%     107ns ± 2%   -10.65%  (p=0.008 n=5+5)
IPv4_inline-8                        120ns ± 0%     118ns ± 1%    -1.67%  (p=0.016 n=4+5)
StdIPv6-8                            211ns ± 2%     215ns ± 1%    +2.18%  (p=0.008 n=5+5)
IPv6-8                               281ns ± 1%     252ns ± 1%   -10.19%  (p=0.008 n=5+5)
IPv4Contains-8                      11.8ns ± 4%     4.7ns ± 2%   -60.00%  (p=0.008 n=5+5)
ParseIPv4-8                         68.1ns ± 4%    78.8ns ± 1%   +15.74%  (p=0.008 n=5+5)
ParseIPv6-8                          419ns ± 1%     409ns ± 0%    -2.40%  (p=0.016 n=4+5)
StdParseIPv4-8                      73.7ns ± 1%    88.8ns ± 2%   +20.50%  (p=0.008 n=5+5)
StdParseIPv6-8                       132ns ± 2%     134ns ± 1%      ~     (p=0.079 n=5+5)
IPPrefixMasking/IPv4_/32-8          36.3ns ± 3%     4.8ns ± 4%   -86.72%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/17-8          39.0ns ± 0%     4.8ns ± 3%   -87.78%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/0-8           36.9ns ± 2%     4.8ns ± 4%   -87.07%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/128-8         32.7ns ± 1%     4.7ns ± 2%   -85.47%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/65-8          39.8ns ± 1%     4.7ns ± 1%   -88.13%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/0-8           40.7ns ± 1%     4.7ns ± 2%   -88.41%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/128-8     136ns ± 3%       5ns ± 2%   -96.53%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8      142ns ± 2%       5ns ± 1%   -96.65%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8       143ns ± 2%       5ns ± 3%   -96.67%  (p=0.008 n=5+5)
IPSetFuzz-8                         22.7µs ± 2%    16.4µs ± 2%   -27.84%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
StdIPv4-8                            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv4-8                               0.00B          0.00B           ~     (all equal)
IPv4_inline-8                        0.00B          0.00B           ~     (all equal)
StdIPv6-8                            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv6-8                               16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv4Contains-8                       0.00B          0.00B           ~     (all equal)
ParseIPv4-8                          0.00B          0.00B           ~     (all equal)
ParseIPv6-8                          48.0B ± 0%     48.0B ± 0%      ~     (all equal)
StdParseIPv4-8                       16.0B ± 0%     16.0B ± 0%      ~     (all equal)
StdParseIPv6-8                       16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPPrefixMasking/IPv4_/32-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv4_/17-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv4_/0-8            0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/128-8          0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/65-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/0-8            0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_zone_/128-8     16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8      16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8       16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPSetFuzz-8                         2.60kB ± 0%    2.60kB ± 0%      ~     (p=1.000 n=5+5)

name                              old allocs/op  new allocs/op  delta
StdIPv4-8                             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv4-8                                0.00           0.00           ~     (all equal)
IPv4_inline-8                         0.00           0.00           ~     (all equal)
StdIPv6-8                             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv6-8                                1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv4Contains-8                        0.00           0.00           ~     (all equal)
ParseIPv4-8                           0.00           0.00           ~     (all equal)
ParseIPv6-8                           3.00 ± 0%      3.00 ± 0%      ~     (all equal)
StdParseIPv4-8                        1.00 ± 0%      1.00 ± 0%      ~     (all equal)
StdParseIPv6-8                        1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPPrefixMasking/IPv4_/32-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv4_/17-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv4_/0-8             0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/128-8           0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/65-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/0-8             0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_zone_/128-8      1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8       1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8        1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPSetFuzz-8                           33.0 ± 0%      33.0 ± 0%      ~     (all equal)

Signed-off-by: David Anderson <dave@natulte.net>
majst01 pushed a commit to majst01/netaddr that referenced this issue Dec 26, 2020
This allows IP manipulation operations to operate on IPs as two registers, which empirically
leads to significant speedups, in particular on OOO superscalars where the two halves can
be processed in parallel.

You might expect that we could keep the representation as [16]byte, and do a cycle of
BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient
code. Unfortunately, due to a variety of missing optimizations in the Go compiler,
that is not the case and that code turns into byte-wise operations. On the other hand,
converting to a uint64 pair at construction results in efficient construction (a pair of
MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit
pairs). See also:
 - golang/go#41663
 - golang/go#41684
 - golang/go#42958
 - The discussion in inetaf#63

name                              old time/op    new time/op    delta
StdIPv4-8                            146ns ± 2%     141ns ± 2%    -3.42%  (p=0.016 n=5+5)
IPv4-8                               120ns ± 1%     107ns ± 2%   -10.65%  (p=0.008 n=5+5)
IPv4_inline-8                        120ns ± 0%     118ns ± 1%    -1.67%  (p=0.016 n=4+5)
StdIPv6-8                            211ns ± 2%     215ns ± 1%    +2.18%  (p=0.008 n=5+5)
IPv6-8                               281ns ± 1%     252ns ± 1%   -10.19%  (p=0.008 n=5+5)
IPv4Contains-8                      11.8ns ± 4%     4.7ns ± 2%   -60.00%  (p=0.008 n=5+5)
ParseIPv4-8                         68.1ns ± 4%    78.8ns ± 1%   +15.74%  (p=0.008 n=5+5)
ParseIPv6-8                          419ns ± 1%     409ns ± 0%    -2.40%  (p=0.016 n=4+5)
StdParseIPv4-8                      73.7ns ± 1%    88.8ns ± 2%   +20.50%  (p=0.008 n=5+5)
StdParseIPv6-8                       132ns ± 2%     134ns ± 1%      ~     (p=0.079 n=5+5)
IPPrefixMasking/IPv4_/32-8          36.3ns ± 3%     4.8ns ± 4%   -86.72%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/17-8          39.0ns ± 0%     4.8ns ± 3%   -87.78%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/0-8           36.9ns ± 2%     4.8ns ± 4%   -87.07%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/128-8         32.7ns ± 1%     4.7ns ± 2%   -85.47%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/65-8          39.8ns ± 1%     4.7ns ± 1%   -88.13%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/0-8           40.7ns ± 1%     4.7ns ± 2%   -88.41%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/128-8     136ns ± 3%       5ns ± 2%   -96.53%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8      142ns ± 2%       5ns ± 1%   -96.65%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8       143ns ± 2%       5ns ± 3%   -96.67%  (p=0.008 n=5+5)
IPSetFuzz-8                         22.7µs ± 2%    16.4µs ± 2%   -27.84%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
StdIPv4-8                            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv4-8                               0.00B          0.00B           ~     (all equal)
IPv4_inline-8                        0.00B          0.00B           ~     (all equal)
StdIPv6-8                            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv6-8                               16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPv4Contains-8                       0.00B          0.00B           ~     (all equal)
ParseIPv4-8                          0.00B          0.00B           ~     (all equal)
ParseIPv6-8                          48.0B ± 0%     48.0B ± 0%      ~     (all equal)
StdParseIPv4-8                       16.0B ± 0%     16.0B ± 0%      ~     (all equal)
StdParseIPv6-8                       16.0B ± 0%     16.0B ± 0%      ~     (all equal)
IPPrefixMasking/IPv4_/32-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv4_/17-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv4_/0-8            0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/128-8          0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/65-8           0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_/0-8            0.00B          0.00B           ~     (all equal)
IPPrefixMasking/IPv6_zone_/128-8     16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8      16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8       16.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
IPSetFuzz-8                         2.60kB ± 0%    2.60kB ± 0%      ~     (p=1.000 n=5+5)

name                              old allocs/op  new allocs/op  delta
StdIPv4-8                             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv4-8                                0.00           0.00           ~     (all equal)
IPv4_inline-8                         0.00           0.00           ~     (all equal)
StdIPv6-8                             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv6-8                                1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPv4Contains-8                        0.00           0.00           ~     (all equal)
ParseIPv4-8                           0.00           0.00           ~     (all equal)
ParseIPv6-8                           3.00 ± 0%      3.00 ± 0%      ~     (all equal)
StdParseIPv4-8                        1.00 ± 0%      1.00 ± 0%      ~     (all equal)
StdParseIPv6-8                        1.00 ± 0%      1.00 ± 0%      ~     (all equal)
IPPrefixMasking/IPv4_/32-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv4_/17-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv4_/0-8             0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/128-8           0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/65-8            0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_/0-8             0.00           0.00           ~     (all equal)
IPPrefixMasking/IPv6_zone_/128-8      1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-8       1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-8        1.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
IPSetFuzz-8                           33.0 ± 0%      33.0 ± 0%      ~     (all equal)

Signed-off-by: David Anderson <dave@natulte.net>
Signed-off-by: Stefan Majer <stefan.majer@gmail.com>
@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Dec 28, 2020

It looks like a have a working rule, both for amd64:

(MOVBstore [7] p1 (SHRQconst [56] w)
  x1:(MOVWstore [5] p1 (SHRQconst [40] w)
  x2:(MOVLstore [1] p1 (SHRQconst [8] w)
  x3:(MOVBstore p1 w mem))))
  && x1.Uses == 1
  && x2.Uses == 1
  && x3.Uses == 1
  && clobber(x1, x2, x3)
  => (MOVQstore p1 w mem)

and for s390x, using its corresponding ops:

(MOVBstore [7] p1 (SRDconst w)
  x1:(MOVHBRstore [5] p1 (SRDconst w)
  x2:(MOVWBRstore [1] p1 (SRDconst w)
  x3:(MOVBstore p1 w mem))))
  && x1.Uses == 1
  && x2.Uses == 1
  && x3.Uses == 1
  && clobber(x1, x2, x3)
  => (MOVDBRstore p1 w mem)

The current rules seem to already optimize the generated code both for arm64 and ppc64le, so I did not need to do anything else there.

This change makes the new codegen test pass, but I do have a couple of questions before submitting a patch:

  1. I'm not sure if these rules are too specific (both the pattern itself and the constants in the move and shift ops). Maybe specificity is not an issue at this step (you already said that you were ok matching the 1-4-2-1 pattern), but I'm curious about how you evaluate/measure that (if there is a need to evaluate that at all). I'm mostly concerned about avoidable performance issues that could arise from adding overly-specific rules that match weird patterns.
  2. I tested this for other architectures using the --all_codegen flag (as in go run test/run.go --all_codegen -- test/codegen/memcombine.go), but I'm not sure whether this is the right approach. Should I do anything else to test this for architectures other than amd64?

@agnivade
Copy link
Contributor

@agnivade agnivade commented Dec 28, 2020

Nice work @agarciamontoro.

  1. I'll let @josharian answer that. But we usually look at the impact of the change and how widespread it is. If it has a significant impact, and the change is minimal, it should be okay to add a specific pattern. But the question does come up as to why just use 1 and not the rest of 4!-1 patterns.
  2. is the right approach. As long as you have architecture specific annotations, you don't need to do anything else.

@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Dec 28, 2020

Thank you, @agnivade!

But the question does come up as to why just use 1 and not the rest of 4!-1 patterns.

Yup, that's a good question. The specific pattern I used is the one generated when compiling the function in the description of the issue. My guess is that this happens because of the order of the previously applied rules, but I was not able to identify the ones that caused the problem: the output from -gcflags=-d=ssa/lower/debug=2 is indeed voluminous, and from what I saw, it logs every value changed instead of every rule applied (which can affect multiple values), so it's hard to keep track of what's happening. Is there some other debug mode I don't know of that we could use here?

I can probably come up with a rule that covers all the cases, but do we have proof of the other cases actually happening? I'm not sure how we could check that.

@josharian
Copy link
Contributor Author

@josharian josharian commented Dec 28, 2020

Yay!

I'm not sure if these rules are too specific (both the pattern itself and the constants in the move and shift ops). Maybe specificity is not an issue at this step (you already said that you were ok matching the 1-4-2-1 pattern), but I'm curious about how you evaluate/measure that (if there is a need to evaluate that at all).

The specificity is fine. This is a particular pattern that arises, so this is the pattern we match. It would be nice eventually do to something more principled around how we order and apply rewrite rules, but it gets very weedy very quickly, and this is Good Enough. And we have codegen tests to prevent regressions.

I'm mostly concerned about avoidable performance issues that could arise from adding overly-specific rules that match weird patterns.

The incremental cost of adding additional rewrite rules is very low.

If you're curious, you could add an && noteRule("1421") to the LHS of the rule and re-run make.bash to see how often it triggers. Or use compilecmp -fn=changed to see what functions get optimized. But in this case, because we know it shows up in the wild, that supporting evidence is not too critical.

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 30, 2020

Change https://golang.org/cl/280456 mentions this issue: cmd/compile: add rule to coalesce writes

@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Dec 30, 2020

Just sent the change matching the specific pattern. Thank you for the tip on the noteRule, though, it's super useful!

@golang golang locked and limited conversation to collaborators Feb 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix Performance
Projects
None yet
Development

No branches or pull requests

7 participants