Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: missed opportunity to coalesce reads/writes #41663

Open
josharian opened this issue Sep 27, 2020 · 6 comments
Open

cmd/compile: missed opportunity to coalesce reads/writes #41663

josharian opened this issue Sep 27, 2020 · 6 comments

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Sep 27, 2020

package p

import "encoding/binary"

func f(b []byte, x *[8]byte) {
	_ = b[8]
	t := binary.LittleEndian.Uint64(x[:])
	binary.LittleEndian.PutUint64(b, t)
}

This should compile down to two MOVQs on amd64, one to load from x and one to write to b.

Instead, it compiles to a series of smaller MOVxs. The coalescing rules may need more cases added.

cc @randall77 @dr2chase @martisch @mundaym

@josharian josharian added this to the Unplanned milestone Sep 27, 2020
@agarciamontoro
Copy link

@agarciamontoro agarciamontoro commented Sep 28, 2020

I'd like to work on this one!

@mundaym
Copy link
Member

@mundaym mundaym commented Sep 28, 2020

@agarciamontoro Thanks for offering to take a look. I suspect the best place to start is to add a new test in test/codegen/memcombine.go. You'll want to check that no small MOV* operations are generated using the annotations, for example:

func f(b []byte, x *[8]byte) {
	_ = b[8]
        // amd64:-`MOVB`,-`MOVW`,-`MOVL`
	binary.LittleEndian.PutUint64(b, binary.LittleEndian.Uint64(x[:]))
}

After verifying that your new test fails the tricky bit will be trying to figure out why the existing rules don't work. Most likely another optimization is being applied first and that is interfering with the pattern match. GOSSAFUNC will be useful for trying to figure out what is going on. You can find the optimization rules for load/store merging on AMD64 in

// Combining byte loads into larger (unaligned) loads.
// There are many ways these combinations could occur. This is
// designed to match the way encoding/binary.LittleEndian does it.
// Little-endian loads
(OR(L|Q) x0:(MOVBload [i0] {s} p mem)
sh:(SHL(L|Q)const [8] x1:(MOVBload [i1] {s} p mem)))
&& i1 == i0+1
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVWload [i0] {s} p mem)
(OR(L|Q) x0:(MOVBload [i] {s} p0 mem)
sh:(SHL(L|Q)const [8] x1:(MOVBload [i] {s} p1 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVWload [i] {s} p0 mem)
(OR(L|Q) x0:(MOVWload [i0] {s} p mem)
sh:(SHL(L|Q)const [16] x1:(MOVWload [i1] {s} p mem)))
&& i1 == i0+2
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVLload [i0] {s} p mem)
(OR(L|Q) x0:(MOVWload [i] {s} p0 mem)
sh:(SHL(L|Q)const [16] x1:(MOVWload [i] {s} p1 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVLload [i] {s} p0 mem)
(ORQ x0:(MOVLload [i0] {s} p mem)
sh:(SHLQconst [32] x1:(MOVLload [i1] {s} p mem)))
&& i1 == i0+4
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVQload [i0] {s} p mem)
(ORQ x0:(MOVLload [i] {s} p0 mem)
sh:(SHLQconst [32] x1:(MOVLload [i] {s} p1 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (MOVQload [i] {s} p0 mem)
(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i1] {s} p mem))
or:(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i0] {s} p mem))
y))
&& i1 == i0+1
&& j1 == j0+8
&& j0 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j0] (MOVWload [i0] {s} p mem)) y)
(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i] {s} p1 mem))
or:(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i] {s} p0 mem))
y))
&& j1 == j0+8
&& j0 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j0] (MOVWload [i] {s} p0 mem)) y)
(ORQ
s1:(SHLQconst [j1] x1:(MOVWload [i1] {s} p mem))
or:(ORQ
s0:(SHLQconst [j0] x0:(MOVWload [i0] {s} p mem))
y))
&& i1 == i0+2
&& j1 == j0+16
&& j0 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j0] (MOVLload [i0] {s} p mem)) y)
(ORQ
s1:(SHLQconst [j1] x1:(MOVWload [i] {s} p1 mem))
or:(ORQ
s0:(SHLQconst [j0] x0:(MOVWload [i] {s} p0 mem))
y))
&& j1 == j0+16
&& j0 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j0] (MOVLload [i] {s} p0 mem)) y)
// Big-endian loads
(OR(L|Q)
x1:(MOVBload [i1] {s} p mem)
sh:(SHL(L|Q)const [8] x0:(MOVBload [i0] {s} p mem)))
&& i1 == i0+1
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (ROLWconst <v.Type> [8] (MOVWload [i0] {s} p mem))
(OR(L|Q)
x1:(MOVBload [i] {s} p1 mem)
sh:(SHL(L|Q)const [8] x0:(MOVBload [i] {s} p0 mem)))
&& x0.Uses == 1
&& x1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, sh)
=> @mergePoint(b,x0,x1) (ROLWconst <v.Type> [8] (MOVWload [i] {s} p0 mem))
(OR(L|Q)
r1:(ROLWconst [8] x1:(MOVWload [i1] {s} p mem))
sh:(SHL(L|Q)const [16] r0:(ROLWconst [8] x0:(MOVWload [i0] {s} p mem))))
&& i1 == i0+2
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPL <v.Type> (MOVLload [i0] {s} p mem))
(OR(L|Q)
r1:(ROLWconst [8] x1:(MOVWload [i] {s} p1 mem))
sh:(SHL(L|Q)const [16] r0:(ROLWconst [8] x0:(MOVWload [i] {s} p0 mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPL <v.Type> (MOVLload [i] {s} p0 mem))
(ORQ
r1:(BSWAPL x1:(MOVLload [i1] {s} p mem))
sh:(SHLQconst [32] r0:(BSWAPL x0:(MOVLload [i0] {s} p mem))))
&& i1 == i0+4
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPQ <v.Type> (MOVQload [i0] {s} p mem))
(ORQ
r1:(BSWAPL x1:(MOVLload [i] {s} p1 mem))
sh:(SHLQconst [32] r0:(BSWAPL x0:(MOVLload [i] {s} p0 mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& sh.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& mergePoint(b,x0,x1) != nil
&& clobber(x0, x1, r0, r1, sh)
=> @mergePoint(b,x0,x1) (BSWAPQ <v.Type> (MOVQload [i] {s} p0 mem))
(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i0] {s} p mem))
or:(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i1] {s} p mem))
y))
&& i1 == i0+1
&& j1 == j0-8
&& j1 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j1] (ROLWconst <typ.UInt16> [8] (MOVWload [i0] {s} p mem))) y)
(OR(L|Q)
s0:(SHL(L|Q)const [j0] x0:(MOVBload [i] {s} p0 mem))
or:(OR(L|Q)
s1:(SHL(L|Q)const [j1] x1:(MOVBload [i] {s} p1 mem))
y))
&& j1 == j0-8
&& j1 % 16 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (OR(L|Q) <v.Type> (SHL(L|Q)const <v.Type> [j1] (ROLWconst <typ.UInt16> [8] (MOVWload [i] {s} p0 mem))) y)
(ORQ
s0:(SHLQconst [j0] r0:(ROLWconst [8] x0:(MOVWload [i0] {s} p mem)))
or:(ORQ
s1:(SHLQconst [j1] r1:(ROLWconst [8] x1:(MOVWload [i1] {s} p mem)))
y))
&& i1 == i0+2
&& j1 == j0-16
&& j1 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, r0, r1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j1] (BSWAPL <typ.UInt32> (MOVLload [i0] {s} p mem))) y)
(ORQ
s0:(SHLQconst [j0] r0:(ROLWconst [8] x0:(MOVWload [i] {s} p0 mem)))
or:(ORQ
s1:(SHLQconst [j1] r1:(ROLWconst [8] x1:(MOVWload [i] {s} p1 mem)))
y))
&& j1 == j0-16
&& j1 % 32 == 0
&& x0.Uses == 1
&& x1.Uses == 1
&& r0.Uses == 1
&& r1.Uses == 1
&& s0.Uses == 1
&& s1.Uses == 1
&& or.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& mergePoint(b,x0,x1,y) != nil
&& clobber(x0, x1, r0, r1, s0, s1, or)
=> @mergePoint(b,x0,x1,y) (ORQ <v.Type> (SHLQconst <v.Type> [j1] (BSWAPL <typ.UInt32> (MOVLload [i] {s} p0 mem))) y)
// Combine 2 byte stores + shift into rolw 8 + word store
(MOVBstore [i] {s} p w
x0:(MOVBstore [i-1] {s} p (SHRWconst [8] w) mem))
&& x0.Uses == 1
&& clobber(x0)
=> (MOVWstore [i-1] {s} p (ROLWconst <w.Type> [8] w) mem)
(MOVBstore [i] {s} p1 w
x0:(MOVBstore [i] {s} p0 (SHRWconst [8] w) mem))
&& x0.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x0)
=> (MOVWstore [i] {s} p0 (ROLWconst <w.Type> [8] w) mem)
// Combine stores + shifts into bswap and larger (unaligned) stores
(MOVBstore [i] {s} p w
x2:(MOVBstore [i-1] {s} p (SHRLconst [8] w)
x1:(MOVBstore [i-2] {s} p (SHRLconst [16] w)
x0:(MOVBstore [i-3] {s} p (SHRLconst [24] w) mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& clobber(x0, x1, x2)
=> (MOVLstore [i-3] {s} p (BSWAPL <w.Type> w) mem)
(MOVBstore [i] {s} p3 w
x2:(MOVBstore [i] {s} p2 (SHRLconst [8] w)
x1:(MOVBstore [i] {s} p1 (SHRLconst [16] w)
x0:(MOVBstore [i] {s} p0 (SHRLconst [24] w) mem))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& sequentialAddresses(p1, p2, 1)
&& sequentialAddresses(p2, p3, 1)
&& clobber(x0, x1, x2)
=> (MOVLstore [i] {s} p0 (BSWAPL <w.Type> w) mem)
(MOVBstore [i] {s} p w
x6:(MOVBstore [i-1] {s} p (SHRQconst [8] w)
x5:(MOVBstore [i-2] {s} p (SHRQconst [16] w)
x4:(MOVBstore [i-3] {s} p (SHRQconst [24] w)
x3:(MOVBstore [i-4] {s} p (SHRQconst [32] w)
x2:(MOVBstore [i-5] {s} p (SHRQconst [40] w)
x1:(MOVBstore [i-6] {s} p (SHRQconst [48] w)
x0:(MOVBstore [i-7] {s} p (SHRQconst [56] w) mem))))))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& x3.Uses == 1
&& x4.Uses == 1
&& x5.Uses == 1
&& x6.Uses == 1
&& clobber(x0, x1, x2, x3, x4, x5, x6)
=> (MOVQstore [i-7] {s} p (BSWAPQ <w.Type> w) mem)
(MOVBstore [i] {s} p7 w
x6:(MOVBstore [i] {s} p6 (SHRQconst [8] w)
x5:(MOVBstore [i] {s} p5 (SHRQconst [16] w)
x4:(MOVBstore [i] {s} p4 (SHRQconst [24] w)
x3:(MOVBstore [i] {s} p3 (SHRQconst [32] w)
x2:(MOVBstore [i] {s} p2 (SHRQconst [40] w)
x1:(MOVBstore [i] {s} p1 (SHRQconst [48] w)
x0:(MOVBstore [i] {s} p0 (SHRQconst [56] w) mem))))))))
&& x0.Uses == 1
&& x1.Uses == 1
&& x2.Uses == 1
&& x3.Uses == 1
&& x4.Uses == 1
&& x5.Uses == 1
&& x6.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& sequentialAddresses(p1, p2, 1)
&& sequentialAddresses(p2, p3, 1)
&& sequentialAddresses(p3, p4, 1)
&& sequentialAddresses(p4, p5, 1)
&& sequentialAddresses(p5, p6, 1)
&& sequentialAddresses(p6, p7, 1)
&& clobber(x0, x1, x2, x3, x4, x5, x6)
=> (MOVQstore [i] {s} p0 (BSWAPQ <w.Type> w) mem)
// Combine constant stores into larger (unaligned) stores.
(MOVBstoreconst [c] {s} p x:(MOVBstoreconst [a] {s} p mem))
&& x.Uses == 1
&& a.Off() + 1 == c.Off()
&& clobber(x)
=> (MOVWstoreconst [makeValAndOff64(a.Val()&0xff | c.Val()<<8, a.Off())] {s} p mem)
(MOVBstoreconst [a] {s} p x:(MOVBstoreconst [c] {s} p mem))
&& x.Uses == 1
&& a.Off() + 1 == c.Off()
&& clobber(x)
=> (MOVWstoreconst [makeValAndOff64(a.Val()&0xff | c.Val()<<8, a.Off())] {s} p mem)
(MOVWstoreconst [c] {s} p x:(MOVWstoreconst [a] {s} p mem))
&& x.Uses == 1
&& a.Off() + 2 == c.Off()
&& clobber(x)
=> (MOVLstoreconst [makeValAndOff64(a.Val()&0xffff | c.Val()<<16, a.Off())] {s} p mem)
(MOVWstoreconst [a] {s} p x:(MOVWstoreconst [c] {s} p mem))
&& x.Uses == 1
&& a.Off() + 2 == c.Off()
&& clobber(x)
=> (MOVLstoreconst [makeValAndOff64(a.Val()&0xffff | c.Val()<<16, a.Off())] {s} p mem)
(MOVLstoreconst [c] {s} p x:(MOVLstoreconst [a] {s} p mem))
&& x.Uses == 1
&& a.Off() + 4 == c.Off()
&& clobber(x)
=> (MOVQstore [a.Off32()] {s} p (MOVQconst [a.Val()&0xffffffff | c.Val()<<32]) mem)
(MOVLstoreconst [a] {s} p x:(MOVLstoreconst [c] {s} p mem))
&& x.Uses == 1
&& a.Off() + 4 == c.Off()
&& clobber(x)
=> (MOVQstore [a.Off32()] {s} p (MOVQconst [a.Val()&0xffffffff | c.Val()<<32]) mem)
(MOVQstoreconst [c] {s} p x:(MOVQstoreconst [c2] {s} p mem))
&& config.useSSE
&& x.Uses == 1
&& c2.Off() + 8 == c.Off()
&& c.Val() == 0
&& c2.Val() == 0
&& clobber(x)
=> (MOVOstore [c2.Off32()] {s} p (MOVOconst [0]) mem)
// Combine stores into larger (unaligned) stores. Little endian.
(MOVBstore [i] {s} p (SHR(W|L|Q)const [8] w) x:(MOVBstore [i-1] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVWstore [i-1] {s} p w mem)
(MOVBstore [i] {s} p w x:(MOVBstore [i+1] {s} p (SHR(W|L|Q)const [8] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVWstore [i] {s} p w mem)
(MOVBstore [i] {s} p (SHR(L|Q)const [j] w) x:(MOVBstore [i-1] {s} p w0:(SHR(L|Q)const [j-8] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVWstore [i-1] {s} p w0 mem)
(MOVBstore [i] {s} p1 (SHR(W|L|Q)const [8] w) x:(MOVBstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x)
=> (MOVWstore [i] {s} p0 w mem)
(MOVBstore [i] {s} p0 w x:(MOVBstore [i] {s} p1 (SHR(W|L|Q)const [8] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x)
=> (MOVWstore [i] {s} p0 w mem)
(MOVBstore [i] {s} p1 (SHR(L|Q)const [j] w) x:(MOVBstore [i] {s} p0 w0:(SHR(L|Q)const [j-8] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 1)
&& clobber(x)
=> (MOVWstore [i] {s} p0 w0 mem)
(MOVWstore [i] {s} p (SHR(L|Q)const [16] w) x:(MOVWstore [i-2] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVLstore [i-2] {s} p w mem)
(MOVWstore [i] {s} p (SHR(L|Q)const [j] w) x:(MOVWstore [i-2] {s} p w0:(SHR(L|Q)const [j-16] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVLstore [i-2] {s} p w0 mem)
(MOVWstore [i] {s} p1 (SHR(L|Q)const [16] w) x:(MOVWstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& clobber(x)
=> (MOVLstore [i] {s} p0 w mem)
(MOVWstore [i] {s} p1 (SHR(L|Q)const [j] w) x:(MOVWstore [i] {s} p0 w0:(SHR(L|Q)const [j-16] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 2)
&& clobber(x)
=> (MOVLstore [i] {s} p0 w0 mem)
(MOVLstore [i] {s} p (SHRQconst [32] w) x:(MOVLstore [i-4] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVQstore [i-4] {s} p w mem)
(MOVLstore [i] {s} p (SHRQconst [j] w) x:(MOVLstore [i-4] {s} p w0:(SHRQconst [j-32] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVQstore [i-4] {s} p w0 mem)
(MOVLstore [i] {s} p1 (SHRQconst [32] w) x:(MOVLstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& clobber(x)
=> (MOVQstore [i] {s} p0 w mem)
(MOVLstore [i] {s} p1 (SHRQconst [j] w) x:(MOVLstore [i] {s} p0 w0:(SHRQconst [j-32] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& clobber(x)
=> (MOVQstore [i] {s} p0 w0 mem)
(MOVBstore [i] {s} p
x1:(MOVBload [j] {s2} p2 mem)
mem2:(MOVBstore [i-1] {s} p
x2:(MOVBload [j-1] {s2} p2 mem) mem))
&& x1.Uses == 1
&& x2.Uses == 1
&& mem2.Uses == 1
&& clobber(x1, x2, mem2)
=> (MOVWstore [i-1] {s} p (MOVWload [j-1] {s2} p2 mem) mem)
(MOVWstore [i] {s} p
x1:(MOVWload [j] {s2} p2 mem)
mem2:(MOVWstore [i-2] {s} p
x2:(MOVWload [j-2] {s2} p2 mem) mem))
&& x1.Uses == 1
&& x2.Uses == 1
&& mem2.Uses == 1
&& clobber(x1, x2, mem2)
=> (MOVLstore [i-2] {s} p (MOVLload [j-2] {s2} p2 mem) mem)
(MOVLstore [i] {s} p
x1:(MOVLload [j] {s2} p2 mem)
mem2:(MOVLstore [i-4] {s} p
x2:(MOVLload [j-4] {s2} p2 mem) mem))
&& x1.Uses == 1
&& x2.Uses == 1
&& mem2.Uses == 1
&& clobber(x1, x2, mem2)
=> (MOVQstore [i-4] {s} p (MOVQload [j-4] {s2} p2 mem) mem)
.

Finally if you get a chance, try and check that other architectures optimize the pattern too. arm64, ppc64le and s390x all also do unaligned load/store merging.

@mundaym
Copy link
Member

@mundaym mundaym commented Sep 28, 2020

Aside: it would be really nice to do the unaligned load/store merging optimizations in a generic optimization pass. These rules are quite hard to maintain when there are a lot of optimizations that might interfere with the target patterns.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 28, 2020

As a quick aside, when new rules are added, is there anything to warn us about them making other rules suddenly trigger less often? With the huge amount of rules we have today, it's practically impossible to foresee that kind of interaction.

I guess we can keep adding more and more tests to cover common cases, but it still feels like some sort of tooling would be nice. I imagine there are a few rules today that basically never trigger anymore, for example.

@mundaym
Copy link
Member

@mundaym mundaym commented Sep 28, 2020

@mvdan I use compilecmp a lot (https://github.com/josharian/compilecmp). That at least tells me when the generated code for a function grows significantly (often a sign I've broken a pre-existing rewrite rule). In practice the codegen tests are also fairly good at catching a lot of stuff, there is quite a lot of coverage there now.

@agarciamontoro
Copy link

@agarciamontoro agarciamontoro commented Sep 29, 2020

Thank you for all the detailed info, I'll start investigating as soon as possible! I may get back to you in the following days with questions, this would be my first contribution to the project :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.