Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: add unrolling stage for automatic loop unrolling #51302

Open
vpachkov opened this issue Feb 21, 2022 · 8 comments
Open

cmd/compile: add unrolling stage for automatic loop unrolling #51302

vpachkov opened this issue Feb 21, 2022 · 8 comments
Labels
NeedsInvestigation Performance
Milestone

Comments

@vpachkov
Copy link
Contributor

@vpachkov vpachkov commented Feb 21, 2022

Loop unrolling is a tecnique intended to speed up loops. It's supported by other mature compilers such as Clang.

This proposal consists of possible implementation ideas: general loop unrolling rules and how they can apply to Golang compiler, and some simple benchmarks reflecting this optimization performance on simple constant range loops.

It's easy to begin with a simple constant range for loops such as:

for i := 0 ; i < 1000 ; i++ {
	a[i] += 2;
}

And then add more features inside unroll package which will represent a loop unrolling optimization pass.

Unroll package implementation ideas

The following approach could already be easily integrated inside Golang optimization pipeline right after the inlining stage.

// Inlining
base.Timer.Start("fe", "inlining")
if base.Flag.LowerL != 0 {
	inline.InlinePackage()
}
noder.MakeWrappers(typecheck.Target) // must happen after inlining

// Unrolling
unroll.UnrollPackage()

UnrollPackage() will traverse each function to find for loops and check if it's appropriate for unrolling, then perform unrolling by calling Unroll() function if so.

// Unroll function takes 2 parameters:
// forstmt - an appropriate for loop,
// unroll - an unrolling factor (the amount of times the body will be repeated).
// It unrolls it in-place and returns a tail which should be placed right after the loop.
// The tail is generated if for range isn't divisible by the unrolling factor.
func Unroll(forstmt *ir.ForStmt, unroll uint32) (tail ir.Nodes) {
	....
}

It's important to calculate the unrolling factor correctly. If it's too big we can run into a problem when a for loop body exceeds the instruction cache. A possible idea of picking the factor is by reusing the part of the inlining stage, that is, hairyVisitor since there was already a lot of work done for choosing the weights of the nodes.

if A = maximum weight of the for loop which is short enough to be kept in cache, and B = the weight of a for loop body calculated by the extended version of hairyVisitor, then the unrolling factor = A / B. If it's greater than 1 then loop unrolling is beneficial for that loop.

Unroll function implementation ideas

When unroll variable is picked a for loop can be unrolled in 4 steps. Once again, we're dealing with constant values here:

  1. Align condition so the loop does not go out of the boundaries (i < 1000 -> i < 1000 / unroll * unroll)
if forstmt.Cond != nil {
	cmp := forstmt.Cond.(*ir.BinaryExpr)
	val := ir.ConstValue(cmp.Y).(int64)
	alignedval = val / int64(unroll) * int64(unroll)
	newval := ir.NewConstExpr(constant.MakeInt64(alignedval), cmp.Y)
	cmp.Y = newval
}
  1. Modify post expression so the induction variable goes unroll steps at a time (i++ -> i += unroll)
// Unroll post
if forstmt.Post != nil {
	post := forstmt.Post.(*ir.AssignOpStmt)
	inc := ir.ConstValue(post.Y).(int64)
	unrolledinc := inc * int64(unroll)
	newinc := ir.NewConstExpr(constant.MakeInt64(unrolledinc), post.Y)
	post.Y = newinc
}
  1. Modify body.

This step is a little bit more complex since to only copy the body isn't enough. Suppose unroll is 4 and the body of the loop is:

for i := 0 ; i < 100 ; i++ {
	sum += i
}

Just coping the body 4 times isn't enough since it gives us:

for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
	sum += i
	sum += i
	sum += i
	sum += i
}

The correct version is:

for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
	sum += i
	sum += i + 1
	sum += i + 2
	sum += i + 3
}

Firstly, we must find all induction variables and then after coping the body, apply shifting operation each time.
Keeping that in mind, body unrolling can be implemented in the following way:

// Firstly, copy original version of a body
bodycopy := ir.DeepCopyList(base.Pos, forstmt.Body)

// Copy the body unroll - 1 times, apply shifting and insert it in the body
for unr := uint64(1); unr < unroll; unr++ {
	appendbody := ir.DeepCopyList(base.Pos, bodycopy)

	// i is a loop induction variable
	// Note: there could be multiple induction variables for a loop.
	shiftNodes(appendbody, i, uint64(inc)*unr)

	forstmt.Body.Append(appendbody...)
}

// shiftNodes function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be shifted,
// shift - a shift constant.
// It generates a new node which represents an expression orig + shift and
// changes every orig reference to the new expression.
func shiftNodes(nodes ir.Nodes, orig ir.Node, shift uint64) {
	idx := ir.NewConstExpr(constant.MakeUint64(shift), orig)
	idx.SetType(types.Types[types.TUINTPTR])
	idx.SetTypecheck(1)

	shifted := ir.NewBinaryExpr(base.Pos, ir.OADD, orig, idx)
	shifted.SetType(orig.Type())
	shifted.SetTypecheck(orig.Typecheck())

	var edit func(ir.Node) ir.Node
	edit = func(x ir.Node) ir.Node {
		ir.EditChildren(x, edit)
		if x == orig {
			return shifted
		}
		return x
	}

	for _, node := range nodes {
		edit(node)
	}
}

Suppose we have a loop:

for i := 0 ; i < 101 ; i++ {
	sum += i
}

And the unroll is 3. Than it should be unrolled to this:

for i := 0 ; i < 101 / 3 * 3 ; i+=3 {
	sum += i
	sum += i + 1
	sum += i + 2
}

Since 101 isn't divisible by 3 there are 101 % 3 operations that hasn't been performed yet:

sum += 99
sum += 100

This should also be generated and placed after the for loop.

for idx := val / unroll * unroll; idx < val; idx++ {
	appendtail := ir.DeepCopyList(base.Pos, bodycopy)
	placeConst(appendtail, i, idx)
	tail.Append(appendtail...)
}

// placeConst function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be replaced by a constant,
// con - a constant itself.
// It generates a new constant and changes every orig reference to the new contant node.
func placeConst(nodes ir.Nodes, orig ir.Node, con int64) {
	i := ir.NewConstExpr(constant.MakeInt64(con), orig)
	i.SetType(types.Types[types.TINT64])
	i.SetTypecheck(1)

	var edit func(ir.Node) ir.Node
	edit = func(x ir.Node) ir.Node {
		ir.EditChildren(x, edit)
		if x == orig {
			return i
		}
		return x
	}

	for _, node := range nodes {
		edit(node)
	}
}

Results

The following lines of code:

var a [100]int

for i := 0 ; i < 100 - 1; i++ {
	a[i] += 2
}

currently are compiled to:

 1053eb3:	31 c0                	xor    eax,eax
 1053eb5:	eb 09                	jmp    1053ec0 <_main.main+0x60>
 1053eb7:	48 83 44 c4 18 02    	add    QWORD PTR [rsp+rax*8+0x18],0x2
 1053ebd:	48 ff c0             	inc    rax
 1053ec0:	48 83 f8 63          	cmp    rax,0x63
 1053ec4:	7c f1                	jl     1053eb7 <_main.main+0x57>

After applying a very basic version of loop unrolling with the above approach those are compiled to:

 1053eb3:	31 c0                	xor    eax,eax
 1053eb5:	eb 1c                	jmp    1053ed3 <_main.main+0x73>
 1053eb7:	48 83 44 c4 18 02    	add    QWORD PTR [rsp+rax*8+0x18],0x2
 1053ebd:	48 83 44 c4 20 02    	add    QWORD PTR [rsp+rax*8+0x20],0x2
 1053ec3:	48 83 44 c4 28 02    	add    QWORD PTR [rsp+rax*8+0x28],0x2
 1053ec9:	48 83 44 c4 30 02    	add    QWORD PTR [rsp+rax*8+0x30],0x2
 1053ecf:	48 83 c0 04          	add    rax,0x4
 1053ed3:	48 83 f8 60          	cmp    rax,0x60
 1053ed7:	7c de                	jl     1053eb7 <_main.main+0x57>
 1053ed9:	48 83 84 24 18 03 00 	add    QWORD PTR [rsp+0x318],0x2
 1053ee0:	00 02 
 1053ee2:	48 83 84 24 20 03 00 	add    QWORD PTR [rsp+0x320],0x2
 1053ee9:	00 02 
 1053eeb:	48 83 84 24 28 03 00 	add    QWORD PTR [rsp+0x328],0x2

Body is repeated 4 times with the shifted indices. Tail is placed after the loop that handles 96th, 97th and 98th indices.

Benchmarks

func BenchmarkUnrolling(b *testing.B) {
	var a [100]int

	for j := 0 ; j < b.N ; j++ {
		for i := 0 ; i < 100 - 1; i++ {
			a[i] += 2
		}
	}
}
name          old time/op  new time/op  delta
Unrolling-12  51.7ns ± 0%  23.9ns ± 3%  -53.71%  (p=0.016 n=4+5)
@gopherbot gopherbot added this to the Proposal milestone Feb 21, 2022
@vpachkov
Copy link
Contributor Author

@vpachkov vpachkov commented Feb 21, 2022

Related: #49997

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 21, 2022

This is not an API change, so taking this out of the proposal process.

In my opinion the most important consideration is compile time. The Go compiler has a much greater emphasis on compile time than a compiler like clang.

CC @randall77 @golang/runtime

@ianlancetaylor ianlancetaylor added NeedsInvestigation and removed Proposal labels Feb 21, 2022
@ianlancetaylor ianlancetaylor removed this from the Proposal milestone Feb 21, 2022
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Feb 21, 2022
@ianlancetaylor ianlancetaylor changed the title proposal: cmd/compile: add unrolling stage for automatic loop unrolling cmd/compile: add unrolling stage for automatic loop unrolling Feb 21, 2022
@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 22, 2022

Another consideration is binary size. We don't want to just unroll everything, as that can lead to binary bloat.

I think there's definitely something we can proceed with here. The way I see it there are 2 parts:

  1. The mechanism of loop unrolling
  2. The heuristics for loop unrolling

It would be good to start with figuring out how to implement 1, and have a very conservative heuristic for 2. Once we have that working, we can investigate improving the heuristic to encompass more cases. Whether that is using profile-directed feedback, or analysis of what kind of loop bodies would benefit the most, or whatever.

@erifan
Copy link
Contributor

@erifan erifan commented Feb 22, 2022

Maybe we should do the loop invariant optimization first?

@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 22, 2022

@erifan is mentioning #15808, I think. I agree that fixing that issue would also help, but I don't think there is any fundamental need to order the two. Other than, perhaps, #15808 is easier and the heuristics are probably a lot easier to figure out.

@erifan
Copy link
Contributor

@erifan erifan commented Feb 22, 2022

@erifan is mentioning #15808, I think. I agree that fixing that issue would also help, but I don't think there is any fundamental need to order the two. Other than, perhaps, #15808 is easier and the heuristics are probably a lot easier to figure out.

Yeah, I agree that there is no necessary connection between the two, I mean if loop unrolling is based on loop invariant hoisting , will this optimization be a little easier to implement?

In addition, perhaps not very relevant to this topic, do we need to consider loop-related optimizations from a higher dimension. Because maybe in the future we will also consider auto-vectorization, I hope we have a general consideration.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 22, 2022

cc @dr2chase who did some experiment with loop unrolling in SSA.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Feb 22, 2022

I've played with both loop unrolling and loop invariant hoisting, and when the Go world was very very Intel-oriented neither of these reliably made sense (and their code was unpleasant). I recently attempted to revive the invariant-hoisting CL and could not easily get it to work (and I am not likely to look at this again before March 7, i.e., the likely go1.18 release date).

But in general, the wins were less than awesome, binary sizes go up a bit with loop unrolling, and for a given binary-size/compile-time budget it was not at all clear that this was the best way to increase performance. It also creates problems for debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Performance
Projects
None yet
Development

No branches or pull requests

8 participants