New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: critical path scheduling #8893

Open
dvyukov opened this Issue Oct 7, 2014 · 30 comments

Comments

Projects
None yet
7 participants
@dvyukov
Member

dvyukov commented Oct 7, 2014

Currently go tool schedules work in topological order, this leads to suboptimal
parallelization. It's possible that towards end, very few work items are available
leading to idle processors.

Critical path scheduling gives much better results in this respect:
http://en.wikipedia.org/wiki/Critical_path_method

Here is a prototype CL:
https://golang.org/cl/13235046/diff/3001/src/cmd/go/build.go
I've observed up to 2x 'go test -a std' speedup on a 32-way machine.

We can also assign meaningful weights to work nodes: e.g. test > link > compile.
It would also be super useful to somehow assign weights to test nodes proportional to
test run time (e.g. runtime tests take significantly more than than most packages). This
will have significant effect even on low-core machines (as runtime tests will start
first).
@rsc

This comment has been minimized.

Contributor

rsc commented Oct 7, 2014

Comment 1:

Please send CL after Go 1.4 is out.

Labels changed: added release-go1.5, removed release-none.

Status changed to Accepted.

@josharian

This comment has been minimized.

Contributor

josharian commented Feb 7, 2015

@dvyukov are you planning to migrate that CL to gerrit and mail? Does it speed up make.bash as well?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Feb 7, 2015

Is anybody planning to review it to +2 state?

@josharian

This comment has been minimized.

Contributor

josharian commented Feb 7, 2015

I promise to review it thoroughly (although I'll be in and out until late March). And rsc asked for it to be mailed. If you don't want to bother, do you mind if I take it over?

The existing CL looks possibly incomplete. assignPriorities contains (lines 648-649):

if a.priority < a1.priority+1 {
}
@dvyukov

This comment has been minimized.

Member

dvyukov commented Feb 8, 2015

Take over it.

That if should be:

if a.priority < a1.priority+a.weight {
a.priority = a1.priority+a.weight
}

Also weight must be more sensible, that is, represent real cost of action. Something along the lines of:
link > compile
run test > link
Fake instant actions should have weight << everything else.
The goal here is to avoid serial tail of actions that run one after another delaying whole thing.

Ideally 'run test' weights represent actual execution time, so that runtime test starts first rather then starts late and delays all.bash. This may be possible in 'go test' continuos mode that monitors filesystem changes. But that's of course not a part of this change.

@minux

This comment has been minimized.

Member

minux commented Feb 8, 2015

@dvyukov

This comment has been minimized.

Member

dvyukov commented Feb 8, 2015

And compiling anything involving cgo should be even heavier weight than
compiling any other packages.

good idea

If we reorder tests, do we need to preserve the order of showing the
results?

absolutely
but that is already ensured by chaining printing actions with dependencies, otherwise they would be printed out of order even today with p>1
so you don't need to do anything special here

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 17, 2015

I still plan to take over this CL. I am just waiting for the compiler performance to stabilize a bit, so that I can get more realistic measurements of the relative speeds of different kinds of actions.

@rsc rsc removed accepted labels Apr 14, 2015

josharian added a commit to josharian/go that referenced this issue Apr 24, 2015

cmd/go: use critical path scheduling
DO NOT REVIEW

This is a straight port of Dmitry's CL 13235046.

TODO:

* Understand better how it works.
* Assign better weights.
* Convince myself that it is an improvement.
  Early measurements are not yet compelling,
  although that could be because the weights
  are not well-chosen.

Notes on weights from the issue:

link > compile
run test > link
Fake instant actions << everything else
cgo increases weights

Fixes golang#8893.

Change-Id: I8fba875a9a56aed132bd6b37b607e374f336728d
@bradfitz

This comment has been minimized.

Member

bradfitz commented May 7, 2015

This is pretty high priority for our all-compile builder. I increased it from 2 cores to 16 cores and we got almost no additional speed out of the builder.

@josharian

This comment has been minimized.

Contributor

josharian commented May 7, 2015

I'll put this back in my queue (next step is attaching some more realistic weights.), but I'm unlikely to make much progress this week.

In the meantime, if you pull in https://go-review.googlesource.com/#/c/8918 as-is, does it help? It didn't show much improvement for me locally.

@josharian

This comment has been minimized.

Contributor

josharian commented May 8, 2015

@bradfitz any numbers on that CL on a big machine?

@dvyukov @minux I spent a bit of time looking at tool exec times for 6g, 6l, and asm. asm is very fast, linking is very slow, but 6g has a huge amount of variability--two orders of magnitude. The variability goes down a tiny bit if you look at the number of files to compile, but not much. The variability is down to one order of magnitude if you look at the number of bytes to be compiled, but I don't think that we want to stat every file before getting started. Any other ideas for things to look at? As is, I think the best we can do for weights is very rough estimates, like fake=1, asm=10, gc=100, link=1000, cgo=1000. Any other ideas?

@mwhudson

This comment has been minimized.

Contributor

mwhudson commented May 8, 2015

We already do stat every file before getting started (to compute staleness). So maybe that information can be saved somewhere?

@dvyukov

This comment has been minimized.

Member

dvyukov commented May 8, 2015

Number of files look like a good first approximation.
Does this thing help with 8-16 cores?

@dvyukov

This comment has been minimized.

Member

dvyukov commented May 8, 2015

Probably also open files to read build tags.

@bradfitz

This comment has been minimized.

Member

bradfitz commented May 8, 2015

Sorry, I haven't had any free time to test this.

@josharian

This comment has been minimized.

Contributor

josharian commented May 8, 2015

No prob. I'm in the middle of running some tests now, actually, and it looks better than before. Expect a minimal CL and more comments here later today.

@josharian

This comment has been minimized.

Contributor

josharian commented May 9, 2015

My earlier optimism was the result of a bug.

I have 8 logical (4 real) CPUs on my laptop. It spent the afternoon running three build tests in a loop: go build -a std, go test -short std, all.bash. I had everything else shut down, measured wall time, and ran the results through a t-test. At 50 iterations (20 for all.bash, which is SLOW), the CL showed no statistically significant improvement, and the means were within 1%.

I then hacked in the best available predictor: Store previous action times on disk. Even with almost perfect weights, I got pretty much the same results.

Furthermore, this re-ordering hurts the experience of running all.bash, as there are long stalls between test outputs, which then come in spurts.

Maybe this helps with lots more cores than I have--Brad, I guess I would like your help confirming that after all. As it is, though, I don't feel comfortable mailing the CL, since I have no way to demonstrate its value.

@minux

This comment has been minimized.

Member

minux commented May 9, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented May 9, 2015

Dmitry's initial post said 2x speed-up on a 32 core machine. Maybe my machine (which is probably typical) doesn't have enough cores to benefit.

@minux

This comment has been minimized.

Member

minux commented May 9, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented May 9, 2015

CL 8918. It is a straight port of Dmitry's original CL.

@minux

This comment has been minimized.

Member

minux commented May 9, 2015

@dvyukov

This comment has been minimized.

Member

dvyukov commented May 10, 2015

I've done some testing using time go build -a -p 4 std.
Before:

real    0m11.527s
real    0m11.478s
real    0m11.353s
real    0m11.544s
real    0m11.586s

After:

real    0m10.443s
real    0m10.496s
real    0m10.487s
real    0m10.348s
real    0m10.451s

There is clear 10% win.
I've also modified cmd/go to print number of executing actions every 500ms to confirm the win.
Before:

EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 2
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 2

After:

EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 1
EXECUTING 1
EXECUTING 4

There is clear difference towards end of build -- old code leads to lower parallelism because previous actions were scheduled badly.

Also, the win highly depends on work graph structure. It is possible to come up with dependency structure which will be handled arbitrary inefficiently by the old code. So some users can see higher speedup than we see on std lib. Moreover, currently scheduling depends on package names (or how does it order them now?), which is just weird. You can increase/reduce build speed by renaming packages.

Also, better scheduling will have more impact when we make actions finer-grained. In particular split cgo compilation and maybe asm compilation.

So I am sure that this is a positive change and we should do it. The additional complexity is minimal and very localized.

@josharian

This comment has been minimized.

Contributor

josharian commented May 11, 2015

There is clear 10% win.

I don't know why I can't replicate that speed-up. Maybe you should take the CL back over, with my apologies. It doesn't make sense for me to continue to work on it when I cannot observe its impact (and thus can't tell whether my changes hurt or help).

better scheduling will have more impact when we make actions finer-grained.

Agreed. It is very coarse right now. And having thought about it more, simply saving the execution time of past identical actions might not be a bad long-term strategy for weight estimation.

In particular split cgo compilation and maybe asm compilation.

asm compilation is very fast. At least in the stdlib, there's very little cgo, so I don't think this will impact all.bash much, but it could make a big difference for heavy cgo users.

@dvyukov

This comment has been minimized.

Member

dvyukov commented May 12, 2015

@josharian have you tried running exactly "time go build -a -p 4 std"?

What does the following debug output shows on your machine?

diff --git a/src/cmd/go/build.go b/src/cmd/go/build.go
index fda126b..ceca4c5 100644
--- a/src/cmd/go/build.go
+++ b/src/cmd/go/build.go
@@ -18,6 +18,7 @@ import (
        "os"
        "os/exec"
        "path"
+       "sync/atomic"
        "path/filepath"
        "regexp"
        "runtime"
@@ -1139,6 +1140,12 @@ func (b *builder) do(root *action) {
        if buildN {
                par = 1
        }
+       var executing uint32
+       go func() {
+               for range time.NewTicker(500*time.Millisecond).C {
+                       fmt.Printf("EXECUTING %v\n", executing)
+               }
+       }()
        for i := 0; i < par; i++ {
                wg.Add(1)
                go func() {
@@ -1154,7 +1161,9 @@ func (b *builder) do(root *action) {
                                        b.exec.Lock()
                                        a := b.ready.pop()
                                        b.exec.Unlock()
+                                       atomic.AddUint32(&executing, 1)
                                        handle(a)
+                                       atomic.AddUint32(&executing, ^uint32(0))
                                case <-interrupted:
                                        setExitStatus(1)
                                        return

josharian added a commit to josharian/go that referenced this issue May 15, 2015

cmd/go: use critical path scheduling
This is a port of part of CL 13235046, by dvyukov.

Use the critical path method to schedule work.

On my laptop (8 logical cores), this CL reduces
the wall time for some but not all large commands:

go build -a std
old=19.0s	new=13.2s	delta=-30.7%	n=50	p=2e-119

go test -short std
old=65.3s	new=54.4s	delta=-16.8%	n=25	p=1e-45

all.bash
old=318.4s	new=317.9s	delta=-0.16%	n=10	p=0.73

This initial implementation assigns equal weight
to all actions. This is a significant oversimplification.
Making actions more granular and assigning more
accurate weights is future work.
However, this should be enough to improve the CPU
utilization of the compile-all builder.

Note that this causes noticeable stalls in output
when building or testing many packages, since
the scheduling usually does not correspond
to the planned output order. This should improve
as scheduling improves.

Updates golang#8893.

Change-Id: I8fba875a9a56aed132bd6b37b607e374f336728d

josharian added a commit to josharian/go that referenced this issue May 15, 2015

cmd/go: use critical path scheduling
This is a port of part of CL 13235046, by dvyukov.

Use the critical path method to schedule work.

On my laptop (8 logical cores), this CL reduces
the wall time for some but not all large commands:

go build -a std
old=19.0s	new=13.2s	delta=-30.7%	n=50	p=2e-119

go test -short std
old=65.3s	new=54.4s	delta=-16.8%	n=25	p=1e-45

all.bash
old=318.4s	new=317.9s	delta=-0.16%	n=10	p=0.73

This initial implementation assigns equal weight
to all actions. This is a significant oversimplification.
Making actions more granular and assigning more
accurate weights is future work.
However, this should be enough to improve the CPU
utilization of the compile-all builder.

Note that this causes noticeable stalls in output
when building or testing many packages, since
the scheduling usually does not correspond
to the planned output order. This should improve
as scheduling improves.

Updates golang#8893.

Change-Id: I8fba875a9a56aed132bd6b37b607e374f336728d
@rsc

This comment has been minimized.

Contributor

rsc commented Jun 29, 2015

Too late for Go 1.5.

@rsc rsc modified the milestones: Go1.6Early, Go1.5 Jun 29, 2015

@josharian josharian self-assigned this Jun 29, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented Aug 25, 2015

I'm spending what time I have on SSA. Hopefully someone else wants to pick this back up.

@josharian josharian removed their assignment Aug 25, 2015

@josharian

This comment has been minimized.

Contributor

josharian commented Sep 16, 2015

My WIP CL is 8918.

@ALTree

This comment has been minimized.

Member

ALTree commented Oct 9, 2015

I fooled around with josharian's CL during a slow evening, I got nowhere but I'll report my findings, just to add another datapoint. This is on a 4-core intel machine.

First of all I applied dvyukov patch (the one that prints the number of currently executing jobs every 500ms), and run go build -a -p 4 std .

Before:

EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 3
EXECUTING 1
EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 1

after:

EXECUTING 1
EXECUTING 1
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 4
EXECUTING 1
EXECUTING 4

the parallelism of the last part of the job is indeed higher.

Unfortunately, as minux and josharian, I wasn't able to observe any significant speedup. For time go build -a -p 4 std there's no statistical difference between tip and the CL. For make.bash I observed a modest 1.5% speedup. I didn't try with all.bash.

@rsc

This comment has been minimized.

Contributor

rsc commented Nov 4, 2015

Thanks for the null results. They're helpful. Deprioritizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment