Shake ~40% slower than Ninja to build Chrome #25

Closed
ndmitchell opened this Issue Jul 1, 2013 · 20 comments

Comments

Projects
None yet
3 participants
@ndmitchell
Owner

ndmitchell commented Jul 1, 2013

The shake process needed ~100% cpu, ~50 threads, ~1.2GB of ram throughout most of the build. Ninja needs around 2.5%cpu, 1 thread, and tops out at 50MB ram.

1,138,046,195,456 bytes allocated in the heap
1,862,700,251,920 bytes copied during GC
     569,340,160 bytes maximum residency (791 sample(s))
      28,014,248 bytes maximum slop
            1251 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     2112419 colls,     0 par   1085.59s   1117.12s     0.0005s    0.0139s
  Gen  1       791 colls,     0 par   285.72s   288.67s     0.3649s    0.8631s

  TASKS: 52 (1 bound, 51 peak workers (51 total), using -N18)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  365.22s  (1133.66s elapsed)
  GC      time  1371.30s  (1405.79s elapsed)
  EXIT    time    0.01s  (  0.03s elapsed)
  Total   time  1736.53s  (2539.48s elapsed)

  Alloc rate    3,116,097,718 bytes per MUT second

  Productivity  21.0% of total user, 14.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    42m19.694s
user    378m31.914s
sys 26m21.621s
@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Jul 1, 2013

Owner

The final database was 27Mb. First guess is a space leak somewhere.

Owner

ndmitchell commented Jul 1, 2013

The final database was 27Mb. First guess is a space leak somewhere.

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Jul 11, 2013

Another large project to test would be VXL. I'm seeing Ninja taking ~60% the time of shake.

I'm using an internal branch, but timings (trying my best to keep ccache out of the equation, disabling the OUL and BRL contrib code; defaults otherwise):

% time ninja

[0-0->4395/4395@14.6] Linking CXX executable contrib/prip/vpyr/tests/vpyr_test2
ninja 2109.61s user 215.86s system 772% cpu 5:00.84 total
% # (14.6 targets/second according to my NINJA_STATUS)
% time shake -j10

shake -j10 2286.25s user 206.05s system 515% cpu 8:03.24 total

Another large project to test would be VXL. I'm seeing Ninja taking ~60% the time of shake.

I'm using an internal branch, but timings (trying my best to keep ccache out of the equation, disabling the OUL and BRL contrib code; defaults otherwise):

% time ninja

[0-0->4395/4395@14.6] Linking CXX executable contrib/prip/vpyr/tests/vpyr_test2
ninja 2109.61s user 215.86s system 772% cpu 5:00.84 total
% # (14.6 targets/second according to my NINJA_STATUS)
% time shake -j10

shake -j10 2286.25s user 206.05s system 515% cpu 8:03.24 total

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Jul 11, 2013

Shake commit 3ce16e0; Ninja 1.3.4.

Shake commit 3ce16e0; Ninja 1.3.4.

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Jul 15, 2013

Owner

@mathstuf Thanks for your information, much appreciated. There's definitely something going on here, I'll let you know once I have tracked it down.

Owner

ndmitchell commented Jul 15, 2013

@mathstuf Thanks for your information, much appreciated. There's definitely something going on here, I'll let you know once I have tracked it down.

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Aug 1, 2013

Owner

I think I've finally figured out what is going on, and it seems to be a GHC runtime bug - essentially if you have enough threads with the wrong dependencies between them (related to #28) then the GHC runtime hits a worse than constant-time case in figuring out which thread should run next. I am designing a fix to switch to Shake scheduling its threads itself, which should fix the problem.

Owner

ndmitchell commented Aug 1, 2013

I think I've finally figured out what is going on, and it seems to be a GHC runtime bug - essentially if you have enough threads with the wrong dependencies between them (related to #28) then the GHC runtime hits a worse than constant-time case in figuring out which thread should run next. I am designing a fix to switch to Shake scheduling its threads itself, which should fix the problem.

@nh2

This comment has been minimized.

Show comment
Hide comment
@nh2

nh2 Aug 29, 2013

Contributor

Maybe its also worth checking out how GHC 7.8 performs on this?

Contributor

nh2 commented Aug 29, 2013

Maybe its also worth checking out how GHC 7.8 performs on this?

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Oct 31, 2013

Owner

I switched to a version of Shake that doesn't execute the commands and ran it on a prebuilt version of LLVM. That gives us almost all the Shake overhead, with no actual commands, and runs in 1 min. Profiling shows that parsing the dependency Makefiles takes up ~60%, and normalising filepaths takes up ~30%. All of this overhead is trivial to eliminate (just switch to the right string type), so I'll do that and then we can remeasure.

Owner

ndmitchell commented Oct 31, 2013

I switched to a version of Shake that doesn't execute the commands and ran it on a prebuilt version of LLVM. That gives us almost all the Shake overhead, with no actual commands, and runs in 1 min. Profiling shows that parsing the dependency Makefiles takes up ~60%, and normalising filepaths takes up ~30%. All of this overhead is trivial to eliminate (just switch to the right string type), so I'll do that and then we can remeasure.

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Oct 31, 2013

On Thu, Oct 31, 2013 at 07:30:05 -0700, Neil Mitchell wrote:

I switched to a version of Shake that doesn't execute the commands and
ran it on a prebuilt version of LLVM. That gives us almost all the
Shake overhead, with no actual commands, and runs in 1 min. Profiling
shows that parsing the dependency Makefiles takes up ~60%, and
normalising filepaths takes up ~30%. All of this overhead is trivial
to eliminate (just switch to the right string type), so I'll do that
and then we can remeasure.

FYI, Ninja now dumps out a parsed structure for dependency files to
avoid the cost of reading them. It does the parsing while waiting for
the compile to complete to minimize the wait time. See the docs[1]. It
might be worthwhile to get shake to do something similar.

[1]http://martine.github.io/ninja/manual.html#_deps

On Thu, Oct 31, 2013 at 07:30:05 -0700, Neil Mitchell wrote:

I switched to a version of Shake that doesn't execute the commands and
ran it on a prebuilt version of LLVM. That gives us almost all the
Shake overhead, with no actual commands, and runs in 1 min. Profiling
shows that parsing the dependency Makefiles takes up ~60%, and
normalising filepaths takes up ~30%. All of this overhead is trivial
to eliminate (just switch to the right string type), so I'll do that
and then we can remeasure.

FYI, Ninja now dumps out a parsed structure for dependency files to
avoid the cost of reading them. It does the parsing while waiting for
the compile to complete to minimize the wait time. See the docs[1]. It
might be worthwhile to get shake to do something similar.

[1]http://martine.github.io/ninja/manual.html#_deps

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Oct 31, 2013

Owner

@mathstuf Shake already supports the deps stuff. The deps stuff of Ninja is for faster rebuilds, not the initial parsing, which is where Shake has the bottleneck. With Ninja if you don't have deps on, it reparses all Makefiles each time, and if you do have deps it stores them in the database. With Shake it always puts the parsed Makefile in the database, and if you have deps off it does a few extra file modification checks (equivalent to the Ninja behaviour, but faster in the case of deps not being set). Thanks thinking of it.

Owner

ndmitchell commented Oct 31, 2013

@mathstuf Shake already supports the deps stuff. The deps stuff of Ninja is for faster rebuilds, not the initial parsing, which is where Shake has the bottleneck. With Ninja if you don't have deps on, it reparses all Makefiles each time, and if you do have deps it stores them in the database. With Shake it always puts the parsed Makefile in the database, and if you have deps off it does a few extra file modification checks (equivalent to the Ninja behaviour, but faster in the case of deps not being set). Thanks thinking of it.

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Oct 31, 2013

Owner

As of now, the overhead from Shake should be 4 times less than it was before. Profiling shows half the remaining time goes in a simple to optimise function (file path normalisation) so I'll do that one tomorrow and see if any obvious hotspots are left after that. I wouldn't particularly recommend Ninja uses trying it out today, better to wait for tomorrow at least.

Owner

ndmitchell commented Oct 31, 2013

As of now, the overhead from Shake should be 4 times less than it was before. Profiling shows half the remaining time goes in a simple to optimise function (file path normalisation) so I'll do that one tomorrow and see if any obvious hotspots are left after that. I wouldn't particularly recommend Ninja uses trying it out today, better to wait for tomorrow at least.

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Nov 5, 2013

Owner

Time to build LLVM, not running any commands, has dropped from 58s to 5s. Normalise still accounts for about 25% of the time, so I'll try and improve that further.

Owner

ndmitchell commented Nov 5, 2013

Time to build LLVM, not running any commands, has dropped from 58s to 5s. Normalise still accounts for about 25% of the time, so I'll try and improve that further.

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Nov 7, 2013

Owner

I found that normalising the dependencies of makefiles is pointless, which takes normalise down to < 2% of the remaining time. With that, file modification time checking dominates even when building, which is how it should be. My system is having issues today with slow file modtime checks, but I think I'm seeing times of ~4s to rebuild everything (excluding running commands). @mathstuf - if you have any time to rebenchmark Shake, I'd love to hear how it compares to Ninja for you. Let me know if it's easier to do so once I release, and I'll make a quick release.

Time has gone down, so I ran with memory profiling, and got:

shake

This shows that 20Mb is consumed by strings (fair enough) and 80Mb by stack threads (way too much). The stuff in #28 should solve that.

Owner

ndmitchell commented Nov 7, 2013

I found that normalising the dependencies of makefiles is pointless, which takes normalise down to < 2% of the remaining time. With that, file modification time checking dominates even when building, which is how it should be. My system is having issues today with slow file modtime checks, but I think I'm seeing times of ~4s to rebuild everything (excluding running commands). @mathstuf - if you have any time to rebenchmark Shake, I'd love to hear how it compares to Ninja for you. Let me know if it's easier to do so once I release, and I'll make a quick release.

Time has gone down, so I ran with memory profiling, and got:

shake

This shows that 20Mb is consumed by strings (fair enough) and 80Mb by stack threads (way too much). The stuff in #28 should solve that.

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Nov 7, 2013

Experiments with VXL here with ninja 1.3.4 and shake master:

ninja -> cold FS cache, empty ccache -> 2626.45s user 184.41s system 775% cpu 6:02.58 total
ninja -> hot FS cache, empty ccache -> 2629.27s user 185.50s system 771% cpu 6:05.00 total
shake -j10 -> hot FS cache, empty ccache -> 2441.32s user 168.99s system 762% cpu 5:42.20 total
ninja -> hot FS cache, empty ccache -> 2628.95s user 185.73s system 778% cpu 6:01.46 total

Things are looking good :) . With ninja master (43b3296f633658b303fa3194b093cccce77df013):

ninja -> hot FS cache, empty ccache -> 2629.15s user 185.52s system 777% cpu 6:02.03 total
ninja -> hot FS cache, empty ccache -> 2629.64s user 185.24s system 774% cpu 6:03.26 total

mathstuf commented Nov 7, 2013

Experiments with VXL here with ninja 1.3.4 and shake master:

ninja -> cold FS cache, empty ccache -> 2626.45s user 184.41s system 775% cpu 6:02.58 total
ninja -> hot FS cache, empty ccache -> 2629.27s user 185.50s system 771% cpu 6:05.00 total
shake -j10 -> hot FS cache, empty ccache -> 2441.32s user 168.99s system 762% cpu 5:42.20 total
ninja -> hot FS cache, empty ccache -> 2628.95s user 185.73s system 778% cpu 6:01.46 total

Things are looking good :) . With ninja master (43b3296f633658b303fa3194b093cccce77df013):

ninja -> hot FS cache, empty ccache -> 2629.15s user 185.52s system 777% cpu 6:02.03 total
ninja -> hot FS cache, empty ccache -> 2629.64s user 185.24s system 774% cpu 6:03.26 total
@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Nov 7, 2013

Owner

So to confirm those numbers mean what I think they do - Shake is ~20s faster than Ninja and building VXL? I was aiming for a draw, so being faster would be impressively surprising.

Owner

ndmitchell commented Nov 7, 2013

So to confirm those numbers mean what I think they do - Shake is ~20s faster than Ninja and building VXL? I was aiming for a draw, so being faster would be impressively surprising.

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Nov 7, 2013

On Thu, Nov 07, 2013 at 14:04:10 -0800, Neil Mitchell wrote:

So to confirm those numbers mean what I think they do - Shake is ~20s
faster than Ninja and building VXL? I was aiming for a draw, so being
faster would be impressively surprising.

That's what I saw. I gave ninja quite a few runs to beat shake's time. I
can try with larger projects later tonight (ParaView most likely).

mathstuf commented Nov 7, 2013

On Thu, Nov 07, 2013 at 14:04:10 -0800, Neil Mitchell wrote:

So to confirm those numbers mean what I think they do - Shake is ~20s
faster than Ninja and building VXL? I was aiming for a draw, so being
faster would be impressively surprising.

That's what I saw. I gave ninja quite a few runs to beat shake's time. I
can try with larger projects later tonight (ParaView most likely).

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Nov 8, 2013

Owner

You may also want to try VXL with Shake a few times. Shake builds things in a random order, to try and avoid disk/cpu bottlenecks, so the time may well vary more than Ninja.

Owner

ndmitchell commented Nov 8, 2013

You may also want to try VXL with Shake a few times. Shake builds things in a random order, to try and avoid disk/cpu bottlenecks, so the time may well vary more than Ninja.

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Nov 8, 2013

On Fri, Nov 08, 2013 at 03:54:49 -0800, Neil Mitchell wrote:

You may also want to try VXL with Shake a few times. Shake builds
things in a random order, to try and avoid disk/cpu bottlenecks, so
the time may well vary more than Ninja.

Some more runs:

2625.47s user 182.77s system 732% cpu 6:23.18 total
2635.80s user 183.56s system 762% cpu 6:09.86 total
2637.48s user 183.41s system 766% cpu 6:08.22 total

Then my laptop started overheating:

3647.00s user 250.08s system 757% cpu 8:34.74 total

mathstuf commented Nov 8, 2013

On Fri, Nov 08, 2013 at 03:54:49 -0800, Neil Mitchell wrote:

You may also want to try VXL with Shake a few times. Shake builds
things in a random order, to try and avoid disk/cpu bottlenecks, so
the time may well vary more than Ninja.

Some more runs:

2625.47s user 182.77s system 732% cpu 6:23.18 total
2635.80s user 183.56s system 762% cpu 6:09.86 total
2637.48s user 183.41s system 766% cpu 6:08.22 total

Then my laptop started overheating:

3647.00s user 250.08s system 757% cpu 8:34.74 total
@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell Nov 13, 2013

Owner

Thanks for the further numbers. Given the degredation in Shake speed, I wonder if your laptop was already slightly warm by the time you were running Shake the second time? I guess running shake, then ninja, then shake, then ninja is probably the only way to get comparable times out (that's what I did for the LLVM timings).

Owner

ndmitchell commented Nov 13, 2013

Thanks for the further numbers. Given the degredation in Shake speed, I wonder if your laptop was already slightly warm by the time you were running Shake the second time? I guess running shake, then ninja, then shake, then ninja is probably the only way to get comparable times out (that's what I did for the LLVM timings).

@mathstuf

This comment has been minimized.

Show comment
Hide comment
@mathstuf

mathstuf Nov 13, 2013

On Wed, Nov 13, 2013 at 11:47:59 -0800, Neil Mitchell wrote:

Thanks for the further numbers. Given the degredation in Shake speed,
I wonder if your laptop was already slightly warm by the time you were
running Shake the second time? I guess running shake, then ninja, then
shake, then ninja is probably the only way to get comparable times out
(that's what I did for the LLVM timings).

Possibly. I also found out that my power cord has been intermittently
working based on the angle of the cable end, so some power throttling
may be in there as well. I'll run more tests on a desktop when I get the
chance.

On Wed, Nov 13, 2013 at 11:47:59 -0800, Neil Mitchell wrote:

Thanks for the further numbers. Given the degredation in Shake speed,
I wonder if your laptop was already slightly warm by the time you were
running Shake the second time? I guess running shake, then ninja, then
shake, then ninja is probably the only way to get comparable times out
(that's what I did for the LLVM timings).

Possibly. I also found out that my power cord has been intermittently
working based on the angle of the cable end, so some power throttling
may be in there as well. I'll run more tests on a desktop when I get the
chance.

@ndmitchell

This comment has been minimized.

Show comment
Hide comment
@ndmitchell

ndmitchell May 8, 2014

Owner

I've got enough numbers from enough users to know that Shake is no longer massively slower. To move forward I'll likely need a new test case, with an actively interested user, and a particular benchmark in mind. I'll leave that to a new bug report.

Owner

ndmitchell commented May 8, 2014

I've got enough numbers from enough users to know that Shake is no longer massively slower. To move forward I'll likely need a new test case, with an actively interested user, and a particular benchmark in mind. I'll leave that to a new bug report.

@ndmitchell ndmitchell closed this May 8, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment