Optimize optimizer eliminate stage #3732

Merged
merged 26 commits into from Sep 18, 2015

Conversation

Projects
None yet
3 participants
@aidanhs
Contributor

aidanhs commented Sep 2, 2015

These are just speed and style changes - there should be no output difference at all.

Using the .bc file from #3718, before:

# EMCC_DEBUG=1 emcc -O3 -o out.js slow/libemscripten_worker.bc 2>&1 | grep 'emcc step\|total time'
DEBUG    root: emcc step "parse arguments and setup" took 0.01 seconds
DEBUG    root: emcc step "bitcodeize inputs" took 0.00 seconds
DEBUG    root: emcc step "process inputs" took 0.00 seconds
DEBUG    root: emcc step "calculate system libraries" took 0.43 seconds
DEBUG    root: emcc step "link" took 0.63 seconds
DEBUG    root: emcc step "post-link" took 10.19 seconds
DEBUG    root: emcc step "emscript (llvm=>js)" took 8.62 seconds
DEBUG    root: emcc step "source transforms" took 0.34 seconds
DEBUG    root: emcc step "js opts" took 88.50 seconds
DEBUG    root: emcc step "final emitting" took 0.04 seconds
DEBUG    root: total time: 108.75 seconds

After:

# EMCC_DEBUG=1 emcc -O3 -o out.js slow/libemscripten_worker.bc 2>&1 | grep 'emcc step\|total time'
DEBUG    root: emcc step "parse arguments and setup" took 0.01 seconds
DEBUG    root: emcc step "bitcodeize inputs" took 0.00 seconds
DEBUG    root: emcc step "process inputs" took 0.00 seconds
DEBUG    root: emcc step "calculate system libraries" took 0.43 seconds
DEBUG    root: emcc step "link" took 0.63 seconds
DEBUG    root: emcc step "post-link" took 10.35 seconds
DEBUG    root: emcc step "emscript (llvm=>js)" took 9.11 seconds
DEBUG    root: emcc step "source transforms" took 0.78 seconds
DEBUG    root: emcc step "js opts" took 67.66 seconds
DEBUG    root: emcc step "final emitting" took 0.05 seconds
DEBUG    root: total time: 89.01 seconds

So 20-25% optimisation speedup, 15-20% overall speedup. The optimizer tests pass, I've not yet run any others, will do so now.

I've got a few other optimisations, but want to get these and the switch optimisation (#3733) fix done first.

tools/optimizer/optimizer.cpp
+#ifdef PROFILING
+ tstmtscan += clock() - start;
+ start = clock();
+#endif

This comment has been minimized.

@waywardmonkeys

waywardmonkeys Sep 2, 2015

Contributor

Indentation looks off here and on the block above.

@waywardmonkeys

waywardmonkeys Sep 2, 2015

Contributor

Indentation looks off here and on the block above.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 2, 2015

Contributor

All core asm2 tests succeed.

Contributor

aidanhs commented Sep 2, 2015

All core asm2 tests succeed.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 2, 2015

Owner

Reading the code, this all looks good (and I like the refactoring), and those numbers sound great. However, I tested on another codebase (a large game engine) and I actually see a slowdown: 7.37 seconds to 8.27.

I also tested on Poppler from the test suite, with EMCC_CORES=1 to make it slow enough to be noticeable, and I get 2.54 to 2.60 seconds, so a tiny slowdown, probably just noise.

This makes me think we should test this more, it might be that it helps the original testcase that motivated this, but harms others.

Owner

kripken commented Sep 2, 2015

Reading the code, this all looks good (and I like the refactoring), and those numbers sound great. However, I tested on another codebase (a large game engine) and I actually see a slowdown: 7.37 seconds to 8.27.

I also tested on Poppler from the test suite, with EMCC_CORES=1 to make it slow enough to be noticeable, and I get 2.54 to 2.60 seconds, so a tiny slowdown, probably just noise.

This makes me think we should test this more, it might be that it helps the original testcase that motivated this, but harms others.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 3, 2015

Contributor

Is the game engine available for me to compile? Possibly as a .bc? Big slow testcases are good. I'll take a look at poppler as well and see if I can massage it into something usable.

I'm kinda surprised it's slower, but on reflection I can see that the changes I made could punish lots of small functions - I wonder if that's what's happening. There is a fairly invasive change I was considering, but delayed to see what the thoughts were thus far. Let me work on that and see if it can deliver any improvement.

Contributor

aidanhs commented Sep 3, 2015

Is the game engine available for me to compile? Possibly as a .bc? Big slow testcases are good. I'll take a look at poppler as well and see if I can massage it into something usable.

I'm kinda surprised it's slower, but on reflection I can see that the changes I made could punish lots of small functions - I wonder if that's what's happening. There is a fairly invasive change I was considering, but delayed to see what the thoughts were thus far. Let me work on that and see if it can deliver any improvement.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 3, 2015

Owner

The engine isn't open, even as a .bc. But I'll test locally tomorrow with each patch separately, to at least try to narrow down the issue.

I'm hoping though that this isn't specific to unity. Maybe poppler or bananabread (I can provide a bc for that) with -s LINKABLE=1 (to disable dead code elimination and leave a lot more stuff) might also show something.

Owner

kripken commented Sep 3, 2015

The engine isn't open, even as a .bc. But I'll test locally tomorrow with each patch separately, to at least try to narrow down the issue.

I'm hoping though that this isn't specific to unity. Maybe poppler or bananabread (I can provide a bc for that) with -s LINKABLE=1 (to disable dead code elimination and leave a lot more stuff) might also show something.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 3, 2015

Contributor

I suspect you'll find it's the final commit, 31e528a. I should've really gone through the benchmarks before, will try and do so tomorrow.

Contributor

aidanhs commented Sep 3, 2015

I suspect you'll find it's the final commit, 31e528a. I should've really gone through the benchmarks before, will try and do so tomorrow.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 3, 2015

Owner

Yes, it does look like the last commit ("Avoid iterating over irrelevant tracked variables") is where the unity slowdown comes from.

Owner

kripken commented Sep 3, 2015

Yes, it does look like the last commit ("Avoid iterating over irrelevant tracked variables") is where the unity slowdown comes from.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 4, 2015

Contributor

Can you try the latest branch against unity?

Contributor

aidanhs commented Sep 4, 2015

Can you try the latest branch against unity?

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 4, 2015

Contributor

Sorry, that posted before I was ready - I've got rid of the problematic commit and there's still a speedup with the lambdas test run from 42s to 18s. There's even a very tiny sqlite speedup (~3%).

Contributor

aidanhs commented Sep 4, 2015

Sorry, that posted before I was ready - I've got rid of the problematic commit and there's still a speedup with the lambdas test run from 42s to 18s. There's even a very tiny sqlite speedup (~3%).

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 7, 2015

Contributor

I've added a couple of new commits after managing to grab the unity .bc file from a unity install and I'm mostly happy with where this is up to.

I did benchmarking of the optimizer run directly against an arbitrary unity js chunk (14M), the sqlite (14M) and the lambdas testcase (22M). Each measurement is for all O3 optimisation passes, best of 5 runs, user time as reported by the time bash builtin:

Case Before branch After branch
Unity 27.431s 26.247s
Sqlite 1m20.520s 1m18.458s
Lambdas 1m19.287s 52.067s

It's worth taking a look at "Avoid array bounds checking for traverse* functions" - it's theoretically less safe because Ref[] does bounds checks, but I feel that these functions are small enough to be verifiable.

Contributor

aidanhs commented Sep 7, 2015

I've added a couple of new commits after managing to grab the unity .bc file from a unity install and I'm mostly happy with where this is up to.

I did benchmarking of the optimizer run directly against an arbitrary unity js chunk (14M), the sqlite (14M) and the lambdas testcase (22M). Each measurement is for all O3 optimisation passes, best of 5 runs, user time as reported by the time bash builtin:

Case Before branch After branch
Unity 27.431s 26.247s
Sqlite 1m20.520s 1m18.458s
Lambdas 1m19.287s 52.067s

It's worth taking a look at "Avoid array bounds checking for traverse* functions" - it's theoretically less safe because Ref[] does bounds checks, but I feel that these functions are small enough to be verifiable.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 8, 2015

Owner

I still see a slowdown here, on unity. When I set EMCC_CORES=1 to reduce noise, I get 50.88 50.99 for the js opts stage in EMCC_DEBUG=1 before this patch, and 51.12 51.20 after this patch. It's around a 0.5% slowdown, so not large, but still in the wrong direction.

How are you measuring your times?

Owner

kripken commented Sep 8, 2015

I still see a slowdown here, on unity. When I set EMCC_CORES=1 to reduce noise, I get 50.88 50.99 for the js opts stage in EMCC_DEBUG=1 before this patch, and 51.12 51.20 after this patch. It's around a 0.5% slowdown, so not large, but still in the wrong direction.

How are you measuring your times?

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 8, 2015

Contributor

That's very interesting. My measurement was taken by pausing the build process during js opts, taking one of the chunks out of there and running the optimizer directly against it (optimizer $file asm eliminate simplifyExpressions simplifyIfs registerizeHarder minifyLocals asmLastOpts last minifyWhitespace).

When I do a full compilation using EMCC_CORES=1 EMCC_DEBUG=1 emcc -s LINKABLE=1 -o /tmp/out.js -O2 slow/UnityNative.bc (LINKABLE=1 to avoid dropping functions) I still get the speedups - from 243.82, 243.83 to 232.21, 233.13 (which is the same order of magnitude as in my table).

Assuming you're on Linux, what distro and gcc version are you using? I've got Ubuntu 14.04 and the default compiler, gcc 4.8.4 - I wonder if a different compiler version is doing something differently.

Contributor

aidanhs commented Sep 8, 2015

That's very interesting. My measurement was taken by pausing the build process during js opts, taking one of the chunks out of there and running the optimizer directly against it (optimizer $file asm eliminate simplifyExpressions simplifyIfs registerizeHarder minifyLocals asmLastOpts last minifyWhitespace).

When I do a full compilation using EMCC_CORES=1 EMCC_DEBUG=1 emcc -s LINKABLE=1 -o /tmp/out.js -O2 slow/UnityNative.bc (LINKABLE=1 to avoid dropping functions) I still get the speedups - from 243.82, 243.83 to 232.21, 233.13 (which is the same order of magnitude as in my table).

Assuming you're on Linux, what distro and gcc version are you using? I've got Ubuntu 14.04 and the default compiler, gcc 4.8.4 - I wonder if a different compiler version is doing something differently.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 8, 2015

Contributor

Ignore me, I've done something very stupid, let me look again...

Contributor

aidanhs commented Sep 8, 2015

Ignore me, I've done something very stupid, let me look again...

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 9, 2015

Contributor

Summary of changes:

  • at an explicit reminder (for people like me) to enable release mode when cmaking the optimizer
  • use a name->vec lookup for dependencies rather than name->set as vecs are faster
  • new commit to clean up ref construction

Now I've fixed my issues and tweaked the dep lookup, I get from about 54.1s knocked down to 53.3s for the unity js opts phase on a single core. other.test_js_opts passes.

Contributor

aidanhs commented Sep 9, 2015

Summary of changes:

  • at an explicit reminder (for people like me) to enable release mode when cmaking the optimizer
  • use a name->vec lookup for dependencies rather than name->set as vecs are faster
  • new commit to clean up ref construction

Now I've fixed my issues and tweaked the dep lookup, I get from about 54.1s knocked down to 53.3s for the unity js opts phase on a single core. other.test_js_opts passes.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 9, 2015

Owner

I still see qualitatively different numbers here, no change or a tiny slowdown on unity, tiny slowdown on poppler, no change on bullet.

For poppler, for example, I run EMCC_CORES=1 EMCC_DEBUG=1 ./emcc -O2 poppler.bc -s LINKABLE=1 and I get 7.62 7.60 7.61 before, 7.67 7.68 7.67 after. Very small slowdown, but consistent.

Maybe it's just a matter of a different cpu / different local compiler and stdlib, that accounts for us seeing different things. But if so then it suggests the changes here are going to vary across machines.

If we can't figure this out, it might make sense to still merge this or parts of this, as the refactorings are nice. But perhaps we should only take the refactorings for clarity, and not for speed?

Owner

kripken commented Sep 9, 2015

I still see qualitatively different numbers here, no change or a tiny slowdown on unity, tiny slowdown on poppler, no change on bullet.

For poppler, for example, I run EMCC_CORES=1 EMCC_DEBUG=1 ./emcc -O2 poppler.bc -s LINKABLE=1 and I get 7.62 7.60 7.61 before, 7.67 7.68 7.67 after. Very small slowdown, but consistent.

Maybe it's just a matter of a different cpu / different local compiler and stdlib, that accounts for us seeing different things. But if so then it suggests the changes here are going to vary across machines.

If we can't figure this out, it might make sense to still merge this or parts of this, as the refactorings are nice. But perhaps we should only take the refactorings for clarity, and not for speed?

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 9, 2015

Contributor

No no, it's not good enough to slow things down! I will rummage around for some low hanging fruit.

Contributor

aidanhs commented Sep 9, 2015

No no, it's not good enough to slow things down! I will rummage around for some low hanging fruit.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 9, 2015

Owner

Another thought, there might be some bigger things to optimize in that pass. If I recall correctly, around the comment "Look for statements, including while-switch pattern" we have the main loop, which traverses the whole tree and calls scan. And scan does its own traversal on the entire tree shown to it. I don't remember all the details, but I wonder if we couldn't stop those traversals in some cases, and avoid a big chunk of O(N^2) work. But, this is tricky stuff.

Owner

kripken commented Sep 9, 2015

Another thought, there might be some bigger things to optimize in that pass. If I recall correctly, around the comment "Look for statements, including while-switch pattern" we have the main loop, which traverses the whole tree and calls scan. And scan does its own traversal on the entire tree shown to it. I don't remember all the details, but I wonder if we couldn't stop those traversals in some cases, and avoid a big chunk of O(N^2) work. But, this is tricky stuff.

aidanhs added some commits Sep 10, 2015

Minor parser cleanups
 - !isSpace in Frag() is not necessary because that case will fall
   through and abort anyway
 - any successful switch case will assign str, no need to recheck
 - there is only one callsite of parseAfterIdent, move the skipspace
   inside the call (and therefore avoid calling skipspace twice
   before expressions)
 - no need for two checks for 0 bytes
Don't perform unnecessary handling of VAR nodes
We know that there will be no assignment part of var nodes because
normalisation has either removed them or aborted because the var node
needs fixing.
@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 17, 2015

Contributor

I've pushed a bunch of changes. I opted not to go for the harder idea, and instead pursued the low-hanging fruit. There are some speedups from

  • Reserving space in vectors on creation, rather than relying on pushing to reallocate ("Reserve space on vector creation", "Also reserve space when creation nodes during runtime").
  • Allocate vectors in an arena, like Values ("Allocate vecs in arena").
  • Improving memory locality in traverse* by sticking values in locals ("Caching values in locals is faster than hitting memory", "Cache array size, use pointer arithmetic to bypass some indirection").

Unfortunately, the final item makes the traverse* functions really ugly.

However, the rest of the commits are generally just refactoring. You may wish to examine the following two with suspicion:

  • "Don't pass optional blocks as arguments" - removes optional parameters, in exchange for needing a function to transfer block contents from one node to another. I prefer this way because it makes it very clear that the functions are always going to create a block.
  • "Don't perform unnecessary handling of VAR nodes" ⚠️ - the commit message provides justification, but was it deliberate that processing would continue in release mode (i.e. when the assert was removed)? I figured that requiring valid asm was a small price to pay for the code simplification.

The rest of the refactorings should be easily verifiable.

I see a ~10% speedup in the optimisation of unity (54.5s to 48.5s with EMCC_CORES=1 EMCC_DEBUG=1 emcc -O2 UnityNative.bc -s LINKABLE=1), sqlite (6.45s to 5.87s with asm2.test_sqlite) and poppler (4.97 to 4.48 with asm2.test_poppler).

I've run all asm2 tests (+other.test_{js,native}_optimizer) with assertions enabled on the optimizer and they seem fine (anything that fails also fails on incoming for me).

Contributor

aidanhs commented Sep 17, 2015

I've pushed a bunch of changes. I opted not to go for the harder idea, and instead pursued the low-hanging fruit. There are some speedups from

  • Reserving space in vectors on creation, rather than relying on pushing to reallocate ("Reserve space on vector creation", "Also reserve space when creation nodes during runtime").
  • Allocate vectors in an arena, like Values ("Allocate vecs in arena").
  • Improving memory locality in traverse* by sticking values in locals ("Caching values in locals is faster than hitting memory", "Cache array size, use pointer arithmetic to bypass some indirection").

Unfortunately, the final item makes the traverse* functions really ugly.

However, the rest of the commits are generally just refactoring. You may wish to examine the following two with suspicion:

  • "Don't pass optional blocks as arguments" - removes optional parameters, in exchange for needing a function to transfer block contents from one node to another. I prefer this way because it makes it very clear that the functions are always going to create a block.
  • "Don't perform unnecessary handling of VAR nodes" ⚠️ - the commit message provides justification, but was it deliberate that processing would continue in release mode (i.e. when the assert was removed)? I figured that requiring valid asm was a small price to pay for the code simplification.

The rest of the refactorings should be easily verifiable.

I see a ~10% speedup in the optimisation of unity (54.5s to 48.5s with EMCC_CORES=1 EMCC_DEBUG=1 emcc -O2 UnityNative.bc -s LINKABLE=1), sqlite (6.45s to 5.87s with asm2.test_sqlite) and poppler (4.97 to 4.48 with asm2.test_poppler).

I've run all asm2 tests (+other.test_{js,native}_optimizer) with assertions enabled on the optimizer and they seem fine (anything that fails also fails on incoming for me).

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 18, 2015

Owner

Confirmed, I see an 8% speedup. Nice! Reviewing now.

Owner

kripken commented Sep 18, 2015

Confirmed, I see an 8% speedup. Nice! Reviewing now.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 18, 2015

why does arr_index have curly braces?

why does arr_index have curly braces?

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 18, 2015

Owner

Please add any changes after this point as followup commits, so I don't need to re-read the ones I am going through now.

Owner

kripken commented Sep 18, 2015

Please add any changes after this point as followup commits, so I don't need to re-read the ones I am going through now.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 18, 2015

Please rename this to allocArray

Please rename this to allocArray

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 18, 2015

Owner

That's it for my comments. Excellent work!

Owner

kripken commented Sep 18, 2015

That's it for my comments. Excellent work!

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Sep 18, 2015

Contributor

Made review changes.

Contributor

aidanhs commented Sep 18, 2015

Made review changes.

@kripken kripken merged commit d2d975f into kripken:incoming Sep 18, 2015

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Sep 18, 2015

Owner

Merged, thanks!

Owner

kripken commented Sep 18, 2015

Merged, thanks!

@aidanhs aidanhs deleted the aidanhs:aphs-optimize-optimizer-4 branch Sep 18, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment