New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up registerizeHarder optimizer for huge functions #3364

Merged
merged 2 commits into from Apr 17, 2015

Conversation

Projects
None yet
3 participants
@aidanhs
Contributor

aidanhs commented Apr 15, 2015

My benchmarking (with emtcl) indicates that this PR speeds up the js optimisation part of a -O3 build significantly for codebases with huge functions (and therefore especially with llvm-lto). As with my previous optimizer PR, 'caveat emptor' - I first used C++11 a couple of weeks ago. Please review carefully in case I've made a critical mistake which makes it all wrong (馃槩).

The first commit is the main one. My reading of the code is that (previously) a vector was used to attempt to keep track of the 'last' junction in order to try and go from back to front in one shot. Unfortunately using a vector as a stack isn't actually a reliable way to do this as we push junctions back onto the list in an arbitrary order (with for (auto b : junc.inblocks)).
Using an ordered set:

  • lets us reliably choose the 'last' junction, regardless of insertion order
  • allows elimination of a data structure
  • allows us to not worry about checking before inserting

The second commit is just a minor tweak, I don't know if it gives any speedup. It was just misleading to compare against a different variable when we're actually checking for true.

I've run the first 50 tests (alphabetically) from test_core and all of test_other. They all (bar four seemingly irrelevant other tests) pass.

Python is a good example to see the speedup as it has a huge function:

$ # before
$ EMCC_DEBUG=1 emcc -O3 --llvm-lto 1 -s EMULATE_FUNCTION_POINTER_CASTS=1 -o out.js python/python.bc 2>&1 | grep 'js opts'
DEBUG    root: emcc step "js opts" took 9.12 seconds
# after
$ EMCC_DEBUG=1 emcc -O3 --llvm-lto 1 -s EMULATE_FUNCTION_POINTER_CASTS=1 -o out.js python/python.bc 2>&1 | grep 'js opts'
DEBUG    root: emcc step "js opts" took 5.91 seconds
@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 15, 2015

Contributor

Looking at updating the js-optimizer at the moment.
Thinking about creating an 'orderedset' type.

Contributor

aidanhs commented Apr 15, 2015

Looking at updating the js-optimizer at the moment.
Thinking about creating an 'orderedset' type.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 15, 2015

Owner

js-optimizer.js is probably not worth updating, almost everyone should be using the native compiled one. the .js version is legacy at this point.

Owner

kripken commented Apr 15, 2015

js-optimizer.js is probably not worth updating, almost everyone should be using the native compiled one. the .js version is legacy at this point.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 15, 2015

Owner

Make sure you rebuild the native optimizer to test it. emcc --clear-cache, or manually erase it from the cache dir.

Also important to test the other suite here, in particular ./tests/runner.py other.test_js_optimizer (which tests native as well, despite the name).

Owner

kripken commented Apr 15, 2015

Make sure you rebuild the native optimizer to test it. emcc --clear-cache, or manually erase it from the cache dir.

Also important to test the other suite here, in particular ./tests/runner.py other.test_js_optimizer (which tests native as well, despite the name).

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 16, 2015

Contributor

Ah ok, I figured we were keeping the js one around if it wasn't possible to compile for some reason. But that doesn't make sense, since compiling llvm is required to run emscripten.

I've already run the other.* tests (4 unrelated failures - test_crunch, test_scons, test_llvm_lit, test_outline).
The optimizer isn't in my cache, but in the name of belt and braces I've just wiped ~/.em* and run the other.test_js_optimizer test. It passed.

Contributor

aidanhs commented Apr 16, 2015

Ah ok, I figured we were keeping the js one around if it wasn't possible to compile for some reason. But that doesn't make sense, since compiling llvm is required to run emscripten.

I've already run the other.* tests (4 unrelated failures - test_crunch, test_scons, test_llvm_lit, test_outline).
The optimizer isn't in my cache, but in the name of belt and braces I've just wiped ~/.em* and run the other.test_js_optimizer test. It passed.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

Owner

Hmm, the optimizer must be in the cache. Or a placeholder file saying that building it failed. What does ls -al ~/.emscripten_cache show? Does EMCC_DEBUG=1 output show the native optimizer is being used?

Owner

kripken commented Apr 16, 2015

Hmm, the optimizer must be in the cache. Or a placeholder file saying that building it failed. What does ls -al ~/.emscripten_cache show? Does EMCC_DEBUG=1 output show the native optimizer is being used?

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 16, 2015

Contributor

The built optimizer lives in a directory alongside emscripten/incoming - I just cd into emscripten/incoming_optimizer_64bit (or something like that) and type make whenever I switch branches.

~/.emscripten points to the file there as well.

Contributor

aidanhs commented Apr 16, 2015

The built optimizer lives in a directory alongside emscripten/incoming - I just cd into emscripten/incoming_optimizer_64bit (or something like that) and type make whenever I switch branches.

~/.emscripten points to the file there as well.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

Owner

cc @rfk who wrote this code

Owner

kripken commented Apr 16, 2015

cc @rfk who wrote this code

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

Seems like this assumes checkedLive was false before that call to analyzeJunction? Is that true?

kripken commented on b75bb47 Apr 16, 2015

Seems like this assumes checkedLive was false before that call to analyzeJunction? Is that true?

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

Oh wait, I get it.

kripken replied Apr 16, 2015

Oh wait, I get it.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

A set has an order, but it is based on the key. Inserting won't necessarily add it to the "end", so popping from the end doesn't really matter. Could have grabbed the next from the start, I think. However, I don't see a reason to change this, as it does look nice this way in terms of clarity.

kripken commented on 3f5a78f Apr 16, 2015

A set has an order, but it is based on the key. Inserting won't necessarily add it to the "end", so popping from the end doesn't really matter. Could have grabbed the next from the start, I think. However, I don't see a reason to change this, as it does look nice this way in terms of clarity.

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

The only question is whether it is important for us to have it at the end of that list. @rfk ?

kripken replied Apr 16, 2015

The only question is whether it is important for us to have it at the end of that list. @rfk ?

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 16, 2015

Owner

So, the whole reason I use a set is because of the ordering by key - it actually is helpful to grab from the largest key. Reasoning as follows:

  • First note that (in terms of the final result) it doesn't actually matter which order we look at junctions. It's a bit like bubble sort, in that we keep looping until there are no more changes, i.e. we hit stability. Intuitively (though I haven't proved it) that's independent of the examination order - there's only one minimal set of variables, and if we loop until there are no more changes then we've hit that minimal set.
  • Because we propagate changes backwards, we want to start at the end - in theory, this will mean that we begin with the junctions least likely to change once we've scanned them. If we started at the beginning, any junctions in the middle of the function with altered liveness could flow changes backwards all the way to the start. While junctions at the end can change, they're less likely to (I've empirically verified this by taking things from the beginning - it gives a 25% slowdown...which is still faster than the previous implementation).
  • This is where the set is valuable - we are able to always choose the last-most junction, because junction ids (seem to) correspond directly to the order of their location. This contrasts to the vector-as-a-stack approach, which could end up choosing a bad order to reduce junctions in.

One note about EXIT_JUNCTION after looking at this again, I don't think we actually need to add it to the junctions to be considered at all (as long as we start by adding all inblocks to the block workset) - it has no live vars and is not the entry point for anything so it will only be considered once.
(I'll do this in a separate PR)

Owner

aidanhs replied Apr 16, 2015

So, the whole reason I use a set is because of the ordering by key - it actually is helpful to grab from the largest key. Reasoning as follows:

  • First note that (in terms of the final result) it doesn't actually matter which order we look at junctions. It's a bit like bubble sort, in that we keep looping until there are no more changes, i.e. we hit stability. Intuitively (though I haven't proved it) that's independent of the examination order - there's only one minimal set of variables, and if we loop until there are no more changes then we've hit that minimal set.
  • Because we propagate changes backwards, we want to start at the end - in theory, this will mean that we begin with the junctions least likely to change once we've scanned them. If we started at the beginning, any junctions in the middle of the function with altered liveness could flow changes backwards all the way to the start. While junctions at the end can change, they're less likely to (I've empirically verified this by taking things from the beginning - it gives a 25% slowdown...which is still faster than the previous implementation).
  • This is where the set is valuable - we are able to always choose the last-most junction, because junction ids (seem to) correspond directly to the order of their location. This contrasts to the vector-as-a-stack approach, which could end up choosing a bad order to reduce junctions in.

One note about EXIT_JUNCTION after looking at this again, I don't think we actually need to add it to the junctions to be considered at all (as long as we start by adding all inblocks to the block workset) - it has no live vars and is not the entry point for anything so it will only be considered once.
(I'll do this in a separate PR)

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 16, 2015

Owner

Just to be safe, I ran benchmarks and a bunch of tests on this. I don't see any issues. I also measured on SQLite, and I see a 30% speedup on the optimizer phase, and 40% on poppler - nice! :)

So this looks great, just waiting on @rfk if he has any comments.

Owner

kripken commented Apr 16, 2015

Just to be safe, I ran benchmarks and a bunch of tests on this. I don't see any issues. I also measured on SQLite, and I see a 30% speedup on the optimizer phase, and 40% on poppler - nice! :)

So this looks great, just waiting on @rfk if he has any comments.

@rfk

This comment has been minimized.

Show comment
Hide comment
@rfk

rfk Apr 17, 2015

Contributor

code that's cleaner and faster? nice! this looks good to me.

I believe that "order doesnt matter for correctness, but can hugely impact performance" is indeed a known result in this space. Junction ids are not in general ordered by flow position but it's probably a reasonable approximation, and there's no arguing with the empirical results :-)

Contributor

rfk commented Apr 17, 2015

code that's cleaner and faster? nice! this looks good to me.

I believe that "order doesnt matter for correctness, but can hugely impact performance" is indeed a known result in this space. Junction ids are not in general ordered by flow position but it's probably a reasonable approximation, and there's no arguing with the empirical results :-)

@kripken kripken merged commit b75bb47 into kripken:incoming Apr 17, 2015

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 17, 2015

Owner

Great! Merged. Looking forward to the followup stuff.

Owner

kripken commented Apr 17, 2015

Great! Merged. Looking forward to the followup stuff.

@aidanhs aidanhs deleted the aidanhs:aphs-optimize-optimizer branch Apr 17, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment