New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up registerizeHarder variable conflict checking #3371

Merged
merged 7 commits into from Apr 21, 2015

Conversation

Projects
None yet
2 participants
@aidanhs
Contributor

aidanhs commented Apr 20, 2015

Results in a 50% speedup in sqlite compilation!

This PR is based on the observation that vector access is much faster than set/map access. By mapping names to numbers and using a vector we can speed up all conflict setting (first 3 commits). I then mitigate the perf hit to iteration by filtering the vector outside of the loop which would iterate over it (final commit).

All asm3 tests pass, as does test_other. I should probably be noting that I don't have spidermonkey installed so some parts of tests (e.g. actual sqlite execution, though I did check that with nodejs) aren't being run. This obviously also affects the before/after for sqlite below.

Before (after #3368):

~/em/emsdk/emscripten/incoming/tests $ python runner.py asm3.test_sqlite 2>&1 | grep 'test in'
Ran 1 test in 61.199s
~/em/emsdk/emscripten/incoming/tests $ EMCC_DEBUG=1 emcc -O3 --llvm-lto 1 -s EMULATE_FUNCTION_POINTER_CASTS=1 -o out.js python/python.bc 2>&1 | grep 'js opts'
DEBUG    root: emcc step "js opts" took 5.23 seconds
~/em/emsdk/emscripten/incoming/tests $ cd ~/em/BananaBread/cube2/src/web
~/em/BananaBread/cube2/src/web $ (make && time make client server) 2>&1 | grep real
real    0m28.940s

After this pr:

~/em/emsdk/emscripten/incoming/tests $ python runner.py asm3.test_sqlite 2>&1 | grep 'test in'
Ran 1 test in 29.944s
~/em/emsdk/emscripten/incoming/tests $ EMCC_DEBUG=1 emcc -O3 --llvm-lto 1 -s EMULATE_FUNCTION_POINTER_CASTS=1 -o out.js python/python.bc 2>&1 | grep 'js opts'
DEBUG    root: emcc step "js opts" took 3.17 seconds
~/em/emsdk/emscripten/incoming/tests $ cd ~/em/BananaBread/cube2/src/web
~/em/BananaBread/cube2/src/web $ (make && time make client server) 2>&1 | grep real
real    0m18.117s
@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

Owner

I haven't read the patches yet, but the js opts part on sqlite goes from 35.84 seconds to... 5.73 (!). That's much more than a 50% speedup! But it's so big I was kind of worried that something changed in the output for the worse, so I'll ran the benchmark suite on this... and it looks like code speed and size remain the same. Awesome!

Starting to read the patches now.

Owner

kripken commented Apr 21, 2015

I haven't read the patches yet, but the js opts part on sqlite goes from 35.84 seconds to... 5.73 (!). That's much more than a 50% speedup! But it's so big I was kind of worried that something changed in the output for the worse, so I'll ran the benchmark suite on this... and it looks like code speed and size remain the same. Awesome!

Starting to read the patches now.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Contributor

Yes, my 'flagship measurement' at the top is based on the overall compilation time, just as an interesting demonstration of how much js optimisation was previously dominating compilation time. Bananabread is similarly measuring overall compilation time. Only python drills down to js opts time. If I start putting together more optimisations maybe I'll try and present both the individual and overall speedup for clarity. Though maybe I need to focus elsewhere now :)

I too was very surprised, which is why I ran all of test_other and asm3 and tweaked the sqlite test to run under nodejs on my PC just to check it actually ran. I'll have to add the benchmark suite to my list for validating my changes.

Contributor

aidanhs commented Apr 21, 2015

Yes, my 'flagship measurement' at the top is based on the overall compilation time, just as an interesting demonstration of how much js optimisation was previously dominating compilation time. Bananabread is similarly measuring overall compilation time. Only python drills down to js opts time. If I start putting together more optimisations maybe I'll try and present both the individual and overall speedup for clarity. Though maybe I need to focus elsewhere now :)

I too was very surprised, which is why I ran all of test_other and asm3 and tweaked the sqlite test to run under nodejs on my PC just to check it actually ran. I'll have to add the benchmark suite to my list for validating my changes.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

It's actually not obvious that clear leaves the reserved space there. Did you read somewhere that it would?

But it might and there is no reason not to give it the opportunity. Although I wonder if calling reserve after the clear wouldn't be more guaranteed to accomplish this. Maybe worth measuring? Can do it as an optional followup.

It's actually not obvious that clear leaves the reserved space there. Did you read somewhere that it would?

But it might and there is no reason not to give it the opportunity. Although I wonder if calling reserve after the clear wouldn't be more guaranteed to accomplish this. Maybe worth measuring? Can do it as an optional followup.

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Owner

You're right, this is non-obvious. It's definitely a case of "I read it somewhere", but I can't remember where and I doubt it was authoritative.

However, the implementation for _Hashtable on my linux PC (which, if I'm able to grep /usr/include correctly, is what unordered_map uses behind the scenes) appears to have the following implementation:

  template<typename _Key, typename _Value,                                      
       typename _Alloc, typename _ExtractKey, typename _Equal,                  
       typename _H1, typename _H2, typename _Hash, typename _RehashPolicy,      
       typename _Traits>                                                        
    void                                                                        
    _Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal,                       
           _H1, _H2, _Hash, _RehashPolicy, _Traits>::                           
    clear() noexcept                                                            
    {                                                                           
      _M_deallocate_nodes(_M_begin());                                          
      __builtin_memset(_M_buckets, 0, _M_bucket_count * sizeof(__bucket_type)); 
      _M_element_count = 0;                                                     
      _M_before_begin()._M_nxt = nullptr;                                       
    }

Note no call to _M_deallocate_buckets.

So a measurement by me won't show up anything interesting. If you think it's worth doing I'll add a 'code review comments' commit containing this (and any other changes from your other comments).

Owner

aidanhs replied Apr 21, 2015

You're right, this is non-obvious. It's definitely a case of "I read it somewhere", but I can't remember where and I doubt it was authoritative.

However, the implementation for _Hashtable on my linux PC (which, if I'm able to grep /usr/include correctly, is what unordered_map uses behind the scenes) appears to have the following implementation:

  template<typename _Key, typename _Value,                                      
       typename _Alloc, typename _ExtractKey, typename _Equal,                  
       typename _H1, typename _H2, typename _Hash, typename _RehashPolicy,      
       typename _Traits>                                                        
    void                                                                        
    _Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal,                       
           _H1, _H2, _Hash, _RehashPolicy, _Traits>::                           
    clear() noexcept                                                            
    {                                                                           
      _M_deallocate_nodes(_M_begin());                                          
      __builtin_memset(_M_buckets, 0, _M_bucket_count * sizeof(__bucket_type)); 
      _M_element_count = 0;                                                     
      _M_before_begin()._M_nxt = nullptr;                                       
    }

Note no call to _M_deallocate_buckets.

So a measurement by me won't show up anything interesting. If you think it's worth doing I'll add a 'code review comments' commit containing this (and any other changes from your other comments).

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Owner

(on the subject of vector reservation - http://stackoverflow.com/a/6882816/2352259)

Owner

aidanhs replied Apr 21, 2015

(on the subject of vector reservation - http://stackoverflow.com/a/6882816/2352259)

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Contributor

FYI, I've just pushed a tiny followup commit to improve code clarity. It has no performance impact I can measure (possibly a very slight improvement).

Contributor

aidanhs commented Apr 21, 2015

FYI, I've just pushed a tiny followup commit to improve code clarity. It has no performance impact I can measure (possibly a very slight improvement).

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

why do you think that the reserve property is maintained through the assign?

I don't see a reason not to do this, the cost is low, and some implementations might do what you want. but i don't know if they would.

Might be worth measuring it specifically, but it could change by implementation for all we know ;)

why do you think that the reserve property is maintained through the assign?

I don't see a reason not to do this, the cost is low, and some implementations might do what you want. but i don't know if they would.

Might be worth measuring it specifically, but it could change by implementation for all we know ;)

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Owner

Well, reserve requests the vector be able to hold at least N elements, and assign populates the vector with N elements. I can't imagine why the vector would need to change size for the second part!
The assign is necessary so I can use it as a contiguous array at will. The reserve probably isn't necessary - I'd expect it does a reserve+multiple memcpy (or similar) as part of an assign call.

Anyway, this bit of code is gone now :)

Owner

aidanhs replied Apr 21, 2015

Well, reserve requests the vector be able to hold at least N elements, and assign populates the vector with N elements. I can't imagine why the vector would need to change size for the second part!
The assign is necessary so I can use it as a contiguous array at will. The reserve probably isn't necessary - I'd expect it does a reserve+multiple memcpy (or similar) as part of an assign call.

Anyway, this bit of code is gone now :)

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

Taking this line as an example: we used to map names to JuncVars. Now we map names to nums, then look up the nums in a vector. So we still have a map lookup, but also added a vector lookup.

Is this beneficial because we have a single name to num map? or is it a downside that is offset by other upsides?

Taking this line as an example: we used to map names to JuncVars. Now we map names to nums, then look up the nums in a vector. So we still have a map lookup, but also added a vector lookup.

Is this beneficial because we have a single name to num map? or is it a downside that is offset by other upsides?

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Owner

This is a downside that's offset by other upsides.
This section of code is pretty quick. The main time sinks from my experiments were previously live variable analysis (addressed in pr 1 and 2) and conflict analysis (addressed in pr 2 and 3).
What I the 'block processing' and 'reconstruction' stages (i.e. the stages affected by the extra vector lookup you mention) are proportionally larger after my three PRs, but are still (even when summed) generally swamped by live variable analysis time.
My fourth and final PR in this set will probably be to clean up my high level clock-based profiling and submit that for potential future use.

Owner

aidanhs replied Apr 21, 2015

This is a downside that's offset by other upsides.
This section of code is pretty quick. The main time sinks from my experiments were previously live variable analysis (addressed in pr 1 and 2) and conflict analysis (addressed in pr 2 and 3).
What I the 'block processing' and 'reconstruction' stages (i.e. the stages affected by the extra vector lookup you mention) are proportionally larger after my three PRs, but are still (even when summed) generally swamped by live variable analysis time.
My fourth and final PR in this set will probably be to clean up my high level clock-based profiling and submit that for potential future use.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

hmm, it's not a map - why call it one?

hmm, it's not a map - why call it one?

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

oh, I see the next commit addresses that

oh, I see the next commit addresses that

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 21, 2015

Owner

FWIW, it was being used as a map (lookup table) rather than a vector (access via iteration).
I just needed a different name to convert this lookup table into the thing I'd access via iteration.

Owner

aidanhs replied Apr 21, 2015

FWIW, it was being used as a map (lookup table) rather than a vector (access via iteration).
I just needed a different name to convert this lookup table into the thing I'd access via iteration.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

Owner

Looks good. I had just a few minor questions.

Owner

kripken commented Apr 21, 2015

Looks good. I had just a few minor questions.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Apr 21, 2015

Owner

Ok, that addressed all my questions, thanks. Merging.

Owner

kripken commented Apr 21, 2015

Ok, that addressed all my questions, thanks. Merging.

@kripken kripken merged commit 3bc4e10 into kripken:incoming Apr 21, 2015

@aidanhs aidanhs deleted the aidanhs:aphs-optimize-optimizer-3 branch Apr 22, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment