Try branch-free version of Eytzinger binary search #42

chshersh · 2018-07-11T08:55:59Z

No description provided.

qnikst · 2018-10-14T08:23:52Z

The issue appeared to be much more complex then I anticipated. I've tried to use the approach described in the paper. But there are many problem on the road:

This approach relies on a compiler to generate cmov instruction so the code is branch-free. But we don't have direct control of the instruction that are emitted from code generator. And at least GHC 8.4.3 do not generate them.
It's not possible to help compiler by having let !unlifted = .... in go unlifted as unlifted values can't be in let binding.
The solution relies on the __builtin_ffs intrinsic to decode result, but implementing __builtin_ffs either in terms of existing ones Wordsize - (ctz# (complement value)) or using unsafe ffi leads to a larger overhead.
The solution relies on efficient preloading of the array so it will be in the cache. But we keep 2 arrays and fetch values from them, so quite likely we efficiently corrupt our caches, breaking performance.
There are no efficient functions for comparing 128bit values (though we can have such in C).
I've tried to use prefetch alone, but either I did that wrong, or performance hit for introducing ST (so we have state token) outterweight performance benefits.

What can be done:

Introduce new primop for ffs into GHC.
Instead of keeping 2 unboxed vectors - keep only one and calculate offset on our own.
Write lookup procedure in cmm directly.

I'm not sure if any of that (except for 1.) worth the trouble. For anyone interested in I have code in https://github.com/qnikst/typerep-map/tree/qn/nobranch but it leads to 1.5x slowdown, maybe anyone would be more lucky than me..

chshersh · 2018-10-14T09:48:10Z

@qnikst Thanks for this extremely valuable comment and your investigation!

qnikst · 2018-10-14T10:36:20Z

I've pushed a C version that is called via unsafe ccall, but surprisingly enough it works slower than native Haskell version (but faster than "no-branch" Haskell version). Seems that using a vector of 128bit words may be the only way to really improve speed (if needed)

qnikst · 2018-10-14T12:51:04Z

Update now I have a version that works the same or a bit better then Haskell version (benchmarks are not precise enough to validate that), but so we can try to add support for that and have a flag to disable C code (may be relevant for ghcjs).

chshersh added enhancement question bench labels Jul 11, 2018

chshersh added good first issue help wanted labels Aug 20, 2018

vrom911 added the Hacktoberfest https://hacktoberfest.digitalocean.com/ label Sep 30, 2018

vrom911 added this to To do in #2: Hacktoberfest (October, 2018) via automation Sep 30, 2018

chshersh removed the Hacktoberfest https://hacktoberfest.digitalocean.com/ label Nov 7, 2018

chshersh removed this from To do in #2: Hacktoberfest (October, 2018) Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try branch-free version of Eytzinger binary search #42

Try branch-free version of Eytzinger binary search #42

chshersh commented Jul 11, 2018

qnikst commented Oct 14, 2018

chshersh commented Oct 14, 2018

qnikst commented Oct 14, 2018

qnikst commented Oct 14, 2018

Try branch-free version of Eytzinger binary search #42

Try branch-free version of Eytzinger binary search #42

Comments

chshersh commented Jul 11, 2018

qnikst commented Oct 14, 2018

chshersh commented Oct 14, 2018

qnikst commented Oct 14, 2018

qnikst commented Oct 14, 2018