Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try branch-free version of Eytzinger binary search #42

Open
chshersh opened this issue Jul 11, 2018 · 4 comments
Open

Try branch-free version of Eytzinger binary search #42

chshersh opened this issue Jul 11, 2018 · 4 comments

Comments

@chshersh
Copy link
Contributor

No description provided.

@qnikst
Copy link
Contributor

qnikst commented Oct 14, 2018

The issue appeared to be much more complex then I anticipated. I've tried to use the approach described in the paper. But there are many problem on the road:

  1. This approach relies on a compiler to generate cmov instruction so the code is branch-free. But we don't have direct control of the instruction that are emitted from code generator. And at least GHC 8.4.3 do not generate them.
  2. It's not possible to help compiler by having let !unlifted = .... in go unlifted as unlifted values can't be in let binding.
  3. The solution relies on the __builtin_ffs intrinsic to decode result, but implementing __builtin_ffs either in terms of existing ones Wordsize - (ctz# (complement value)) or using unsafe ffi leads to a larger overhead.
  4. The solution relies on efficient preloading of the array so it will be in the cache. But we keep 2 arrays and fetch values from them, so quite likely we efficiently corrupt our caches, breaking performance.
  5. There are no efficient functions for comparing 128bit values (though we can have such in C).
  6. I've tried to use prefetch alone, but either I did that wrong, or performance hit for introducing ST (so we have state token) outterweight performance benefits.

What can be done:

  1. Introduce new primop for ffs into GHC.
  2. Instead of keeping 2 unboxed vectors - keep only one and calculate offset on our own.
  3. Write lookup procedure in cmm directly.

I'm not sure if any of that (except for 1.) worth the trouble. For anyone interested in I have code in https://github.com/qnikst/typerep-map/tree/qn/nobranch but it leads to 1.5x slowdown, maybe anyone would be more lucky than me..

@chshersh
Copy link
Contributor Author

@qnikst Thanks for this extremely valuable comment and your investigation!

@qnikst
Copy link
Contributor

qnikst commented Oct 14, 2018

I've pushed a C version that is called via unsafe ccall, but surprisingly enough it works slower than native Haskell version (but faster than "no-branch" Haskell version). Seems that using a vector of 128bit words may be the only way to really improve speed (if needed)

@qnikst
Copy link
Contributor

qnikst commented Oct 14, 2018

Update now I have a version that works the same or a bit better then Haskell version (benchmarks are not precise enough to validate that), but so we can try to add support for that and have a flag to disable C code (may be relevant for ghcjs).

@chshersh chshersh removed the Hacktoberfest https://hacktoberfest.digitalocean.com/ label Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants