Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Optimize intrinsic functions with pre-processor-fu #1720
BTW this is very interesting:
Using our CPU intrinsics format, and AVX2 (all this is on well):
Same hardware, but using our OpenCL code (and an AVX2-aware CPU driver):
Above, the device asked for scalar code so we served it that, then it was auto-vectorized and actually faster than our intrinsics.
Here's forcing 8x vector source code:
Slightly slower than auto-vectorized in this case, but still faster than our CPU format.
We don't get this good results for OpenCL with all formats, yet. But some day, we may. Scalar OpenCL is very easy to write, much much easier than writing plain CPU code using intrinsics.
You can not define a macro in a macro, but you can call a macro from a macro. Actually in as many levels as you want.
Currently only MD4 & MD5 are done, and not completely. What is done, is there are now (behind the curtain) two different functions - one for a single/first block and another for "reload". Also, the "flat to interleaved" is moved to a separate function but that is also hidden by PP macros (mosty optimized away since SSEi_flags are constants).
Boost is 5-10% depending on format. Still, I'm not quite sure we want to walk this path.