Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize intrinsic functions with pre-processor-fu #1720

Open
magnumripper opened this issue Aug 29, 2015 · 8 comments

Comments

Projects
None yet
2 participants
@magnumripper
Copy link
Owner

commented Aug 29, 2015

I'm pretty sure we can drop most branches in the SIMD functions using some clever rewrites and macros. The OpenCL functions does that. Well, that's not really a fair comparison since everything is always inlined but anyway.

@magnumripper

This comment has been minimized.

Copy link
Owner Author

commented Aug 29, 2015

Example: The actual MD4 function would be SIMD_MD4(w, a, b, c, d). Everything else would be preprocessor stuff and at compile time most of it would disappear and there would be NO branches. I'm pretty sure the boost would be noticable.

@magnumripper

This comment has been minimized.

Copy link
Owner Author

commented Aug 29, 2015

One thing that complicates things is that macros can't contain macros. So we can't put pseudo-intrinsics in macros...

@jfoug

This comment has been minimized.

Copy link
Collaborator

commented Aug 29, 2015

Debuggging is also very hard with macro usages like this. I am seeing this in parts of dyna. You want to be DAMN sure things are correct before macro-izing things, and have a fall back to allow debugging.

@magnumripper

This comment has been minimized.

Copy link
Owner Author

commented Aug 31, 2015

BTW this is very interesting:

Using our CPU intrinsics format, and AVX2 (all this is on well):

$ ../run/john -test -form=wpapsk
Benchmarking: wpapsk, WPA/WPA2 PSK [PBKDF2-SHA1 256/256 AVX2 8x]... (8xOMP) DONE
Raw:    13116 c/s real, 1649 c/s virtual

Same hardware, but using our OpenCL code (and an AVX2-aware CPU driver):

$ ../run/john -test -form=wpapsk-opencl -dev=0
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL]... DONE
Raw:    13540 c/s real, 1699 c/s virtual

Above, the device asked for scalar code so we served it that, then it was auto-vectorized and actually faster than our intrinsics.

Here's forcing 8x vector source code:

$ ../run/john -test -form=wpapsk-opencl -dev=0 -force-vector=8
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL 8x]... DONE
Raw:    13320 c/s real, 1668 c/s virtual

Slightly slower than auto-vectorized in this case, but still faster than our CPU format.

We don't get this good results for OpenCL with all formats, yet. But some day, we may. Scalar OpenCL is very easy to write, much much easier than writing plain CPU code using intrinsics.

@magnumripper

This comment has been minimized.

Copy link
Owner Author

commented Sep 5, 2015

One thing that complicates things is that macros can't contain macros. So we can't put pseudo-intrinsics in macros...

This is not correct. They can, but you'll need to be careful.

@jfoug

This comment has been minimized.

Copy link
Collaborator

commented Sep 5, 2015

How do you define a macro in a macro? This shortfall was why I did the dynamic-big-hash.c the way I did (using an external script to do my macro expansions).

@magnumripper

This comment has been minimized.

Copy link
Owner Author

commented Sep 5, 2015

You can not define a macro in a macro, but you can call a macro from a macro. Actually in as many levels as you want.

http://stackoverflow.com/questions/7972785/can-a-c-macro-definition-refer-to-other-macros

@magnumripper

This comment has been minimized.

Copy link
Owner Author

commented Sep 7, 2015

A little experiment is now in the cpp-intrinsics topic branch. Specifically 51f3fe6 for now.

Currently only MD4 & MD5 are done, and not completely. What is done, is there are now (behind the curtain) two different functions - one for a single/first block and another for "reload". Also, the "flat to interleaved" is moved to a separate function but that is also hidden by PP macros (mosty optimized away since SSEi_flags are constants).

Boost is 5-10% depending on format. Still, I'm not quite sure we want to walk this path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.