Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify SHA-1's H() a.k.a F3() a.k.a SHA-2's Maj() implementations #1727

Closed
5 tasks done
magnumripper opened this issue Sep 2, 2015 · 8 comments
Closed
5 tasks done
Assignees
Milestone

Comments

@magnumripper
Copy link
Member

http://www.openwall.com/lists/john-dev/2015/09/02/5

  • Change all OpenCL definitions using bitselect to 2-op version.
  • Change all OpenCL non bitselect fallbacks, and CUDA versions, to 4-op version.
  • Change all Ch() for CUDA and (non-bitselect) OpenCL to 3-op.
  • Same for SIMD intrinsics.
  • Have a look at the scalar plain C stuff while at it.
@magnumripper magnumripper self-assigned this Sep 2, 2015
@magnumripper magnumripper added this to the 1.8.0-jumbo-2 milestone Sep 2, 2015
@jfoug
Copy link
Collaborator

jfoug commented Sep 2, 2015

That was a good catch. It is why I cringe at people writing all the inline stuff, just to gain a percent or 2, thus HIDING the things that can easily make better gains (such as improved algorithm or other simplification tricks). I know we have done many items recently that have unified code (the pbkdf2_*.h stuff is great examples).

@magnumripper
Copy link
Member Author

Note to self: bitselect(x, y, z) in XOP is _mm_cmov_si128(y, x, z) (mind the order). z is inverted.

magnumripper added a commit that referenced this issue Sep 2, 2015
SHA-2's Ch() implementations, using better optimized ones.
OpenCL and CUDA formats. See #1727.
magnumripper added a commit that referenced this issue Sep 2, 2015
SHA-2's Ch() implementations, using better optimized ones.
Intrinsics formats. See #1727.
@magnumripper
Copy link
Member Author

All done.

Re-assigning to @zzlei, please test/benchmark on NEON and Altivec if/when you can. I will test for regressions in OpenCL and Intel CPU.

@magnumripper
Copy link
Member Author

Added e8703bb and 957a538 too after realizing MD4/5 F() is also same as Ch()

@magnumripper
Copy link
Member Author

Oh, and MD4 G() is same as SHA-2 Maj(). 7071b4a and Solar found a new way of doing MD5 I() using one less ops 382a961.

@lei-april
Copy link
Contributor

I just tried it on Power. The only access I have to Power is through GCC farm, and it fluctuates so bad (too many users perhaps).

Here's just 3 consecutive runs:

[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    29257 c/s real, 2307 c/s virtual

[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    133032 c/s real, 1706 c/s virtual

[zlei@gcc2-power8 src]$ ../run/john --test --format=pbkdf2-hmac-sha1
Will run 152 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 AltiVec 4x]... (152xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    93388 c/s real, 1609 c/s virtual

I don't think I can get any useful benchmark results on this machine.

@magnumripper
Copy link
Member Author

At least we know it's working 😄

What if you run a lot fewer threads, like 4 or 8?

@lei-april
Copy link
Contributor

What if you run a lot fewer threads, like 4 or 8?

Yes, that works! I'll post the result on john-dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants