-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU mask mode, general discussion #1037
Comments
I saw 6-7G c/s on a GTX 980 with current code. My laptop GT 650M does 3-400M c/s. |
FWIW I saw this once while running TS
...but I can't reproduce it. This was with Apple's CPU device and TS sets LWS=8 and GWS=64 (although this device will force LWS down to 1). |
I'm getting some issues with cpu devices too. Apparently GWS=512 and LWS=64 seems to be the minimum limit. |
But there is no particular reason there would be such limit, right? Maybe it's just another barrier needed somewhere or something like that. |
@Sayantan2048 I think this patch fixes the performance counters for good, please review
This code assumes |
BTW we should consider the possibility for overrun here. The result of |
Here's code that is safe for that. However, I think we'll never need that high number for a single crypt call.
|
It always is - but I totally fail to see how/where that happens! So I don't dare committing this. |
add32to64(&status.cands, crk_key_index * Are you sure it won't interfere with *pcount as it already updates inside crypt_all(). I mean please check we don't have a situation where mask_int_cand.num_int_cand is multiplied twice, once inside crypt_all() and again inside crk_salt_loop(). |
See line 17 mask_ext.c. It is set to 1 even when there is no mask mode. |
That I'm sure of. The thing you mention happens once per salt and updates everything but p/s. This one take care of p/s and is not multiplied by salts. Ah, yes it's statically initialized to 1. I will merge this now then! |
Mask has some problems with -dev=cpu. Trying to debug, it always blames /usr/lib/libamdocl64.so
[Fixed now, works now on CPU and GPU]
|
Where was the problem, with mask mode or the format or common opencl code? |
Is it possible that this mask (sometimes) misbehave
|
I checked every format allocation (all details). One of them was causing it. |
Off the top of my head, I think a mask like |
Ok, I will try to nail the problem with this particular mask. A side note, putting this mask stuff hurts benchmark numbers for 'a regular run' a lot. But, a real 'regular run' has, basically, the same performance that it had using old mask-less source code. I compared sha256 (new and old) and md4 and md5 (I guess It makes sense). I was planning to create two kernels, but it seems useless to a real user (no gain or loss). Anyone disagrees? |
Once GPU-side mask is universally working, the self-test will benchmark it, and (I guess) show a separate figure for that speed. |
There is something wrong with this mask expansion. Result of analisys (--skip-self-tests and no autotune).
Somehow, what is calling set_key behaves in 2 different ways. And, for some reason, it fails sometimes.
Above is an example, I used sha256 to get the numbers. |
When GPU side mask is activated (currently only available on raw-md4-opencl), it is expected that set_keys() is called fewer times as some portion of the key is generated by the format. The above case seems like a bug to me.
|
@magnumripper commit a649032 is causing performance degradation due to register spilling on 7970 with catalyst 14.12. Speed is now reduced form 3.9Gc/s to 1.8Gc/s for raw-md4-opencl |
That's in raw-MD4 or what? That's odd, this is an optimization we should have, and the parens are merely making it more obvious to the compiler. An alternative is actually coding the optimization with a tmp variable. |
BTW I will try it will 14.9 - the 14.12 is known as the worst driver version ever. |
password -> 8a9d093f14f8701df17732b2bb182c74 |
I know that. My point is: when running in GPU mask mode.
|
@magnumripper I tried to build john on well, but it failed.
Regarding commit a649032, should we wait for the next driver release ? |
Disable native tests:
|
My understanding of UTF-16 is somewhat fuzzy at the moment and I require some guidance regarding handling of UTF-16 characters.
|
Good question, that could be 100,000 characters. I think it's best to keep using an "internal encoding" from the user perspective even though we are not really limited to it. So it'll be like this: A. User picks (or has as a default) internal encoding CP1234. Our current CPU-side mask mode works just like that except (D).
Assuming we go with my answer above, same applies here. We do exactly as we do today, using the internal encoding, and then convert it to a set of UTF-16 code points.
Command line is either decoded as UTF-8 or a code page, depending on john.conf or command line encoding settings. Even when using UTF-8 we have the notion of an internal encoding (which defaults to ISO-8859-1). So everything is quite normal cstrings up to and including with the format's set_key(). A UTF-16 format (eg. NT) should be "encoding aware". It knows the string sent to set_key() is to be decoded and converts it to a "string" of unsigned shorts. For UTF-8 this is done using a (fast and simple) function, for legacy code pages it's a LUT. For the special case of ISO-8859-1 it's actually just a cast - the character 0xA3 (a pound sign) in UTF-16 is 0x00A3. BTW nt-opencl currently can't handle anything but ISO-8859-1. All other UTF-16 formats in Jumbo can, AFAIK.
Instead of -stdout, you can run the above with netntlmv2 or ntlmv2-opencl and follow what happens and where. The latter case fully works but is not very effective, it converts on GPU but transfer is a huge bottleneck. It should use GPU mask.
I don't quite understand your question. See unicode.c for shared generic functions. Also see iconv(1) for verifying stuff
|
while it doesn't. This is a bad bug: False negatives!
@Sayantan2048 can we not get rid of this warning that is printed whenever keyspace is exhausted?
It's confusing. If it's needed as an assertion we should try to tweak it so it's muted for the normal no-problem situation. |
with --internal-encoding. See #1037.
@Sayantan2048 I committed a first version of full Unicode support for NT-opencl in 9cecf81. This version doesn't change the underlying functions - it just decodes on GPU as needed. Cases like this one work fine (with any supported encoding):
This fails (UTF-8 character preceeding mask place-holders, and no internal encoding):
The reason it fails, is mask mode thinks the first As soon as you add --internal-encoding the problem goes away:
(EDIT: b2227bb makes sure you can't run the above without internal encoding) Performance should be totally unaffected when not actually using UTF-8 or codepage (it actually builds different kernels). And even when you do, performance is still pretty good. However, if we changed mask mode's |
…ding so we don't get false negatives from improper use. See #1037.
Wow, on the Titan X performance is more or less unaffected even with codepage table lookups in the inner loop. I get over 15 Gp/s using LWS: 256, GWS: 36864 and mask mode either way. On the 7970, performance drops from 7.5 Gp/s to 5 Gp/s when using internal encoding. Maybe I should have implemented this prior to CMIYC-2015... 😢 |
mode with --internal-encoding. See #1037.
…larms and Sayantan wont even comment it. See #1037.
Help me understand this and correct me where I'm wrong. Mask mode internally only supports uint8 chars, and I beleive encoding requiring uint16 or uint32 are first converted to uint8 chars and then fed into mask mode. So if ö takes uint16, I suppsoe it is split into two uint8 chars, mask mode treats these chars as separate and assigns separate location for them to iterate over. However, NT must treat them as one char, and put them at same location assuming we're using uint16. I see we're using PUTSHORT macros which stuffs uint8 keys typecast as short bytes into nt_buffer. So, the two uint8 chars of ö ends up at different location within nt_buffer. When it should have shared one uint16 char, it is using two!! This is where I'm getting confused!! Or better, please explain the whole chain of conversion going on starting with initial mask. |
Here's our current chain (using an internal encoding) for a Euro sign:
This works perfectly fine, except you need to find an internal encoding that holds any characters you will need (and it's not always possible - for example you might have problems finding a codepage that can hold a string containing a russian character and a Euro sign). Unicode-aware mask mode would work like this:
But now we'd get the opposite problem for non-Microsoft formats actually using UTF-8 as target encoding: A "€" will expand to |
To get around this I guess you'd need to introduce another piece of data to the mask struct: "For position n our place-holder will eat m bytes" or something like that. So that one will depend on target encoding: In case of NT it will always be 1 (as in one uint16) while for raw-MD5 it will be 1 for ASCII characters and 2, 3 or 4 bytes for non-ASCII... |
Hm no, that will not do. Consider the mask Maybe we should just stick to the internal encoding. It really solves most problems, but has its limitations. |
In utf-8, € is represented by three bytes (<82>). |
@Sayantan2048 you have implemented your own auto-tune in seven GPU-mask formats, with no shared code at all. That's the opposite direction from what I and @claudioandre has struggled with for a long time. I tried changing NT-opencl to use our shared auto-tune in 2772b8f and I see no downsides (actually it seems to work better). I'm planning to do the same with the six others. I can understand if mscash2 needs its own auto-tune (for multi device support) but for other formats it's really the wrong way to go. Once we use shared code it'll be a much simpler task implementing GPU-mask-autotune. |
Thank you, it should and will work for raw hashes, however, for salted hashes, do they set valid salt before benchmark? I think you'll also run into problems with descrypt and lm-opencl, where kernels are rebuilt as neeeded during auto-tune. |
As my initail thoughts, I think his complicates the GPU side mask!! For raw-md5, we'll need to do 1 or 2 or 3 or 4 putchar insted of one, if we treat every placeholder as UTF-32. Worst part is, it won't be SIMD friendly. |
Not exactly, but we'll need too many scalar branches which ideally shouldn't cause performance degradation but when put inside loops, they tend to perform very poorly. |
Yeah I think we should stick to "internal encoding" for now. Simple is beautiful.
The shared auto-tune uses salt and ciphertexts from the test vectors so it should be fine. I wont touch DES or LM, I'm looking at fixing the following:
(I think nsldap and raw-sha1 will be merged in the process, like we did for CPU formats) |
I did mscash and it seemed to work at first but something's not right with salts (as you said). Just what the heck are you doing in there? Are there good reasons for deviating from the specified interfaces? The shared auto-tune works like a champ for almost 50 formats, salted or not. Reverted it while investigating. |
OK, I think I get the picture. I probably wasn't very far from a working version. I'll continue later. |
Works now. I will polish it a bit more. @Sayantan2048 perhaps the shared code should include ability to build a "fake db" out of test vectors? Or would that be too format-specific? BTW you really should move some of your dupe code for hash tables and stuff, to shared code or at least shared source (a C header). |
I'm closing this generic issue now. We'll open specific issues when needed. I played a lot with GPU mask formats lately, including with various encodings and they work damn good and damn fast 👍 |
For future experimental/incomplete commits (eg. when taking a stab at some UTF-16 format), please use the topic branch again so we keep the amount of likely-broken code in bleeding-jumbo as low as possible. But let's merge things as soon as they seem stable so it doesn't diverge too much. Also, we often don't find out about problems until we actually merge (#1036 is a good example).
Let's use this issue for general discussion. For specific problems we should create separate issues.
The text was updated successfully, but these errors were encountered: