Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU mask mode, general discussion #1037

Closed
magnumripper opened this issue Jan 28, 2015 · 68 comments
Closed

GPU mask mode, general discussion #1037

magnumripper opened this issue Jan 28, 2015 · 68 comments
Assignees

Comments

@magnumripper
Copy link
Member

For future experimental/incomplete commits (eg. when taking a stab at some UTF-16 format), please use the topic branch again so we keep the amount of likely-broken code in bleeding-jumbo as low as possible. But let's merge things as soon as they seem stable so it doesn't diverge too much. Also, we often don't find out about problems until we actually merge (#1036 is a good example).

Let's use this issue for general discussion. For specific problems we should create separate issues.

@magnumripper
Copy link
Member Author

I saw 6-7G c/s on a GTX 980 with current code. My laptop GT 650M does 3-400M c/s.

@magnumripper
Copy link
Member Author

FWIW I saw this once while running TS

form=raw-md4-opencl               guesses:    0 -show=   0 unk unk : Expected count(s) (1500)  [!!!FAILED!!!  return code 256]
Self test failed (cmp_one(1))

...but I can't reproduce it. This was with Apple's CPU device and TS sets LWS=8 and GWS=64 (although this device will force LWS down to 1).

@sayan1an
Copy link
Contributor

I'm getting some issues with cpu devices too. Apparently GWS=512 and LWS=64 seems to be the minimum limit.

@magnumripper
Copy link
Member Author

But there is no particular reason there would be such limit, right? Maybe it's just another barrier needed somewhere or something like that.

@magnumripper
Copy link
Member Author

@Sayantan2048 I think this patch fixes the performance counters for good, please review

diff --git a/src/cracker.c b/src/cracker.c
index 45fb099..4aadc71 100644
--- a/src/cracker.c
+++ b/src/cracker.c
@@ -47,6 +47,7 @@
 #include "recovery.h"
 #include "external.h"
 #include "options.h"
+#include "mask_ext.h"
 #include "mask.h"
 #include "unicode.h"
 #include "john.h"
@@ -747,7 +748,8 @@ static int crk_salt_loop(void)
    } while ((salt = salt->next));

    if (done >= 0)
-       add32to64(&status.cands, crk_key_index);
+       add32to64(&status.cands, crk_key_index *
+                 mask_int_cand.num_int_cand);

    if (salt)
        return 1;

This code assumes mask_int_cand.num_int_cand is always 1 unless GPU generation is active. In particular, it has to be 1 even if mask mode was not used or initialized at all (or always initialized to a degree).

@magnumripper
Copy link
Member Author

BTW we should consider the possibility for overrun here. The result of crk_key_index * mask_int_cand.num_int_cand must fit in 32-bit or we'll need to use add64to64() instead.

@magnumripper
Copy link
Member Author

Here's code that is safe for that. However, I think we'll never need that high number for a single crypt call.

    if (done >= 0) {
        int64 totcand;
        mul32by32(&totcand, crk_key_index, mask_int_cand.num_int_cand);
        add64to64(&status.cands, &totcand);
    }       

@magnumripper
Copy link
Member Author

This code assumes mask_int_cand.num_int_cand is always 1 unless GPU generation is active.

It always is - but I totally fail to see how/where that happens! So I don't dare committing this.

@sayan1an
Copy link
Contributor

sayan1an commented Feb 4, 2015

add32to64(&status.cands, crk_key_index *
mask_int_cand.num_int_cand);

Are you sure it won't interfere with *pcount as it already updates inside crypt_all(). I mean please check we don't have a situation where mask_int_cand.num_int_cand is multiplied twice, once inside crypt_all() and again inside crk_salt_loop().

@sayan1an
Copy link
Contributor

sayan1an commented Feb 4, 2015

It always is - but I totally fail to see how/where that happens! So I don't dare committing this.

See line 17 mask_ext.c. It is set to 1 even when there is no mask mode.

@magnumripper
Copy link
Member Author

Are you sure it won't interfere with *pcount as it already updates inside crypt_all()

That I'm sure of. The thing you mention happens once per salt and updates everything but p/s. This one take care of p/s and is not multiplied by salts.

Ah, yes it's statically initialized to 1. I will merge this now then!

@magnumripper
Copy link
Member Author

154f00d

@claudioandre-br
Copy link
Member

Mask has some problems with -dev=cpu. Trying to debug, it always blames /usr/lib/libamdocl64.so

[..] hashes.txt -form=raw-md4-opencl --mask=passwor?l -dev=1
Device 1: AMD Phenom(tm) II X6 1075T Processor
Local worksize (LWS) 32, global worksize (GWS) 262144
Using Mask Mode with internal candidate generation,global worksize(GWS) set to 16384
Loaded 3 password hashes with no different salts (Raw-MD4-opencl [MD4 OpenCL (inefficient, development use only)])
Press 'q' or Ctrl-C to abort, almost any other key for status
Falha de segmentação (imagem do núcleo gravada) *segfault*

[Fixed now, works now on CPU and GPU]
BTW: happens the same way in raw-sha256-opencl (not able to nailed it), on gpu it works fine

0g 0:00:00:42 N/A 0g/s 147410Kp/s 147410Kc/s 1031MC/s GPU:57°C util:99% fan:46% aaaicbua..aaavdhua

@sayan1an
Copy link
Contributor

Where was the problem, with mask mode or the format or common opencl code?

@claudioandre-br
Copy link
Member

Is it possible that this mask (sometimes) misbehave --mask=[Pp][Aa@][Ss5][Ss5][Ww][Oo0][Rr][Dd]

  1. [#ASSWORD] is expected? And it missed the key "password"
0g 0:00:00:00  0g/s 2592p/s 2592c/s 15552C/s GPU:45°C fan:40% #ASSWORD..O###W#RD####w#RD####W#rD####w#rD####W#Rd####w#Rd####W#r
  1. Ok [password and P@55w0rD] cracked keys.
2g 0:00:00:00  8.333g/s 5400p/s 5400c/s 32400C/s GPU:45°C fan:40% password..O###W#RDp###W#RDP###w#RDp###w#RDP###W#rDp###W#rDP###w#r
W

@claudioandre-br
Copy link
Member

Where was the problem, with mask mode or the format or common opencl code?

I checked every format allocation (all details). One of them was causing it.

@magnumripper
Copy link
Member Author

[#ASSWORD] is expected?

Off the top of my head, I think a mask like -mask=?l?l?lword will only ever show as ###word in the status lines. The GPU-side of the mask is shown as #'s.

@claudioandre-br
Copy link
Member

Ok, I will try to nail the problem with this particular mask.


A side note, putting this mask stuff hurts benchmark numbers for 'a regular run' a lot. But, a real 'regular run' has, basically, the same performance that it had using old mask-less source code. I compared sha256 (new and old) and md4 and md5 (I guess It makes sense).

I was planning to create two kernels, but it seems useless to a real user (no gain or loss). Anyone disagrees?

@magnumripper
Copy link
Member Author

Once GPU-side mask is universally working, the self-test will benchmark it, and (I guess) show a separate figure for that speed.

@claudioandre-br
Copy link
Member

There is something wrong with this mask expansion. Result of analisys (--skip-self-tests and no autotune).

  • When it works (and cracks), set_key() is called 16 times, when it fails, set_key() is called 8 times

Somehow, what is calling set_key behaves in 2 different ways. And, for some reason, it fails sometimes.

rm ../run/*.pot; LWS=128 GWS=1048576 ../run/john ~/testhashes -form=raw-md4-opencl --mask=[Pp][Aa@][Ss5][Ss5][Ww][Oo0][Rr][Dd] 
Device 0: Juniper [AMD Radeon HD 6700 Series]
Local worksize (LWS) 128, global worksize (GWS) 1048576
Loaded 1 password hash (Raw-MD4-opencl [MD4 OpenCL (inefficient, development use only)])
Press 'q' or Ctrl-C to abort, almost any other key for status
Get key error! 647 647
0g 0:00:00:00  0g/s 3600p/s 3600c/s 3600C/s GPU:43°C fan:40% #ASSWORD..#ASSWORD
Session completed

Above is an example, I used sha256 to get the numbers.

@sayan1an
Copy link
Contributor

When GPU side mask is activated (currently only available on raw-md4-opencl), it is expected that set_keys() is called fewer times as some portion of the key is generated by the format.

The above case seems like a bug to me.
Get key error! 647 647
Can you give me the hash that you were supposed to crack in the above raw-md4-opencl example.

Update: I have found some ASAN bug in core mask mode code. Now fixed. commit a67153c

@sayan1an
Copy link
Contributor

@magnumripper commit a649032 is causing performance degradation due to register spilling on 7970 with catalyst 14.12. Speed is now reduced form 3.9Gc/s to 1.8Gc/s for raw-md4-opencl

@magnumripper
Copy link
Member Author

That's in raw-MD4 or what? That's odd, this is an optimization we should have, and the parens are merely making it more obvious to the compiler. An alternative is actually coding the optimization with a tmp variable.

@magnumripper
Copy link
Member Author

BTW I will try it will 14.9 - the 14.12 is known as the worst driver version ever.

@claudioandre-br
Copy link
Member

Can you give me the hash

password -> 8a9d093f14f8701df17732b2bb182c74

@claudioandre-br
Copy link
Member

When GPU side mask is activated [..] it is expected that set_keys() is called fewer times

I know that. My point is: when running in GPU mask mode.

  • 16 set_key() calls: 16 is the correct number of set_key calls that has to be done. The right solution is produced (all 1296 candidates generated). It always crack the hash producing the correct key (in this example the word password) in GPU
  • 8 set_key() calls: '8' means a bug somewhere. It does not crack. It does not produce the right key. It misses some of the 1296 candidates and it can't produce the correct key.

@sayan1an
Copy link
Contributor

@magnumripper I tried to build john on well, but it failed.

/tmp/ccpEnmM3.s: Assembler messages:
/tmp/ccpEnmM3.s:434: Error: no such instruction: `vfmadd312sd .LC5(%rip),%xmm0,%xmm2'

Regarding commit a649032, should we wait for the next driver release ?

@claudioandre-br
Copy link
Member

Disable native tests:

It will end up with errors like "no such instruction: `vfmadd312sd ...". 
The workaround is to add the option "--disable-native-march" to configure, 
which will stop it from ever adding that compiler option.

@sayan1an
Copy link
Contributor

My understanding of UTF-16 is somewhat fuzzy at the moment and I require some guidance regarding handling of UTF-16 characters.

  1. What is the range of values covered by the set '?s' in UTF-16 mode?
  2. How do we specify custom set int UTF-16, like ?1 or ?2 etc ?
  3. How do we handle UTF-16 on CPU side ? One UTF-16 char as two UTF-8/ASCII chars ? On GPU, I suppose it should be implementation dependent that could vary from format to format.
  4. Can you give me an example of UTF-16 mask ? I would like to tinker with it to better understand how it is handled on cpu side so that we can make a decision regarding GPU side mask.
  5. What is codepage conversion ? Specifically, what is the bit pattern of string say "bit" in UTF-16 and after codepage conversion to UTF-8 in little endian?

@magnumripper
Copy link
Member Author

  1. What is the range of values covered by the set '?s' in UTF-16 mode?

Good question, that could be 100,000 characters. I think it's best to keep using an "internal encoding" from the user perspective even though we are not really limited to it. So it'll be like this:

A. User picks (or has as a default) internal encoding CP1234.
B. User picks a mask of ?s.
C. Mask mode decides ?s for CP1234 is [range] (a string encoded in CP1234).
D. GPU part of mask mode gets that same range "string" - but encoded as UTF-16 (possibly it doesn't resemble a string at all, it could be in any format we chose).

Our current CPU-side mask mode works just like that except (D).

  1. How do we specify custom set int UTF-16, like ?1 or ?2 etc ?

Assuming we go with my answer above, same applies here. We do exactly as we do today, using the internal encoding, and then convert it to a set of UTF-16 code points.

  1. How do we handle UTF-16 on CPU side ? One UTF-16 char as two UTF-8/ASCII chars ? On GPU, I suppose it should be implementation dependent that could vary from format to format.

Command line is either decoded as UTF-8 or a code page, depending on john.conf or command line encoding settings. Even when using UTF-8 we have the notion of an internal encoding (which defaults to ISO-8859-1). So everything is quite normal cstrings up to and including with the format's set_key().

A UTF-16 format (eg. NT) should be "encoding aware". It knows the string sent to set_key() is to be decoded and converts it to a "string" of unsigned shorts. For UTF-8 this is done using a (fast and simple) function, for legacy code pages it's a LUT. For the special case of ISO-8859-1 it's actually just a cast - the character 0xA3 (a pound sign) in UTF-16 is 0x00A3. BTW nt-opencl currently can't handle anything but ISO-8859-1. All other UTF-16 formats in Jumbo can, AFAIK.

  1. Can you give me an example of UTF-16 mask ?
$ ../run/john -stdout -inp:utf8 -int:cp1252 -1:'[€$£]' -mask:?1 -max-len=1
€
$
£
3p 0:00:00:00 100.00% (2015-04-23 08:15) 13.04p/s £

Instead of -stdout, you can run the above with netntlmv2 or ntlmv2-opencl and follow what happens and where. The latter case fully works but is not very effective, it converts on GPU but transfer is a huge bottleneck. It should use GPU mask.

  1. What is codepage conversion ? Specifically, what is the bit pattern of string say "bit" in UTF-16 and after codepage conversion to UTF-8 in little endian?

I don't quite understand your question. See unicode.c for shared generic functions. Also see iconv(1) for verifying stuff

$ echo müller | hd
00000000  6d c3 bc 6c 6c 65 72 0a                           |m..ller.|
00000008

$ echo müller | iconv -t cp1252 | hd
00000000  6d fc 6c 6c 65 72 0a                              |m.ller.|
00000007

$ echo müller | iconv -t utf-16le | hd
00000000  6d 00 fc 00 6c 00 6c 00  65 00 72 00 0a 00        |m...l.l.e.r...|
0000000e

@magnumripper
Copy link
Member Author

@Sayantan2048 can we not get rid of this warning that is printed whenever keyspace is exhausted?

Get key error! 90249 90249

It's confusing. If it's needed as an assertion we should try to tweak it so it's muted for the normal no-problem situation.

magnumripper added a commit that referenced this issue Sep 22, 2015
@magnumripper
Copy link
Member Author

@Sayantan2048 I committed a first version of full Unicode support for NT-opencl in 9cecf81. This version doesn't change the underlying functions - it just decodes on GPU as needed.

Cases like this one work fine (with any supported encoding):

$ ../run/john -form:nt-opencl -dev=2 test.in -enc:utf8 -int:latin1 -mask:?l?L?l?ler
Device 2: Tahiti [AMD Radeon HD 7900 Series]
Rules/masks using ISO-8859-1
Loaded 1 password hash (nt-opencl, NT [MD4 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
möller           (u0)
1g 0:00:00:01  0.7092g/s 436283p/s 436283c/s 436283C/s #ö##er..#ÿ##er

This fails (UTF-8 character preceeding mask place-holders, and no internal encoding):

$ ../run/john -form:nt-opencl -dev=6 test.in -enc:utf8 -mask:mö?l?ler
Device 6: GeForce GTX TITAN X
Loaded 1 password hash (nt-opencl, NT [MD4 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:01  0g/s 352.0p/s 352.0c/s 352.0C/s GPU:39°C util:46% fan:22% mö##er..????????
Session completed

The reason it fails, is mask mode thinks the first ?l should be inserted at pos. 3 (starting from 0) becase it thinks 'ö' is two characters (it is indeed two bytes). It should be inserted at pos. 2.

As soon as you add --internal-encoding the problem goes away:

$ ../run/john -form:nt-opencl -dev=6 test.in -enc:utf8 -mask:mö?l?ler -int:cp1252
Device 6: GeForce GTX TITAN X
Rules/masks using ISO-8859-1
Loaded 1 password hash (nt-opencl, NT [MD4 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
möller           (u0)
1g 0:00:00:02  0.5000g/s 338.0p/s 338.0c/s 338.0C/s GPU:39°C util:17% fan:22% mö##er..õõõõõõõõ
Use the "--show" option to display all of the cracked passwords reliably
Session completed

(EDIT: b2227bb makes sure you can't run the above without internal encoding)

Performance should be totally unaffected when not actually using UTF-8 or codepage (it actually builds different kernels). And even when you do, performance is still pretty good.

However, if we changed mask mode's int_keys to be an array of uint16_t instead of uint8_t, we would get rid of the codepage table lookups in inner loop.

magnumripper added a commit that referenced this issue Sep 22, 2015
…ding

so we don't get false negatives from improper use. See #1037.
@magnumripper
Copy link
Member Author

performance is still pretty good.

Wow, on the Titan X performance is more or less unaffected even with codepage table lookups in the inner loop. I get over 15 Gp/s using LWS: 256, GWS: 36864 and mask mode either way.

On the 7970, performance drops from 7.5 Gp/s to 5 Gp/s when using internal encoding.

Maybe I should have implemented this prior to CMIYC-2015... 😢

magnumripper added a commit that referenced this issue Sep 22, 2015
magnumripper added a commit that referenced this issue Sep 23, 2015
@sayan1an
Copy link
Contributor

However, if we changed mask mode's int_keys to be an array of uint16_t instead of uint8_t, we would get rid of the codepage table lookups in inner loop.

Help me understand this and correct me where I'm wrong.

Mask mode internally only supports uint8 chars, and I beleive encoding requiring uint16 or uint32 are first converted to uint8 chars and then fed into mask mode. So if ö takes uint16, I suppsoe it is split into two uint8 chars, mask mode treats these chars as separate and assigns separate location for them to iterate over. However, NT must treat them as one char, and put them at same location assuming we're using uint16. I see we're using PUTSHORT macros which stuffs uint8 keys typecast as short bytes into nt_buffer. So, the two uint8 chars of ö ends up at different location within nt_buffer. When it should have shared one uint16 char, it is using two!! This is where I'm getting confused!!

Or better, please explain the whole chain of conversion going on starting with initial mask.

@magnumripper
Copy link
Member Author

Here's our current chain (using an internal encoding) for a Euro sign:

  1. Mask and placeholders are converted from UTF-8 to some internal 8-bit codepage (eg. ISO-8859-15 which has "€" as 0xa4) in mask_init.
  2. Throughout core and mask mode, that "€" is obviously just a char like any other.
  3. When we arrive at the PUTSHORT macros, we do a table lookup of cp[0xa4] for ISO-8859-15 which yields 0x20ac. The cp[] array is defined in opencl_unicode.h.

This works perfectly fine, except you need to find an internal encoding that holds any characters you will need (and it's not always possible - for example you might have problems finding a codepage that can hold a string containing a russian character and a Euro sign).

Unicode-aware mask mode would work like this:

  1. Mask and placeholders are converted to UTF-32 in mask_init.
  2. Everything works like today except all "strings", arrays and "chars" are made of uint instead of char/uchar - throughout mask mode. There are no variable lengths - all characters are 32 bits.
  3. When we arrive at the PUTSHORT macros, for NT we'd just do (c & 0xffff) to get UCS-2. No table lookup. But for eg. raw-MD5 where we do want UTF-8 as target encoding, we'd have to convert to UTF-8 here. That is very cheap though. And actually I see a way to get rid of that too, but that's a later discussion.

But now we'd get the opposite problem for non-Microsoft formats actually using UTF-8 as target encoding: A "€" will expand to two three bytes at that time while an "a" would be just one.

@magnumripper
Copy link
Member Author

But now we'd get the opposite problem for non-Microsoft formats actually using UTF-8 as target encoding: A "€" will expand to two bytes at that time while an "a" would be just one.

To get around this I guess you'd need to introduce another piece of data to the mask struct: "For position n our place-holder will eat m bytes" or something like that. So that one will depend on target encoding: In case of NT it will always be 1 (as in one uint16) while for raw-MD5 it will be 1 for ASCII characters and 2, 3 or 4 bytes for non-ASCII...

@magnumripper
Copy link
Member Author

Hm no, that will not do. Consider the mask a[bö]c for UTF-8 target encoding. For first candidate, we need to have the initial word prepered as a#c and then we can insert the b. But for second candidate we'd need to have the initial word prepered as a##c to get room for the two-byte ö. For it would even need to be a###c.

Maybe we should just stick to the internal encoding. It really solves most problems, but has its limitations.

@frank-dittrich
Copy link
Collaborator

In utf-8, € is represented by three bytes (<82>).

@magnumripper
Copy link
Member Author

@Sayantan2048 you have implemented your own auto-tune in seven GPU-mask formats, with no shared code at all. That's the opposite direction from what I and @claudioandre has struggled with for a long time. I tried changing NT-opencl to use our shared auto-tune in 2772b8f and I see no downsides (actually it seems to work better). I'm planning to do the same with the six others. I can understand if mscash2 needs its own auto-tune (for multi device support) but for other formats it's really the wrong way to go.

Once we use shared code it'll be a much simpler task implementing GPU-mask-autotune.

@sayan1an
Copy link
Contributor

Thank you, it should and will work for raw hashes, however, for salted hashes, do they set valid salt before benchmark? I think you'll also run into problems with descrypt and lm-opencl, where kernels are rebuilt as neeeded during auto-tune.

@sayan1an
Copy link
Contributor

To get around this I guess you'd need to introduce another piece of data to the mask struct: "For position n our place-holder will eat m bytes" or something like that. So that one will depend on target encoding: In case of NT it will always be 1 (as in one uint16) while for raw-MD5 it will be 1 for ASCII characters and 2, 3 or 4 bytes for non-ASCII...

As my initail thoughts, I think his complicates the GPU side mask!! For raw-md5, we'll need to do 1 or 2 or 3 or 4 putchar insted of one, if we treat every placeholder as UTF-32. Worst part is, it won't be SIMD friendly.

@sayan1an
Copy link
Contributor

As my initail thoughts, I think his complicates the GPU side mask!! For raw-md5, we'll need to do 1 or 2 or 3 or 4 putchar insted of one, if we treat every placeholder as UTF-32. Worst part is, it won't be SIMD friendly.

Not exactly, but we'll need too many scalar branches which ideally shouldn't cause performance degradation but when put inside loops, they tend to perform very poorly.

@magnumripper
Copy link
Member Author

we'll need too many scalar branches

Yeah I think we should stick to "internal encoding" for now. Simple is beautiful.

for salted hashes, do they set valid salt before benchmark? I think you'll also run into problems with descrypt and lm-opencl, where kernels are rebuilt as neeeded during auto-tune.

The shared auto-tune uses salt and ciphertexts from the test vectors so it should be fine. I wont touch DES or LM, I'm looking at fixing the following:

$ git grep -l "auto_tune("
opencl_mscash_fmt_plug.c
opencl_nsldap_fmt_plug.c
opencl_rawmd4_fmt_plug.c
opencl_rawmd5_fmt_plug.c
opencl_rawsha1_fmt_plug.c
opencl_salted_sha_fmt_plug.c

(I think nsldap and raw-sha1 will be merged in the process, like we did for CPU formats)

@magnumripper
Copy link
Member Author

I did mscash and it seemed to work at first but something's not right with salts (as you said). Just what the heck are you doing in there? Are there good reasons for deviating from the specified interfaces? The shared auto-tune works like a champ for almost 50 formats, salted or not.

Reverted it while investigating.

@magnumripper
Copy link
Member Author

OK, I think I get the picture. I probably wasn't very far from a working version. I'll continue later.

@magnumripper
Copy link
Member Author

Works now. I will polish it a bit more.

@Sayantan2048 perhaps the shared code should include ability to build a "fake db" out of test vectors? Or would that be too format-specific?

BTW you really should move some of your dupe code for hash tables and stuff, to shared code or at least shared source (a C header).

@magnumripper
Copy link
Member Author

I'm closing this generic issue now. We'll open specific issues when needed. I played a lot with GPU mask formats lately, including with various encodings and they work damn good and damn fast 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants