Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM problem w/ -fork #758

Closed
frank-dittrich opened this issue Sep 25, 2014 · 67 comments
Closed

LM problem w/ -fork #758

frank-dittrich opened this issue Sep 25, 2014 · 67 comments
Labels

Comments

@frank-dittrich
Copy link
Collaborator

./jtrts.pl -noprelims -passthru '--fork=3'
...
form=pwdump_lm                    guesses: 1122 -show=2757 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
.pot CHK:pwdump_lm                guesses:  983 -show=2757 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]
...
Some tests had Errors. Performed 280 tests.1 errors  1 errors reprocessing the .POT files
Time used was 465 seconds

This is with latest jtrts commit (19ca304106ebebb5d0d9b717adc2cb4626cc9808) and latest bleeding-jumbo commit (4454406)

$ ./john --list=build-info 
Version: 1.8.0.2-bleeding-jumbo
Build: linux-gnu 64-bit AVX-autoconf
Arch: 64-bit LE
$JOHN is ./
Format interface version: 12
Max. number of reported tunable costs: 3
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
Max. Markov mode level: 400
Max. Markov mode password length: 30
Compiler version: 4.2.1 Compatible Clang 3.4 (tags/RELEASE_34/final)
gcc version: 4.2.1
clang version: 3.4 (tags/RELEASE_34/final)
OpenSSL library version: 01000105f
OpenSSL 1.0.1e-fips 11 Feb 2013
GMP library version: 5.1.2
NSS library version: 3.17 Basic ECC (loaded: 3.17 Extended ECC)
NSPR library version: 4.10.7
Kerberos version 5 support enabled
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's
@frank-dittrich
Copy link
Collaborator Author

I get

form=pwdump_lm                    guesses: 1125 0:00:00:00 DONE  [PASSED]
.pot CHK:pwdump_lm                guesses:  981 -show=2734 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]

with a GCC build (OMP enabled) when I pass --fork=3.

 $ ./john --list=build-info 
Version: 1.8.0.2-bleeding-jumbo OMP
Build: linux-gnu 64-bit AVX-autoconf
Arch: 64-bit LE
$JOHN is ./
Format interface version: 12
Max. number of reported tunable costs: 3
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
Max. Markov mode level: 400
Max. Markov mode password length: 30
Compiler version: 4.8.3 20140911 (Red Hat 4.8.3-7)
gcc version: 4.8.3
OpenSSL library version: 01000105f
OpenSSL 1.0.1e-fips 11 Feb 2013
GMP library version: 5.1.2
NSS library version: 3.17 Basic ECC (loaded: 3.17 Extended ECC)
NSPR library version: 4.10.7
Kerberos version 5 support enabled
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's

@jfoug
Copy link
Collaborator

jfoug commented Sep 25, 2014

I get lm fails on GCC build, OMP -fork=3

JtR/bleed/test$ ./jtrts.pl -noprelims -passthru '--fork=3' -q
.......
form=lm guesses: 1493 -show=2992 0:00:00:00 DONE : Expected count(s) (1500)(-show3000) [!!!FAILED!!!]
.pot CHK:lm guesses: 1487 -show=2987 0:00:00:00 DONE : Expected count(s) (1500) [!!!FAILED!!!]

NOTE, if I just run against lm, I get LM failures, then LM working, then lm_pwdump working sometimes and failing others.

The runs are inconsistant. I do not think this is a problem with TS, it is something between LM and fork mode.

@magnumripper
Copy link
Member

I can't reproduce (on x86-64 OSX). Note that the exact number for "guesses" is supposed to vary "randomly" depending on how many dupe cracks you happened to get, but the -show figure should not vary at all.

In general, issues like this "can't" really be due to -fork. Basically a forked process is no different than a -node process - there is virtually no fork specific code anywhere in john after the cracking has started. But -fork may reveal a bug that happened to hide without it.

@jfoug
Copy link
Collaborator

jfoug commented Sep 25, 2014

Run this over and over again, and see if you get failures at times:

./jtrts.pl -noprelims -passthru '--fork=3' lm

On my ubuntu-64 (OMP build), sometimes it works properly, sometimes not. Yes, I understand that guesses will vary, but show should not, but I am seeing that at times jtr is NOT cracking all the hashes. I have not looked any deeper than showing that I see the exact same thing Frank is seeing.

@magnumripper
Copy link
Member

Nope, but I made a for loop with increasing fork count and found I can trigger it with sufficiently high number of forks.

$ ./jtrts.pl -noprelims -passthru '--fork=9' lm --stoponerror
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [darwin13.4.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1503 -show=2997 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]
Exiting on error.  The pot file ./tst.pot contains the found data
The command used to run this test was:
../run/john -ses=./tst --fork=9  -pot=./tst.pot LM_tst.in --wordlist=pw.dic -form=lm

$ ../run/john -pot=./tst.pot LM_tst.in -form:lm -show=left
u540-LM:2:$LM$8756e789c3e08314
u441-LM:2:$LM$34acd11f14f1a3ec
u108-LM:2:$LM$11028e91b87b43a5

1497 password hashes cracked, 3 left

Indeed, three hashes are not cracked. So this is some problem in JtR.

I can't seem to trigger the problem when using -node=1/9 instead of -fork=9... and I did try 2/9 and so on too.

@magnumripper magnumripper changed the title ./jtrts fails with 64bit linux clang build and -passthru '--fork=3' LM problem w/ -fork Sep 25, 2014
@magnumripper
Copy link
Member

I have verified that this problem exist in versions prior to our recently added self-tests. So they are not causing it.

@magnumripper
Copy link
Member

What scares me the most is that this is not likely a problem with LM at all. It might be wordlist mode vs. node/fork, or perhaps pot sync.

@frank-dittrich
Copy link
Collaborator Author

Yes, I also get a varying number of guesses when I repeatedly run

for i in `seq 2 40`; do echo $i; ./jtrts.pl -noprelims -q -passthru "--fork=$i" lm --stoponerror; done

1st run:

2
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
form=pwdump_lm                    guesses:  982 -show=2756 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
...
40
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
form=lm                           guesses: 1481 -show=2980 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]

2nd run:

2
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
All tests passed without error.  Performed 2 tests.  Time used was 1 seconds
...
40
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
form=lm                           guesses: 1511 -show=2995 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]

@magnumripper
Copy link
Member

This is too hard to debug using gdb or the like. We should probably use git-bisect.

@magnumripper
Copy link
Member

Core John does not seem to have the issue (but TS has an issue openwall/john-tests#17 when trying)

@magnumripper
Copy link
Member

git-bisect is a piece of work. Here's the offending commit: 1f827c7

Ouch! This doesn't make it easy.

@magnumripper
Copy link
Member

OK so it happens during -fork with a shared memory mapped buffer but per-node indexes (of each node's words in that shared buffer). But how can it be intermittent!?

@frank-dittrich
Copy link
Collaborator Author

And what causes only LM tests to fail? For --fork=2 ... --fork=14, I only see lm and pwdump_lm tests fail.

@magnumripper
Copy link
Member

Not sure if that's worth contemplating yet.

I tried disabling mmap() and the bug is still there. So it's not about the shared memory map but the per-node index buffer. That's a relief.

@magnumripper
Copy link
Member

I was wrong - we don't use any index buffer unless rules are involved. So what we have is a shared memory buffer, and mgetl() instead of fgetl(). But I tried disabling mmap() and bug still shows, which means we are just using fgetl() and no magic. We practically do not use anything that was touched in that commit. So what the heck is wrong!? And how come it was introduced with that patch?

@magnumripper
Copy link
Member

OK, it's not that commit! If I check out the commit before it, and add -mem=1 to disable memory buffering, the bug shows up!

./jtrts.pl -noprelims -passthru '--fork=9 -mem=1' lm
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [macosx-x86-64]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1506 -show=2334 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]
.pot CHK:lm                       guesses:  834 -show=2334 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]

form=pwdump_lm                    guesses: 1276 -show=1571 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
.pot CHK:pwdump_lm                guesses:  318 -show=1571 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]

Some tests had Errors. Performed 2 tests.2 errors  2 errors reprocessing the .POT files
Time used was 2 seconds

Back to bisecting. But this was a relief!!

@magnumripper
Copy link
Member

0b870fa is bad. The problem is I do not have any good commit because 0b870fa is the very merge from core that implemented -fork. So we can't bisect this. I need to re-test core.

@magnumripper
Copy link
Member

I really can't trigger the bug in core.

@magnumripper
Copy link
Member

OK I'm puzzled. I now tested bleeding-jumbo but with a wordlist.c from core John, just modified a tad for autoconf so it builds.

$ ./jtrts.pl -noprelims -passthru '--fork=9' lm 
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [darwin13.4.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1504 -show=2998 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]
.pot CHK:lm                       guesses: 1498 -show=2998 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]

form=pwdump_lm                    guesses: 1372 0:00:00:00 DONE  [PASSED]
.pot CHK:pwdump_lm                guesses:  984 -show=2758 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]

Some tests had Errors. Performed 2 tests.1 errors  2 errors reprocessing the .POT files
Time used was 2 seconds

Whatever the bug is, it's not in wordlist.c

@magnumripper
Copy link
Member

bd10920 is the latest merge from core. That's the version to focus on - what differences do we have that can have this result?

  • LM_fmt.c has no differences against core
  • wordlist.c has differences, but using a wordlist.c from core still show the bug

@magnumripper
Copy link
Member

  • wordlist.c has differences, but using a wordlist.c from core still show the bug

Not true! If I use wordlist.c from core, in a bleeding-jumbo of bd10920, bug goes away. To use it I had to insert this line:

diff --git a/src/wordlist.c b/src/wordlist.c
index 064b33d..9a4b855 100644
--- a/src/wordlist.c
+++ b/src/wordlist.c
@@ -19,6 +19,7 @@
 #include "params.h"
 #include "common.h"
 #include "path.h"
+#include "os.h"
 #include "signals.h"
 #include "loader.h"
 #include "logger.h"

Weirdest thing, the same operation does not make the bug go away in HEAD of bleeding-jumbo.

@magnumripper
Copy link
Member

I officially give up here, for now.

@magnumripper
Copy link
Member

Heureka, I found a way to trigger it without using fork.

No, that just showed I'm too tired. I will give up now.

@frank-dittrich
Copy link
Collaborator Author

IMHO its not an issue with locking the pot file.
I ran

$ ./john --format=lm ../test/LM_tst.in --wordlist=../test/pw.dic --fork=11 > john.stdout ; sort -u john.pot |wc -l
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-11 of 11 (fork)
2 138g 0:00:00:00 DONE (2014-09-26 13:40) 6900g/s 26650p/s 26650c/s 39975KC/s SKIPPIN..R1234B5
Press 'q' or Ctrl-C to abort, almost any other key for status
10 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s B123456..Z123456
7 135g 0:00:00:00 DONE (2014-09-26 13:40) 4500g/s 17766p/s 17766c/s 24198KC/s Z123456..W123456
8 134g 0:00:00:00 DONE (2014-09-26 13:40) 4466g/s 17766p/s 17766c/s 24198KC/s LATE*NI..X123456
5 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s 1234567..U123456
9 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s ..Y123456
6 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s BERT*ER..V123456
3 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s BERT$ER..S12345C
4 135g 0:00:00:00 DONE (2014-09-26 13:40) 4500g/s 17766p/s 17766c/s 24198KC/s BERT$ER..T123456
11 137g 0:00:00:00 DONE (2014-09-26 13:40) 4566g/s 17766p/s 17766c/s 24198KC/s  ..0A23456
1 127g 0:00:00:00 DONE (2014-09-26 13:40) 3175g/s 13325p/s 13325c/s 2052KC/s SKIPPIN..Q123A45
Waiting for 10 children to terminate
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1486

and

$ ./john --format=lm ../test/LM_tst.in --wordlist=../test/pw.dic > john.stdout2 ; sort -u john.pot |wc -l
Warning: poor OpenMP scalability for this hash type, consider --fork=8
Will run 8 OpenMP threads
Press 'q' or Ctrl-C to abort, almost any other key for status
14g 0:00:00:00 DONE (2014-09-26 13:41) 350.0g/s 146575p/s 146575c/s 2052KC/s SKIPPIN..0A23456
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1500

Let's see what has not been cracked in the --fork=11 run:

$ cat john.stdout2 
Loaded 1500 password hashes with no different salts (LM [DES 128/128 AVX-16])
Remaining 14 password hashes with no different salts
AB1266           (u1266-LM:2)
AB1244           (u1244-LM:2)
AB1086           (u1086-LM:2)
AB936            (u936-LM:2)
AB760            (u760-LM:2)
AB755            (u755-LM:2)
AB723            (u723-LM:2)
AB716            (u716-LM:2)
AB540            (u540-LM:2)
AB441            (u441-LM:2)
AB268            (u268-LM:2)
AB265            (u265-LM:2)
AB232            (u232-LM:2)
AB78             (u78-LM:2)

$ grep AB1266 john.stdout*
john.stdout2:AB1266           (u1266-LM:2)

$ grep u1244-LM:2 john.stdout*
john.stdout2:AB1244           (u1244-LM:2)

@frank-dittrich
Copy link
Collaborator Author

I can reproduce the bug with Markov mode.

$ cut -d: -f 2- john.pot > lm.txt
$ ./calc_stat lm.txt lm-stats
zero -10*log proba1[65] (1499) / 1500 converted to 1
zero -10*log proba2[65*256+66] (1499) / 1499, converted to 1 to prevent infinite length candidates
zero -10*log proba2[75*256+73] (1) / 1, converted to 1 to prevent infinite length candidates
zero -10*log proba2[83*256+75] (1) / 1, converted to 1 to prevent infinite length candidates
$ ./mkvcalcproba lm-stats lm.txt |cut -f 3|sort -n|uniq -c
reading stats ... [0x1ac5010]
119 lines parsed [0x1ac5010]
scanning password file ...
freeing stuff ...
charsetsize = 17
      1 10
     13 28
      5 41
      5 46
     27 50
     93 51
      5 59
      9 63
     46 64
     24 68
     76 69
     27 72
    402 73
    650 74
      1 75
      5 77
     24 81
     86 82
      1 99

Then I created a Markov mode that should crack 1498 out of those 1500 passwords:

 [Markov:lm]
Statsfile = $JOHN/lm-stats
MkvLvl = 82
MkvMaxLen = 7
MkvMinLvl = 28
; MkvMinLen = 0

I verified that this Markov mode works as expected:

$ ./john --format=lm ../test/LM_tst.in --markov=lm > /dev/null ; sort -u john.pot |wc -l;     
Warning: poor OpenMP scalability for this hash type, consider --fork=8
Will run 8 OpenMP threads
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538)
1498g 0:00:00:00 DONE (2014-09-26 15:51) 49933g/s 51133p/s 51133c/s 76700KC/s AB1000..S
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1498

Then I tried --fork:

$ for i in `seq 2 40`; do echo $i; rm john.pot ; ./john --format=lm ../test/LM_tst.in --markov=lm --fork=$i > /dev/null ; sort -u john.pot |wc -l; done
...
4
Will run 2 OpenMP threads per process (8 total across 4 processes)
Node numbers 1-4 of 4 (fork)
2 367g 0:00:00:00 DONE (2014-09-26 15:53) 12233g/s 13100p/s 13100c/s 19650KC/s AB1341..AB322
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538 split over nodes)
3 383g 0:00:00:00 DONE (2014-09-26 15:53) 12766g/s 13133p/s 13133c/s 14906KC/s AB314..AB669
1 375g 0:00:00:00 DONE (2014-09-26 15:53) 9375g/s 9850p/s 9850c/s 3851KC/s AB1000..AB1350
Waiting for 3 children to terminate
4 373g 0:00:00:00 DONE (2014-09-26 15:53) 9325g/s 9600p/s 9600c/s 10896KC/s AB660..S
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1486
5
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-5 of 5 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538 split over nodes)
4 317g 0:00:00:00 DONE (2014-09-26 15:53) 15850g/s 15850p/s 15850c/s 23775KC/s AB452..AB737
2 296g 0:00:00:00 DONE (2014-09-26 15:53) 14800g/s 15850p/s 15850c/s 23775KC/s AB1272..AB189
3 310g 0:00:00:00 DONE (2014-09-26 15:53) 15500g/s 15800p/s 15800c/s 23700KC/s AB1814..AB460
5 303g 0:00:00:00 DONE (2014-09-26 15:53) 15150g/s 15400p/s 15400c/s 23100KC/s AB729..S
1 312g 0:00:00:00 DONE (2014-09-26 15:53) 15600g/s 15850p/s 15850c/s 23775KC/s AB1000..AB1281
Waiting for 4 children to terminate
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1498
6
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-6 of 6 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538 split over nodes)
2 266g 0:00:00:00 DONE (2014-09-26 15:53) 13300g/s 13300p/s 13300c/s 19950KC/s AB1226..AB1465
3 239g 0:00:00:00 DONE (2014-09-26 15:53) 11950g/s 13250p/s 13250c/s 19875KC/s AB1457..AB322
4 266g 0:00:00:00 DONE (2014-09-26 15:53) 13300g/s 13300p/s 13300c/s 19950KC/s AB314..AB553
1 248g 0:00:00:00 DONE (2014-09-26 15:53) 12400g/s 13300p/s 13300c/s 3511KC/s AB1000..AB1235
Waiting for 5 children to terminate
6 251g 0:00:00:00 DONE (2014-09-26 15:53) 25100g/s 25600p/s 25600c/s 38400KC/s AB776..S
5 266g 0:00:00:00 DONE (2014-09-26 15:53) 13300g/s 13300p/s 13300c/s 19950KC/s AB545..AB784
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1496
...

As with wordlist mode, the number of cracked passwords varies when I run the same command repeatedly.

@jfoug
Copy link
Collaborator

jfoug commented Sep 26, 2014

Should we focus on the obvious? LM built with OMP has some issues (somehow) with -fork

@frank-dittrich
Copy link
Collaborator Author

With --single=none and user name = password,

AB251:$LM$0004c62411e4e2a4
...

I am not able to reproduce the error.
I guess that's due to the fact that in this case it's not the list of password candidates that gets split across "nodes".

When I change 1499 user names into AB and use this as single mode

[List.Rules:lm]
:
$[0-9]
Az"[1-90][0-9]"
Az"[1-90][0-9][0-9]"
Az"1[0-4][0-9][0-9]"

I can reproduce the error using single mode:

$ for i in `seq 2 40`; do echo -n $i; echo -n " "; rm john.pot ; ./john --format=lm lm.hashes --single:lm --fork=$i > /dev/null 2> /dev/null ; sort -u john.pot |wc -l; done
2 1496
3 1494
4 1488
5 1484
6 1491
7 1495
8 1493
9 1491
10 1486
11 1474
12 1492
13 1479
14 1488
15 1478
16 1492
17 1490
18 1495
19 1491
20 1494
21 1496
22 1490
23 1496
24 1495
25 1489
26 1498
27 1491
28 1493
29 1492
30 1497
31 1480
32 1493
33 1477
34 1494
35 1489
36 1495
37 1485
38 1479
39 1479
40 1497

@frank-dittrich
Copy link
Collaborator Author

On 09/26/2014 04:07 PM, JimF wrote:

Should we focus on the obvious? LM built with OMP has some issues
(somehow) with -fork?

Since I reproduced this with clang (which doesn't support OMP), it is
not related to OMP.

And so far we couldn't reproduce it in core.
So, LM triggers it either because it is extremely fast (which helps to
trigger race conditions) or may be because hashes are split into 2 halves.
If it is because hashes are split, we should be able do reproduce this
with bigcrypt (descrypt) hashes. But I doubt that is the issue.

@frank-dittrich
Copy link
Collaborator Author

Then may be you did something wrong.
Even with ReloadAtCrack = Y, your latest commit 3b4c29f fixes that problem for me.
But when I checkout 479a6ea, the bug still exists.

@frank-dittrich
Copy link
Collaborator Author

This bug seems to be related to #798.
With latest bleeding-jumbo, I (rarely) get the problem with the number of forked processes close to the number when bug #798 occurs:

 $ (for i in `seq 650 999`; do echo -n "$i "; rm john.pot john.log; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} > /dev/null 2> /dev/null; LC_ALL=C sort -u john.pot | wc -l; done); echo $?
650 1500
651 1500
652 1500
653 1500
654 1500
655 1500
656 1500
657 1500
658 1500
659 1500
660 1500
661 1500
662 1500
663 1499
664 1500
665 1500
666 1500
667 1500
668 1500
669 User defined signal 2
140

In this case, --fork=663 cracked just 1499 hashes.

@frank-dittrich
Copy link
Collaborator Author

At least, john now behaves in a consistent way:

$ (for i in `seq 1 99`; do rm john.pot john.log; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=663 > /dev/null 2> /dev/null; echo -n " `LC_ALL=C sort -u john.pot | wc -l`"; done); echo $?
 1499 1499 1499 1499 1499 1499 1499 1499 1499^C^C

Which reminds me of http://www.despair.com/consistency.html

@frank-dittrich
Copy link
Collaborator Author

With this john.local.conf

[Local:Options]
ReloadAtCrack = N
ReloadAtSave = N

I manage to get less than 1500 unique john.pot lines only if I hit the ulimit of (in my case currently 1024) processes per user.
That looks like this:

[fd@f20x32 run]$ (for i in `seq 1020 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt; LC_ALL=C sort -u john.pot | wc -l; done |LC_ALL=C grep -v 1500 ); echo $?
/bin/rm: cannot remove ‘john.pot’: No such file or directory
bash: fork: retry: Resource temporarily unavailable
1020    174
bash: fork: retry: Resource temporarily unavailable
1021    32
1022    53
1023    15
1024    16
0
[fd@f20x32 run]$ grep error t-102*.log
t-1020.log:1 0:00:00:00 Terminating on error, john.c:496
t-1020.log:1 0:00:00:00 Terminating on error, john.c:496
t-1021.log:1 0:00:00:00 Terminating on error, john.c:496
t-1021.log:1 0:00:00:00 Terminating on error, john.c:496
t-1022.log:1 0:00:00:00 Terminating on error, john.c:496
t-1022.log:1 0:00:00:00 Terminating on error, john.c:496
t-1023.log:1 0:00:00:00 Terminating on error, john.c:496
t-1023.log:1 0:00:00:00 Terminating on error, john.c:496
t-1024.log:1 0:00:00:00 Terminating on error, john.c:496
t-1024.log:1 0:00:00:00 Terminating on error, john.c:496

So I guess I'll kep that john.local.conf ans issue a

killall -s USR2 john

Whenever I want a pot file sync.

@magnumripper
Copy link
Member

I think we should fix (or at least consider) openwall/john-tests#20 before going on with this.

@magnumripper
Copy link
Member

Anyway, I can not reproduce this bug on OSX anymore, even with default john.conf. Except when I go past ulimits.

@magnumripper
Copy link
Member

However, I can reproduce it using 9d9c271 (and -mem=1) which was before there was any pot reload or USR2 signalling at all.

@magnumripper
Copy link
Member

Can you reproduce this with current code?

@frank-dittrich
Copy link
Collaborator Author

Before Monday I'll not have access to the system where I want to do the re-test.

@frank-dittrich
Copy link
Collaborator Author

$ (for i in `seq 2 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done | LC_ALL=C grep -v 1500 ); echo $?
323 0   1499
325 0   1499
331 0   1499
342 0   1499
346 0   1499
350 0   1499
352 0   1499
356 0   1499
358 0   1499
373 0   1499
374 0   1499
376 0   1499
377 0   1499
379 
0
$ sort -u john.pot | wc -l
1499
$ grep -i error t-3??.log
$ grep -c Cracked t-377.log 
1518
$ grep Cracked t-377.log |cut -d+ -f 2- |sort -u|wc -l 
1499
$ wc -l stdout-377.txt 
1519 stdout-377.txt
$ head stdout-377.txt 
Loaded 1500 password hashes with no different salts (LM [DES 128/128 SSE2])
AB208            (u208-LM:2)
AB585            (u585-LM:2)
AB962            (u962-LM:2)
AB1339           (u1339-LM:2)
AB207            (u207-LM:2)
AB584            (u584-LM:2)
AB961            (u961-LM:2)
AB1338           (u1338-LM:2)
AB216            (u216-LM:2)
$ head -n 4 stderr-377.txt 
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-377 of 377 (fork)
6 4g 0:00:00:00 DONE (2014-11-09 22:42) 12.12g/s 48.48p/s 48.48c/s 72727C/s BERT*ER..IZE__12
3 4g 0:00:00:00 DONE (2014-11-09 22:42) 6.779g/s 27.11p/s 27.11c/s 40677C/s BERT$ER..L3__12
$ grep -c " DONE " stderr-377.txt 
377
$ grep -v -n " DONE " stderr-377.txt 
1:Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
2:Node numbers 1-377 of 377 (fork)
30:Press 'q' or Ctrl-C to abort, almost any other key for status
195:Waiting for 376 children to terminate
382:Warning: passwords printed above might be partial and not be all those cracked
383:Use the "--show" option to display all of the cracked passwords reliably
384:Session completed

So, I need a much higher number of forked processes to reproduce this.
But the number of forked processes (> 300, but < 400 on an otherwise idle system)is still way lower than the ulimit -u (1024).
I get no indication of errors in log files, in stderr or stdout output or exit code.

@magnumripper
Copy link
Member

Trying the same on OSX, it works fine up to 572, and from that point the problem can be seen in the stderr file: "fork: Resource temporarily unavailable". So I can't reproduce any problem with John. That's with an ulimit of 709 max user processes.

@frank-dittrich
Copy link
Collaborator Author

When I change Save = 60 to Save = 62 in john.conf, the error occurs much later:

606 0   1499
621 0   1499
626 0   1499
675 0   1499
688 0   1499
690 0   1499
692 0   1499
698 0   1499
706 0   1499
707 0   1499
708 0   1497
716 0   1499
718 0   1499
723 0   1499
726 

Still no indication of an error in *.log stderr-*.txt stdout-*.txt.

Can you try schedtool -a 0x1 -e ./john ... instead of ./john?

@magnumripper
Copy link
Member

I do not have anything named schedtool.

@magnumripper
Copy link
Member

I can't see why the Save timer would affect this. No session runs for that long anyway, right?

@magnumripper
Copy link
Member

I can't see why the Save timer would affect this

OK, it's because OS_TIMER counts backwards.

@frank-dittrich
Copy link
Collaborator Author

With Save = 2 I get

$ (for i in `seq 2 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done | LC_ALL=C grep -v 1500 ); echo $?
268 0   1499
282 0   1499
293 0   1499
304 0   1499
318 0   1499
319 0   1499
325 0   1499
332 0   1499
333 0   1499
335 0   1497
340 0   1497
...
957 0   1499
958 0   1499
959 0   1499
960 0   1498
961 0   1496
bash: fork: retry: Resource temporarily unavailable
962 1   79
963 1   23
964 1   2
965 1   97
966 1   2
bash: fork: retry: Resource temporarily unavailable
967 1   0
968 1   0
969 1   41
970 1   7
971 1   0
972 1   0
973 1   0
974 1   4
975 1   0
976 1   5
977 1   29
978 1   0
979 1   0
bash: fork: retry: Resource temporarily unavailable
980 1   0
981 1   53
bash: fork: retry: Resource temporarily unavailable
982 1   12
983 1   25
984 1   31
985 1   0
986 1   0
987 1   0
988 1   0
989 1   7
bash: fork: retry: Resource temporarily unavailable
990 1   2
991 1   0
992 1   2
993 1   32
994 1   0
995 1   41
996 1   5
997 1   0
998 1   9
999 1   25
1000    1   15
1001    1   5
1002    1   33
1003    1   0
bash: fork: retry: Resource temporarily unavailable
1004    1   3
1005    1   13
1006    1   2
1007    1   5
1008    1   0
1009    1   2
1010    1   0
1011    1   28
1012    1   44
1013    1   18
1014    1   4
bash: fork: retry: Resource temporarily unavailable
1015    1   2
1016    1   10
1017    1   7
1018    1   2
1019    1   0
1020    1   25
1021    1   9
1022    1   19
1023    1   9
1024    1   1
0
$ grep -i error t-96[12].log
t-962.log:1 0:00:00:00 Terminating on error, john.c:489

@frank-dittrich
Copy link
Collaborator Author

On 11/09/2014 11:38 PM, magnum wrote:

Trying the same on OSX, it works fine up to 572, and from that point
the problem can be seen in the stderr file: "fork: Resource
temporarily unavailable". So I can't reproduce any problem with John.

I bet with less than 572 forks, john finishes in less than 1 second on
you OSX system, so you don't get the signals when 59 seconds of the 60
are left.

@frank-dittrich
Copy link
Collaborator Author

On my 64bit Linux system with an Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (quad core, no hyperthreading) I manage to get the john killed with SUGUSR2 at 761 forked processes (with Idle=Y in john.conf, when I run 4 other john processes (with Idle = N in john.conf) trying (not) to crack the rar test hashes: ./john --incremental --min-length=6 hashes.rar --session=rar4 ...

30670 fd        20   0  542700  48620   9224 R  92.8  0.6  18:16.65 ./john --session=rar1 --incremental --min-length=6 hashes.rar                                           
27686 fd        20   0  542704  45892   9188 R  90.8  0.6  14:33.13 ./john --incremental --min-length=6 hashes.rar --session=rar5                                           
20452 fd        20   0  542700  48624   9216 R  85.8  0.6  20:27.23 ./john --incremental --min-length=6 hashes.rar --session=rar3                                           
20484 fd        20   0  542700  48592   9188 R  85.8  0.6  20:00.28 ./john --incremental --min-length=6 hashes.rar --session=rar4                                           
10178 fd        20   0   93968  10644   6844 R   4.0  0.1   0:00.12 ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=510 --session=t-510

And some tests with a smaller number of forked processes exited with $? = 0, but less than 1500 unique hashes cracked.

$ (for i in `seq 500 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done | LC_ALL=C grep -v 1500 ); echo $?
631 0   1499
726 0   1499
737 0   1499
743 0   1499
755 0   1498
761 
0

@frank-dittrich
Copy link
Collaborator Author

As root, with ulimits -u = 7794 (compared to 1024 as a regular user), nothing changes.

# (for i in `seq 2 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done | LC_ALL=C grep -v 1500 ); echo $?
/bin/rm: cannot remove ‘john.pot’: No such file or directory
350 0   1499
355 0   1499
356 0   1499
358 0   1499
373 0   1499
374 0   1499
380 0   1497
389 0   1498
391 0   1499
394 
0
# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7794
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 7794
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

With some more changes in ulimits

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15588
max locked memory       (kbytes, -l) 256
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 1638400
real-time priority              (-r) 0
stack size              (kbytes, -s) 16384
cpu time               (seconds, -t) unlimited
max user processes              (-u) 7794
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I get

# (for i in `seq 2 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done | LC_ALL=C grep -v 1500 ); echo $?
/bin/rm: cannot remove ‘john.pot’: No such file or directory
202 0   1499
207 0   1499
238 0   1498
266 0   1499
268 0   1499
271 
0

So, this is most likely not related to ulimit.

@magnumripper
Copy link
Member

I bet #798 (comment) would make it much better, (at least provided you don't use a Save interval that is a multiple of 3).

@frank-dittrich
Copy link
Collaborator Author

I bet #798 (comment) would make it much better, (at least provided you don't use a Save interval that is a multiple of 3).

I tested with the default Save interval of 60, which is a multiple of 3. Why do you think that matters here? All these test runs are much faster than 60 seconds, even on my old 32bit system.

(for i in `seq 300 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done |grep -v 1500 ); echo $?
rm: cannot remove ‘t-*.rec’: No such file or directory
603 0   1499
608 0   1499
663 0   1499
675 0   1499
696 0   1499
700 0   1497
711 0   1498
714 0   1498
717 
0

No indication of problems in log file, stderr or stdout output.
So, the improvement is similar to what I got when I use Save = 60 without that patch.

@frank-dittrich
Copy link
Collaborator Author

BTW, the --fork=717 test that got killed produced a john.pot file with 1500 unique hashes.

@magnumripper
Copy link
Member

I tested with the default Save interval of 60, which is a multiple of 3. Why do you think that matters here? All these test runs are much faster than 60 seconds, even on my old 32bit system.

Because, like you found out, without that patch and with OS_TIMER, some things would happen after 0 seconds instead of after three seconds. Maybe that is unrelated.

@frank-dittrich
Copy link
Collaborator Author

Because, like you found out, without that patch and with OS_TIMER, some things would happen after 0 seconds instead of after three seconds. Maybe that is unrelated.

Yes, but that was without the patch.
With Save =60, after one second, you'd get the SIGUSR2 signals, because 59 & 3 == 3.

With your patch, you'd get the SIGUSR2 signals 2 seconds later, no matter if you have Save = 60 or Save = 62.
(60 - 57) & 3 == 3 and (62 - 59) & 3 == 3.

@magnumripper
Copy link
Member

I had the vague idea that under this crazy over-booking of resources there could be a difference. But now that I think about it, we should already be "protected" against USR2 anyway: The real key is calling sig_init() and sig_init_child() early enough and this doesn't change that. I can't think of any way to do those earlier than we do now iirc.

@magnumripper
Copy link
Member

I can't think of any way to do those earlier than we do now iirc.

On the other hand, if we indeed introduce yet another counter instead of using ((timer_save_interval - timer_save_value) & 3) == 3), we could also opt to never ever send a USR2 during the first minute, or whatever we choose.

@magnumripper
Copy link
Member

Anyway we put it, I think this issue and #798 are purely academical. If we can find solutions for them, fine. If we can't, I will not lose any sleep over it.

@magnumripper
Copy link
Member

I'm closing this. If some similar problem can still be triggered, please open a brand new issue. This issue is too clobbered anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants