Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve packaged Windows builds #3960

Closed
solardiz opened this issue May 16, 2019 · 18 comments
Closed

Improve packaged Windows builds #3960

solardiz opened this issue May 16, 2019 · 18 comments

Comments

@solardiz
Copy link
Member

solardiz commented May 16, 2019

I'll start recording related sub-issues in here.

  1. doc/README[.txt] turns from a symlink to a text file with the symlink's target filename in it. This isn't helpful, and the file is better to be removed by windows-package target. Use an equivalent of these commands in the windows-package target:
rm ../doc/README
find ../doc ../run/rules -type f -exec sed -i -e 's/\r*$/\r/' {} ';'
sed -i -e 's/\r*$/\r/' ../README.md ../run/*.conf ../run/password.lst
find ../doc -type f -not -name '*.txt' -not -name '*.md' -exec mv -v '{}' '{}'.txt \;
  1. We should move all of the fallback executables under a subdirectory (but make sure this doesn't exclude them from the peflags) called e.g. fallback. This will be a closer match to how I intended this functionality to be used, where on Linux systemwide installs /usr/libexec/john is used. This is why I didn't worry about those files confusing the user.

Otherwise someone might run a suboptimal build thinking it's the best suitable for their CPU, by looking at the many filenames.

Of course, this change also requires changing where the executables expect their next fallbacks.

  1. We should add AVX-512 builds, either as AVX512BW->AVX2->... or as AVX512BW->AVX512F->AVX2->...
@claudioandre-br
Copy link
Member

  1. [...]

Fixed in #3959

  1. [...]
# Move all of the fallback executables under a subdirectory
mkdir ../run/libexec

# CPU (OMP and extensions fallback)
shell "./configure [...] && mv ../run/john ../run/libexec/john-sse2-non-omp"
[...]
shell "./configure [...] && make -sj2  && make -s strip"

@solardiz
Copy link
Member Author

I don't mind using libexec for the directory name, but fallback is probably better in this case - no need to match the Unix'ish directory name there, and fallback is descriptive. It is reasonable for someone to knowingly use one of the fallback binaries in some special case.

@claudioandre-br
Copy link
Member

Ok, fallback

@solardiz
Copy link
Member Author

Don't forget you need to also specify the paths to fallback binaries during build of binaries that will invoke the fallbacks.

@claudioandre-br
Copy link
Member

claudioandre-br commented May 16, 2019

I know (and that makes me a sad person). The madness of escaping.

@solardiz
Copy link
Member Author

This change shouldn't require additional escaping, or does it?

@claudioandre-br
Copy link
Member

Testing now. Let's see.

@claudioandre-br
Copy link
Member

It is basically done and works in https://ci.appveyor.com/project/claudioandre-br/johntheripper/builds/24608734.

C:\Temp\JohnTheRipper\run>john --list=build-info
Version: 1.9.0-jumbo-1
Build: cygwin 32-bit i686 SSE2 AC OMP
SIMD: SSE2, interleaving: MD4:3 MD5:3 SHA1:2 SHA256:1 SHA512:1
CPU tests: SSE2
OMP fallback binary: fallback/john-sse2-non-omp
$JOHN is
Format interface version: 14
Max. number of reported tunable costs: 4
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
SALT_HASH_SIZE: 1048576
SINGLE_IDX_MAX: 2147483648
SINGLE_BUF_MAX: 4294967295
Effective limit: Number of salts vs. SingleMaxBufferSize
Max. Markov mode level: 400
Max. Markov mode password length: 30
gcc version: 7.4.0
OpenCL headers version: 2.2
Crypto library: OpenSSL
OpenSSL library version: 01010102f
OpenSSL 1.1.1b  26 Feb 2019
GMP library version: 6.1.2
File locking: fcntl()
fseek(): fseeko
ftell(): ftello
fopen(): _fopen64
memmem(): System's

But there is a bad side effect. And a Windows symlink did not solve it (remember Win7 32bits).

  • I need the libraries in two directories.
  • I had to Ctrl+c and Ctrl+v
17/05/2019  03:29    <DIR>          .
17/05/2019  03:29    <DIR>          ..
17/05/2019  03:26    <SYMLINK>      cygbz2-1.dll [..\cygbz2-1.dll]
17/05/2019  03:29    <SYMLINK>      cygcrypt-0.dll [..\cygcrypt-0.dll]
17/05/2019  03:29    <SYMLINK>      cygcrypt-2.dll [..\cygcrypt-2.dll]
17/05/2019  03:29    <SYMLINK>      cygcrypto-1.0.0.dll [..\cygcrypto-1.0.0.d

17/05/2019  03:29    <SYMLINK>      cygcrypto-1.1.dll [..\cygcrypto-1.1.dll]
17/05/2019  03:29    <SYMLINK>      cyggcc_s-1.dll [..\cyggcc_s-1.dll]
17/05/2019  03:29    <SYMLINK>      cyggmp-10.dll [..\cyggmp-10.dll]
17/05/2019  03:29    <SYMLINK>      cyggomp-1.dll [..\cyggomp-1.dll]
17/05/2019  03:29    <SYMLINK>      cygOpenCL-1.dll [..\cygOpenCL-1.dll]
17/05/2019  03:29    <SYMLINK>      cygssl-1.0.0.dll [..\cygssl-1.0.0.dll]
17/05/2019  03:29    <SYMLINK>      cygssl-1.1.dll [..\cygssl-1.1.dll]
17/05/2019  03:29    <SYMLINK>      cygwin1.dll [..\cygwin1.dll]
17/05/2019  03:29    <SYMLINK>      cygz.dll [..\cygz.dl]
16/05/2019  23:48         7.139.342 john-avx-non-omp.exe
16/05/2019  23:52         7.186.446 john-avx.exe
17/05/2019  00:04         7.143.438 john-avx2-non-omp.exe
16/05/2019  23:33         7.141.390 john-sse2-non-omp.exe
16/05/2019  23:37         7.189.006 john-sse2.exe
16/05/2019  23:40         7.160.334 john-sse41-non-omp.exe
16/05/2019  23:44         7.207.438 john-sse41.exe
16/05/2019  23:56         7.114.766 john-xop-non-omp.exe
17/05/2019  00:00         7.161.870 john-xop.exe
              22 arquivo(s)     64.444.030 bytes
               2 pasta(s)   37.389.836.288 bytes disponíveis

@solardiz
Copy link
Member Author

* I need the libraries in two directories.

Oh, I missed that problem, which I now realize was to be expected. This may be a reason to revert to the other approach I mentioned on the 1.9.0-jumbo-1 meta-issue:

"maybe we should include the best SIMD+OMP not only as john.exe, but also with its full name consistent with the rest. [...] We could then also use our symlink.c to produce a tiny john.exe that merely executes the best one (letting it start the fallback chain if necessary)."

@claudioandre-br
Copy link
Member

claudioandre-br commented May 17, 2019

Done!

The Windows release "reloaded" is available at: https://ci.appveyor.com/project/claudioandre-br/johntheripper/builds/24630991

  • 64bits only;
  • it contains the avx512bw binaries (OMP and non-OMP);
  • it contains a lightweight john.exe that fallbacks;

The fallback is tested by CI itself (avx512bw->avx2)
But the avx512bw binary deserves real testing, I have no idea how the avx512bw Windows binary behaves in the real world.

I added only avx512bw because we are very close to CI limits (I shouldn't, better, can't add more stuff).

@solardiz
Copy link
Member Author

* 64bits only

I guess this is temporary, just for the test build? We also need Win32 builds for our releases.

Regarding AVX-512 support in 32-bit builds, on one hand I doubt there's a 32-bit Windows that supports AVX-512 on context switches, but on the other hand someone might mistakenly install a 32-bit build that we release on 64-bit Windows on AVX-512 capable hardware. Do we care about having such installs use AVX-512?

@claudioandre-br
Copy link
Member

claudioandre-br commented May 18, 2019

I guess this is temporary, just for the test build? We also need Win32 builds for our releases.

I removed all 32bits testing. But I will be able to release for 32bits.

Do we care about having such installs use AVX-512?

I wont build AVX512BW on 32bits (see #3962). I will only build 512BW for 64bits.


Note to self: AVX512F versus AVX512BW (from a Linux machine).

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE2 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1026 c/s real, 523 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE2 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	607 c/s real, 608 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSSE3 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1189 c/s real, 596 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSSE3 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	617 c/s real, 617 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE4.1 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1188 c/s real, 595 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 SSE4.1 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	611 c/s real, 611 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 AVX 2x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1612 c/s real, 807 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 AVX 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	770 c/s real, 771 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3130 c/s real, 1567 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	1639 c/s real, 1641 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512F 8x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	7445 c/s real, 3730 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512F 8x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3635 c/s real, 3635 c/s virtual

Will run 2 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	7782 c/s real, 3891 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	3755 c/s real, 3755 c/s virtual

@solardiz
Copy link
Member Author

I guess you meant to test 512BW in 2xOMP there, but didn't. Anyway, we first need to know which of our formats are expected to benefit from BW, then test those. @magnumripper Perhaps you can suggest specific formats to use for 512F vs. 512BW tests.

@claudioandre-br
Copy link
Member

claudioandre-br commented May 18, 2019

I guess you meant to test 512BW in 2xOMP there

No, I did something wrong (fixed now).

@magnumripper
Copy link
Member

(...) someone might mistakenly install a 32-bit build that we release on 64-bit Windows on AVX-512 capable hardware. Do we care about having such installs use AVX-512?

I don't think it's worth it.

@magnumripper Perhaps you can suggest specific formats to use for 512F vs. 512BW tests.

I believe the only difference between them (currently) is in swap32/swap64, so -BW probably doesn't gain very much (likely only noticable for raw formats - unless it's hidden there too for other reasons).

@solardiz
Copy link
Member Author

likely only noticable for raw formats

Claudio shows a difference for sha512crypt, and I'm not too surprised it too might involve byte swaps - but we need to take a look at the code, or just run more benchmarks first to confirm there's a difference for that format. Anyway, I think we've decided on going 512BW for 64-bit Windows.

@magnumripper
Copy link
Member

magnumripper commented May 19, 2019

The basic SHA512 function in simd-intrinsics.c only has swaps if we use it with "flat in" and/or "flat out" flags, that is, we feed it scalar buffers (it will also use scatter/gather instructions of course but they don't differ between -F and -BW).

The sha512crypt format indeed uses SSEi_FLAT_IN and also does some byte swaps on its own - but the latter is (ultimately) using __builtin_bswap64((x)) on scalars so will be same speed on AVX-512F. I'm a bit surprised the gain is that big for BW, but that's a good thing of course. Now, I wonder if it would possible to avoid those scalar swaps by using SSEi_FLAT_OUT in the very last SHA512 call, we should look into that. Apparently Jim wrote the SIMD support.

@claudioandre-br
Copy link
Member

Closing for now. Everything is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants