Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

descrypt-opencl: Compile per-salt kernels using OpenMP #1618

Open
magnumripper opened this issue Aug 6, 2015 · 26 comments
Open

descrypt-opencl: Compile per-salt kernels using OpenMP #1618

magnumripper opened this issue Aug 6, 2015 · 26 comments

Comments

@magnumripper
Copy link
Member

This is for HARDCODE_SALT of course.

@Sayantan2048 this should basically be really trivial and would speed up initial building of all salts' kernels a whole lot on multicore systems (eg. super would do it 32x faster). However, you'd probably need to move some code around.

What do you think?

@magnumripper magnumripper added this to the 1.8.0-jumbo-2 milestone Aug 6, 2015
@sayan1an
Copy link
Contributor

sayan1an commented Aug 7, 2015

It's doable but I don't expect speed up of 32x or anywhere near it due to disk IO limitations. I'll move the code around anyway because this format doesn't support LWS autotune. Also, if we are to build all kernels, we'd do it in reset before starting actual cracking. So, it should be easy.

@magnumripper
Copy link
Member Author

Even if it's not 32x I'm sure it will be several hundred percents faster. Building JtR with -j32 or not is a huge difference and it's nearly the same thing.

You could add a macro to opencl_DES_hst_dev_shared.h, for having it optional (in case there are problems on some systems, as Solar imagined).

#define HARDCODE_SALT          0
#define PARALLEL_BUILD         0
#define FULL_UNROLL            0

@magnumripper
Copy link
Member Author

Bump! Now even more kernels seem to be used just for -test, and they don't even seem to be cached to disk? I had to cancel it after it compiled 45 kernels.

All formats except yours does cache binaries, except for nvidia devices which has caching in the driver. Regardless, I really think parallelizing builds with OpenMP would be a good idea.

@magnumripper
Copy link
Member Author

Hmm I see on super's Tahiti, only 9 kernels were built for -test. And on Titan X, 12 were built. On my macbook using Intel Graphics HD5000, like I said I aborted it after 45... How come this difference?

@magnumripper
Copy link
Member Author

Oh, maybe it's because of the "autotune fail"...

../run/john -test -form:descrypt-opencl
Device 1: Iris
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL]... Possible auto_tune fail!!.
Salt compiled from Source:1
Salt compiled from Source:2
Salt compiled from Source:3
(...)
Salt compiled from Source:43
Salt compiled from Source:44
Salt compiled from Source:45
^CSession aborted

@sayan1an
Copy link
Contributor

sayan1an commented Oct 1, 2015

Are clCreateProgramWithSource and clGetProgramBuildInfo thread safe ? Even if they are, they operate on one variable 'program[sequential_id]', making it unsafe.

@magnumripper
Copy link
Member Author

In OpenCL 1.1. all functions are thread-safe except for clSetKernelArg(). But that one is quick so we'd just call it sequentially.

@sayan1an
Copy link
Contributor

sayan1an commented Oct 2, 2015

I think program object needs to be thread safe. i.e 'program[sequential_id][thread_id]'. Otherwise we can't call clBuildProgram in parallel.

@magnumripper
Copy link
Member Author

You may be right. We could probably work around it if we want to but it might be more complex than I hoped.

@sayan1an
Copy link
Contributor

sayan1an commented Oct 2, 2015

Apart from this, include_source() function is not thread safe. Also we must ensure kernel_source is not being modified while building kernel and no instance of opencl_read_source() is running in parallel.

@magnumripper
Copy link
Member Author

Our own functions are no problem. I'll just make them thread-safe.

@sayan1an
Copy link
Contributor

sayan1an commented Oct 2, 2015

I just made some changes required for parallel build and thread safety. Please review 0dbaf3a.

@magnumripper
Copy link
Member Author

I'll try to digest it. I was thinking we could pass a (thread local) buffer (or rather a pointer) to opencl_read_source() and then pass the same pointer to opencl_build() (or vice versa, something along the lines of that).

The fact we had hard-coded program[0] was kind of funny :fail:

@sayan1an
Copy link
Contributor

sayan1an commented Oct 2, 2015

program[0] is a local object. Not the global one. We'll be passing a thread local program buffer to opencl_build() and build_from_binary().

@magnumripper
Copy link
Member Author

As far as I can see it's a global, declared as cl_program program[MAX_GPU_DEVICES];.

@sayan1an
Copy link
Contributor

sayan1an commented Oct 3, 2015

Unfortunately, we have same name for global program object and the pointer to program object(as function argument). I plan on using program object supplied by the format for descrypt-opencl. All other formats would be using global program object declared in common-opencl.

@sayan1an
Copy link
Contributor

sayan1an commented Oct 4, 2015

@magnumripper I have implemented parallel build 830f6a0. You may turn it on by setting PARALLEL_BUILD to 1 in opencl_DES_hst_dev_shared.h. However, we should really make path_expand() function thread safe in order to reduce the number of critcal sections and speed up build process.

@magnumripper
Copy link
Member Author

Cool, I'll try it out.

@magnumripper
Copy link
Member Author

I used Solar's way of pre-compiling all 4096 salts to benchmark this. While the format works fine with other test files, this one make it segfault:

$ perl -e '$c64 = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; foreach $c1 (split //, $c64) { foreach $c2 (split //, $c64) { print "$c1$c2...........\n"; } }' > pw-fakedes

$ head pw-fakedes 
.............
./...........
.0...........
.1...........
.2...........
.3...........
.4...........
.5...........
.6...........
.7...........

$ ../run/john pw-fakedes -form:descrypt-opencl -dev=2
Device 2: Tahiti [AMD Radeon HD 7900 Series]
Using default input encoding: UTF-8
Loaded 4096 password hashes with 4096 different salts (descrypt-opencl, traditional crypt(3) [DES OpenCL])
Salt compiled from Source:910
Salt compiled from Source:910
Salt compiled from Source:2275
Salt compiled from Source:990
Salt compiled from Source:0
Segmentation fault

$ ../run/john pw-fakedes -form:descrypt-opencl -dev=6
Device 6: GeForce GTX TITAN X
Using default input encoding: UTF-8
Loaded 4096 password hashes with 4096 different salts (descrypt-opencl, traditional crypt(3) [DES OpenCL])
Salt compiled from Source:910
Salt compiled from Source:910
Salt compiled from Source:2275
Salt compiled from Source:990
Salt compiled from Source:0
Segmentation fault

@magnumripper
Copy link
Member Author

$ gdb --args ../run/john pw-fakedes -form:descrypt-opencl -dev=6
(gdb) r
Starting program: /home/magnum/src/john/run/john pw-fakedes -form:descrypt-opencl -dev=6
Device 6: GeForce GTX TITAN X
Using default input encoding: UTF-8
Loaded 4096 password hashes with 4096 different salts (descrypt-opencl, traditional crypt(3) [DES OpenCL])
Salt compiled from Source:910
Salt compiled from Binary:910
Salt compiled from Binary:2275
Salt compiled from Binary:990
Salt compiled from Binary:0
Program received signal SIGSEGV, Segmentation fault.
0x000000000074fc23 in remove_duplicates_64 (num_loaded_hashes=1, hash_table_size=128, verbosity=0)
    at bt_hash_type_64.c:440
440             loaded_hashes_64[i] = loaded_hashes_64[num_unique_hashes];
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.7.x86_64 gmp-4.3.1-7.el6_2.2.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-37.el6_6.x86_64 libX11-1.6.0-2.2.el6.x86_64 libXau-1.0.6-4.el6.x86_64 libXext-1.3.2-2.1.el6.x86_64 libXinerama-1.1.3-2.1.el6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libgomp-4.4.7-11.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libstdc++-4.4.7-11.el6.x86_64 libxcb-1.9.1-2.el6.x86_64 mesa-libGL-10.1.2-2.el6.x86_64 nss-softokn-freebl-3.14.3-22.el6_6.x86_64 numactl-2.0.9-2.el6.x86_64 opencl-1.2-intel-cpu-3.1.1.11385-1.x86_64 opencl-1.2-intel-mic-3.1.1.11385-1.x86_64 openssl-1.0.1e-30.el6_6.11.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x000000000074fc23 in remove_duplicates_64 (num_loaded_hashes=1, hash_table_size=128, 
    verbosity=0) at bt_hash_type_64.c:440
#1  0x000000000074cd62 in create_perfect_hash_table (htype=<value optimized out>, 
    loaded_hashes_ptr=<value optimized out>, num_ld_hashes=1, offset_table_ptr=0x7fffffffa0d0, 
    offset_table_sz_ptr=0x7fffffffa0d8, hash_table_sz_ptr=0x7fffffffa0dc, verb=0) at bt.c:680
#2  0x00000000005de2a2 in fill_buffer (salt=<value optimized out>, 
    max_uncracked_hashes=<value optimized out>, max_hash_table_size=0xd0c2a4)
    at opencl_DES_bs_plug.c:224
#3  0x00000000005de842 in build_tables (db=<value optimized out>) at opencl_DES_bs_plug.c:423
#4  0x00000000005d678f in reset (db=0xdb3da0) at opencl_DES_bs_f_plug.c:645
#5  0x00000000007053d7 in john_run () at john.c:1587
#6  0x0000000000705bc7 in main (argc=4, argv=0x7fffffffe3d8) at john.c:1883
(gdb) 

@magnumripper
Copy link
Member Author

Hmm @Sayantan2048 is the problem that we have 4096 unique salts but only 1 unique binary?

@magnumripper
Copy link
Member Author

No, that doesn't seem to be it.

@magnumripper
Copy link
Member Author

BTW note that I did not even enable OpenMP builds yet! I was going to make a baseline first.

@sayan1an
Copy link
Contributor

sayan1an commented Oct 7, 2015

On Wed, Oct 7, 2015 at 3:13 AM, magnum notifications@github.com wrote:

BTW note that I did not even enable OpenMP builds yet!


Reply to this email directly or view it on GitHub
#1618 (comment)
.

Another edge case issue!! Binaries are all zero!! These tables use zero to
mark invalid/duplicate hashes.

@magnumripper
Copy link
Member Author

LOL. OK, the following works better

perl -e '$c64 = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; foreach $c1 (split //, $c64) { foreach $c2 (split //, $c64) { print "$c1${c2}Nf8Sbh3HDfQ\n"; } }' > pw-fakedes

@magnumripper magnumripper modified the milestones: 1.8.0-jumbo-2, 1.8.0-jumbo-3 Nov 12, 2015
@magnumripper magnumripper modified the milestones: 1.8.0-jumbo-3, 1.8.0-jumbo-2 Nov 12, 2015
@magnumripper
Copy link
Member Author

It really shouldn't segfault though, that's nasty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants