Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia busy-wait workaround #4781

Closed
3 tasks
asaidac opened this issue Aug 9, 2021 · 15 comments
Closed
3 tasks

Nvidia busy-wait workaround #4781

asaidac opened this issue Aug 9, 2021 · 15 comments
Labels
enhancement RFC / discussion Help or comments wanted

Comments

@asaidac
Copy link

asaidac commented Aug 9, 2021

Hello,

Not sure if the JtR is still maintained - but I'll try anyway.
I have a headless 4 GPUs rig, with a 4 physical cores CPU

Using OpenCL "-dev=1,2,3,4" where -dev are the GPU's assigned numbers by JtR.
The GPU is working well - the problem is the CPU.

During the run all 4 cores are loaded to cca. 100%.
I know that part of the job is done by CPU - but not a that kind of load.

I am suspecting that
cudaThreadSynchronize() is effectively a spin lock which polls the GPU at rather high frequency, waiting until the GPU kernel is finished. Because the CPU thread is just sitting in a polling loop, it actually isn’t doing much work - just using unnecessary power.

Any ideas how to fix this?

Thanks

Checklist

  • 🥇 I've read and understood these instrucions;
    • This is not a support forum, it's a bug tracker. For questions and support, review postings on the john-users mailing list.
  • 👍 I've tested using latest bleeding version from this repository.
    • Be clear about your environment and what you are doing. Share a sample hash or file that can be used to reproduce.
  • 😕 I'm confused and I need guidance.

IMPORTANT

We expect only reports of issues with the latest revision found in this GitHub repository. We do not expect in here, and have little use for, reports of issues only seen in a release or in a distro package.

Attach details about your OS and about john, including:

  • The output of ./john --list=build-info.
  • The command line you are using.
@claudioandre-br
Copy link
Member

If you are using NVIDIA (it seems to be the case), this is a known issue (examples):

BTW: the workaround looks ugly.

@claudioandre-br
Copy link
Member

@solardiz this sounds bad:

Not sure if the JtR is still maintained.

We are old and tired but not dead.


BTW: I think we will close the issue shortly, it is known and it is "by design" (?) from NVIDIA itself.

@asaidac
Copy link
Author

asaidac commented Aug 9, 2021 via email

@asaidac
Copy link
Author

asaidac commented Aug 9, 2021 via email

@magnumripper
Copy link
Member

why not moving to CUDA (since currently OpenCL is deprecated

If you are willing to spend a couple thousand hours coding CUDA for us, you are very welcome to do so! 😉
We had CUDA along with OpenCL years ago but since CUDA is proprietary and we lacked volunteers we ended up with OpenCL only. Not sure why you say it's deprecated?

there is a piece of code that can even fix the OpenCL problem

I might experiment with that some time. I never cared a lot though.

In my case I have 4 cores ALL at 100%

Sure, one per GPU. It's stupid but it's not the end of the world, you have a multitasking OS.

@magnumripper magnumripper self-assigned this Aug 9, 2021
@claudioandre-br
Copy link
Member

there is a piece of code that can even fix the OpenCL

  • Is there a fix or a workaround?
  • CUDA was removed from JtR years ago.
  • And JtR is maintained. And it is still used by many people.

Here you can find a discussion about the toptic.

@magnumripper magnumripper changed the title CPU empty spinning cycles Nvidia busy-wait workaround Aug 9, 2021
@asaidac
Copy link
Author

asaidac commented Aug 9, 2021 via email

@claudioandre-br
Copy link
Member

Solar is going to complain, but that is too much.

Do you know who should I contact on the development team?

Are you asking magnum this question? Is that is a joke? A Millennial(ness)? If you need paid support you can contact Openwall. 3rd line below.

$ john
John the Ripper 1.9.0-jumbo-1+bleeding-2e6eba4 2021-07-07 17:16:06 +0200 OMP [linux-gnu 64-bit x86_64 AVX2 AC]
Copyright (c) 1996-2021 by Solar Designer and others
Homepage: https://www.openwall.com/john/

Usage: john [OPTIONS] [PASSWORD-FILES]

Use --help to list all available options.

@claudioandre-br
Copy link
Member

Just to reinforce (9 year ago, john-dev, the complain is about CUDA):
1.

This is just an observation: somehow with CUDA we're fully wasting an
entire CPU core per GPU, whereas with OpenCL we're only using some CPU
time on one core:
  1. We found that:
Like this Increased CPU usage with last drivers starting from 270.xx
  1. A workaround might:
decrease latency when waiting for the device, but may lower the performance of CPU threads

Something we prefer to avoid.

@solardiz
Copy link
Member

solardiz commented Aug 9, 2021

Hi @asaidac. John the Ripper is maintained. The fact that we might not fix this one issue doesn't mean the project as a whole is unmaintained. You don't need to contact anyone in particular - you've already contacted us here.

As to fixing this issue, ideally NVIDIA would. As we're currently aware, we can only do one of these things:

  1. Do nothing, leave everything as-is.
  2. Implement a workaround in our code, by estimating the kernel invocation duration and sleeping for most of this time, only calling the OpenCL API that on NVIDIA ends up busy-waiting when the time is almost up. BTW, I wasn't aware other projects were doing this already, but per a link Claudio posted above apparently hashcat was doing it in 2016? In my own experience with hashcat, it too uses 100% of a logical CPU per OpenCL device. However, recent versions of hashcat reintroduced CUDA support. So when CUDA works (in my experience, in some GPU hosting providers' containers only OpenCL works, and CUDA fails), then hashcat avoids that problem.
  3. We can reintroduce CUDA support. Most likely, in a way similar to the way hashcat reintroduced it - avoiding having to maintain separate kernel sources for OpenCL and CUDA, instead writing them in a language that fits both (luckily, OpenCL and CUDA aren't too dissimilar, making this practical). As I understand, only OpenCL vs. CUDA specific include files would need to be separate.

We don't currently intend to go and reintroduce CUDA support. Definitely not before the next release. Maybe later. However, I don't mind us experimenting with the workaround.

BTW, another workaround, and one a user can use (with taskset on Linux), is to bind all of those processes to the same CPU core, or at least to fewer cores. This will probably reduce efficiency of GPU usage, though - except when you still allocate one logical CPU (one hardware thread) per GPU. So e.g. if you have a 4-core 8-thread CPU and 4 GPUs, you can bind the 4 processes to 4 logical CPUs that correspond to 2 physical cores, leaving the other 2 physical cores completely idle. This should halve the power consumption without a slowdown.

@solardiz
Copy link
Member

solardiz commented Aug 9, 2021

using 4 cores at 100% = 4x 35w

I guess you can also lower that by forcing those cores into a lower power state (lower clock rate). Something like cpupower frequency-set -g powersave (requires certain kernel modules loaded), or in several other ways. See e.g. https://wiki.archlinux.org/title/CPU_frequency_scaling

Of course, ideally the issue wouldn't exist in NVIDIA OpenCL, or would be worked around in JtR, so I am not posting the above suggestions about taskset and cpupower as excuses for NVIDIA and for us not doing things better. However, those are in fact additional workarounds you can use to mitigate the impact of this issue.

@solardiz solardiz added enhancement RFC / discussion Help or comments wanted labels Aug 9, 2021
@magnumripper
Copy link
Member

magnumripper commented Aug 10, 2021

In a long run usually 8 days that is cca 24 KW WAISTED energy

I hear you and fixing this, in any way, would be "timely" and better for our battered environment. The best (and easiest, and most effective for the world) would be for Nvidia to fix their shit.

We can reintroduce CUDA support. Most likely, in a way similar to the way hashcat reintroduced it - avoiding having to maintain separate kernel sources for OpenCL and CUDA, instead writing them in a language that fits both (luckily, OpenCL and CUDA aren't too dissimilar, making this practical). As I understand, only OpenCL vs. CUDA specific include files would need to be separate.

I wasn't aware of such dual-use kernels, that's conceptually very interesting. But it only solves kernel side - I think hashcat has a whole lot more shared code for formats whereas we'd need to write full CPU-side code for CUDA for each of our ~89 GPU formats. Obviously we would add an abstraction layer instead, but that would still be a rewrite of 89 formats (OTOH it would pay off for every future format we add).

If we could afford pulling off a GSoC again (provided they still run it?) it would be an excellent task: Primarily, write all shared code and move at least one format. Secondary, move more/all formats.

@asaidac
Copy link
Author

asaidac commented Aug 10, 2021 via email

@magnumripper
Copy link
Member

OK I'm closing this, but we still have #4363 for possibly doing something about it.

@solardiz
Copy link
Member

@asaidac What JtR format(s) did (still do?) you need this for? We've just started working around this issue in #4944 for tezos-opencl, with some other formats probably to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement RFC / discussion Help or comments wanted
Projects
None yet
Development

No branches or pull requests

4 participants