Nvidia busy-wait workaround #4781

asaidac · 2021-08-09T16:31:18Z

Hello,

Not sure if the JtR is still maintained - but I'll try anyway.
I have a headless 4 GPUs rig, with a 4 physical cores CPU

Using OpenCL "-dev=1,2,3,4" where -dev are the GPU's assigned numbers by JtR.
The GPU is working well - the problem is the CPU.

During the run all 4 cores are loaded to cca. 100%.
I know that part of the job is done by CPU - but not a that kind of load.

I am suspecting that
cudaThreadSynchronize() is effectively a spin lock which polls the GPU at rather high frequency, waiting until the GPU kernel is finished. Because the CPU thread is just sitting in a polling loop, it actually isn’t doing much work - just using unnecessary power.

Any ideas how to fix this?

Thanks

Checklist

🥇 I've read and understood these instrucions;
- This is not a support forum, it's a bug tracker. For questions and support, review postings on the john-users mailing list.
👍 I've tested using latest bleeding version from this repository.
- Be clear about your environment and what you are doing. Share a sample hash or file that can be used to reproduce.
😕 I'm confused and I need guidance.
- Please, read the instructions at https://www.openwall.com/john/#lists, then join the list before posting.

IMPORTANT

We expect only reports of issues with the latest revision found in this GitHub repository. We do not expect in here, and have little use for, reports of issues only seen in a release or in a distro package.

Attach details about your OS and about john, including:

The output of ./john --list=build-info.
The command line you are using.

The text was updated successfully, but these errors were encountered:

claudioandre-br · 2021-08-09T16:49:32Z

If you are using NVIDIA (it seems to be the case), this is a known issue (examples):

BTW: the workaround looks ugly.

claudioandre-br · 2021-08-09T16:54:28Z

@solardiz this sounds bad:

Not sure if the JtR is still maintained.

We are old and tired but not dead.

BTW: I think we will close the issue shortly, it is known and it is "by design" (?) from NVIDIA itself.

asaidac · 2021-08-09T17:28:50Z

Hello I know about Nvidia OpenCL problem. My question - if JtR is still maintained - is why not moving to CUDA (since currently OpenCL is deprecated.) BTW there is a piece of code that can even fix the OpenCL problem if anyone cares about it

…

On Mon, Aug 9, 2021, 12:49 Claudio André ***@***.***> wrote: If you are using NVIDIA (it seems to be the case), this is a known issue (examples): - https://forums.developer.nvidia.com/t/opencl-busy-wait-still-not-fixed/46441 - openmm/openmm#1541 <openmm/openmm#1541> BTW: the workaround looks ugly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALB37M7TJBNL4YYKX7R3I5LT4ABKPANCNFSM5B2L4OIQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

asaidac · 2021-08-09T17:31:56Z

Note: in one of the filed bugs - OpenCL will take over one core at 100% In my case I have 4 cores ALL at 100%

…

On Mon, Aug 9, 2021, 12:49 Claudio André ***@***.***> wrote: If you are using NVIDIA (it seems to be the case), this is a known issue (examples): - https://forums.developer.nvidia.com/t/opencl-busy-wait-still-not-fixed/46441 - openmm/openmm#1541 <openmm/openmm#1541> BTW: the workaround looks ugly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALB37M7TJBNL4YYKX7R3I5LT4ABKPANCNFSM5B2L4OIQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

magnumripper · 2021-08-09T17:52:19Z

why not moving to CUDA (since currently OpenCL is deprecated

If you are willing to spend a couple thousand hours coding CUDA for us, you are very welcome to do so! 😉
We had CUDA along with OpenCL years ago but since CUDA is proprietary and we lacked volunteers we ended up with OpenCL only. Not sure why you say it's deprecated?

there is a piece of code that can even fix the OpenCL problem

I might experiment with that some time. I never cared a lot though.

In my case I have 4 cores ALL at 100%

Sure, one per GPU. It's stupid but it's not the end of the world, you have a multitasking OS.

claudioandre-br · 2021-08-09T17:52:34Z

there is a piece of code that can even fix the OpenCL

Is there a fix or a workaround?
CUDA was removed from JtR years ago.
And JtR is maintained. And it is still used by many people.

Here you can find a discussion about the toptic.

https://www.openwall.com/lists/john-dev/2012/05/16/12

asaidac · 2021-08-09T19:30:36Z

In my car using 4 cores at 100% = 4x 35w = 12oW => 2.88 KW / 24h In a long run usually 8 days that is cca 24 KW WAISTED energy + the power required by GPUs So, if the development for JtR is closed, I need to find another solution since JTR is too expensive to run. Do you know who should I contact on the development team?

…

On Mon, Aug 9, 2021 at 1:52 PM magnum ***@***.***> wrote: why not moving to CUDA (since currently OpenCL is deprecated If you are willing to spend a couple thousand hours coding CUDA for us, you are very welcome to do so! 😉 We had CUDA along with OpenCL years ago but since CUDA is proprietary and we lacked volunteers we ended up with OpenCL only. Not sure why you say it's deprecated? there is a piece of code that can even fix the OpenCL problem I might experiment with that some time. I never cared a lot though. In my case I have 4 cores ALL at 100% Sure, one per GPU. It's stupid but it's not the end of the world, you have a multitasking OS. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALB37M7WZTL6IJ63JPI7XVDT4AIV5ANCNFSM5B2L4OIQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

claudioandre-br · 2021-08-09T19:46:35Z

Solar is going to complain, but that is too much.

Do you know who should I contact on the development team?

Are you asking magnum this question? Is that is a joke? A Millennial(ness)? If you need paid support you can contact Openwall. 3rd line below.

$ john
John the Ripper 1.9.0-jumbo-1+bleeding-2e6eba4 2021-07-07 17:16:06 +0200 OMP [linux-gnu 64-bit x86_64 AVX2 AC]
Copyright (c) 1996-2021 by Solar Designer and others
Homepage: https://www.openwall.com/john/

Usage: john [OPTIONS] [PASSWORD-FILES]

Use --help to list all available options.

claudioandre-br · 2021-08-09T19:52:35Z

Just to reinforce (9 year ago, john-dev, the complain is about CUDA):
1.

This is just an observation: somehow with CUDA we're fully wasting an
entire CPU core per GPU, whereas with OpenCL we're only using some CPU
time on one core:

We found that:

Like this Increased CPU usage with last drivers starting from 270.xx

A workaround might:

decrease latency when waiting for the device, but may lower the performance of CPU threads

Something we prefer to avoid.

solardiz · 2021-08-09T20:03:33Z

Hi @asaidac. John the Ripper is maintained. The fact that we might not fix this one issue doesn't mean the project as a whole is unmaintained. You don't need to contact anyone in particular - you've already contacted us here.

As to fixing this issue, ideally NVIDIA would. As we're currently aware, we can only do one of these things:

Do nothing, leave everything as-is.
Implement a workaround in our code, by estimating the kernel invocation duration and sleeping for most of this time, only calling the OpenCL API that on NVIDIA ends up busy-waiting when the time is almost up. BTW, I wasn't aware other projects were doing this already, but per a link Claudio posted above apparently hashcat was doing it in 2016? In my own experience with hashcat, it too uses 100% of a logical CPU per OpenCL device. However, recent versions of hashcat reintroduced CUDA support. So when CUDA works (in my experience, in some GPU hosting providers' containers only OpenCL works, and CUDA fails), then hashcat avoids that problem.
We can reintroduce CUDA support. Most likely, in a way similar to the way hashcat reintroduced it - avoiding having to maintain separate kernel sources for OpenCL and CUDA, instead writing them in a language that fits both (luckily, OpenCL and CUDA aren't too dissimilar, making this practical). As I understand, only OpenCL vs. CUDA specific include files would need to be separate.

We don't currently intend to go and reintroduce CUDA support. Definitely not before the next release. Maybe later. However, I don't mind us experimenting with the workaround.

BTW, another workaround, and one a user can use (with taskset on Linux), is to bind all of those processes to the same CPU core, or at least to fewer cores. This will probably reduce efficiency of GPU usage, though - except when you still allocate one logical CPU (one hardware thread) per GPU. So e.g. if you have a 4-core 8-thread CPU and 4 GPUs, you can bind the 4 processes to 4 logical CPUs that correspond to 2 physical cores, leaving the other 2 physical cores completely idle. This should halve the power consumption without a slowdown.

solardiz · 2021-08-09T20:17:36Z

using 4 cores at 100% = 4x 35w

I guess you can also lower that by forcing those cores into a lower power state (lower clock rate). Something like cpupower frequency-set -g powersave (requires certain kernel modules loaded), or in several other ways. See e.g. https://wiki.archlinux.org/title/CPU_frequency_scaling

Of course, ideally the issue wouldn't exist in NVIDIA OpenCL, or would be worked around in JtR, so I am not posting the above suggestions about taskset and cpupower as excuses for NVIDIA and for us not doing things better. However, those are in fact additional workarounds you can use to mitigate the impact of this issue.

magnumripper · 2021-08-10T12:18:24Z

In a long run usually 8 days that is cca 24 KW WAISTED energy

I hear you and fixing this, in any way, would be "timely" and better for our battered environment. The best (and easiest, and most effective for the world) would be for Nvidia to fix their shit.

We can reintroduce CUDA support. Most likely, in a way similar to the way hashcat reintroduced it - avoiding having to maintain separate kernel sources for OpenCL and CUDA, instead writing them in a language that fits both (luckily, OpenCL and CUDA aren't too dissimilar, making this practical). As I understand, only OpenCL vs. CUDA specific include files would need to be separate.

I wasn't aware of such dual-use kernels, that's conceptually very interesting. But it only solves kernel side - I think hashcat has a whole lot more shared code for formats whereas we'd need to write full CPU-side code for CUDA for each of our ~89 GPU formats. Obviously we would add an abstraction layer instead, but that would still be a rewrite of 89 formats (OTOH it would pay off for every future format we add).

If we could afford pulling off a GSoC again (provided they still run it?) it would be an excellent task: Primarily, write all shared code and move at least one format. Secondary, move more/all formats.

asaidac · 2021-08-10T12:44:52Z

Hello I consider this issue closed - I implement a solution that will alleviate the problem I WAS experienced. Since the CPU creates spin cycles / no actual work being done, and the fact the CPU will support OpenCL I've added the CPU to the device list Node numbers 1-5 of 5 (fork) Device 4: GeForce GTX 1080 Ti *Device 5: pthread-Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz* Device 3: GeForce GTX 1660 Device 2: GeForce GTX 1060 6GB Device 1: GeForce GTX 1080 In this configuration, the CPU does not make a big contribution to the general performance level but at least does not use empty power any longer. 1 0g 0:00:04:41 0.01% (8) (ETA: 2021-09-01 09:34) 0g/s 295710p/s 295710c/s 295710C/s Dev#1:57°C kt4jheaa..zi61heaa 3 0g 0:00:04:41 0.01% (8) (ETA: 2021-09-08 15:39) 0g/s 222116p/s 222116c/s 222116C/s Dev#3:60°C wy1xbbuu..q8z1bbuu 2 0g 0:00:04:42 0.01% (8) (ETA: 2021-09-18 15:39) 0g/s 165752p/s 165752c/s 165752C/s Dev#2:52°C 7tjo7sss..lovc7sss *5 0g 0:00:04:43 (8) 0g/s 567.9p/s 567.9c/s 567.9C/s v9t52222..b8t52222* 4 0g 0:00:04:43 0.02% (8) (ETA: 2021-08-26 14:17) 0g/s 401689p/s 401689c/s 401689C/s Dev#4:56°C rgoygzff..tk4fgzff I am tuning the GPUs for power control - hence the performance is 20-30% lower than the max. 20210810 08:35:40 0 159 57 - 89 4 0 0 4513 1822 20210810 08:35:40 1 101 51 - 100 2 0 0 3802 1708 20210810 08:35:40 2 100 60 - 79 3 0 0 4001 2115 20210810 08:35:40 3 174 55 - 88 3 0 0 5005 1708 John the Ripper 1.9.0-jumbo-1+bleeding-2e6eba49f 2021-07-07 17:16:06 +0200 OMP [linux-gnu 64-bit x86_64 AVX2 AC] Copyright (c) 1996-2021 by Solar Designer and others NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 Thank you for your time

…

On Tue, Aug 10, 2021 at 8:18 AM magnum ***@***.***> wrote: Most likely, in a way similar to the way hashcat reintroduced it - avoiding having to maintain separate kernel sources for OpenCL and CUDA, instead writing them in a language that fits both (luckily, OpenCL and CUDA aren't too dissimilar, making this practical). As I understand, only OpenCL vs. CUDA specific include files would need to be separate. I wasn't aware of such dual-use kernels, that's conceptually very interesting. But it only solves kernel side - I think hashcat has a whole lot more shared code for formats whereas we'd need to write full CPU-side code for CUDA for each of our ~89 GPU formats. Obviously we would add an abstraction layer instead, but that would still be a rewrite of 89 formats (OTOH it would pay off for every future format we add). If we could afford pulling off a GSoC again (provided they still run it?) it would be an excellent task: Primarily, write all shared code and move at least one format. Secondary, move more/all formats. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALB37M7JS3SWSXR73OPQULDT4EKJXANCNFSM5B2L4OIQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

magnumripper · 2021-08-10T12:48:20Z

OK I'm closing this, but we still have #4363 for possibly doing something about it.

solardiz · 2021-12-14T21:08:40Z

@asaidac What JtR format(s) did (still do?) you need this for? We've just started working around this issue in #4944 for tezos-opencl, with some other formats probably to follow.

magnumripper self-assigned this Aug 9, 2021

magnumripper changed the title ~~CPU empty spinning cycles~~ Nvidia busy-wait workaround Aug 9, 2021

solardiz added enhancement RFC / discussion Help or comments wanted labels Aug 9, 2021

magnumripper mentioned this issue Aug 10, 2021

Use unified helper functions for creating and destroying OpenCL buffers/mappings - and/or for re-introducing CUDA #4363

Open

magnumripper removed their assignment Aug 10, 2021

magnumripper closed this as completed Aug 10, 2021

magnumripper mentioned this issue Jan 19, 2022

Avoid busy wait #5006

Merged

WincerChan mentioned this issue May 20, 2024

Ubuntu 20.04 + RTX 4090 (rented container) - Unlucky or bug? WincerChan/SolVanityCL#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia busy-wait workaround #4781

Nvidia busy-wait workaround #4781

asaidac commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

asaidac commented Aug 9, 2021 via email

asaidac commented Aug 9, 2021 via email

magnumripper commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

asaidac commented Aug 9, 2021 via email

claudioandre-br commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

solardiz commented Aug 9, 2021

solardiz commented Aug 9, 2021

magnumripper commented Aug 10, 2021 •

edited

Loading

asaidac commented Aug 10, 2021 via email

magnumripper commented Aug 10, 2021

solardiz commented Dec 14, 2021

Nvidia busy-wait workaround #4781

Nvidia busy-wait workaround #4781

Comments

asaidac commented Aug 9, 2021

Checklist

IMPORTANT

claudioandre-br commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

asaidac commented Aug 9, 2021 via email

asaidac commented Aug 9, 2021 via email

magnumripper commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

asaidac commented Aug 9, 2021 via email

claudioandre-br commented Aug 9, 2021

claudioandre-br commented Aug 9, 2021

solardiz commented Aug 9, 2021

solardiz commented Aug 9, 2021

magnumripper commented Aug 10, 2021 • edited Loading

asaidac commented Aug 10, 2021 via email

magnumripper commented Aug 10, 2021

solardiz commented Dec 14, 2021

magnumripper commented Aug 10, 2021 •

edited

Loading