Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

informatorius · 2016-07-16T10:09:04Z

When running OpenCL on Nvidia GPU then one CPU core is used 100%.
This is also the case with e.g. folding@home

Now another OpenCL tool "hashcat" posted a workaround to reduce cpu usage to 10%.
Maybe this can also be applied to OpenMM?

<<<
For example with the OpenCL runtime of NVidia, they still have a 5-year-old-known-bug which creates 100% CPU load on a single core per NVidia GPU (NVidia's OpenCL busy-wait). If you're using oclHashcat for quite a while you may remember the same bug happened to AMD years ago.

Basically, what NVidia is missing here is that they use spinning instead of yielding. Their goal was to increase the performance but in our case there's actually no gain from having a CPU burning loop. The hashcat kernels run for ~100ms and that's quite a long time for an OpenCL kernel. At such a scale, spinning creates only disadvantages and there's no way to turn it off (Only CUDA supports that).

But why is this a problem? If the OpenCL runtime spins on a core to find out if a GPU kernel is finished it creates 100% CPU load. Now imagine you have another OpenCL device, e.g. your CPU, creating also 100% CPU load, it will cause problems even if it's legitimate to do that here. The GPU's CPU-burning thread will slow down by 50%, and you end up with a slower GPU rate just by enabling your CPU too (--opencl-device-type 1). For AMD GPU that's not the case (they fixed that bug years ago.)

To help mitigate this issue, I've implemented the following behavior:

Hashcat will try to workaround the problem by sleeping for some precalculated time after the kernel was queued and flushed. This will decrease the CPU load down to less than 10% with almost no impact on cracking performance.

https://github.com/hashcat/hashcat/releases/tag/v3.00

Boind GPUGrid has it as option SWAN_SYNC

The text was updated successfully, but these errors were encountered:

Squall-Leonhart · 2017-03-12T04:49:16Z

I see pandegroup hasn't seen fit to make this change, a shame as it would free up cores for folding on.

i have a 12 thread xeon and can only use 9 of them with 2 gpu's at work.

jchodera · 2017-03-12T14:28:20Z

It looks like the OpenCL workaround code is here. It monitors the execution time of the kernel and calls usleep with a fraction of this execution time in its own polling scheme. @peastman can comment on whether this is feasible to integrate into OpenMM's OpenCL platform---to me, it looks like this would require rewriting the whole Khronos-provided cl.hpp---but I can't imagine how this wouldn't disrupt performance when the sleep time fraction is large enough.

For the CUDA platform (which is coming to Folding@home in core22), it seems that NVIDIA has provided a cudaDeviceScheduleYield option that could be set via cudaSetDeviceFlags. It may be straightforward to let this be set as either an environment variable or CUDA platform property, which would allow us to yield on Folding@home core22 to be more friendly to foreground tasks.

@peastman: What do you think about the CUDA solution?

Squall-Leonhart · 2017-03-12T14:50:37Z

"but I can't imagine how this wouldn't disrupt performance when the sleep time fraction is large enough."

it supposedly can, but in this mbevand/silentarmy#60 they worked the usage down to as little as 4% core usage

jchodera · 2017-03-12T14:55:52Z

Do you know if each "solution" is a separate kernel launch for that application? If so, I wonder if the number of kernel launches per second is very low---less than 100/second in the reports in that thread---compared to OpenMM, which may mean that the same strategy would have a larger impact on OpenMM GPU performance. If it wasn't ridiculously difficult to implement, this could also be controlled by a platform property, but I think the easiest route (if Folding@home behavior is the primary concern) is to use cudaSetDeviceFlags for a core22 version. We can potentially even just use that in the Folding@home core on its own to see if this helps---I'll make a note to introduce that into the first test builds of core22.

(I'm presuming Folding@home is the use case of interest here. If not, let us know!)

Squall-Leonhart · 2017-03-12T15:03:29Z

In my case, yes very much so.

the final commits in the pull i referenced actually increased gpu hashing, i assume by minimising some latency caused by the nvidia busy wait.

Here is the final commit by mbevand
mbevand/silentarmy@a6c3517

getting this in or, if the cows come home and nvidia fixes this in their opencl library, would free up cores to use on the x86 fahcores.

peastman · 2017-03-13T16:03:29Z

You didn't mention what platform you're using. When running on Windows, the OpenCL platform already does a yield() after each time step. We assume that if you're on Windows, it's most likely F@H and you want to sacrifice some performance to keep the UI more responsive. On Linux, we assume you're more likely running on a server and you want the best simulation performance possible.

The behavior of CUDA is different. You can select whether the CPU should spin loop during kernels (better performance but uses more CPU) or sleep. You can set this with the CudaUseBlockingSync platform property when you create your Context.

jchodera · 2017-03-13T17:52:38Z

You can set this with the CudaUseBlockingSync platform property when you create your Context.

Thanks, @peastman! That suggests that I set the following property in the OpenMM FAH core:

contextProperties["UseBlockingSync"] = "false";

CudaUseBlockingSync is listed as deprecated in favor of UseBlockingSync.

@peastman: I presume you're not interested in implementing a microsleep scheme platform option for OpenCL for UseBlockingSync = false?

peastman · 2017-03-13T18:01:52Z

I presume you're not interested in implementing a microsleep scheme platform option for OpenCL for UseBlockingSync = false?

Is the current method of just calling yield() not working well enough? (Remember, we only do that on Windows.)

bb30994 · 2017-03-13T21:30:02Z

On a dedicated OpenMM machine, the option can be set by the scientist, but providing that option to the Folding@home donor is a good idea by way of jchodera's issue for FAHCore_22.

If the (Windows) CUDA or OpenCL driver happens to be used on a gaming machine it probably makes sense for NVidia and AMD to default to spin-wait to maximize the frame rate, pleasing a majority of their customers. If the same drivers are used on a Folding@home machine where Folding@home is running on both a CPU and a GPU, freeing up CPU resources is probably a good thing, but making it an optional setting is still a good idea.

If the GPU happens to be a slow one, interactive changes to the screen can get bogged down and introduce noticeable visible "screen lag" unless OpenMM yields frequently enough for those updates to get dispatched.

jchodera · 2017-03-13T22:50:23Z

Is the current method of just calling yield() not working well enough? (Remember, we only do that on Windows.)

Is it possible to let the yield() behavior (on all systems) be controlled by the UseBlockingSync platform property?

peastman · 2017-03-13T23:23:43Z

They're doing very different things, though they both have the goal of keeping the UI responsive. The blocking sync option determines whether the CPU spin loops when waiting for the GPU. Disabling it reduces overall CPU usage. The yield doesn't directly reduce CPU usage, but it makes it easier for other processes to interrupt it, so it will be friendlier to other things running on the system.

Squall-Leonhart · 2017-03-14T02:01:56Z

am on Windows, and as the OP said, yield doesn't actually yield. when using one gpu you lose 1 cpu core to it and the cpu core defaults to say, 11 threads.
with 2 gpu's, this goes down to 9 threads.

this is dropping overall ppd.

"If the GPU happens to be a slow one, interactive changes to the screen can get bogged down and introduce noticeable visible "screen lag" unless OpenMM yields frequently enough for those updates to get dispatched."

This is the experience presently with core 21 on slower gpu's.

11 threads on cpu (as defaulted)
1 for a gtx 550ti
it wasn't very fun to do anything else on the machine.

informatorius · 2018-09-09T14:45:22Z

That script from referenced ethereum #399 does several changes in Windows registry but only one is related to performance "Disable CPU Core Parking". Has nothing to do with this OP.

informatorius · 2019-01-02T15:21:59Z

@peastman:

The blocking sync option determines whether the CPU spin loops when waiting for the GPU. Disabling it reduces overall CPU usage

I would like to have that option as FAH user on Windows and Linux because now there are systems used with only 2 CPU cores but 4 GPUs or more, the typical mining rig.

peastman · 2019-01-02T17:46:06Z

I believe core22 does that already. @jchodera can you confirm?

jchodera · 2019-01-02T23:00:03Z

@peastman: I've just checked the code, and we haven't yet uncommented this code for disabling the blocking sync with the CUDA platform:

 // Don't use blocking CUDA calls in order to be more friendly to CPUs.
  // TODO: Uncomment this for core22, but may need to check if platform supports this property first.
  //contextProperties["UseBlockingSync"] = "false";

Do I need to check if UseBlockingSync is supported by the Platform before setting this?

However, we haven't built the win core for CUDA just yet. Would this imply there is something wrong with the way that the OpenCL platform is yielding to the OS within OpenMM?

peastman · 2019-01-02T23:26:42Z

So far as I know, there's no corresponding option for OpenCL. It's up to the runtime to decide how it will handle synchronization.

ph0b · 2020-06-23T10:01:08Z

Since OpenCL 2.2, the cl_khr_throttle_hints extension is defined and allows to specify the throttling that's needed.
It allows to create a command queue with CL_QUEUE_THROTTLE_KHR set to CL_QUEUE_THROTTLE_LOW_KHR. That tells the runtime it may run in a way that's optimized, with lowest energy consumption.
On Intel GPUs at least, that hint is supported and translates to blocking sync.

Squall-Leonhart · 2020-06-23T11:53:36Z

if only nvidia supported opencl 2.2 and that information was relevant.

peastman · 2020-12-01T01:03:52Z

It sounds like this should no longer be an issue in core22? I'm going to go ahead and close this issue.

informatorius changed the title ~~Workaround?: Nvidia GPU OpenCL uses 100% CPU one core~~ Workaround: Nvidia GPU OpenCL uses 100% CPU one core Jul 16, 2016

birdie-github mentioned this issue Nov 23, 2016

sa-solver 100% CPU core with nvidia mbevand/silentarmy#54

Closed

chfast mentioned this issue Dec 8, 2017

CPU at 100% ethereum-mining/ethminer#399

Closed

peastman closed this as completed Dec 1, 2020

ph0b mentioned this issue Dec 17, 2020

lowering CPU usage of OpenMM GPU to free a CPU core #2955

Open

claudioandre-br mentioned this issue Aug 9, 2021

Nvidia busy-wait workaround openwall/john#4781

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

informatorius commented Jul 16, 2016 •

edited

Loading

Squall-Leonhart commented Mar 12, 2017

jchodera commented Mar 12, 2017

Squall-Leonhart commented Mar 12, 2017

jchodera commented Mar 12, 2017

Squall-Leonhart commented Mar 12, 2017 •

edited

Loading

peastman commented Mar 13, 2017

jchodera commented Mar 13, 2017

peastman commented Mar 13, 2017

bb30994 commented Mar 13, 2017

jchodera commented Mar 13, 2017

peastman commented Mar 13, 2017

Squall-Leonhart commented Mar 14, 2017

informatorius commented Sep 9, 2018 •

edited

Loading

informatorius commented Jan 2, 2019 •

edited

Loading

peastman commented Jan 2, 2019

jchodera commented Jan 2, 2019

peastman commented Jan 2, 2019

ph0b commented Jun 23, 2020

Squall-Leonhart commented Jun 23, 2020

peastman commented Dec 1, 2020

Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

Comments

informatorius commented Jul 16, 2016 • edited Loading

Squall-Leonhart commented Mar 12, 2017

jchodera commented Mar 12, 2017

Squall-Leonhart commented Mar 12, 2017

jchodera commented Mar 12, 2017

Squall-Leonhart commented Mar 12, 2017 • edited Loading

peastman commented Mar 13, 2017

jchodera commented Mar 13, 2017

peastman commented Mar 13, 2017

bb30994 commented Mar 13, 2017

jchodera commented Mar 13, 2017

peastman commented Mar 13, 2017

Squall-Leonhart commented Mar 14, 2017

informatorius commented Sep 9, 2018 • edited Loading

informatorius commented Jan 2, 2019 • edited Loading

peastman commented Jan 2, 2019

jchodera commented Jan 2, 2019

peastman commented Jan 2, 2019

ph0b commented Jun 23, 2020

Squall-Leonhart commented Jun 23, 2020

peastman commented Dec 1, 2020

informatorius commented Jul 16, 2016 •

edited

Loading

Squall-Leonhart commented Mar 12, 2017 •

edited

Loading

informatorius commented Sep 9, 2018 •

edited

Loading

informatorius commented Jan 2, 2019 •

edited

Loading