Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

Closed
informatorius opened this issue Jul 16, 2016 · 20 comments
Closed

Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541

informatorius opened this issue Jul 16, 2016 · 20 comments

Comments

@informatorius
Copy link
Contributor

informatorius commented Jul 16, 2016

When running OpenCL on Nvidia GPU then one CPU core is used 100%.
This is also the case with e.g. folding@home

Now another OpenCL tool "hashcat" posted a workaround to reduce cpu usage to 10%.
Maybe this can also be applied to OpenMM?

<<<
For example with the OpenCL runtime of NVidia, they still have a 5-year-old-known-bug which creates 100% CPU load on a single core per NVidia GPU (NVidia's OpenCL busy-wait). If you're using oclHashcat for quite a while you may remember the same bug happened to AMD years ago.

Basically, what NVidia is missing here is that they use spinning instead of yielding. Their goal was to increase the performance but in our case there's actually no gain from having a CPU burning loop. The hashcat kernels run for ~100ms and that's quite a long time for an OpenCL kernel. At such a scale, spinning creates only disadvantages and there's no way to turn it off (Only CUDA supports that).

But why is this a problem? If the OpenCL runtime spins on a core to find out if a GPU kernel is finished it creates 100% CPU load. Now imagine you have another OpenCL device, e.g. your CPU, creating also 100% CPU load, it will cause problems even if it's legitimate to do that here. The GPU's CPU-burning thread will slow down by 50%, and you end up with a slower GPU rate just by enabling your CPU too (--opencl-device-type 1). For AMD GPU that's not the case (they fixed that bug years ago.)

To help mitigate this issue, I've implemented the following behavior:

Hashcat will try to workaround the problem by sleeping for some precalculated time after the kernel was queued and flushed. This will decrease the CPU load down to less than 10% with almost no impact on cracking performance.

https://github.com/hashcat/hashcat/releases/tag/v3.00

Boind GPUGrid has it as option SWAN_SYNC

@informatorius informatorius changed the title Workaround?: Nvidia GPU OpenCL uses 100% CPU one core Workaround: Nvidia GPU OpenCL uses 100% CPU one core Jul 16, 2016
@Squall-Leonhart
Copy link

I see pandegroup hasn't seen fit to make this change, a shame as it would free up cores for folding on.

i have a 12 thread xeon and can only use 9 of them with 2 gpu's at work.

@jchodera
Copy link
Member

It looks like the OpenCL workaround code is here. It monitors the execution time of the kernel and calls usleep with a fraction of this execution time in its own polling scheme. @peastman can comment on whether this is feasible to integrate into OpenMM's OpenCL platform---to me, it looks like this would require rewriting the whole Khronos-provided cl.hpp---but I can't imagine how this wouldn't disrupt performance when the sleep time fraction is large enough.

For the CUDA platform (which is coming to Folding@home in core22), it seems that NVIDIA has provided a cudaDeviceScheduleYield option that could be set via cudaSetDeviceFlags. It may be straightforward to let this be set as either an environment variable or CUDA platform property, which would allow us to yield on Folding@home core22 to be more friendly to foreground tasks.

@peastman: What do you think about the CUDA solution?

@Squall-Leonhart
Copy link

"but I can't imagine how this wouldn't disrupt performance when the sleep time fraction is large enough."

it supposedly can, but in this mbevand/silentarmy#60 they worked the usage down to as little as 4% core usage

@jchodera
Copy link
Member

Do you know if each "solution" is a separate kernel launch for that application? If so, I wonder if the number of kernel launches per second is very low---less than 100/second in the reports in that thread---compared to OpenMM, which may mean that the same strategy would have a larger impact on OpenMM GPU performance. If it wasn't ridiculously difficult to implement, this could also be controlled by a platform property, but I think the easiest route (if Folding@home behavior is the primary concern) is to use cudaSetDeviceFlags for a core22 version. We can potentially even just use that in the Folding@home core on its own to see if this helps---I'll make a note to introduce that into the first test builds of core22.

(I'm presuming Folding@home is the use case of interest here. If not, let us know!)

@Squall-Leonhart
Copy link

Squall-Leonhart commented Mar 12, 2017

In my case, yes very much so.

the final commits in the pull i referenced actually increased gpu hashing, i assume by minimising some latency caused by the nvidia busy wait.

Here is the final commit by mbevand
mbevand/silentarmy@a6c3517

getting this in or, if the cows come home and nvidia fixes this in their opencl library, would free up cores to use on the x86 fahcores.

@peastman
Copy link
Member

You didn't mention what platform you're using. When running on Windows, the OpenCL platform already does a yield() after each time step. We assume that if you're on Windows, it's most likely F@H and you want to sacrifice some performance to keep the UI more responsive. On Linux, we assume you're more likely running on a server and you want the best simulation performance possible.

The behavior of CUDA is different. You can select whether the CPU should spin loop during kernels (better performance but uses more CPU) or sleep. You can set this with the CudaUseBlockingSync platform property when you create your Context.

@jchodera
Copy link
Member

You can set this with the CudaUseBlockingSync platform property when you create your Context.

Thanks, @peastman! That suggests that I set the following property in the OpenMM FAH core:

contextProperties["UseBlockingSync"] = "false";

CudaUseBlockingSync is listed as deprecated in favor of UseBlockingSync.

@peastman: I presume you're not interested in implementing a microsleep scheme platform option for OpenCL for UseBlockingSync = false?

@peastman
Copy link
Member

I presume you're not interested in implementing a microsleep scheme platform option for OpenCL for UseBlockingSync = false?

Is the current method of just calling yield() not working well enough? (Remember, we only do that on Windows.)

@bb30994
Copy link

bb30994 commented Mar 13, 2017

On a dedicated OpenMM machine, the option can be set by the scientist, but providing that option to the Folding@home donor is a good idea by way of jchodera's issue for FAHCore_22.

If the (Windows) CUDA or OpenCL driver happens to be used on a gaming machine it probably makes sense for NVidia and AMD to default to spin-wait to maximize the frame rate, pleasing a majority of their customers. If the same drivers are used on a Folding@home machine where Folding@home is running on both a CPU and a GPU, freeing up CPU resources is probably a good thing, but making it an optional setting is still a good idea.

If the GPU happens to be a slow one, interactive changes to the screen can get bogged down and introduce noticeable visible "screen lag" unless OpenMM yields frequently enough for those updates to get dispatched.

@jchodera
Copy link
Member

Is the current method of just calling yield() not working well enough? (Remember, we only do that on Windows.)

Is it possible to let the yield() behavior (on all systems) be controlled by the UseBlockingSync platform property?

@peastman
Copy link
Member

They're doing very different things, though they both have the goal of keeping the UI responsive. The blocking sync option determines whether the CPU spin loops when waiting for the GPU. Disabling it reduces overall CPU usage. The yield doesn't directly reduce CPU usage, but it makes it easier for other processes to interrupt it, so it will be friendlier to other things running on the system.

@Squall-Leonhart
Copy link

am on Windows, and as the OP said, yield doesn't actually yield. when using one gpu you lose 1 cpu core to it and the cpu core defaults to say, 11 threads.
with 2 gpu's, this goes down to 9 threads.

this is dropping overall ppd.

"If the GPU happens to be a slow one, interactive changes to the screen can get bogged down and introduce noticeable visible "screen lag" unless OpenMM yields frequently enough for those updates to get dispatched."

This is the experience presently with core 21 on slower gpu's.

11 threads on cpu (as defaulted)
1 for a gtx 550ti
it wasn't very fun to do anything else on the machine.

@informatorius
Copy link
Contributor Author

informatorius commented Sep 9, 2018

That script from referenced ethereum #399 does several changes in Windows registry but only one is related to performance "Disable CPU Core Parking". Has nothing to do with this OP.

@informatorius
Copy link
Contributor Author

informatorius commented Jan 2, 2019

@peastman:

The blocking sync option determines whether the CPU spin loops when waiting for the GPU. Disabling it reduces overall CPU usage

I would like to have that option as FAH user on Windows and Linux because now there are systems used with only 2 CPU cores but 4 GPUs or more, the typical mining rig.

@peastman
Copy link
Member

peastman commented Jan 2, 2019

I believe core22 does that already. @jchodera can you confirm?

@jchodera
Copy link
Member

jchodera commented Jan 2, 2019

@peastman: I've just checked the code, and we haven't yet uncommented this code for disabling the blocking sync with the CUDA platform:

 // Don't use blocking CUDA calls in order to be more friendly to CPUs.
  // TODO: Uncomment this for core22, but may need to check if platform supports this property first.
  //contextProperties["UseBlockingSync"] = "false";

Do I need to check if UseBlockingSync is supported by the Platform before setting this?

However, we haven't built the win core for CUDA just yet. Would this imply there is something wrong with the way that the OpenCL platform is yielding to the OS within OpenMM?

@peastman
Copy link
Member

peastman commented Jan 2, 2019

So far as I know, there's no corresponding option for OpenCL. It's up to the runtime to decide how it will handle synchronization.

@ph0b
Copy link
Contributor

ph0b commented Jun 23, 2020

Since OpenCL 2.2, the cl_khr_throttle_hints extension is defined and allows to specify the throttling that's needed.
It allows to create a command queue with CL_QUEUE_THROTTLE_KHR set to CL_QUEUE_THROTTLE_LOW_KHR. That tells the runtime it may run in a way that's optimized, with lowest energy consumption.
On Intel GPUs at least, that hint is supported and translates to blocking sync.

@Squall-Leonhart
Copy link

if only nvidia supported opencl 2.2 and that information was relevant.

@peastman
Copy link
Member

peastman commented Dec 1, 2020

It sounds like this should no longer be an issue in core22? I'm going to go ahead and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants