-
Notifications
You must be signed in to change notification settings - Fork 515
-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround: Nvidia GPU OpenCL uses 100% CPU one core #1541
Comments
I see pandegroup hasn't seen fit to make this change, a shame as it would free up cores for folding on. i have a 12 thread xeon and can only use 9 of them with 2 gpu's at work. |
It looks like the OpenCL workaround code is here. It monitors the execution time of the kernel and calls For the CUDA platform (which is coming to Folding@home in core22), it seems that NVIDIA has provided a @peastman: What do you think about the CUDA solution? |
"but I can't imagine how this wouldn't disrupt performance when the sleep time fraction is large enough." it supposedly can, but in this mbevand/silentarmy#60 they worked the usage down to as little as 4% core usage |
Do you know if each "solution" is a separate kernel launch for that application? If so, I wonder if the number of kernel launches per second is very low---less than 100/second in the reports in that thread---compared to OpenMM, which may mean that the same strategy would have a larger impact on OpenMM GPU performance. If it wasn't ridiculously difficult to implement, this could also be controlled by a platform property, but I think the easiest route (if Folding@home behavior is the primary concern) is to use (I'm presuming Folding@home is the use case of interest here. If not, let us know!) |
In my case, yes very much so. the final commits in the pull i referenced actually increased gpu hashing, i assume by minimising some latency caused by the nvidia busy wait. Here is the final commit by mbevand getting this in or, if the cows come home and nvidia fixes this in their opencl library, would free up cores to use on the x86 fahcores. |
You didn't mention what platform you're using. When running on Windows, the OpenCL platform already does a The behavior of CUDA is different. You can select whether the CPU should spin loop during kernels (better performance but uses more CPU) or sleep. You can set this with the |
Thanks, @peastman! That suggests that I set the following property in the OpenMM FAH core: contextProperties["UseBlockingSync"] = "false";
@peastman: I presume you're not interested in implementing a microsleep scheme platform option for OpenCL for |
Is the current method of just calling |
On a dedicated OpenMM machine, the option can be set by the scientist, but providing that option to the Folding@home donor is a good idea by way of jchodera's issue for FAHCore_22. If the (Windows) CUDA or OpenCL driver happens to be used on a gaming machine it probably makes sense for NVidia and AMD to default to spin-wait to maximize the frame rate, pleasing a majority of their customers. If the same drivers are used on a Folding@home machine where Folding@home is running on both a CPU and a GPU, freeing up CPU resources is probably a good thing, but making it an optional setting is still a good idea. If the GPU happens to be a slow one, interactive changes to the screen can get bogged down and introduce noticeable visible "screen lag" unless OpenMM yields frequently enough for those updates to get dispatched. |
Is it possible to let the |
They're doing very different things, though they both have the goal of keeping the UI responsive. The blocking sync option determines whether the CPU spin loops when waiting for the GPU. Disabling it reduces overall CPU usage. The yield doesn't directly reduce CPU usage, but it makes it easier for other processes to interrupt it, so it will be friendlier to other things running on the system. |
am on Windows, and as the OP said, yield doesn't actually yield. when using one gpu you lose 1 cpu core to it and the cpu core defaults to say, 11 threads. this is dropping overall ppd. "If the GPU happens to be a slow one, interactive changes to the screen can get bogged down and introduce noticeable visible "screen lag" unless OpenMM yields frequently enough for those updates to get dispatched." This is the experience presently with core 21 on slower gpu's. 11 threads on cpu (as defaulted) |
That script from referenced ethereum #399 does several changes in Windows registry but only one is related to performance "Disable CPU Core Parking". Has nothing to do with this OP. |
I would like to have that option as FAH user on Windows and Linux because now there are systems used with only 2 CPU cores but 4 GPUs or more, the typical mining rig. |
I believe core22 does that already. @jchodera can you confirm? |
@peastman: I've just checked the code, and we haven't yet uncommented this code for disabling the blocking sync with the CUDA platform: // Don't use blocking CUDA calls in order to be more friendly to CPUs.
// TODO: Uncomment this for core22, but may need to check if platform supports this property first.
//contextProperties["UseBlockingSync"] = "false"; Do I need to check if However, we haven't built the |
So far as I know, there's no corresponding option for OpenCL. It's up to the runtime to decide how it will handle synchronization. |
Since OpenCL 2.2, the cl_khr_throttle_hints extension is defined and allows to specify the throttling that's needed. |
if only nvidia supported opencl 2.2 and that information was relevant. |
It sounds like this should no longer be an issue in core22? I'm going to go ahead and close this issue. |
When running OpenCL on Nvidia GPU then one CPU core is used 100%.
This is also the case with e.g. folding@home
Now another OpenCL tool "hashcat" posted a workaround to reduce cpu usage to 10%.
Maybe this can also be applied to OpenMM?
<<<
For example with the OpenCL runtime of NVidia, they still have a 5-year-old-known-bug which creates 100% CPU load on a single core per NVidia GPU (NVidia's OpenCL busy-wait). If you're using oclHashcat for quite a while you may remember the same bug happened to AMD years ago.
Basically, what NVidia is missing here is that they use spinning instead of yielding. Their goal was to increase the performance but in our case there's actually no gain from having a CPU burning loop. The hashcat kernels run for ~100ms and that's quite a long time for an OpenCL kernel. At such a scale, spinning creates only disadvantages and there's no way to turn it off (Only CUDA supports that).
But why is this a problem? If the OpenCL runtime spins on a core to find out if a GPU kernel is finished it creates 100% CPU load. Now imagine you have another OpenCL device, e.g. your CPU, creating also 100% CPU load, it will cause problems even if it's legitimate to do that here. The GPU's CPU-burning thread will slow down by 50%, and you end up with a slower GPU rate just by enabling your CPU too (--opencl-device-type 1). For AMD GPU that's not the case (they fixed that bug years ago.)
To help mitigate this issue, I've implemented the following behavior:
https://github.com/hashcat/hashcat/releases/tag/v3.00
Boind GPUGrid has it as option SWAN_SYNC
The text was updated successfully, but these errors were encountered: