New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL error: Irreducible ControlFlow Detected #2986
Comments
The part about the extension is just a warning. You can ignore it. Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem. |
It's likely one of the Custom forces since the next WU did not have any of those and ran successfully. Would we need to try to build and run a system with one Force at a time in order to debug? Is there a way to step through compiling kernel by kernel? |
If we can add debugging code to the core, we could just have it print out the source of each kernel before compiling it. |
We've seen the "Irreducible ControlFlow Detected" message before, though not frequently enough to identify a pattern. Is there a reasonable way to add in-line diagnostic information to that particular error? That just isn't enough information to answer John's question. |
I searched through all the .cl and .cc openmm/platforms/common/src/kernels/verlet.cc Lines 57 to 58 in fce2608
I don't know what this while loop does but it looks suspicious since openmm/platforms/common/src/kernels/gbsaObc.cc Lines 206 to 220 in fce2608
customGBValueN2.cc and nonbonded.cl have similar loops. The CPU version looks more likely to break out of the loop. openmm/platforms/common/src/kernels/gbsaObc_cpu.cc Lines 212 to 221 in fce2608
|
That loop is scanning through the Did the system with the error involve implicit solvent? What integrator did it use? That will tell us whether the above code could be related. |
This was FAH project 13438, for the COVID Moonshot, which involves a hybrid alchemical system with a good number of I've attached serialized XML files of the RUN that failed if that is of interest. |
It doesn't have a GBSAOBCForce, and it uses a CustomIntegrator instead of a VerletIntegrator. So none of the loops mentioned above is involved. |
I can't reproduce this on an AMD Navi GPU. The following script runs without problems. from simtk.openmm import *
system = XmlSerializer.deserialize(open('system.xml').read())
integrator = XmlSerializer.deserialize(open('integrator.xml').read())
state = XmlSerializer.deserialize(open('state.xml').read())
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'))
context.setState(state)
context.getState(getForces=True) It's a different GPU of course, and also a different OS (Ubuntu 20.04). Cape Verde is a pretty old GPU, released in 2012. |
Thanks for trying this out! Is there any instrumentation we can add to the core to bring back more information? Failing that, we will keep trying to find someone experiencing this issue. It's unclear to me whether Cape Verde refers to the first release in 2012 or the architecture, which has been in production for many years. |
Hm, I might be misreading the info about which GPUs featured Cape Verde: |
Cape Verde was a specific GPU. It was based on the GCN 1.0 architecture. If you can find someone who is experiencing the problem, we definitely could create an instrumented core that would provide more information. |
I have a cape verde card and am currently experiencing this issue on Windows. I wasn't getting it last week, I think my last GPU work unit was completed Friday night and nothing has changes as far as I know since then. I have both windows updates and my graphics drivers updates configured to notify me of available updates but not install anything automatically so I feel confident that. I'd be happy to run a modified version of FAH to gather more information about this issue. Given that this started without any software changes it is possible this issue is related to the work units being issued by the server or something else that could change and 'fix' itself on it's own so we'll have to cross our fingers it continues long enough to test. |
Great! @jchodera are you set up for building cores? Here are the lines where it compiles kernels: openmm/platforms/opencl/src/OpenCLContext.cpp Lines 616 to 622 in 9008050
Immediately before those lines, add the line cout<<src.str()<<endl; That will make it print the source for each kernel to the console (which I believe gets redirected to one of the logs?) just before attempting to compile it. Then we can see what the last kernel was it attempted to compile. |
I haven't done any sort of development for the project so I'm not sure if I'm setup for building cores. Based on the log messages it seems like the folding at home client may be doing the building for me. If you can point me to the common location for the core code I'd be happy to modify it and see what I get in my logs. |
I just realized that I have the file path from the logs, so that solves that. However that's a temp directory and no longer exists for me. It seems like the folding at home client is downloading the kernel code and cleaning it up rather quickly so I'm not sure the best way to jump in an interfere. Any suggestions? |
Apologies we haven't been able to make progress on this yet. We're still working on automating core22 builds with @dotsdl but hope to have something soon we can use to help debug this. |
This seems to be specific to This issue may be related? |
I am still seeing some of these errors on the Radeon 7770 HD under Windows 10 on project 13446.
|
@peastman Did we ever figure out where this is coming from? I'm still seeing a ton of this on Folding@home. |
Not that I know of. I gave some suggestions above on how we could begin tracking it down. |
Double precision FP was an extension to OpenCL 1.0 and 1.1. It became an optional part of OpenCL 1.2 but the the extension was kept for backwards compatibility. Alternatively, clGetDeviceInfo can be used to check that An overzealous driver was probably to blame for throwing a warning when openmm/platforms/opencl/src/OpenCLContext.cpp Lines 606 to 607 in 76520ce
Wrapping this pragma inside an OpenCL version check may avoid having the issue reappear. |
A PR removing the pragma would be welcome! There's no need for a version check. We don't support versions earlier than 1.2 anymore. |
Any idea what might cause an error like this (on the Folding@home version, core22 0.0.14)?
The configuration is:
cc: https://foldingforum.org/viewtopic.php?p=348173#p348173
The text was updated successfully, but these errors were encountered: