Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL error: Irreducible ControlFlow Detected #2986

Closed
jchodera opened this issue Jan 18, 2021 · 23 comments
Closed

OpenCL error: Irreducible ControlFlow Detected #2986

jchodera opened this issue Jan 18, 2021 · 23 comments
Labels

Comments

@jchodera
Copy link
Member

jchodera commented Jan 18, 2021

Any idea what might cause an error like this (on the Folding@home version, core22 0.0.14)?

Failed to create OpenCL context:
Error compiling kernel: "C:\Users\Owner\AppData\Local\Temp\OCL5264T24.cl", line 21: warning: OpenCL
          extension is now part of core
  #pragma OPENCL EXTENSION cl_khr_fp64 : enable
                           ^

Error:E010:Irreducible ControlFlow Detected

The configuration is:

************************************ System ************************************
        CPU: Intel(R) Pentium(R) CPU G840 @ 2.80GHz
     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
       CPUs: 2
     Memory: 15.98GiB
Free Memory: 11.53GiB
    Threads: WINDOWS_THREADS
 OS Version: 6.2
Has Battery: false
 On Battery: false
 UTC Offset: -5
        PID: 5264
        CWD: C:\ProgramData\FAHClient\work
************************************ OpenMM ************************************
   Revision: 189320d0
********************************************************************************
  -- 0 --
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 2.1 AMD-APP (3188.4)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.

(1) device(s) found on platform 0:
  -- 0 --
  DEVICE_NAME = Capeverde
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VERSION = OpenCL 1.2 AMD-APP (3188.4)
  DRIVER_VERSION = 3188.4

cc: https://foldingforum.org/viewtopic.php?p=348173#p348173

@peastman
Copy link
Member

The part about the extension is just a warning. You can ignore it.

Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem.

@jchodera
Copy link
Member Author

It's likely one of the Custom forces since the next WU did not have any of those and ran successfully.

Would we need to try to build and run a system with one Force at a time in order to debug? Is there a way to step through compiling kernel by kernel?

@peastman
Copy link
Member

If we can add debugging code to the core, we could just have it print out the source of each kernel before compiling it.

@bb30994
Copy link

bb30994 commented Jan 18, 2021

We've seen the "Irreducible ControlFlow Detected" message before, though not frequently enough to identify a pattern. Is there a reasonable way to add in-line diagnostic information to that particular error? That just isn't enough information to answer John's question.

@bdenhollander
Copy link
Contributor

Regarding the error, this looks relevant: https://community.khronos.org/t/errorirreducible-controlflow-detected/1986. Is that all we have in the log? Unfortunately, it doesn't give any indication about what kernel is causing the problem.

I searched through all the .cl and .cc for loops based on that thread and found one instance where the starting condition is specified outside. It should be obvious to the compiler but it is stylistically inconsistent with the rest of the code base.

int index = GLOBAL_ID;
for (; index < numAtoms; index += GLOBAL_SIZE) {

I don't know what this while loop does but it looks suspicious since tbx is unchanged.

// Skip over tiles that have exclusions, since they were already processed.
SYNC_WARPS;
while (skipTiles[tbx+TILE_SIZE-1] < pos) {
SYNC_WARPS;
if (skipBase+tgx < NUM_TILES_WITH_EXCLUSIONS) {
int2 tile = exclusionTiles[skipBase+tgx];
skipTiles[LOCAL_ID] = tile.x + tile.y*NUM_BLOCKS - tile.y*(tile.y+1)/2;
}
else
skipTiles[LOCAL_ID] = end;
skipBase += TILE_SIZE;
currentSkipIndex = tbx;
SYNC_WARPS;
}

customGBValueN2.cc and nonbonded.cl have similar loops. The CPU version looks more likely to break out of the loop.

// Skip over tiles that have exclusions, since they were already processed.
while (nextToSkip < pos) {
if (currentSkipIndex < NUM_TILES_WITH_EXCLUSIONS) {
int2 tile = exclusionTiles[currentSkipIndex++];
nextToSkip = tile.x + tile.y*NUM_BLOCKS - tile.y*(tile.y+1)/2;
}
else
nextToSkip = end;
}

@peastman
Copy link
Member

That loop is scanning through the exclusionTiles array to find a particular index. The exit condition isn't based on tbx changing. It's based on the values of the latest data that got loaded into skipTiles.

Did the system with the error involve implicit solvent? What integrator did it use? That will tell us whether the above code could be related.

@jchodera
Copy link
Member Author

This was FAH project 13438, for the COVID Moonshot, which involves a hybrid alchemical system with a good number of Custom*Force terms and NonbondedForce perturbation groups.

I've attached serialized XML files of the RUN that failed if that is of interest.

PROJ13438-RUN12681.zip

@peastman
Copy link
Member

It doesn't have a GBSAOBCForce, and it uses a CustomIntegrator instead of a VerletIntegrator. So none of the loops mentioned above is involved.

@peastman
Copy link
Member

I can't reproduce this on an AMD Navi GPU. The following script runs without problems.

from simtk.openmm import *

system = XmlSerializer.deserialize(open('system.xml').read())
integrator = XmlSerializer.deserialize(open('integrator.xml').read())
state = XmlSerializer.deserialize(open('state.xml').read())
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'))
context.setState(state)
context.getState(getForces=True)

It's a different GPU of course, and also a different OS (Ubuntu 20.04). Cape Verde is a pretty old GPU, released in 2012.

@jchodera
Copy link
Member Author

Thanks for trying this out!

Is there any instrumentation we can add to the core to bring back more information?

Failing that, we will keep trying to find someone experiencing this issue.

It's unclear to me whether Cape Verde refers to the first release in 2012 or the architecture, which has been in production for many years.

@jchodera
Copy link
Member Author

Hm, I might be misreading the info about which GPUs featured Cape Verde:
https://www.techpowerup.com/gpu-specs/amd-cape-verde.g100

@peastman
Copy link
Member

Cape Verde was a specific GPU. It was based on the GCN 1.0 architecture.

If you can find someone who is experiencing the problem, we definitely could create an instrumented core that would provide more information.

@weisspe
Copy link

weisspe commented Jan 25, 2021

I have a cape verde card and am currently experiencing this issue on Windows.

I wasn't getting it last week, I think my last GPU work unit was completed Friday night and nothing has changes as far as I know since then. I have both windows updates and my graphics drivers updates configured to notify me of available updates but not install anything automatically so I feel confident that.

I'd be happy to run a modified version of FAH to gather more information about this issue. Given that this started without any software changes it is possible this issue is related to the work units being issued by the server or something else that could change and 'fix' itself on it's own so we'll have to cross our fingers it continues long enough to test.

@peastman
Copy link
Member

Great! @jchodera are you set up for building cores? Here are the lines where it compiles kernels:

cl::Program::Sources sources({src.str()});
cl::Program program(context, sources);
try {
program.build(vector<cl::Device>(1, device), options.c_str());
} catch (cl::Error err) {
throw OpenMMException("Error compiling kernel: "+program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device));
}

Immediately before those lines, add the line

cout<<src.str()<<endl;

That will make it print the source for each kernel to the console (which I believe gets redirected to one of the logs?) just before attempting to compile it. Then we can see what the last kernel was it attempted to compile.

@weisspe
Copy link

weisspe commented Jan 25, 2021

I haven't done any sort of development for the project so I'm not sure if I'm setup for building cores. Based on the log messages it seems like the folding at home client may be doing the building for me. If you can point me to the common location for the core code I'd be happy to modify it and see what I get in my logs.

@weisspe
Copy link

weisspe commented Jan 25, 2021

I just realized that I have the file path from the logs, so that solves that. However that's a temp directory and no longer exists for me. It seems like the folding at home client is downloading the kernel code and cleaning it up rather quickly so I'm not sure the best way to jump in an interfere. Any suggestions?

@jchodera
Copy link
Member Author

jchodera commented Mar 1, 2021

Apologies we haven't been able to make progress on this yet. We're still working on automating core22 builds with @dotsdl but hope to have something soon we can use to help debug this.

@jchodera
Copy link
Member Author

jchodera commented Mar 1, 2021

This seems to be specific to Custom*Forces, since it's only appearing with my COVID Moonshot alchemical free energy calculations.

This issue may be related?

@gunnarre
Copy link

gunnarre commented May 2, 2021

I am still seeing some of these errors on the Radeon 7770 HD under Windows 10 on project 13446.

22:20:16:WU00:FS00:0x22:*************************** Core22 Folding@home Core ***************************
22:20:16:WU00:FS00:0x22:       Core: Core22
22:20:16:WU00:FS00:0x22:       Type: 0x22
22:20:16:WU00:FS00:0x22:    Version: 0.0.13
(....)
22:20:16:WU00:FS00:0x22:************************************ OpenMM ************************************
22:20:16:WU00:FS00:0x22:   Revision: 189320d0
22:20:16:WU00:FS00:0x22:********************************************************************************
22:20:16:WU00:FS00:0x22:Project: 13446 (Run 6351, Clone 17, Gen 0)
22:20:16:WU00:FS00:0x22:Unit: 0x00000000000000000000000000000000
22:20:16:WU00:FS00:0x22:Reading tar file core.xml
22:20:16:WU00:FS00:0x22:Reading tar file integrator.xml.bz2
22:20:16:WU00:FS00:0x22:Reading tar file state.xml.bz2
22:20:16:WU00:FS00:0x22:Reading tar file system.xml.bz2
22:20:16:WU00:FS00:0x22:Digital signatures verified
22:20:16:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
22:20:16:WU00:FS00:0x22:Version 0.0.13
22:20:17:WU00:FS00:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
22:20:17:WU00:FS00:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
22:20:17:WU00:FS00:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
22:20:17:WU00:FS00:0x22:  Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
22:20:17:WU00:FS00:0x22:There are 3 platforms available.
22:20:17:WU00:FS00:0x22:Platform 0: Reference
22:20:17:WU00:FS00:0x22:Platform 1: CPU
22:20:17:WU00:FS00:0x22:Platform 2: OpenCL
22:20:17:WU00:FS00:0x22:  opencl-device 0 specified
22:20:34:WU00:FS00:0x22:Attempting to create OpenCL context:
22:20:34:WU00:FS00:0x22:  Configuring platform OpenCL
22:20:42:WU00:FS00:0x22:Failed to create OpenCL context:
22:20:42:WU00:FS00:0x22:Error compiling kernel: "C:\Users\admin\AppData\Local\Temp\OCL6916T24.cl", line 21: warning: OpenCL
22:20:42:WU00:FS00:0x22:          extension is now part of core
22:20:42:WU00:FS00:0x22:  #pragma OPENCL EXTENSION cl_khr_fp64 : enable
22:20:42:WU00:FS00:0x22:                           ^
22:20:42:WU00:FS00:0x22:
22:20:42:WU00:FS00:0x22:Error:E010:Irreducible ControlFlow Detected
22:20:42:WU00:FS00:0x22:ERROR:125: Failed to create a GPU-enabled OpenMM Context.
22:20:42:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
22:20:42:WU00:FS00:0x22:Saving result file science.log
22:20:42:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
22:20:42:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:20:42:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13446 run:6351 clone:17 gen:0 core:0x22 unit:0x000000110000000000003486000018cf
22:20:42:WU00:FS00:Uploading 2.82KiB to 54.157.202.86
22:20:42:WU00:FS00:Connecting to 54.157.202.86:8080
22:20:43:WU00:FS00:Upload complete
22:20:43:WU00:FS00:Server responded WORK_ACK (400)
22:20:43:WU00:FS00:Cleaning up

@jchodera
Copy link
Member Author

@peastman Did we ever figure out where this is coming from? I'm still seeing a ton of this on Folding@home.

@jchodera jchodera added the bug label May 31, 2021
@peastman
Copy link
Member

peastman commented Jun 1, 2021

Not that I know of. I gave some suggestions above on how we could begin tracking it down.

@bdenhollander
Copy link
Contributor

Double precision FP was an extension to OpenCL 1.0 and 1.1. It became an optional part of OpenCL 1.2 but the the extension was kept for backwards compatibility. Alternatively, clGetDeviceInfo can be used to check that CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE is greater than 0 to confirm a device supports double precision FP. Listing cl_khr_fp64 in CL_DEVICE_EXTENSIONS is still required in OpenCL 3.0 (pg. 77) so it will continue to be valid as a check for double precision.

An overzealous driver was probably to blame for throwing a warning when cl_khr_fp64 was explicitly enabled on OpenCL 1.2+.

if (supportsDoublePrecision)
src << "#pragma OPENCL EXTENSION cl_khr_fp64 : enable\n";

Wrapping this pragma inside an OpenCL version check may avoid having the issue reappear.

@peastman
Copy link
Member

A PR removing the pragma would be welcome! There's no need for a version check. We don't support versions earlier than 1.2 anymore.

bdenhollander added a commit to bdenhollander/openmm that referenced this issue Oct 30, 2023
@peastman peastman changed the title OpenCL error: extension is now part of core OpenCL error: Irreducible ControlFlow Detected Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants