New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openMM "Exception: Particle coordinate is nan" on Radeon VII, ROCm 3.5.1, Linux 5.7 #2813
Comments
Take a look at https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan. It discusses common causes of this error message. |
By the way, the particular step numbers where you're seeing this are because the OpenCL platform only does the check for nan coordinates every 250 steps. |
Thank you for the hints. The relevant Folding@Home work units seem to run without problems on nVidia GPUs and AMD GPUs under Windows. Only AMD GPUs under Linux seem to have these problems. So for now I assume it might be an AMD openCL issue. |
That could be. What AMD GPUs are you seeing this on? Also, how many atoms are in the system? |
@peastman : This is a WU from the Folding@home benchmark project that fails on a number of AMD GPUs, and I specifically asked @ThWuensche to open this issue so we might suggest what we can run on his configuration to further debug. These simulation initial conditions run just fine on many GPUs. To be precise, it fails 714 / 2048 times, and all of those examples seem to be with several kinds of AMD hardware:
Strikingly, all of the successes seem to be on Intel integrated graphics or NVIDIA GPUs. For context, see : https://foldingforum.org/viewtopic.php?f=19&t=35996&start=15#p341819 Here is a ZIP archive of the system that is failing and a test script to run some dynamics with it: |
For the records: I have seen this on RUN9 from the test package, but latest tests have been performed on a WU captured from a FAH run: https://foldingforum.org/viewtopic.php?f=19&t=35996&start=30#p341857 As mentioned in the title the GPUs are Radeon VII on Linux, ROCm stack (before 3.5, yesterday updated to 3.7). |
To use this as scratch board for further debugging ideas: So far we can observe that
How can the nan check be performed on every step? I tried to find something in the openMM code, but so far without success. The script to run the WU on openMM has: Any hints welcome |
Most of the GPUs you mentioned are quite old. The only not quite as old one is gfx906, which is about two years old. Have you seen any failures on a recent (Navi) AMD GPU? I'm wondering if this could be related to #2817, which fails on Vega but works on Navi. But that error only appears at something around 2 million atoms, while your file only has a few thousand. The test.py script included in your code doesn't run. It has errors, like calling Writing out a state that includes positions, velocities, and forces at every step is a good way to debug this. Since the error happens quickly, that will still be fast. Then we can identify where the error first appears (probably in either velocities or forces) and work back from there. |
The list of GPUs is coming from jchodera. I'm running on Radeon VII (gfx906). The thread on the F@H forum however relates to a RX 5700XT. Seems not like the error mentioned in #2817. That task was running on double precision, however if I run on double precision, the error does not show. I'm trying to write out values in a python script that jchodera pasted me in the F@H thread to execute F@H WUs, calling getForces and getPositions on the state returned from context.getState. However both for forces and positions I get an exception that the state does not include them. Do I make a stupid error? (Edit: Solved that myself, need to specify what to get in the state. Was mislead by the empty call in the script) That's the code of the script: from simtk import openmm, unit def read_xml(filename): system = read_xml('system.xml') print('Creating Context...') print('Running simulation...') Clean updel context, integrator |
When you call newstate = context.getState(getPositions=True, getVelocities=True, getForces=True) |
@jchodera: John, I have created a set of logfiles containing positions, velocities, forces after every step. In this case the first nans occur between step 117 and step 118. However to interpret that (meaning of the values) in the moment is beyond my experience. Would it help if I send you a tarball with the files? In general the step in which the first nans occur varies between different runs. If so, where should I send the logs? |
@peastman: I've just pulled some statistics using the translation table on this site:
This seems to confirm your hypothesis regarding VEGA vs NAVI. |
You can upload ZIP files here! |
@peastman : Would it be worth having @ThWuensche run the unit test suite on his GPU? Do we have an easy way to do that (e.g. can you just post a tarball of binaries) or is the only way to do this by building locally and then |
@peastman : Another idea: Would it be easier if we wrote a script to go through step by step, looking for energy and force deviations between CPU and OpenCL platforms, and then to write out a We could then package that tool in |
I'm the poster regarding my Navi card (5700XT) using the newest AMD drop |
@jchodera: Upload here does not work. The compressed archive has 853MB and even if I throw away the logs after the first NaNs it will be to big. I have put it on google drive: https://drive.google.com/file/d/1fDV8SB8bazp5PFi13HrVyndnBxt9eqDB/view?usp=sharing Never used that before for privacy concerns, but in this case it should not be an issue. Hope, it works. |
@jchodera: John, I'm just trying to understand more about the simulation and what could lead to NaNs. One reason for NaN would be a division by 0. Looking for such causes I see in system.xml a division by sigma, by reff_sterics and by r (the equation for reff_sterics seems duplicated). Could one of those (sigma, reff_sterics) get 0, thus creating NaNs? As described the simulation fails with precision single and mixed, but not with precision double. At precision double a calculated value could get much smaller before being considered 0.0. Sorry, if this is a dumb idea, just forwarding thoughts so that nothing slips by. Of course that would not in the first place explain why the problems occur on AMD hardware, but not on nVidia. However this could depend on minor implementation details. I'm just reading about denormals flushed to zero, no idea whether this could be relevant, but could be such implementation detail. Parts of the script mentioned by you in a comment above I could write, however reasonable algorithms for judging the deviations would be needed from you. Maybe we in addition could compare to the results from a double precision simulation (as that works). |
@ThWuensche : @peastman can answer better than I can here as to what might be causing different architectures to behave so differently and be more likely to lead to nans. I suspect one way we could debug this is to follow the idea outlined here and check when the energies/forces start to deviate from the CPU platform by going step by step and comparing: Just waiting to hear from @peastman on whether it would be useful for one of us to write this script for you. |
@jchodera: Might a division by zero occur, as mentioned and in the context mentioned above? Could that be avoided by making sure that the divisors will not get zero? Here is the relevant part of system.xml:
...
|
I have created a small script to work on differences between CPU and openCL, it works, but probably another algorithm is needed for the judgement.
The output shows that at some step the force difference explodes. What makes me wonder is that position and force differences also exist in approximately the some magnitude, if I run the two contexts both on CPU. Maybe I miss something in the script setup? Are the integrators completely independent or could they be linked by global variables (_restorable__class_hash)? Script output``` Reading system.xml... As for the concerns with division by zero in the force setup, I had modified system.xml to test the variables in the divisor (sigma, reff_sterics) with select() against 0, replacing with a small value in case of 0. It didn't change anything, so either there is no problem with division by 0 in that case, or I made a mistake in the setup of that test (I hope select() works for float 0.0 as selector). |
Awesome! Tiny differences between CPU and GPU numerical implementations will cause the trajectories to rapidly diverge, so the better approach is to do this in the inner loop:
Practically, this means your inner loop should look more like this: nsteps = 1000 # take 1000 total steps, comparing the energy each step
for step in range(nsteps):
# Take a single step
integrator_ocl.step(1)
# Get the current state (including forces and energies) from the GPU
state_ocl = context_ocl.getState(getPositions=True, getVelocities=True, getEnergy=True, getForces=True, getParameters=True)
# Set the state on the CPU
context_cpu.setState(state_ocl)
# Compute energy and forces on the CPU
state_cpu = context_cpu.getState(getPositions=True, getVelocities=True, getEnergy=True, getForces=True, getParameters=True)
# Now compare the energies and forces between CPU and GPU
energy_unit = unit.kilocalories_per_mole
energy_diff = (state_ocl.getEnergy() - state_cpu.getEnergy()) / energy_unit
force_unit = unit.kilocalories_per_mole / unit.angstrom
force_diff = (state_ocl.getForces(asNumpy=True) - state_cpu.getForces(asNumpy=True)) / force_unit
force_rms = np.sqrt(np.mean(np.sum(force_diff**2, 1))
# Clean up
del context, integrator For posting output, this example of collapsible Markdown makes the output much easier to read! |
Division by zero would certainly cause NaNs, but the fact that there is such a clear divergence in behavior between hardware suggests something else is going on. There are a wide variety of ways that we can end up with NaNs. Mixed precision is supposed to compute force terms in single precision but accumulate them in double, as well as perform all the constraint and integration operations in double. This is a tricky balancing act because if there is a single operation that causes under- or over-flow because it wasn't accumulated in the appropriate precision, it can cause NaNs to propagate everywhere. But there can be other things here, like synchronization issues or other bugs in indexing that work fine on some architecture but not on others where sizes of things are different. We've seen a number of those, and this might be the case here. |
Out of curiosity, i tried running this system with a workaround that @peastman suggested in #2817 (see My specs: Now let me explain what i did. I employed same debugging techniques @peastman taught me while we were troubleshooting #2817. First, i dumped the system right before it crashed (see
Then i dumped all individual interactions for particle number 2 on both CUDA-mixed and OpenCL-mixed and sorted them (see
Then, i took a look at two pairs of particles: 2 & 12 and 2 & 4059: 2 & 12:
So the 'first' 2 & 4059:
This seemed weird at a first glance. CUDA does compute that force, though it doesn't appear in logs. Well, that's because i don't output zero forces: if((atom1 == debug_atom0 || atom2 == debug_atom0) && !isExcluded && dEdR != 0)
printf("CUDA: %d %d %g\n", atom1, atom2, dEdR); So in the 'second' The next test i would personally do is stripping the system of nonbonded forces and seeing if problems appear in other interactions, but i don't see an easy way to do so without painstakingly taking apart those xml files. Here are my tests and logs: |
@jchodera: For what it's worth, here are the patches I tried to system.xml:
|
@ThWuensche see my comments above: #2813 (comment). Can you provide a working example that you've verified reproduces this problem? |
Hi @peastman : I wrote the original |
Yes. I was able to fix the errors, but once I did it ran without problem. So I need an example that's been verified to actually reproduce the problem. |
@peastman: The script here #2813 (comment) reproduces the problem on my Radeon VIIs. The output from the script (also in that comment) shows the problem. |
@peastman: |
Thanks! Let me see what I can figure out. |
@peastman I have packed the script with modifications proposed by @jchodera, the output of the script (mixed.txt) with logfiles containing positions, velocities, forces for every step together with the system description of the captured F@H WU here [https://drive.google.com/file/d/1_1JJGJJsXq6WsSjFygVquIjpWJnqMwXM/view?usp=sharing]. The archive is rather large (756MB) due to the logs. |
I've hopefully fixed the problem. At any rate, I fixed a problem. It didn't actually have anything to do with AMD GPUs though. It was just an uninitialized memory error. It was related to the use of global parameters to modify the particle charges. It keeps copies of the parameter values on both the host and device. It was initializing the host values to 0, but failing to initialize the device values. If the global parameters all happened to actually be 0 at the start of the simulation, it would assume nothing had changed and not upload the correct values. The fix is in #2819. Once I fixed that, your script works well. I modified the two print statements to change
The numbers grow gradually, as you'd expect for two simulations that are diverging, but everything looks reasonable. |
@peastman: Thanks a lot! Below is my comment in a private message to @jchodera from beginning of July :) My company is developing and manufacturing electronics for industrial automation, so I don't have experience with GPU based computing, but do have experience with embedded software development. Out of that experience I would consider as potential causes of the problems:
Such kind of errors typically result in failures, which depend on the environment (hardware, compilers ...) and are difficult to repeat. What has to be noted is that the WUs that fail mostly fail at the very beginning. Once a WU runs for some time, it most likely completes. |
That seems accurate. Can you try the fix and confirm whether it fixes the problem for you? |
Sorry for dumb question: To try I have to build openmm from git? That's what I'm just doing. Or is there a build I can install through miniconda (that is my actual install), that would probably be faster to test? |
You would have to build it from source. Although once we merge the PR, it will be included in the next nightly build. We can do that if you have trouble compiling. |
Looks rather good! It ran two times up to the configured 1000 steps without breaking, what it never did before. @peastman, you found the reason rather quick, congratulations! When I offered @jchodera to dive into the problem, I was expecting to spend a month digging into the source. I bothered both of you more than planned, but the speed of achieving the result probably is worth it. @jchodera, is it possible to get and run a modified FAH core to test it in the wild? |
Excellent! I'll merge the fix. I'm very happy to track down bugs. The main obstacle is getting a test case I can use to reproduce it. Once I have that, it's just a matter of running the test repeatedly, tracing the error backward until I identify the point where things first go wrong. But if I can't reproduce an error, that's when things are really hard. |
Running Folding@Home WUs of series 13422 I regularly get that exception at fixed step numbers, mostly after step 250, sometimes after step 501. Running it on openMM outside of FAH shows the same issue.
What I have found so far is that the exception happens only in precision mixed or single, in precision double the exception does not occur.
I'm diving into further debugging, but leave (and update) the information here, in case somebody with more experience has an idea how to go on with that issue.
The text was updated successfully, but these errors were encountered: