Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openMM "Exception: Particle coordinate is nan" on Radeon VII, ROCm 3.5.1, Linux 5.7 #2813

Closed
ThWuensche opened this issue Aug 24, 2020 · 39 comments · Fixed by #2819
Closed

Comments

@ThWuensche
Copy link

Running Folding@Home WUs of series 13422 I regularly get that exception at fixed step numbers, mostly after step 250, sometimes after step 501. Running it on openMM outside of FAH shows the same issue.

What I have found so far is that the exception happens only in precision mixed or single, in precision double the exception does not occur.

I'm diving into further debugging, but leave (and update) the information here, in case somebody with more experience has an idea how to go on with that issue.

@peastman
Copy link
Member

Take a look at https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan. It discusses common causes of this error message.

@peastman
Copy link
Member

By the way, the particular step numbers where you're seeing this are because the OpenCL platform only does the check for nan coordinates every 250 steps.

@ThWuensche
Copy link
Author

Thank you for the hints. The relevant Folding@Home work units seem to run without problems on nVidia GPUs and AMD GPUs under Windows. Only AMD GPUs under Linux seem to have these problems. So for now I assume it might be an AMD openCL issue.

@peastman
Copy link
Member

That could be. What AMD GPUs are you seeing this on? Also, how many atoms are in the system?

@jchodera
Copy link
Member

jchodera commented Aug 25, 2020

@peastman : This is a WU from the Folding@home benchmark project that fails on a number of AMD GPUs, and I specifically asked @ThWuensche to open this issue so we might suggest what we can run on his configuration to further debug.

These simulation initial conditions run just fine on many GPUs. To be precise, it fails 714 / 2048 times, and all of those examples seem to be with several kinds of AMD hardware:

gfx906
Scrapper
Devastator
AMD A4-4000 APU with Radeon(tm) HD Graphics 
AMD A6-4400M APU with Radeon(tm) HD Graphics 
AMD A8-4555M APU with Radeon(tm) HD Graphics
AMD A8-5600K APU with Radeon(tm) HD Graphics 
AMD A8-6600K APU with Radeon(tm) HD Graphics
AMD A10-5800K APU with Radeon(tm) HD Graphics 
AMD A10-6790K APU with Radeon(tm) HD Graphics
AMD A10-6800K APU with Radeon(tm) HD Graphics  
...(many more like this)...

Strikingly, all of the successes seem to be on Intel integrated graphics or NVIDIA GPUs.
Considering we've tested on so many devices, this cannot be a coincidence.

For context, see : https://foldingforum.org/viewtopic.php?f=19&t=35996&start=15#p341819

Here is a ZIP archive of the system that is failing and a test script to run some dynamics with it:
RUN9.zip

@ThWuensche
Copy link
Author

For the records: I have seen this on RUN9 from the test package, but latest tests have been performed on a WU captured from a FAH run: https://foldingforum.org/viewtopic.php?f=19&t=35996&start=30#p341857

As mentioned in the title the GPUs are Radeon VII on Linux, ROCm stack (before 3.5, yesterday updated to 3.7).

@ThWuensche
Copy link
Author

ThWuensche commented Aug 25, 2020

To use this as scratch board for further debugging ideas:

So far we can observe that

  • the error comes up on almost all of the simulation tasks in this run, whereas almost all other tasks run without problem
  • the error comes up right at the beginning, mostly before step 251
  • we don't know in more detail when it happens, since after step 250 is the first nan test - would be interesting to have that test for each step
  • the error shows as nan particle position, but due to the nature I would assume that the origin is not from the particle positions, maybe from force calculations; a simple particle position should have "less reliable" failures over different simulations
  • it would be interesting to have and analyze complete system state for the step before nan occurrence and after, with a nan check on every step
  • if we know where the first nan occurs we can search further what causes it

How can the nan check be performed on every step? I tried to find something in the openMM code, but so far without success.

The script to run the WU on openMM has:
newstate = context.getState()
but printing newstate does not tell me a lot. (Edit: reading the docs of course helped, sorry). Would be good to write the state to a file after every single step and compare the changes, looking out for nans and sharp particle position (velocity) or force changes in the state values. Do you have a preferred format or a script that you would like to be used?

Any hints welcome

@peastman
Copy link
Member

Most of the GPUs you mentioned are quite old. The only not quite as old one is gfx906, which is about two years old. Have you seen any failures on a recent (Navi) AMD GPU? I'm wondering if this could be related to #2817, which fails on Vega but works on Navi. But that error only appears at something around 2 million atoms, while your file only has a few thousand.

The test.py script included in your code doesn't run. It has errors, like calling getPlatform() instead of getPlatformByName() and setDefaultPlatformValue() instead of setPropertyDefaultValue(). Once I fix those, it runs without problem on my Radeon R9 M370X. Can you provide a working example that you've verified reproduces the error?

Writing out a state that includes positions, velocities, and forces at every step is a good way to debug this. Since the error happens quickly, that will still be fast. Then we can identify where the error first appears (probably in either velocities or forces) and work back from there.

@ThWuensche
Copy link
Author

ThWuensche commented Aug 25, 2020

The list of GPUs is coming from jchodera. I'm running on Radeon VII (gfx906). The thread on the F@H forum however relates to a RX 5700XT.

Seems not like the error mentioned in #2817. That task was running on double precision, however if I run on double precision, the error does not show.

I'm trying to write out values in a python script that jchodera pasted me in the F@H thread to execute F@H WUs, calling getForces and getPositions on the state returned from context.getState. However both for forces and positions I get an exception that the state does not include them. Do I make a stupid error? (Edit: Solved that myself, need to specify what to get in the state. Was mislead by the empty call in the script)

That's the code of the script:

from simtk import openmm, unit

def read_xml(filename):
"""Deserialize OpenMM object from XML file"""
print(f'Reading {filename}...')
with open(filename, 'r') as infile:
return openmm.XmlSerializer.deserialize(infile.read())

system = read_xml('system.xml')
state = read_xml('state.xml')
integrator = read_xml('integrator.xml')

print('Creating Context...')
platform = openmm.Platform.getPlatformByName('OpenCL')
#platform = openmm.Platform.getPlatformByName('CPU')
platform.setPropertyDefaultValue('Precision', 'mixed')
#platform.setPropertyDefaultValue('Precision', 'double')
#platform.setPropertyDefaultValue('Precision', 'single')
context = openmm.Context(system, integrator, platform)
context.setState(state)

print('Running simulation...')
niterations = 1000 #100
nsteps_per_iteration = 1 #500
#niterations = 100
#nsteps_per_iteration = 500
for iteration in range(niterations):
integrator.step(nsteps_per_iteration)
newstate = context.getState()
print(newstate.getPositions())
print(f'completed {(iteration+1)*nsteps_per_iteration} steps')

Clean up

del context, integrator

@peastman
Copy link
Member

When you call getState() you need to tell it what information to store in the state:

newstate = context.getState(getPositions=True, getVelocities=True, getForces=True)

@ThWuensche
Copy link
Author

@jchodera: John, I have created a set of logfiles containing positions, velocities, forces after every step. In this case the first nans occur between step 117 and step 118. However to interpret that (meaning of the values) in the moment is beyond my experience. Would it help if I send you a tarball with the files? In general the step in which the first nans occur varies between different runs.

If so, where should I send the logs?

@jchodera
Copy link
Member

jchodera commented Aug 25, 2020

Most of the GPUs you mentioned are quite old. The only not quite as old one is gfx906, which is about two years old. Have you seen any failures on a recent (Navi) AMD GPU?
I'm wondering if this could be related to #2817, which fails on Vega but works on Navi. But that error only appears at something around 2 million atoms, while your file only has a few thousand.

@peastman: I've just pulled some statistics using the translation table on this site:

  • gfx906 (VEGA): 1 success and 50 failures for RUN9
  • gfx1010 (NAVI): 121 successes and 1 failure for RUN9

This seems to confirm your hypothesis regarding VEGA vs NAVI.

@jchodera
Copy link
Member

If so, where should I send the logs?

You can upload ZIP files here!

@jchodera
Copy link
Member

@peastman : Would it be worth having @ThWuensche run the unit test suite on his GPU? Do we have an easy way to do that (e.g. can you just post a tarball of binaries) or is the only way to do this by building locally and then make test?

@jchodera
Copy link
Member

jchodera commented Aug 25, 2020

@peastman : Another idea: Would it be easier if we wrote a script to go through step by step, looking for energy and force deviations between CPU and OpenCL platforms, and then to write out a {system,state}.xml along with force-by-force comparisons of energies and forces, and then finer-grained debugging (zeroing out exceptions, particle charges, particle LJ)? That should be straightforward for @ThWuensche to run.

We could then package that tool in simtk.openmm.app so future users could simply use python -m simtk.openmm.app.diagnose system.xml state.xml integrator.xml in future.

@jaos
Copy link

jaos commented Aug 25, 2020

I'm the poster regarding my Navi card (5700XT) using the newest AMD drop amdgpu-pro-20.30-1109583-ubuntu-20.04. Let me know if I can help.

@ThWuensche
Copy link
Author

@jchodera: Upload here does not work. The compressed archive has 853MB and even if I throw away the logs after the first NaNs it will be to big. I have put it on google drive:

https://drive.google.com/file/d/1fDV8SB8bazp5PFi13HrVyndnBxt9eqDB/view?usp=sharing

Never used that before for privacy concerns, but in this case it should not be an issue. Hope, it works.

@ThWuensche
Copy link
Author

ThWuensche commented Aug 26, 2020

@jchodera: John, I'm just trying to understand more about the simulation and what could lead to NaNs. One reason for NaN would be a division by 0. Looking for such causes I see in system.xml a division by sigma, by reff_sterics and by r (the equation for reff_sterics seems duplicated). Could one of those (sigma, reff_sterics) get 0, thus creating NaNs? As described the simulation fails with precision single and mixed, but not with precision double. At precision double a calculated value could get much smaller before being considered 0.0.

Sorry, if this is a dumb idea, just forwarding thoughts so that nothing slips by. Of course that would not in the first place explain why the problems occur on AMD hardware, but not on nVidia. However this could depend on minor implementation details. I'm just reading about denormals flushed to zero, no idea whether this could be relevant, but could be such implementation detail.

Parts of the script mentioned by you in a comment above I could write, however reasonable algorithms for judging the deviations would be needed from you. Maybe we in addition could compare to the results from a double precision simulation (as that works).

@jchodera
Copy link
Member

@ThWuensche : @peastman can answer better than I can here as to what might be causing different architectures to behave so differently and be more likely to lead to nans.

I suspect one way we could debug this is to follow the idea outlined here and check when the energies/forces start to deviate from the CPU platform by going step by step and comparing:
#2813 (comment)

Just waiting to hear from @peastman on whether it would be useful for one of us to write this script for you.

@ThWuensche
Copy link
Author

ThWuensche commented Aug 26, 2020

@jchodera: Might a division by zero occur, as mentioned and in the context mentioned above? Could that be avoided by making sure that the divisors will not get zero?

Here is the relevant part of system.xml:

	<Force energy="U_electrostatics + U_sterics;U_sterics = 4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;reff_sterics = sigma*((softcore_alpha*lambda_alpha + (r/sigma)^6))^(1/6);reff_sterics = sigma*((softcore_alpha*lambda_alpha + (r/sigma)^6))^(1/6);lambda_alpha = new_interaction*(1-lambda_sterics_insert) + old_interaction*lambda_sterics_delete;U_electrostatics = (lambda_electrostatics_insert * unique_new + unique_old * (1 - lambda_electrostatics_delete)) * ONE_4PI_EPS0*chargeProd/r;ONE_4PI_EPS0 = 138.935456;epsilon = (1-lambda_sterics)*epsilonA + lambda_sterics*epsilonB;sigma = (1-lambda_sterics)*sigmaA + lambda_sterics*sigmaB;lambda_sterics = new_interaction*lambda_sterics_insert + old_interaction*lambda_sterics_delete;new_interaction = delta(1-unique_new); old_interaction = delta(1-unique_old);" forceGroup="0" type="CustomBondForce" usesPeriodic="0" version="3">

...

	</Force>

@ThWuensche
Copy link
Author

ThWuensche commented Aug 26, 2020

I have created a small script to work on differences between CPU and openCL, it works, but probably another algorithm is needed for the judgement.

from simtk import openmm, unit
import numpy

def read_xml(filename):
    """Deserialize OpenMM object from XML file"""
    print(f'Reading {filename}...')
    with open(filename, 'r') as infile:
        return openmm.XmlSerializer.deserialize(infile.read())

system = read_xml('system.xml')
state = read_xml('state.xml')
integrator_ocl = read_xml('integrator.xml')
integrator_cpu = read_xml('integrator.xml')

#print(state.getPositions())


print('Creating openCL Context...')
platform_ocl = openmm.Platform.getPlatformByName('OpenCL')
platform_ocl.setPropertyDefaultValue('Precision', 'mixed')
#platform_ocl.setPropertyDefaultValue('Precision', 'double')

print('Creating CPU Context...')
platform_cpu = openmm.Platform.getPlatformByName('CPU')

context_ocl = openmm.Context(system, integrator_ocl, platform_ocl)
context_ocl.setState(state)

context_cpu = openmm.Context(system, integrator_cpu, platform_cpu)
context_cpu.setState(state)

print('Running simulation...')
niterations = 1000
nsteps_per_iteration = 1

for iteration in range(niterations):
    integrator_ocl.step(nsteps_per_iteration)
    integrator_cpu.step(nsteps_per_iteration)
    newstate_ocl = context_ocl.getState(getPositions=True,getVelocities=True,getForces=True)
    newstate_cpu = context_cpu.getState(getPositions=True,getVelocities=True,getForces=True)
    forcediff = numpy.subtract(newstate_cpu.getForces(), newstate_ocl.getForces())
    print(numpy.sum(numpy.abs(forcediff)))
    posdiff = numpy.subtract(newstate_cpu.getPositions(), newstate_ocl.getPositions())
    print(numpy.sum(numpy.abs(posdiff)))
#    logfile = open("logfile"+str(iteration)+".txt","w")
#    logfile.write("Iteration %d\n" % iteration)
#    logfile.write("Positions  "+str(newstate.getPositions())+"\n")
#    logfile.write("Velocities "+str(newstate.getVelocities())+"\n")
#    logfile.write("Forces     "+str(newstate.getForces())+"\n")
#    logfile.close()
    print(f'completed {(iteration+1)*nsteps_per_iteration} steps')

# Clean up                                                                                                                                                                                                     
del context, integrator

The output shows that at some step the force difference explodes. What makes me wonder is that position and force differences also exist in approximately the some magnitude, if I run the two contexts both on CPU. Maybe I miss something in the script setup? Are the integrators completely independent or could they be linked by global variables (_restorable__class_hash)?

Script output

```

Reading system.xml...
Reading state.xml...
Reading integrator.xml...
Reading integrator.xml...
Creating openCL Context...
Creating CPU Context...
Running simulation...
932732.649899893 kJ/(nm mol)
30.303013688575348 nm
completed 1 steps
2370550.6978229303 kJ/(nm mol)
94.29016402412209 nm
completed 2 steps
3801658.9826954347 kJ/(nm mol)
172.42520328599747 nm
completed 3 steps
5283776.243838893 kJ/(nm mol)
259.66397873651147 nm
completed 4 steps
6611074.632446245 kJ/(nm mol)
352.4964142512962 nm
completed 5 steps
7913640.82710628 kJ/(nm mol)
448.3413092071446 nm
completed 6 steps
9042412.976014834 kJ/(nm mol)
545.7401412028207 nm
completed 7 steps
10180953.320429089 kJ/(nm mol)
643.7746230001677 nm
completed 8 steps
11192866.367717529 kJ/(nm mol)
741.7006725816077 nm
completed 9 steps
12243500.696906028 kJ/(nm mol)
839.5428650284039 nm
completed 10 steps
13215870.715828337 kJ/(nm mol)
937.2901639585941 nm
completed 11 steps
14140142.297329616 kJ/(nm mol)
1035.125914098395 nm
completed 12 steps
15162262.061764142 kJ/(nm mol)
1133.1890994093162 nm
completed 13 steps
16114989.270913834 kJ/(nm mol)
1231.457498505044 nm
completed 14 steps
17077456.545753893 kJ/(nm mol)
1330.073026966462 nm
completed 15 steps
18083040.989561774 kJ/(nm mol)
1429.1043908997374 nm
completed 16 steps
19024904.931625158 kJ/(nm mol)
1528.954837096996 nm
completed 17 steps
20011920.774708644 kJ/(nm mol)
1629.6743955960542 nm
completed 18 steps
20953902.487869054 kJ/(nm mol)
1731.4673874618438 nm
completed 19 steps
21985005.280625135 kJ/(nm mol)
1834.0128031919294 nm
completed 20 steps
22918004.932529394 kJ/(nm mol)
1937.9105280989854 nm
completed 21 steps
23948655.73829493 kJ/(nm mol)
2042.6663114184598 nm
completed 22 steps
24867161.518914185 kJ/(nm mol)
2148.040705206395 nm
completed 23 steps
25837611.294789765 kJ/(nm mol)
2254.3041783693293 nm
completed 24 steps
26813587.009418402 kJ/(nm mol)
2361.600070915175 nm
completed 25 steps
27745958.963972945 kJ/(nm mol)
2470.429678221181 nm
completed 26 steps
28722972.251125596 kJ/(nm mol)
2580.614966539853 nm
completed 27 steps
29762942.231953174 kJ/(nm mol)
2692.371123393845 nm
completed 28 steps
30793774.66424491 kJ/(nm mol)
2805.914492612053 nm
completed 29 steps
31863870.596704617 kJ/(nm mol)
2920.909563776672 nm
completed 30 steps
32877269.44204122 kJ/(nm mol)
3036.924575002127 nm
completed 31 steps
33976134.24766429 kJ/(nm mol)
3154.147291312293 nm
completed 32 steps
34993628.75984716 kJ/(nm mol)
3272.5572596237657 nm
completed 33 steps
36006600.09290193 kJ/(nm mol)
3391.68645943666 nm
completed 34 steps
37095106.92264314 kJ/(nm mol)
3511.3794226989494 nm
completed 35 steps
38055818.30433831 kJ/(nm mol)
3632.128771970119 nm
completed 36 steps
39165028.269151315 kJ/(nm mol)
3754.127137280358 nm
completed 37 steps
40145315.6698465 kJ/(nm mol)
3877.413798430277 nm
completed 38 steps
41367225.84997678 kJ/(nm mol)
4002.113691019663 nm
completed 39 steps
42358939.26112457 kJ/(nm mol)
4128.2707950719405 nm
completed 40 steps
43519670.62780715 kJ/(nm mol)
4256.13563183548 nm
completed 41 steps
44641870.04580612 kJ/(nm mol)
4385.350130149105 nm
completed 42 steps
45742777.50730204 kJ/(nm mol)
4516.062062704922 nm
completed 43 steps
46755734.034774296 kJ/(nm mol)
4647.7412673022 nm
completed 44 steps
47815424.54732221 kJ/(nm mol)
4780.285669991736 nm
completed 45 steps
48877479.88173178 kJ/(nm mol)
4913.614697705198 nm
completed 46 steps
49885913.78212947 kJ/(nm mol)
5047.868375412301 nm
completed 47 steps
50809897.61464801 kJ/(nm mol)
5183.790159483085 nm
completed 48 steps
51784885.2155342 kJ/(nm mol)
5320.446483145012 nm
completed 49 steps
52766195.58616836 kJ/(nm mol)
10016.47559399881 nm
completed 50 steps
53768115.95965492 kJ/(nm mol)
10154.101056117885 nm
completed 51 steps
54883618.55741766 kJ/(nm mol)
10292.716149935124 nm
completed 52 steps
55933693.99371166 kJ/(nm mol)
10431.305481165557 nm
completed 53 steps
56944499.06778313 kJ/(nm mol)
10569.866064926404 nm
completed 54 steps
57921970.942374244 kJ/(nm mol)
10707.549573683169 nm
completed 55 steps
58900218.6391818 kJ/(nm mol)
10844.700833533854 nm
completed 56 steps
59819888.625867955 kJ/(nm mol)
10980.841975115198 nm
completed 57 steps
60839515.36976595 kJ/(nm mol)
11116.55308331614 nm
completed 58 steps
61670734.9173074 kJ/(nm mol)
11252.239565567317 nm
completed 59 steps
62664658.346695594 kJ/(nm mol)
11387.635728951103 nm
completed 60 steps
63379555.206574336 kJ/(nm mol)
11522.970323347286 nm
completed 61 steps
64260217.609011196 kJ/(nm mol)
11658.84795346009 nm
completed 62 steps
65075271.3853027 kJ/(nm mol)
11795.482222948434 nm
completed 63 steps
65861008.72106905 kJ/(nm mol)
11932.551240269355 nm
completed 64 steps
66840846.319570325 kJ/(nm mol)
12069.942328092204 nm
completed 65 steps
67705033.43343897 kJ/(nm mol)
12207.955618478367 nm
completed 66 steps
68484749.98776385 kJ/(nm mol)
12345.980942015554 nm
completed 67 steps
69257349.65339229 kJ/(nm mol)
12483.83407219957 nm
completed 68 steps
69986455.09649293 kJ/(nm mol)
12621.927515945614 nm
completed 69 steps
70781805.18062839 kJ/(nm mol)
12759.772921494905 nm
completed 70 steps
71595990.36649759 kJ/(nm mol)
12896.679445072567 nm
completed 71 steps
72275247.6784054 kJ/(nm mol)
13032.82501515648 nm
completed 72 steps
72999335.7391929 kJ/(nm mol)
13168.671381285983 nm
completed 73 steps
73715623.04603477 kJ/(nm mol)
13303.841308055486 nm
completed 74 steps
74324510.57824357 kJ/(nm mol)
13438.434062670836 nm
completed 75 steps
74969051.10737382 kJ/(nm mol)
13573.02764730132 nm
completed 76 steps
75572326.59090891 kJ/(nm mol)
13706.755377105073 nm
completed 77 steps
76063387.63631296 kJ/(nm mol)
13839.622219759929 nm
completed 78 steps
76890644.71309756 kJ/(nm mol)
13972.4500063213 nm
completed 79 steps
77463348.3423628 kJ/(nm mol)
14105.544706048673 nm
completed 80 steps
78448261.42407563 kJ/(nm mol)
14238.60207116144 nm
completed 81 steps
79148732.11060885 kJ/(nm mol)
14371.772345811978 nm
completed 82 steps
79702858.03923264 kJ/(nm mol)
14504.588549260485 nm
completed 83 steps
80429058.76139374 kJ/(nm mol)
14636.515310288727 nm
completed 84 steps
80998977.98888646 kJ/(nm mol)
14767.189063476779 nm
completed 85 steps
81778950.71718144 kJ/(nm mol)
14896.577115036476 nm
completed 86 steps
82199043.05230966 kJ/(nm mol)
15026.025091572992 nm
completed 87 steps
82829026.2310648 kJ/(nm mol)
15155.344851517773 nm
completed 88 steps
83250690.76876226 kJ/(nm mol)
15284.108803624611 nm
completed 89 steps
83862048.36157845 kJ/(nm mol)
15412.724524087958 nm
completed 90 steps
84281882.80578071 kJ/(nm mol)
15540.254048610208 nm
completed 91 steps
84871936.3278902 kJ/(nm mol)
15666.531233709484 nm
completed 92 steps
85452125.25564803 kJ/(nm mol)
15792.335397044479 nm
completed 93 steps
86148248.67350684 kJ/(nm mol)
15917.907157168122 nm
completed 94 steps
86574443.0697213 kJ/(nm mol)
16043.114658037373 nm
completed 95 steps
86948934.76289986 kJ/(nm mol)
16167.986521089722 nm
completed 96 steps
87637051.18215686 kJ/(nm mol)
16292.996711126396 nm
completed 97 steps
87790439.25176756 kJ/(nm mol)
16417.742728093523 nm
completed 98 steps
88530093.81609665 kJ/(nm mol)
16541.443282578886 nm
completed 99 steps
88648652.59959508 kJ/(nm mol)
16664.78673908967 nm
completed 100 steps
89317619.09414001 kJ/(nm mol)
16787.733388976416 nm
completed 101 steps
89587791.3737199 kJ/(nm mol)
16909.69831679242 nm
completed 102 steps
116077802.58023976 kJ/(nm mol)
17031.207350457404 nm
completed 103 steps
111438076.91091172 kJ/(nm mol)
17189.24339527146 nm
completed 104 steps
115545254.31098959 kJ/(nm mol)
17330.002909149927 nm
completed 105 steps
112501778.73219927 kJ/(nm mol)
17446.921657807503 nm
completed 106 steps
9345700929.511671 kJ/(nm mol)
17556.547240526914 nm
completed 107 steps
nan kJ/(nm mol)
nan nm
completed 108 steps
nan kJ/(nm mol)
nan nm
completed 109 steps
nan kJ/(nm mol)
nan nm
completed 110 steps
```

As for the concerns with division by zero in the force setup, I had modified system.xml to test the variables in the divisor (sigma, reff_sterics) with select() against 0, replacing with a small value in case of 0. It didn't change anything, so either there is no problem with division by 0 in that case, or I made a mistake in the setup of that test (I hope select() works for float 0.0 as selector).

@jchodera
Copy link
Member

Awesome!

Tiny differences between CPU and GPU numerical implementations will cause the trajectories to rapidly diverge, so the better approach is to do this in the inner loop:

  • Take a step on the GPU
  • Compute energies and forces on the GPU
  • Retrieve the State from the GPU context
  • Set the positions in the CPU context
  • Compute the energies and forces on the CPU context
  • Compare them
  • Repeat

Practically, this means your inner loop should look more like this:

nsteps = 1000 # take 1000 total steps, comparing the energy each step
for step in range(nsteps):
    # Take a single step
    integrator_ocl.step(1)
    # Get the current state (including forces and energies) from the GPU
    state_ocl = context_ocl.getState(getPositions=True, getVelocities=True, getEnergy=True, getForces=True, getParameters=True)
    # Set the state on the CPU
    context_cpu.setState(state_ocl)
    # Compute energy and forces on the CPU
    state_cpu = context_cpu.getState(getPositions=True, getVelocities=True, getEnergy=True, getForces=True, getParameters=True)
    # Now compare the energies and forces between CPU and GPU
    energy_unit = unit.kilocalories_per_mole
    energy_diff = (state_ocl.getEnergy() - state_cpu.getEnergy()) / energy_unit
    force_unit = unit.kilocalories_per_mole / unit.angstrom
    force_diff = (state_ocl.getForces(asNumpy=True) - state_cpu.getForces(asNumpy=True)) / force_unit
    force_rms = np.sqrt(np.mean(np.sum(force_diff**2, 1))

# Clean up                                                                                                                                                                                                     
del context, integrator

For posting output, this example of collapsible Markdown makes the output much easier to read!

@jchodera
Copy link
Member

@jchodera: Might a division by zero occur, as mentioned and in the context mentioned above? Could that be avoided by making sure that the divisors will not get zero?

Division by zero would certainly cause NaNs, but the fact that there is such a clear divergence in behavior between hardware suggests something else is going on. There are a wide variety of ways that we can end up with NaNs.

Mixed precision is supposed to compute force terms in single precision but accumulate them in double, as well as perform all the constraint and integration operations in double. This is a tricky balancing act because if there is a single operation that causes under- or over-flow because it wasn't accumulated in the appropriate precision, it can cause NaNs to propagate everywhere. But there can be other things here, like synchronization issues or other bugs in indexing that work fine on some architecture but not on others where sizes of things are different. We've seen a number of those, and this might be the case here.

@Dann239
Copy link
Contributor

Dann239 commented Aug 27, 2020

Out of curiosity, i tried running this system with a workaround that @peastman suggested in #2817 (see test.py and output.txt). Unfortunately, it doesn't help. This problem looks to be of different nature.

My specs:
OpenCL: Radeon gfx906 Vega 20
OS: Ubuntu 18.04.2
Driver: ROCm 3.7

Now let me explain what i did. I employed same debugging techniques @peastman taught me while we were troubleshooting #2817. First, i dumped the system right before it crashed (see critical_state.xml). Then, i computed forces and took a look at some of them (see forces.cpp, though it won't compile without proper source modifications since it injects into Low Level API, see #2817 ). Unfortunately, i only focused on nonbonded interactions since i've had the debugging infrastructure set up there and only there.

id=0
k_ind1=0; k_ind2=0
f1 = ( 1015.16501775; -556.55103659; 1048.70070861 )
f2 = ( 1015.62475586; -557.11712646; 1049.58850098 )

id=1
k_ind1=1; k_ind2=1
f1 = ( -556445.87582530; 796168.75200768; -105177.39221812 )
f2 = ( -556456.56250000; 796168.56250000; -105187.83593750 )

id=2
k_ind1=2; k_ind2=2
f1 = ( -197.69593839; -806.84539419; -3622.15521289 )
f2 = ( -177.13220215; -818.42700195; -3620.41821289 )

id=3
k_ind1=3; k_ind2=3
f1 = ( -2026.02055368; -13.53427895; -1125.08539438 )
f2 = ( -2026.05310059; 16.61643600; -1106.81628418 )

id=4
k_ind1=4; k_ind2=4
f1 = ( -23315.01279736; 30.06598057; -3533.94379172 )
f2 = ( -23353.28320312; 10.35844421; -3538.60424805 )

id=5
k_ind1=5; k_ind2=5
f1 = ( -226442.70627862; 57450.10679949; -121765.70672039 )
f2 = ( -226434.40625000; 57440.19531250; -121762.73437500 )

Then i dumped all individual interactions for particle number 2 on both CUDA-mixed and OpenCL-mixed and sorted them (see parser.cpp, run9_raw.log and run9_sorted.log). Most notable features are:

  1. Not all of the forces match:
CUDA: 7 2 -72.808 OpenCL: 7 2 -72.1986
CUDA: 8 2 -13.0701 OpenCL: 8 2 -13.0255
CUDA: 9 2 -4.07822 OpenCL: 9 2 -4.06379
CUDA: 10 2 -0.173618 OpenCL: 10 2 -0.189778
CUDA: 11 2 -2.1891 OpenCL: 11 2 -2.6992
CUDA: 12 2 0.876459 OpenCL: 12 2 1.29725
CUDA: 13 2 -0.243756 OpenCL: 13 2 -0.317599
  1. There are particles with which OpenCL calculates interactions and CUDA doesn't:
OpenCL: 4059 2 -6.99899
OpenCL: 4059 2 -6.99899
OpenCL: 4063 2 -57.0905
OpenCL: 4063 2 -57.0905
OpenCL: 4065 2 25.7459
OpenCL: 4065 2 25.7459

Then, i took a look at two pairs of particles: 2 & 12 and 2 & 4059:

2 & 12:

CUDA: 12 2 a
CUDA: 2 12 a
CUDA: 12 2 a
CUDA: 3067 tiles
CUDA: 2 12 a

OpenCL: 12 2 a
OpenCL: 2 12 a
OpenCL: 12 2 a
OpenCL: 2 12 a
OpenCL: 3067 tiles

So the 'first' COMPUTE_INTERACTION is invoked four times there in each case.

2 & 4059:

CUDA: 4059 2 b
CUDA: 4059 2 b
CUDA: 3067 tiles

OpenCL: 4059 2 b
OpenCL: 4059 2 b
OpenCL: 3067 tiles

This seemed weird at a first glance. CUDA does compute that force, though it doesn't appear in logs. Well, that's because i don't output zero forces:

if((atom1 == debug_atom0 || atom2 == debug_atom0) && !isExcluded && dEdR != 0)
    printf("CUDA: %d %d %g\n", atom1, atom2, dEdR);

So in the 'second' COMPUTE_INTERACTION OpenCL gives non-zero forces in places where they should be zero.

The next test i would personally do is stripping the system of nonbonded forces and seeing if problems appear in other interactions, but i don't see an easy way to do so without painstakingly taking apart those xml files.

Here are my tests and logs:
RUN9_dann239.zip

@ThWuensche
Copy link
Author

@jchodera:
Thank you for the proposal for the script. But shouldn't the state loaded to cpu be taken before calculating the step for ocl? And probably I should execute the step on cpu, that is mentioned in your comment, but seems not executed. And then get both new states and compare them. Or do I miss something. I will go on with that in the evening and report.

For what it's worth, here are the patches I tried to system.xml:

*** system.xml	2020-08-26 22:20:01.625782212 +0200
--- system.xml.check	2020-08-26 22:19:35.033131261 +0200
***************
*** 408593,408599 ****
  				<Exception eps=".10807766844265286" p1="4684" p2="90543" q="-.027745466831098173" sig=".2824772565607574"/>
  			</Exceptions>
  		</Force>
! 		<Force cutoff="1" energy="U_sterics;U_sterics = 4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;epsilon = (1-lambda_sterics)*epsilonA + lambda_sterics*epsilonB;reff_sterics = sigma*((softcore_alpha*lambda_alpha + (r/sigma)^6))^(1/6);sigma = (1-lambda_sterics)*sigmaA + lambda_sterics*sigmaB;lambda_alpha = new_interaction*(1-lambda_sterics_insert) + old_interaction*lambda_sterics_delete;lambda_sterics = core_interaction*lambda_sterics_core + new_interaction*lambda_sterics_insert + old_interaction*lambda_sterics_delete;core_interaction = delta(unique_old1+unique_old2+unique_new1+unique_new2);new_interaction = max(unique_new1, unique_new2);old_interaction = max(unique_old1, unique_old2);epsilonA = sqrt(epsilonA1*epsilonA2);epsilonB = sqrt(epsilonB1*epsilonB2);sigmaA = 0.5*(sigmaA1 + sigmaA2);sigmaB = 0.5*(sigmaB1 + sigmaB2);" forceGroup="0" method="2" switchingDistance="-1" type="CustomNonbondedForce" useLongRangeCorrection="0" useSwitchingFunction="0" version="2">
  			<PerParticleParameters>
  				<Parameter name="sigmaA"/>
  				<Parameter name="epsilonA"/>
--- 408593,408599 ----
  				<Exception eps=".10807766844265286" p1="4684" p2="90543" q="-.027745466831098173" sig=".2824772565607574"/>
  			</Exceptions>
  		</Force>
! 		<Force cutoff="1" energy="U_sterics;U_sterics = 4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;epsilon = (1-lambda_sterics)*epsilonA + lambda_sterics*epsilonB;reff_sterics = select(reff_sterics_unchecked,reff_sterics_unchecked,0.0000001);reff_sterics_unchecked = sigma*(max(0,(softcore_alpha*lambda_alpha + (r/sigma)^6)))^(1/6);sigma = select(sigma_unchecked,sigma_unchecked,0.0000001);sigma_unchecked = (1-lambda_sterics)*sigmaA + lambda_sterics*sigmaB;lambda_alpha = new_interaction*(1-lambda_sterics_insert) + old_interaction*lambda_sterics_delete;lambda_sterics = core_interaction*lambda_sterics_core + new_interaction*lambda_sterics_insert + old_interaction*lambda_sterics_delete;core_interaction = delta(unique_old1+unique_old2+unique_new1+unique_new2);new_interaction = max(unique_new1, unique_new2);old_interaction = max(unique_old1, unique_old2);epsilonA = sqrt(max(0,epsilonA1*epsilonA2));epsilonB = sqrt(max(0,epsilonB1*epsilonB2));sigmaA = 0.5*(sigmaA1 + sigmaA2);sigmaB = 0.5*(sigmaB1 + sigmaB2);" forceGroup="0" method="2" switchingDistance="-1" type="CustomNonbondedForce" useLongRangeCorrection="0" useSwitchingFunction="0" version="2">
  			<PerParticleParameters>
  				<Parameter name="sigmaA"/>
  				<Parameter name="epsilonA"/>
***************
*** 882499,882505 ****
  				</InteractionGroup>
  			</InteractionGroups>
  		</Force>
! 		<Force energy="U_electrostatics + U_sterics;U_sterics = 4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;reff_sterics = sigma*((softcore_alpha*lambda_alpha + (r/sigma)^6))^(1/6);reff_sterics = sigma*((softcore_alpha*lambda_alpha + (r/sigma)^6))^(1/6);lambda_alpha = new_interaction*(1-lambda_sterics_insert) + old_interaction*lambda_sterics_delete;U_electrostatics = (lambda_electrostatics_insert * unique_new + unique_old * (1 - lambda_electrostatics_delete)) * ONE_4PI_EPS0*chargeProd/r;ONE_4PI_EPS0 = 138.935456;epsilon = (1-lambda_sterics)*epsilonA + lambda_sterics*epsilonB;sigma = (1-lambda_sterics)*sigmaA + lambda_sterics*sigmaB;lambda_sterics = new_interaction*lambda_sterics_insert + old_interaction*lambda_sterics_delete;new_interaction = delta(1-unique_new); old_interaction = delta(1-unique_old);" forceGroup="0" type="CustomBondForce" usesPeriodic="0" version="3">
  			<PerBondParameters>
  				<Parameter name="chargeProd"/>
  				<Parameter name="sigmaA"/>
--- 882499,882505 ----
  				</InteractionGroup>
  			</InteractionGroups>
  		</Force>
! 		<Force energy="U_electrostatics + U_sterics;U_sterics = 4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;reff_sterics = select(reff_sterics_unchecked,reff_sterics_unchecked,0.0000001);reff_sterics_unchecked = sigma*(max(0,(softcore_alpha*lambda_alpha + (r/sigma)^6)))^(1/6);lambda_alpha = new_interaction*(1-lambda_sterics_insert) + old_interaction*lambda_sterics_delete;U_electrostatics = (lambda_electrostatics_insert * unique_new + unique_old * (1 - lambda_electrostatics_delete)) * ONE_4PI_EPS0*chargeProd/r;ONE_4PI_EPS0 = 138.935456;epsilon = (1-lambda_sterics)*epsilonA + lambda_sterics*epsilonB;sigma = select(sigma_unchecked,sigma_unchecked,0.0000001);sigma_unchecked = (1-lambda_sterics)*sigmaA + lambda_sterics*sigmaB;lambda_sterics = new_interaction*lambda_sterics_insert + old_interaction*lambda_sterics_delete;new_interaction = delta(1-unique_new); old_interaction = delta(1-unique_old);" forceGroup="0" type="CustomBondForce" usesPeriodic="0" version="3">
  			<PerBondParameters>
  				<Parameter name="chargeProd"/>
  				<Parameter name="sigmaA"/>

@peastman
Copy link
Member

@ThWuensche see my comments above: #2813 (comment). Can you provide a working example that you've verified reproduces this problem?

@jchodera
Copy link
Member

Hi @peastman : I wrote the original test.py script linked to in #2813 (comment), so it's likely I made those mistakes. Is that the test script you're referring to that needs to be fixed? If so, I'll get right on that.

@peastman
Copy link
Member

Yes. I was able to fix the errors, but once I did it ran without problem. So I need an example that's been verified to actually reproduce the problem.

@ThWuensche
Copy link
Author

@peastman: The script here #2813 (comment) reproduces the problem on my Radeon VIIs. The output from the script (also in that comment) shows the problem.

@Dann239
Copy link
Contributor

Dann239 commented Aug 27, 2020

@peastman: test.py that @jchodera provided reproduces the problem for me too (on Radeon gfx906 Vega 20, AND on Radeon RX 5700XT, it even crashes on the very first step) after applying your fixes and some modifications that @jchodera suggested in order to compare OpenCL and CPU forces. However, it seems to work fine on FirePro S9150. If you'd like to see the exact script i ran, it's in my previous comment.

@peastman
Copy link
Member

The script here #2813 (comment) reproduces the problem on my Radeon VIIs.

Thanks! Let me see what I can figure out.

@ThWuensche
Copy link
Author

@peastman I have packed the script with modifications proposed by @jchodera, the output of the script (mixed.txt) with logfiles containing positions, velocities, forces for every step together with the system description of the captured F@H WU here [https://drive.google.com/file/d/1_1JJGJJsXq6WsSjFygVquIjpWJnqMwXM/view?usp=sharing]. The archive is rather large (756MB) due to the logs.

@peastman
Copy link
Member

I've hopefully fixed the problem. At any rate, I fixed a problem. It didn't actually have anything to do with AMD GPUs though. It was just an uninitialized memory error. It was related to the use of global parameters to modify the particle charges. It keeps copies of the parameter values on both the host and device. It was initializing the host values to 0, but failing to initialize the device values. If the global parameters all happened to actually be 0 at the start of the simulation, it would assume nothing had changed and not upload the correct values.

The fix is in #2819.

Once I fixed that, your script works well. I modified the two print statements to change np.sum to np.mean, so it outputs the mean difference in forces and positions for each particle. That's an easier number to interpret. Here's what I now get:

Running simulation...
2.386573478198737 kJ/(nm mol)
0.00011036955640494087 nm
completed 1 steps
6.760362108363513 kJ/(nm mol)
0.00034293573314681495 nm
completed 2 steps
11.72800892101006 kJ/(nm mol)
0.0006279327266006615 nm
completed 3 steps
16.5977852323136 kJ/(nm mol)
0.0009498048167477005 nm
completed 4 steps
21.109005194026416 kJ/(nm mol)
0.0012923188912590308 nm
completed 5 steps

The numbers grow gradually, as you'd expect for two simulations that are diverging, but everything looks reasonable.

@ThWuensche
Copy link
Author

@peastman: Thanks a lot! Below is my comment in a private message to @jchodera from beginning of July :)

My company is developing and manufacturing electronics for industrial automation, so I don't have experience with GPU based computing, but do have experience with embedded software development. Out of that experience I would consider as potential causes of the problems:

  • improper initialized or uninitialized memory
  • non synchronized access to memory from different threads-
  • improper use of memory locations (like wrong pointers or reading/writing beyond end of a memory block, stack)

Such kind of errors typically result in failures, which depend on the environment (hardware, compilers ...) and are difficult to repeat. What has to be noted is that the WUs that fail mostly fail at the very beginning. Once a WU runs for some time, it most likely completes.

@peastman
Copy link
Member

That seems accurate. Can you try the fix and confirm whether it fixes the problem for you?

@ThWuensche
Copy link
Author

ThWuensche commented Aug 27, 2020

Sorry for dumb question: To try I have to build openmm from git? That's what I'm just doing. Or is there a build I can install through miniconda (that is my actual install), that would probably be faster to test?

@peastman
Copy link
Member

You would have to build it from source. Although once we merge the PR, it will be included in the next nightly build. We can do that if you have trouble compiling.

@ThWuensche
Copy link
Author

Looks rather good! It ran two times up to the configured 1000 steps without breaking, what it never did before.

@peastman, you found the reason rather quick, congratulations! When I offered @jchodera to dive into the problem, I was expecting to spend a month digging into the source. I bothered both of you more than planned, but the speed of achieving the result probably is worth it.

@jchodera, is it possible to get and run a modified FAH core to test it in the wild?

@peastman
Copy link
Member

Excellent! I'll merge the fix. I'm very happy to track down bugs. The main obstacle is getting a test case I can use to reproduce it. Once I have that, it's just a matter of running the test repeatedly, tracing the error backward until I identify the point where things first go wrong. But if I can't reproduce an error, that's when things are really hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants