Surprising differences between nnpops and torchani #82

wiederm · 2023-02-12T21:01:47Z

Hi,

I have been getting surprising results running waterbox simulations with the torchani vs the nnpops implementation of ani2x. I used the openmmtools waterbox testsystem with an edge length of 20 A, and a 1 fs timestep, and simulated for 1 ns using a Langevin integrator with 1/ps collision rate. The system was set up with potential.createSystem().

When I run simulations in NpT with the torchani implementation at 300 K everything looks relatively normal (density is a bit too high, and the rdf has some surprising signal though):

When I perform the same simulation with the nnpops implementation I see this:

and the simulation box has shrunk (the initial box size is the yellow outlined square). Also, note the difference in the y-axis for the potential energy.

In NVT I observe vacuum bubbles with nnpos
https://user-images.githubusercontent.com/31651017/218335894-0254ed80-e51f-4189-9bfc-ae94637cfd85.mp4

compared to the same simulation with torchani
https://user-images.githubusercontent.com/31651017/218335817-3e911757-d19d-4f71-b922-8b9de913237e.mp4

I attached a minimal example to reproduce the simulations.

min_example.py.zip

The text was updated successfully, but these errors were encountered:

wiederm · 2023-02-13T08:17:05Z

The relevant packages in my environment are

openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-torch              1.0             cuda112py310hdb05021_0    conda-forge
openmmml                  1.0                      pypi_0    pypi
openmmtools               0.21.4             pyhd8ed1ab_0    conda-forge
nnpops                    0.2             cuda112py310h85a0d14_4    conda-forge

raimis · 2023-02-13T08:34:53Z

@RaulPPelaez could you try to reproduce and verify the issue?

RaulPPelaez · 2023-02-13T12:06:34Z

Out of curiosity, what are you using for visualization?

RaulPPelaez · 2023-02-13T15:48:21Z

Could you try running with nnpops=0.3?

wiederm · 2023-02-14T09:48:46Z

I am using nglview in combination with ipywidgets.

I have started simulations with nnpops=0.3, will update in a few hours with the results.

wiederm · 2023-02-14T11:17:50Z

Just looking at the first few ps simulation time, I see qualitatively the exact same behavior in NpT and NVT with ani2x with nnpops=0.3:

Relevant packages:

openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-torch              1.0             cuda112py310hbd91edb_0    conda-forge
openmmml                  1.0                      pypi_0    pypi
openmmtools               0.21.5             pyhd8ed1ab_0    conda-forge
nnpops                    0.3             cuda112py310hd4d1af5_2    conda-forge

wiederm · 2023-02-15T16:44:24Z

with the updated nnpos=0.3, simulations explode with torchani that were running fine previously (with nnpops they run as described above). The following script runs fine with nnpops=0.2, but if I upgrade to nnpops=0.3 temperature increases significantly and hydrogen-oxygen bonds are broken.

This is a minimum example to reproduce the issue:

from openmm import unit, LangevinIntegrator, Platform
from openmm.app import Simulation, StateDataReporter
from mdtraj.reporters import HDF5Reporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
from openmm.unit import Quantity
from sys import stdout

waterbox = WaterBox(15 * unit.angstrom, constrained=False)
potential = MLPotential('ani2x')
system = potential.createSystem(waterbox.topology, implementation='torchani', removeConstraints=True)
####################
# define simulation parameters
stepsize = Quantity(1., unit.femto * unit.seconds)
collision_rate = 1 / Quantity(1, unit.pico * unit.second)
temperature = Quantity(300, unit.kelvin)

integrator = LangevinIntegrator(
    temperature, collision_rate, stepsize
)

platform = Platform.getPlatformByName('CUDA')
simulation = Simulation(waterbox.topology, system, integrator,platform=platform, platformProperties={'Precision':'mixed'})
simulation.context.setPositions(waterbox.positions)

simulation.reporters.append(
    HDF5Reporter(
        "./tmp.h5",
        10,
    )
)
simulation.reporters.append(StateDataReporter(stdout, 10, step=True, potentialEnergy=True, temperature=True))
simulation.step(5_000)

RaulPPelaez · 2023-02-16T08:17:23Z

I am able to reproduce this. We believe it might be a problem with periodic boundary conditions in nnpops.

RaulPPelaez · 2023-02-20T14:35:04Z

I compared the norms of the forces provided by both implementations using this code:

from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np

box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
    system = potential.createSystem(
        testsystem.topology, implementation=s
    )
    integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
    simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
    simulation.context.setPositions(testsystem.positions)
    forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
    positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)

fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)
with open("fnorms.dat", 'w') as f:
    for p,i,j in zip(positions, fnorms_nnpops, fnorms_torchani):
        f.write(f"{p.x} {p.y} {p.z} {i} {j}\n")

error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)

Then, I take all the particles with height near the center of the domain (z\in [0.55:0.95] nm) and plot a heatmap of the relative error in the force norm:

This could be consistent with a bug in the PBC

RaulPPelaez · 2023-02-21T10:43:59Z

I turned off one by one each of the optimized components here:

NNPOps/src/pytorch/OptimizedTorchANI.py

Lines 41 to 45 in 3c96f5b

    
           # Optimize the components of an ANI model 
        
           self.species_converter = TorchANISpeciesConverter(model.species_converter, atomicNumbers) 
        
           self.aev_computer = TorchANISymmetryFunctions(model.species_converter, model.aev_computer, atomicNumbers) 
        
           self.neural_networks = TorchANIBatchedNN(model.species_converter, model.neural_networks, atomicNumbers) 
        
           self.energy_shifter = TorchANIEnergyShifter(model.species_converter, model.energy_shifter, atomicNumbers)

The error arises when using the aev_computer, replacing the line by:

self.aev_computer = model.aev_computer

Results in relative error around machine precision as one would expect.

wiederm · 2023-02-22T11:29:30Z

Thanks for tracking this down! I will give it a try!

RaulPPelaez · 2023-02-23T12:11:14Z

Many of the CUDA tests in NNPOps fail for me, @peastman @raimis, could you confirm?

Test project /shared/raul/NNPOps/build
      Start  1: TestCpuANISymmetryFunctions
 1/13 Test  #1: TestCpuANISymmetryFunctions ......   Passed    1.50 sec
      Start  2: TestCpuCFConv
 2/13 Test  #2: TestCpuCFConv ....................   Passed    0.48 sec
      Start  3: TestCudaANISymmetryFunctions
 3/13 Test  #3: TestCudaANISymmetryFunctions .....Subprocess aborted***Exception:   1.80 sec
      Start  4: TestCudaCFConv
 4/13 Test  #4: TestCudaCFConv ...................Subprocess aborted***Exception:   1.82 sec
      Start  5: TestBatchedNN


 5/13 Test  #5: TestBatchedNN ....................***Failed  203.79 sec
      Start  6: TestCFConv
 6/13 Test  #6: TestCFConv .......................   Passed    5.28 sec
      Start  7: TestCFConvNeighbors
 7/13 Test  #7: TestCFConvNeighbors ..............   Passed    3.38 sec
      Start  8: TestEnergyShifter
 8/13 Test  #8: TestEnergyShifter ................   Passed  108.41 sec
      Start  9: TestOptimizedTorchANI
 9/13 Test  #9: TestOptimizedTorchANI ............***Failed  216.59 sec
      Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter .............   Passed  111.40 sec
      Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............***Failed  129.94 sec
      Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................***Exception: SegFault  5.57 sec
      Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs .............   Passed    2.47 sec

54% tests passed, 6 tests failed out of 13

Total Test time (real) = 792.56 sec

The following tests FAILED:
          3 - TestCudaANISymmetryFunctions (Subprocess aborted)
          4 - TestCudaCFConv (Subprocess aborted)
          5 - TestBatchedNN (Failed)
          9 - TestOptimizedTorchANI (Failed)
         11 - TestSymmetryFunctions (Failed)
         12 - TestNeighbors (SEGFAULT)

sef43 · 2023-02-24T15:59:58Z

Mine mostly pass:

      Start  1: TestCpuANISymmetryFunctions
 1/13 Test  #1: TestCpuANISymmetryFunctions ......   Passed    2.22 sec
      Start  2: TestCpuCFConv
 2/13 Test  #2: TestCpuCFConv ....................   Passed    0.87 sec
      Start  3: TestCudaANISymmetryFunctions
 3/13 Test  #3: TestCudaANISymmetryFunctions .....   Passed    2.65 sec
      Start  4: TestCudaCFConv
 4/13 Test  #4: TestCudaCFConv ...................   Passed    3.03 sec
      Start  5: TestBatchedNN
 5/13 Test  #5: TestBatchedNN ....................***Failed  155.24 sec
      Start  6: TestCFConv
 6/13 Test  #6: TestCFConv .......................***Failed    4.81 sec
      Start  7: TestCFConvNeighbors
 7/13 Test  #7: TestCFConvNeighbors ..............   Passed    4.92 sec
      Start  8: TestEnergyShifter
 8/13 Test  #8: TestEnergyShifter ................   Passed   97.31 sec
      Start  9: TestOptimizedTorchANI
 9/13 Test  #9: TestOptimizedTorchANI ............   Passed  163.42 sec
      Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter .............   Passed  105.67 sec
      Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............   Passed  120.57 sec
      Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................   Passed    9.12 sec
      Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs .............   Passed    2.79 sec

85% tests passed, 2 tests failed out of 13

Total Test time (real) = 672.64 sec

The following tests FAILED:
	  5 - TestBatchedNN (Failed)
	  6 - TestCFConv (Failed)

The failures are just a couple of floating point assertion errors that might be stochastic

This was using a build environment created with NNPOPs/environment.yaml but with cudatoolkit=11.7 to match my installed cuda

sef43 · 2023-02-24T23:20:32Z

I think these lines in SymmetryFunctions.cpp

            if (device.is_cpu()) {
                impl = std::make_shared<CpuANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
                                                                                                 ^^^^^

#ifdef ENABLE_CUDA
            } else if (device.is_cuda()) {
                // PyTorch allow to chose GPU with "torch.device", but it doesn't set as the default one.
                CHECK_CUDA_RESULT(cudaSetDevice(device.index()));
                impl = std::make_shared<CudaANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
                                                                                                 ^^^^^
#endif
            } else

mean that the periodic flag is hard coded and always set to false and the templated <PERIODIC> displacement/distance calculations in CpuANISymmetryFunctions.cpp / CudaANISymmetryFunctions.cu are never actually used. Or am I wrongly understanding how the .cpp/.cu code is called from the python?

peastman · 2023-02-24T23:55:54Z

I believe you're correct about that. If we replace false with cellPtr != nullptr, that should pass the correct value for whether to use periodic boundary conditions.

RaulPPelaez · 2023-02-27T08:53:20Z

Amazing catch @sef43 ! @peastman's suggestion fixes the issue. The following test now passes:

from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np

box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CPU")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
    system = potential.createSystem(
        testsystem.topology, implementation=s
    )
    print(f"Implementation {s}")
    file=open(f"{s}.dat", 'w')
    integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
    simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
    simulation.context.setPositions(testsystem.positions)
    forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
    positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)

fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)

error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)
print(f"Maximum error: {np.max(error)}")
print(f"Mean error: {np.mean(error)}")
print(f"Std error: {np.std(error)}")

Printing:

Maximum error: 1.1337166912735413e-05
Mean error: 1.2383139611804328e-06
Std error: 1.5981482745315712e-06

wiederm · 2023-03-02T11:09:33Z

this is great, thank you!

raimis added the help wanted Extra attention is needed label Feb 13, 2023

raimis added this to In progress in Accelerated NNP in OpenMM via automation Feb 13, 2023

sef43 mentioned this issue Feb 27, 2023

Fix a bug the periodic boundary conditions in ANI SymmetryFunctions #83

Merged

raimis closed this as completed in #83 Feb 28, 2023

Accelerated NNP in OpenMM automation moved this from In progress to Done Feb 28, 2023

sef43 mentioned this issue Mar 22, 2023

NaNs when using torchani openmm/openmm-ml#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprising differences between nnpops and torchani #82

Surprising differences between nnpops and torchani #82

wiederm commented Feb 12, 2023

wiederm commented Feb 13, 2023

raimis commented Feb 13, 2023

RaulPPelaez commented Feb 13, 2023

RaulPPelaez commented Feb 13, 2023

wiederm commented Feb 14, 2023

wiederm commented Feb 14, 2023 •

edited

Loading

wiederm commented Feb 15, 2023 •

edited

Loading

RaulPPelaez commented Feb 16, 2023

RaulPPelaez commented Feb 20, 2023

RaulPPelaez commented Feb 21, 2023

wiederm commented Feb 22, 2023

RaulPPelaez commented Feb 23, 2023

sef43 commented Feb 24, 2023 •

edited

Loading

sef43 commented Feb 24, 2023

peastman commented Feb 24, 2023

RaulPPelaez commented Feb 27, 2023

wiederm commented Mar 2, 2023

Surprising differences between nnpops and torchani #82

Surprising differences between nnpops and torchani #82

Comments

wiederm commented Feb 12, 2023

wiederm commented Feb 13, 2023

raimis commented Feb 13, 2023

RaulPPelaez commented Feb 13, 2023

RaulPPelaez commented Feb 13, 2023

wiederm commented Feb 14, 2023

wiederm commented Feb 14, 2023 • edited Loading

wiederm commented Feb 15, 2023 • edited Loading

RaulPPelaez commented Feb 16, 2023

RaulPPelaez commented Feb 20, 2023

RaulPPelaez commented Feb 21, 2023

wiederm commented Feb 22, 2023

RaulPPelaez commented Feb 23, 2023

sef43 commented Feb 24, 2023 • edited Loading

sef43 commented Feb 24, 2023

peastman commented Feb 24, 2023

RaulPPelaez commented Feb 27, 2023

wiederm commented Mar 2, 2023

wiederm commented Feb 14, 2023 •

edited

Loading

wiederm commented Feb 15, 2023 •

edited

Loading

sef43 commented Feb 24, 2023 •

edited

Loading