Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchani faster than nnpops for larger systems? #85

Open
wiederm opened this issue Mar 2, 2023 · 12 comments
Open

torchani faster than nnpops for larger systems? #85

wiederm opened this issue Mar 2, 2023 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@wiederm
Copy link

wiederm commented Mar 2, 2023

Hi,

when I simulate a 15 Angstrom waterbox with the torchani and nnpops implementation the torchani implementation is slightly faster. Is nnpops only outperforming torchani with small system size? I have attached a minimum example to reproduce the shown output.

# NNPOPS
Implementation: nnpops

MD run: 1000 steps
#"Step" "Time (ps)"     "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100     0.10000000000000007     -20461968.233400125     0
200     0.20000000000000015     -20462109.582584146     3.02
300     0.3000000000000002      -20462215.08696869      3.02
400     0.4000000000000003      -20462184.75506845      3.02
500     0.5000000000000003      -20462176.182438154     3.02
600     0.6000000000000004      -20462290.934872355     3.02
700     0.7000000000000005      -20462276.06124924      3.02
800     0.8000000000000006      -20462268.749944247     3.01
900     0.9000000000000007      -20462303.856101606     3.01
1000    1.0000000000000007      -20462353.939166784     3.01
# TorchANI
Implementation: torchani

MD run: 1000 steps
#"Step" "Time (ps)"     "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100     0.10000000000000007     -20456827.93509699      0
200     0.20000000000000015     -20453552.138266437     3.36
300     0.3000000000000002      -20446930.31249438      3.39
400     0.4000000000000003      -20442156.674454395     3.39
500     0.5000000000000003      -20434295.0773298       2.97
600     0.6000000000000004      -20432329.317804128     3.03
700     0.7000000000000005      -20427635.139502555     3
800     0.8000000000000006      -20422604.906581655     3.04
900     0.9000000000000007      -20420074.77440338      3.07
1000    1.0000000000000007      -20414884.105911426     3.09

min.py.zip

@RaulPPelaez
Copy link
Contributor

In your code AFAIK you are using the CPU implementation, is this intended?

@JohannesKarwou
Copy link

It's not intended to use the CPU implementation...
I thought using implementation = nnpops is by default using the GPU (that's how nnpops is called in openmm-ml https://github.com/openmm/openmm-ml/blob/c3d8c28eb92bf5c4b16efb81ad7a44b707fc5907/openmmml/models/anipotential.py#L89 when using createSystem). If I run the script on my machine the GPU is used but I get similar results as @wiederm (torchani and nnpops being equally fast)

@RaulPPelaez
Copy link
Contributor

RaulPPelaez commented Mar 2, 2023

Take this minimum example, which is similar to the example provided by @wiederm

import sys

from openmm import LangevinIntegrator, unit, Platform
from openmm.app import Simulation, StateDataReporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox

# constants which might be modified by the user
step = 1000
waterbox = WaterBox(box_edge=15 * unit.angstrom)
nnp = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")

for implementation in ("nnpops","torchani"):
    print(f"Implementation: {implementation}")
    ml_system = nnp.createSystem(waterbox.topology, implementation=implementation)
    simulation = Simulation(
        waterbox.topology,
        ml_system,
        LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 1 * unit.femtosecond),
        platform, prop
    )
    simulation.context.setPositions(waterbox.positions)
    # Production
    if step > 0:
        print("\nMD run: %s steps" % step)
        simulation.reporters.append(
            StateDataReporter(
                sys.stdout,
                reportInterval=100,
                step=True,
                time=True,
                potentialEnergy=True,
                speed=True,
                separator="\t",
            )
        )
        simulation.step(step)

In my GPU, an RTX 2080 Ti I get this:

Implementation: nnpops

MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20461978.001629103	0
200	0.20000000000000015	-20462133.848855495	6.8
300	0.3000000000000002	-20462153.789688706	6.79
400	0.4000000000000003	-20462202.823631693	6.79
500	0.5000000000000003	-20462257.760451913	6.79
600	0.6000000000000004	-20462329.421256337	6.79
700	0.7000000000000005	-20462362.9969222	6.8
800	0.8000000000000006	-20462488.402703974	6.8
900	0.9000000000000007	-20462532.231097963	6.8
1000	1.0000000000000007	-20462481.48763666	6.8
Implementation: torchani

MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20456285.324413814	0
200	0.20000000000000015	-20451616.878087416	2.48
300	0.3000000000000002	-20445519.385244645	2.49
400	0.4000000000000003	-20438851.384950936	2.19
500	0.5000000000000003	-20431004.40918439	2.13
600	0.6000000000000004	-20426584.870540198	2.18
700	0.7000000000000005	-20415840.214279402	2.22
800	0.8000000000000006	-20411478.48251822	2.24
900	0.9000000000000007	-20409822.772401713	2.26
1000	1.0000000000000007	-20402172.29296462	2.27

Note, however, that the GPU utilization I am seeing for the torchani implementation is low (under 30%). Whereas NNPOps is using 100%.
Which GPU are you running on?

@JohannesKarwou
Copy link

I'm using a RTX2060. If I use your script, I get this output:

Warning on use of the timeseries module: If the inherent timescales of the system are long compared to those being analyzed, this statistical inefficiency may be an underestimate.  The estimate presumes the use of many statistically independent samples.  Tests should be performed to assess whether this condition is satisfied.   Be cautious in the interpretation of the data.
Implementation: nnpops
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available
  warnings.warn("Dependency not satisfied, torchani.ase will not be available")
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/resources/

MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20461915.171353746	0
200	0.20000000000000015	-20462175.772429183	3.8
300	0.3000000000000002	-20462196.65597004	3.77
400	0.4000000000000003	-20462139.43374104	3.78
500	0.5000000000000003	-20462263.351597134	3.79
600	0.6000000000000004	-20462383.616304602	3.79
700	0.7000000000000005	-20462338.201707978	3.79
800	0.8000000000000006	-20462476.36408945	3.79
900	0.9000000000000007	-20462493.882426914	3.79
1000	1.0000000000000007	-20462619.096036546	3.8
Implementation: torchani
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/resources/

MD run: 1000 steps
[W BinaryOps.cpp:594] Warning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (function operator())
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20461319.38262451	0
200	0.20000000000000015	-20461412.834623236	3.64
300	0.3000000000000002	-20461563.888187505	3.62
400	0.4000000000000003	-20461594.997226883	3.65
500	0.5000000000000003	-20461698.649372432	3.68
600	0.6000000000000004	-20461768.927100796	3.68
700	0.7000000000000005	-20461759.561995145	3.7
800	0.8000000000000006	-20462005.177399218	3.7
900	0.9000000000000007	-20462068.944122538	3.71
1000	1.0000000000000007	-20462172.986246087	3.7

For nnpops I see a GPU utilization of 100% and for torchani around 30%. These are the package in my environment:

cudatoolkit               11.4.2              h7a5bcfd_11    conda-forge
nnpops                    0.3             cuda112py310h8b99da5_1    conda-forge
openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-ml                 1.0                pyhd8ed1ab_0    conda-forge
openmm-torch              1.0             cuda112py310hb8f62fa_0    conda-forge
openmmtools               0.21.4             pyhd8ed1ab_0    conda-forge
pytorch                   1.12.1          cuda112py310he33e0d6_201    conda-forge
torchani                  2.2.2           cuda112py310h98dee98_6    conda-forge

@RaulPPelaez
Copy link
Contributor

My GPU is much more powerful and yet you see more speed in torchani than me. I am clueless as to why, lets see if the rest have some insights. @raimis @peastman @sef43 Any ideas?

@sef43
Copy link
Member

sef43 commented Mar 2, 2023

this is what I get on RTX3090, NNPOps is faster as expected but torchani is slower than all of the above:

Implementation: nnpops

MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20461951.441185422	0
200	0.20000000000000015	-20462188.18474654	9.89
300	0.3000000000000002	-20462210.048553117	9.92
400	0.4000000000000003	-20462229.56560606	9.91
500	0.5000000000000003	-20462397.756919313	9.92
600	0.6000000000000004	-20462263.190097418	9.91
700	0.7000000000000005	-20462424.72236422	9.91
800	0.8000000000000006	-20462420.394422207	9.91
900	0.9000000000000007	-20462443.257273544	9.9
1000	1.0000000000000007	-20462578.838789392	9.9

Implementation: torchani


MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20456325.17729549	0
200	0.20000000000000015	-20449428.474614438	0.685
300	0.3000000000000002	-20439698.984288476	0.97
400	0.4000000000000003	-20431854.536180284	1.13
500	0.5000000000000003	-20424569.579427063	1.22
600	0.6000000000000004	-20421131.122765917	1.29
700	0.7000000000000005	-20415078.246097725	1.34
800	0.8000000000000006	-20411468.562179044	1.37
900	0.9000000000000007	-20405688.49984586	1.38
1000	1.0000000000000007	-20401371.05406119	1.4

@raimis
Copy link
Contributor

raimis commented Mar 2, 2023

On my ancient GTX 1080 Ti:

MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20461986.05720992	0
200	0.20000000000000015	-20462011.271196663	4.32
300	0.3000000000000002	-20462185.580720104	4.29
400	0.4000000000000003	-20462218.965465758	4.29
500	0.5000000000000003	-20462187.70901094	4.28
600	0.6000000000000004	-20462401.649187673	4.29
700	0.7000000000000005	-20462252.494809993	4.29
800	0.8000000000000006	-20462291.34049955	4.29
900	0.9000000000000007	-20462193.367134728	4.29
1000	1.0000000000000007	-20462424.46822126	4.29
Implementation: torchani

MD run: 1000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
100	0.10000000000000007	-20461523.953313816	0
200	0.20000000000000015	-20461520.8654142	1.72
300	0.3000000000000002	-20461375.710345387	1.64
400	0.4000000000000003	-20461691.991264936	1.7
500	0.5000000000000003	-20461734.920769025	1.8
600	0.6000000000000004	-20461817.857133113	1.83
700	0.7000000000000005	-20461817.208004408	1.89
800	0.8000000000000006	-20462153.94805858	1.92
900	0.9000000000000007	-20462099.346131183	1.96
1000	1.0000000000000007	-20462146.46586435	1.96

@raimis raimis added the help wanted Extra attention is needed label Mar 2, 2023
@wiederm
Copy link
Author

wiederm commented Mar 2, 2023

I have tested your script with two modifications (5K steps, write frequency set to 200 steps) on a RTX 3070 (not the same machine I posted my initial data) and I get the following:

Implementation: nnpops

MD run: 5000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
200	0.20000000000000015	-20462058.886696402	0
400	0.4000000000000003	-20462261.770402238	5.13
600	0.6000000000000004	-20462289.659775756	5.13
800	0.8000000000000006	-20462297.250888396	5.13
1000	1.0000000000000007	-20462514.469884958	5.13
1200	1.1999999999999786	-20462481.41815422	5.13
1400	1.3999999999999566	-20462505.46846665	5.13
1600	1.5999999999999346	-20462626.978850227	5.13
1800	1.7999999999999126	-20462635.409385815	5.13
2000	1.9999999999998905	-20462620.65219273	5.13
2200	2.1999999999998687	-20462562.552356746	5.13
2400	2.3999999999998467	-20462679.17831285	5.13
2600	2.5999999999998247	-20462662.771694366	5.13
2800	2.7999999999998026	-20462858.939390674	5.13
3000	2.9999999999997806	-20462769.659467954	5.13
3200	3.1999999999997586	-20462989.99891446	5.13
3400	3.3999999999997366	-20462942.58059461	5.13
3600	3.5999999999997145	-20463033.362214305	5.13
3800	3.7999999999996925	-20462924.417510215	5.13
4000	3.9999999999996705	-20463011.34567156	5.13
4200	4.199999999999737	-20462983.498863857	5.13
4400	4.399999999999804	-20463055.54401257	5.13
4600	4.599999999999871	-20463034.46516973	5.13
4800	4.799999999999938	-20463039.68574196	5.13
5000	5.000000000000004	-20463034.924004197	5.13
Implementation: torchani

MD run: 5000 steps
#"Step"	"Time (ps)"	"Potential Energy (kJ/mole)"	"Speed (ns/day)"
200	0.20000000000000015	-20461516.526204765	0
400	0.4000000000000003	-20461600.373352136	4.85
600	0.6000000000000004	-20461745.880840868	4.85
800	0.8000000000000006	-20461985.43875364	4.82
1000	1.0000000000000007	-20462241.704375446	4.76
1200	1.1999999999999786	-20462383.980617914	4.73
1400	1.3999999999999566	-20462447.03686768	4.74
1600	1.5999999999999346	-20462761.832365412	4.75
1800	1.7999999999999126	-20462865.32426318	4.77
2000	1.9999999999998905	-20462951.05244407	4.78
2200	2.1999999999998687	-20462853.98548076	4.79
2400	2.3999999999998467	-20463030.63675009	4.77
2600	2.5999999999998247	-20463056.779047217	4.75
2800	2.7999999999998026	-20463190.083291873	4.76
3000	2.9999999999997806	-20463250.281372238	4.77
3200	3.1999999999997586	-20463391.37891716	4.78
3400	3.3999999999997366	-20463425.523587838	4.78
3600	3.5999999999997145	-20463549.743160456	4.79
3800	3.7999999999996925	-20463630.800368927	4.79
4000	3.9999999999996705	-20463500.444433052	4.8
4200	4.199999999999737	-20463560.039706334	4.8
4400	4.399999999999804	-20463650.1258757	4.8
4600	4.599999999999871	-20463781.969737258	4.81
4800	4.799999999999938	-20463734.84937812	4.81
5000	5.000000000000004	-20463774.00930356	4.81

@RaulPPelaez
Copy link
Contributor

Given torchani's low GPU utilization, maybe CPU performance is playing a role here. Perhaps your original 2060 machine has a particularly powerful CPU.
The rest can provide more insights, but maybe reporting just every 100 steps is somehow making reporting hide any potential gains in some systems.
In essence I do not see anything weird going on here. Hopefully in the future we can make the gains get even better :P

@sef43
Copy link
Member

sef43 commented Mar 3, 2023

Yes it seems to be very GPU dependent with the higher end cards with more CUDA cores having much more of a speedup from NNPOps.

@sef43
Copy link
Member

sef43 commented Mar 3, 2023

I think it would be useful to collate some performance benchmarks like the above on different hardware, and system sizes, so people can know if their systems are running at expected speed, i.e similar to here: https://openmm.org/benchmarks

@jchodera
Copy link
Member

jchodera commented Jul 6, 2023

Did we ever figure out what was happening here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants