Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oemol.GetConfs() consuming large amount of memory even when no conformers are present #1855

Open
lilyminium opened this issue Apr 5, 2024 · 4 comments

Comments

@lilyminium
Copy link
Collaborator

lilyminium commented Apr 5, 2024

Describe the bug

Not a bug per se, but could impact on toolkit usability for large molecules -- while debugging openforcefield/openff-nagl#101 I saw that converting molecules to and from OpenEye consumes a large amount of memory that is not seen with RDKit. For a 5177 atom protein, calling Molecule.from_openeye consumes about 800 MiB. Memray attributes most of this to oeconf.GetCoords, even though no conformers are generated or attached at any point to the molecule. Would it be possible to check for conformers before calling conf.GetCoords? (It may be that this triggers the same memory-consuming process, though!)

To Reproduce

mre.py (also attached):

from openff.toolkit import Molecule

protein = Molecule.from_smiles(
    "CC[C@H](C)[C@H](NC(=O)CNC(=O)CNC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)CNC(=O)[C@H](CS)NC(=O)[C@@H](NC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](CO)NC(=O)[C@H](C)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CO)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H]1CCCN1C(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CC(=O)[O-])NC(=O)CNC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)CNC(=O)CNC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H]1CCCN1C(=O)CNC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CCCN1C(=O)[C@H](C)NC(=O)CNC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@@H]([NH3+])CCSC)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)C(C)C)C(C)C)C(C)C)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)O)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](Cc1cnc[nH]1)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)NCC(=O)N[C@@H](CCC(N)=O)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CCSC)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CC(N)=O)C(=O)N[C@H](C(=O)N[C@@H](CS[C@H]1CC(=O)N(c2ccc3c(c2)C(=O)OC32c3ccc(O)cc3Oc3cc(O)ccc32)C1=O)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CO)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)NCC(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CCCNC(N)=[NH2+])C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(N)=O)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccccc1)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N[C@@H](CC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCNC(N)=[NH2+])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(N)=O)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CCSC)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N1CCC[C@H]1C(=O)N[C@@H](CC(=O)[O-])C(=O)NCC(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CCC(N)=O)C(=O)NC)[C@@H](C)CC)[C@@H](C)O)C(C)C)[C@@H](C)CC)[C@@H](C)O)C(C)C)C(C)C)[C@@H](C)CC)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)O)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)C(C)C)[C@@H](C)CC)C(C)C)C(C)C)C(C)C)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)C(C)C"
)
oemol = protein.to_openeye()
offmol = Molecule.from_openeye(oemol)

Requires memray installed:

>>> python -m memray run mre.py
Screenshot 2024-04-05 at 5 35 19 pm

The screenshot points to this line:

off_atom_coords = conf.GetCoords()[oe_id]

Output

Computing environment (please complete the following information):

  • Operating system
  • Output of running conda list
  Name                     Version    Build         Channel
─────────────────────────────────────────────────────────────────
  openff-amber-ff-ports    0.0.4      pyhca7485f_0  conda-forge
  openff-forcefields       2024.03.0  pyhca7485f_0  conda-forge
  openff-interchange-base  0.3.25     pyhd8ed1ab_0  conda-forge
  openff-models            0.1.2      pyhca7485f_0  conda-forge
  openff-nagl              0.3.6      pyhd8ed1ab_0  conda-forge
  openff-nagl-base         0.3.6      pyhd8ed1ab_0  conda-forge
  openff-nagl-models       0.1.2      pyhd8ed1ab_0  conda-forge
  openff-recharge          0.5.2      pyhd8ed1ab_0  conda-forge
  openff-toolkit-base      0.15.2     pyhd8ed1ab_0  conda-forge
  openff-units             0.2.2      pyhca7485f_0  conda-forge
  openff-utilities         0.1.12     pyhd8ed1ab_0  conda-forge

Additional context

mre.zip

Manifest:

  • mre.py (includes the protein smirks)
  • memray-mre.py.10332.bin: the output of memray
  • memray-flamegraph-mre.py.10332.html: the interactive graph in the screenshot
@mattwthompson
Copy link
Member

I tried implementing this since it should be easy, but it's not. Simply adding a NumConfs() call doesn't do the trick. I don't know how to check for OpenEye's annoying "courtesy conformer" without calling out to GetConfs(), which I understand to be the problem:

In [32]: oemol = Molecule.from_smiles("CCO").to_openeye()

In [33]: oemol.NumConfs()
Out[33]: 1

In [34]: [*oemol.GetConfs()][0].GetCoords()
Out[34]:
{0: (0.0, 0.0, 0.0),
 1: (0.0, 0.0, 0.0),
 2: (0.0, 0.0, 0.0),
 3: (0.0, 0.0, 0.0),
 4: (0.0, 0.0, 0.0),
 5: (0.0, 0.0, 0.0),
 6: (0.0, 0.0, 0.0),
 7: (0.0, 0.0, 0.0),
 8: (0.0, 0.0, 0.0)}

In [35]: molecule = Molecule.from_smiles("O=S(=O)(N)c1c(Cl)cc2c(c1)S(=O)(=O)NCN2")

In [36]: molecule.generate_conformers(n_conformers=1)

In [37]: oemol = molecule.to_openeye()

In [38]: oemol.NumConfs()
Out[38]: 1

In [39]: [*oemol.GetConfs()][0].GetCoords()
Out[39]:
{0: (1.8719326257705688, 3.7204949855804443, 2.2212681770324707),
 1: (1.2912099361419678, 4.097604274749756, 0.9475870132446289),
 2: (0.3753527104854584, 5.218091011047363, 0.8554574251174927),
 3: (2.534075975418091, 4.290732383728027, -0.20339979231357574),
 4: (0.4765625, 2.689453125, 0.296875),
 5: (-0.5654296875, 2.794921875, -0.62060546875),
 6: (-1.133737325668335, 4.327646255493164, -1.178165078163147),
 7: (-1.181640625, 1.642578125, -1.119140625),
 8: (-0.7685546875, 0.360595703125, -0.7216796875),
 9: (0.280029296875, 0.290283203125, 0.2086181640625),
 10: (0.90185546875, 1.43359375, 0.71923828125),
 11: (0.84521484375, -1.2734375, 0.802734375),
 12: (1.9853515625, -1.6318359375, -0.0174713134765625),
 13: (0.93994140625, -1.193359375, 2.24609375),
 14: (-0.484130859375, -2.26953125, 0.403076171875),
 15: (-0.9970703125, -2.099609375, -0.96826171875),
 16: (-1.4326171875, -0.7451171875, -1.2314453125),
 17: (2.4708824157714844, 5.089303016662598, -0.8458374738693237),
 18: (3.4932398796081543, 4.066527366638184, 0.08694052696228027),
 19: (-2.0012941360473633, 1.7379204034805298, -1.8306996822357178),
 20: (1.7089277505874634, 1.3463486433029175, 1.4417718648910522),
 21: (-1.2067643404006958, -2.400458812713623, 1.1223875284194946),
 22: (-0.20102126896381378, -2.3644607067108154, -1.6715291738510132),
 23: (-1.8237838745117188, -2.7984254360198975, -1.1371092796325684),
 24: (-2.3384859561920166, -0.6081215143203735, -1.6641637086868286)}

@mattwthompson
Copy link
Member

Okay, actually thinking about this a little more clearly, using GetConfs (which returns an iterator of all conformers) might be the issue if it's not a generator. I can't tell from the docs and SWIG magic if it's lazy like a generator or EEAAO more like a list.

@mattwthompson
Copy link
Member

There's also GetConfIter which only exists when there are two or more conformers. This could provide a useful branching point if it didn't fail to distinguish whether a single conformer was real or not when there's only one.

@lilyminium
Copy link
Collaborator Author

Hm, the courtesy conformer is annoying. This is low priority at best since it's a very moderate amount of memory even for a decent sized protein. Thanks for looking into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants