Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptionally long runtime for a few organic structures from materials project #20

Closed
NiklasGebauer opened this issue Nov 4, 2020 · 7 comments

Comments

@NiklasGebauer
Copy link

Hello,

first of all thanks for this great script! It is really useful and does a good job at solving this tricky task.

I was using it on a few thousand organic molecules from the materials project database and realized that a few structures always lead to exceptionally long runtime (~10 minutes compared to <1 second for most other molecules).

  • Python 3.7.5
  • RDKit 2019.09.2

I'm calling with:
python xyz2mol.py molecule.xyz --use-huckel --charge 0
When I set --no-charged-fragments the calculation instead takes only ~1 minute but this is still a lot longer than for other structures.

Do you have any idea why these structures take so long? Is there anything I could do about it?

You can find the .xyz of two example structures and the resulting SMILES strings below:

  • 1 N#CC(C#N)=C1c2cc([N+](=O)[O-])cc([N+](=O)[O-])c2-c2c1cc([N+](=O)[O-])cc2[N+](=O)[O-]
34
Properties=species:S:1:pos:R:3 pbc="F F F"
C        0.73692500       0.51863000      -0.03802500
C       -0.73709800       0.51853500       0.03824500
C        1.17836400      -0.83219800       0.04762500
C        1.70690900       1.51141400      -0.22401300
C       -1.17839500      -0.83234800      -0.04754000
C       -1.70721100       1.51120700       0.22420600
C        0.00006400      -1.72012500       0.00007400
C        2.53564000      -1.15199000       0.09649200
C        3.06628200       1.21639600      -0.15076800
N        1.37416700       2.88102000      -0.66635000
C       -2.53564500      -1.15224700      -0.09678300
C       -3.06652800       1.21608200       0.15058700
N       -1.37459700       2.88074400       0.66686200
C        0.00030300      -3.09187400       0.00024400
C        3.45228900      -0.10548900       0.03352700
O        0.38166200       2.99492000      -1.38745000
O        2.13692500       3.78391800      -0.35163400
C       -3.45237300      -0.10580900      -0.03401100
O       -0.38185500       2.99457700       1.38765000
O       -2.13743300       3.78365300       0.35239600
C       -1.18832100      -3.88288100      -0.11103400
C        1.18921900      -3.88238300       0.11179100
N        4.89748200      -0.41699000       0.11258300
N       -4.89754100      -0.41737300      -0.11356500
N       -2.12696000      -4.56470000      -0.20142600
N        2.12802200      -4.56397100       0.20223900
O        5.20955100      -1.59603200       0.25743400
O        5.68097700       0.52605800       0.03167900
O       -5.20951000      -1.59645600      -0.25831000
O       -5.68110900       0.52560000      -0.03243400
H        2.90408600      -2.16593900       0.17608200
H        3.80486200       2.00100800      -0.26123200
H       -2.90394700      -2.16622200      -0.17654900
H       -3.80521500       2.00060700       0.26094000
  • 2 [NH2+]=C1N=CN=C2N3[C@@H]4O[C@H](CO[P@@](=O)(O[P@](=O)(O)O[P@@]([O-])(O)=[OH+])OC35[N-]C125)[C@@H](O)[C@H]4O
45
Properties=species:S:1:pos:R:3 pbc="F F F"
O        4.36294500      -1.55898200       1.19840700
C        3.10513300      -0.92751400       0.97875100
C        2.86634300      -0.72258900      -0.53503300
C        2.02171300      -1.95959500       1.33356600
N        1.91655400       0.40588800      -0.76397200
O        2.34711500      -1.92674100      -1.03009800
O        2.29221600      -2.71678000       2.48855300
C        1.92958000      -2.80125800       0.04338700
C        2.15543200       1.70089100      -0.30613100
C        0.73195100       0.48380800      -1.46547600
C        0.54836100      -3.34299100      -0.26937900
C        1.06929500       2.46700000      -0.72438200
N        3.21680800       2.13289500       0.38457700
N        0.19196000       1.68276100      -1.45614000
O       -0.39916100      -2.24266900      -0.37758300
C        1.08410500       3.83051700      -0.35954800
C        3.12623500       3.44233300       0.65789300
P       -0.83728200      -1.70962400      -1.81922600
N        0.10761000       4.70578100      -0.70064300
N        2.14223600       4.29011900       0.33496900
O       -2.20702200      -0.94827500      -1.47786400
O        0.23407600      -0.52044500      -2.20670500
O       -0.92585400      -2.67604700      -2.91904500
P       -2.85940700       0.33289900      -0.70924700
O       -1.97321500       0.42381000       0.64791500
O       -2.45515700       1.58735300      -1.56091900
O       -4.30444600       0.14212700      -0.44905300
P       -2.44426400       0.14947500       2.21721200
O       -1.53871300       0.80975600       3.17277700
O       -3.97216200       0.57457000       2.23570700
O       -2.45168500      -1.46410900       2.28946500
H        5.04302400      -0.88269600       1.34022000
H        3.02449600      -0.00216500       1.54945000
H        3.78897400      -0.48062500      -1.06991600
H        1.07625300      -1.43888400       1.50963300
H        3.25561600      -2.86261800       2.51878700
H        2.63099400      -3.64393100       0.10106200
H       -1.44141000       1.72447800      -1.63391000
H        0.53920000      -3.92335800      -1.19561200
H        0.19327800      -3.96716600       0.55497200
H        3.95475600       3.87280900       1.21424500
H       -0.77975900       4.37659500      -1.05328500
H        0.13788800       5.62638600      -0.28299800
H       -4.43938400       0.38504100       1.38103200
H       -1.79620500      -1.78939000       2.92936300
@jhjensen2
Copy link
Member

Glad you're finding xyz2mol useful!

Molecules with many nitro and phosphate groups will take long and there isn't really a general way to speed it up without changing the entire approach.

One could make some hacks that identify these specific groups and deal with them differently but I am reluctant to put that in the official version of the code. Let me know if you want to implement it locally and I can give you some tips.

I also note that xyz2mol didn't identify the bonding correctly on the second molecule. Maybe removing the Huckel option will fix it. In general it's a good idea to use both and visually inspect those where they differ as a sanity check.

@NiklasGebauer
Copy link
Author

Alright, thanks a lot!

I think it should be fine for now since only very few structures seem to be effected.
I will try to use the script to analyze 3d structures generated with a generative model. There shouldn't be too many nitro and phosphate groups as they are also rare in the training data. But if the process slows down too much I will get back to you and think about implementing the hacks.

Also, I will keep the sanity check with and without Huckel option in mind, thanks for the hint!
In general, I do not require the script to obtain the correct bonding in all cases as long as it detects the bonding correctly in the vast majority of the cases (and it seems to accomplish that on the training data, we'll see how it performs on generated structures which can be more inaccurate).

@NiklasGebauer
Copy link
Author

I just tried the second example without Huckel and it finished in less than a second and obtained the correct bonding (however, the first example also takes a long time without Huckel).

Is there a general rule which of the two approaches is faster/more reliable or does it strongly depend on the structure?
It would be nice to decide for either one of them without manually checking the results if they differ, but this would of course only make sense if one approach is superior (on average) in obtaining the correct bonding.

@jhjensen2
Copy link
Member

In my experience, the Huckel option is more reliable. In fact, I was really surprised to see that it failed for molecule 2. Molecules with many nitro groups will always take a long time, but it's hard to predict in general.

Anyway, just be aware that xyz2mol will occasionally screw up and it's hard to predict if and when it happens.

@NiklasGebauer
Copy link
Author

Thanks again!

Is it always the same lines of code where molecules get stuck when it takes so long, e.g. a loop?
I am thinking about implementing a kill switch that raises a timeout error if the calculations take longer than e.g. 5 minutes.

@jhjensen2
Copy link
Member

it is the loop over valences in AC2BO that can take a long time.

@NiklasGebauer
Copy link
Author

Okay, I should be fine with handling these special cases then.
Keep up the good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants