Exceptionally long runtime for a few organic structures from materials project #20

NiklasGebauer · 2020-11-04T12:37:06Z

Hello,

first of all thanks for this great script! It is really useful and does a good job at solving this tricky task.

I was using it on a few thousand organic molecules from the materials project database and realized that a few structures always lead to exceptionally long runtime (~10 minutes compared to <1 second for most other molecules).

Python 3.7.5
RDKit 2019.09.2

I'm calling with:
python xyz2mol.py molecule.xyz --use-huckel --charge 0
When I set --no-charged-fragments the calculation instead takes only ~1 minute but this is still a lot longer than for other structures.

Do you have any idea why these structures take so long? Is there anything I could do about it?

You can find the .xyz of two example structures and the resulting SMILES strings below:

1 N#CC(C#N)=C1c2cc([N+](=O)[O-])cc([N+](=O)[O-])c2-c2c1cc([N+](=O)[O-])cc2[N+](=O)[O-]

34
Properties=species:S:1:pos:R:3 pbc="F F F"
C        0.73692500       0.51863000      -0.03802500
C       -0.73709800       0.51853500       0.03824500
C        1.17836400      -0.83219800       0.04762500
C        1.70690900       1.51141400      -0.22401300
C       -1.17839500      -0.83234800      -0.04754000
C       -1.70721100       1.51120700       0.22420600
C        0.00006400      -1.72012500       0.00007400
C        2.53564000      -1.15199000       0.09649200
C        3.06628200       1.21639600      -0.15076800
N        1.37416700       2.88102000      -0.66635000
C       -2.53564500      -1.15224700      -0.09678300
C       -3.06652800       1.21608200       0.15058700
N       -1.37459700       2.88074400       0.66686200
C        0.00030300      -3.09187400       0.00024400
C        3.45228900      -0.10548900       0.03352700
O        0.38166200       2.99492000      -1.38745000
O        2.13692500       3.78391800      -0.35163400
C       -3.45237300      -0.10580900      -0.03401100
O       -0.38185500       2.99457700       1.38765000
O       -2.13743300       3.78365300       0.35239600
C       -1.18832100      -3.88288100      -0.11103400
C        1.18921900      -3.88238300       0.11179100
N        4.89748200      -0.41699000       0.11258300
N       -4.89754100      -0.41737300      -0.11356500
N       -2.12696000      -4.56470000      -0.20142600
N        2.12802200      -4.56397100       0.20223900
O        5.20955100      -1.59603200       0.25743400
O        5.68097700       0.52605800       0.03167900
O       -5.20951000      -1.59645600      -0.25831000
O       -5.68110900       0.52560000      -0.03243400
H        2.90408600      -2.16593900       0.17608200
H        3.80486200       2.00100800      -0.26123200
H       -2.90394700      -2.16622200      -0.17654900
H       -3.80521500       2.00060700       0.26094000

2 [NH2+]=C1N=CN=C2N3[C@@H]4O[C@H](CO[P@@](=O)(O[P@](=O)(O)O[P@@]([O-])(O)=[OH+])OC35[N-]C125)[C@@H](O)[C@H]4O

45
Properties=species:S:1:pos:R:3 pbc="F F F"
O        4.36294500      -1.55898200       1.19840700
C        3.10513300      -0.92751400       0.97875100
C        2.86634300      -0.72258900      -0.53503300
C        2.02171300      -1.95959500       1.33356600
N        1.91655400       0.40588800      -0.76397200
O        2.34711500      -1.92674100      -1.03009800
O        2.29221600      -2.71678000       2.48855300
C        1.92958000      -2.80125800       0.04338700
C        2.15543200       1.70089100      -0.30613100
C        0.73195100       0.48380800      -1.46547600
C        0.54836100      -3.34299100      -0.26937900
C        1.06929500       2.46700000      -0.72438200
N        3.21680800       2.13289500       0.38457700
N        0.19196000       1.68276100      -1.45614000
O       -0.39916100      -2.24266900      -0.37758300
C        1.08410500       3.83051700      -0.35954800
C        3.12623500       3.44233300       0.65789300
P       -0.83728200      -1.70962400      -1.81922600
N        0.10761000       4.70578100      -0.70064300
N        2.14223600       4.29011900       0.33496900
O       -2.20702200      -0.94827500      -1.47786400
O        0.23407600      -0.52044500      -2.20670500
O       -0.92585400      -2.67604700      -2.91904500
P       -2.85940700       0.33289900      -0.70924700
O       -1.97321500       0.42381000       0.64791500
O       -2.45515700       1.58735300      -1.56091900
O       -4.30444600       0.14212700      -0.44905300
P       -2.44426400       0.14947500       2.21721200
O       -1.53871300       0.80975600       3.17277700
O       -3.97216200       0.57457000       2.23570700
O       -2.45168500      -1.46410900       2.28946500
H        5.04302400      -0.88269600       1.34022000
H        3.02449600      -0.00216500       1.54945000
H        3.78897400      -0.48062500      -1.06991600
H        1.07625300      -1.43888400       1.50963300
H        3.25561600      -2.86261800       2.51878700
H        2.63099400      -3.64393100       0.10106200
H       -1.44141000       1.72447800      -1.63391000
H        0.53920000      -3.92335800      -1.19561200
H        0.19327800      -3.96716600       0.55497200
H        3.95475600       3.87280900       1.21424500
H       -0.77975900       4.37659500      -1.05328500
H        0.13788800       5.62638600      -0.28299800
H       -4.43938400       0.38504100       1.38103200
H       -1.79620500      -1.78939000       2.92936300

The text was updated successfully, but these errors were encountered:

jhjensen2 · 2020-11-04T14:54:48Z

Glad you're finding xyz2mol useful!

Molecules with many nitro and phosphate groups will take long and there isn't really a general way to speed it up without changing the entire approach.

One could make some hacks that identify these specific groups and deal with them differently but I am reluctant to put that in the official version of the code. Let me know if you want to implement it locally and I can give you some tips.

I also note that xyz2mol didn't identify the bonding correctly on the second molecule. Maybe removing the Huckel option will fix it. In general it's a good idea to use both and visually inspect those where they differ as a sanity check.

NiklasGebauer · 2020-11-04T16:29:28Z

Alright, thanks a lot!

I think it should be fine for now since only very few structures seem to be effected.
I will try to use the script to analyze 3d structures generated with a generative model. There shouldn't be too many nitro and phosphate groups as they are also rare in the training data. But if the process slows down too much I will get back to you and think about implementing the hacks.

Also, I will keep the sanity check with and without Huckel option in mind, thanks for the hint!
In general, I do not require the script to obtain the correct bonding in all cases as long as it detects the bonding correctly in the vast majority of the cases (and it seems to accomplish that on the training data, we'll see how it performs on generated structures which can be more inaccurate).

NiklasGebauer · 2020-11-04T16:40:46Z

I just tried the second example without Huckel and it finished in less than a second and obtained the correct bonding (however, the first example also takes a long time without Huckel).

Is there a general rule which of the two approaches is faster/more reliable or does it strongly depend on the structure?
It would be nice to decide for either one of them without manually checking the results if they differ, but this would of course only make sense if one approach is superior (on average) in obtaining the correct bonding.

jhjensen2 · 2020-11-05T06:47:53Z

In my experience, the Huckel option is more reliable. In fact, I was really surprised to see that it failed for molecule 2. Molecules with many nitro groups will always take a long time, but it's hard to predict in general.

Anyway, just be aware that xyz2mol will occasionally screw up and it's hard to predict if and when it happens.

NiklasGebauer · 2020-11-05T09:16:22Z

Thanks again!

Is it always the same lines of code where molecules get stuck when it takes so long, e.g. a loop?
I am thinking about implementing a kill switch that raises a timeout error if the calculations take longer than e.g. 5 minutes.

jhjensen2 · 2020-11-05T09:44:23Z

it is the loop over valences in AC2BO that can take a long time.

NiklasGebauer · 2020-11-05T10:29:00Z

Okay, I should be fine with handling these special cases then.
Keep up the good work!

NiklasGebauer closed this as completed Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptionally long runtime for a few organic structures from materials project #20

Exceptionally long runtime for a few organic structures from materials project #20

NiklasGebauer commented Nov 4, 2020

jhjensen2 commented Nov 4, 2020

NiklasGebauer commented Nov 4, 2020

NiklasGebauer commented Nov 4, 2020

jhjensen2 commented Nov 5, 2020

NiklasGebauer commented Nov 5, 2020

jhjensen2 commented Nov 5, 2020

NiklasGebauer commented Nov 5, 2020

Exceptionally long runtime for a few organic structures from materials project #20

Exceptionally long runtime for a few organic structures from materials project #20

Comments

NiklasGebauer commented Nov 4, 2020

jhjensen2 commented Nov 4, 2020

NiklasGebauer commented Nov 4, 2020

NiklasGebauer commented Nov 4, 2020

jhjensen2 commented Nov 5, 2020

NiklasGebauer commented Nov 5, 2020

jhjensen2 commented Nov 5, 2020

NiklasGebauer commented Nov 5, 2020