Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDF and MOL file reader import errors in xtb Version 6.2 RC2 #26

Closed
tobigithub opened this issue Oct 20, 2019 · 12 comments · Fixed by #30
Closed

SDF and MOL file reader import errors in xtb Version 6.2 RC2 #26

tobigithub opened this issue Oct 20, 2019 · 12 comments · Fixed by #30
Labels
bug Something isn't working

Comments

@tobigithub
Copy link

tobigithub commented Oct 20, 2019

Intro
I highly appreciate the MOL and SDF file reader and import, because it narrows the gap between cheminformatics with millions of molecules from PubChem and ChemSpider and the quantum chemistry world. So instead of going via information devoid xyz or coord files, the SDF and mol files can be enriched directly with meta information (energies, fukui indices, EA, HOMO-LUMO gaps etc). While we have tools to convert files (OpenBabel, ChemAxon molconvert) the direct calculation on MOL and SDF files is a really important feature, because it will allow for easier workflows and finally also broader use within the community. I am writing this somewhat lengthy intro because it is easy to just get rid of problematic features by excluding them, but that is not my intention.

Describe the bug
The mol/sdf file reader is reading only very specifically formatted molecules, but throws errors on correctly formatted standard MOL (MDL V2000) formatted files. This applies to simple molecules, not exotic SDF options. When it does that it actually leads to many different errors, from falsely breaking bonds to throwing errors regarding atom numbers and optimization failures. I am also adding additional recommendations, or if needed can file them as individual issues?

To Reproduce
Steps to reproduce the behavior:

XTB version: Version 6.2 RC2 (SAW190805)
I checked and converted most molecules with OpenBabel and ChemAxon mview and molconvert. These tools have been used to convert millions/billions of connection tables. They are pretty robust. Case 2-6 shows the errors.

Case 1 (Expected computational behavior):
Warning, this is a faulty SDF/mol molecule recreated from the xtb documentation, it can not be imported by OpenBabel or by MarvinView. There are not enough spaces after the atom coordinates (x y z O), therefore the molecules is corrupt. The xtb file reader probably just reads coordinates.

xtb h2o.sdf --opt  
TOTAL ENERGY               -5.070544442942 Eh   
GRADIENT NORM               0.000055195691 Eh/α 
HOMO-LUMO GAP              14.391940156700 eV   
normal termination of xtb

Case 2 (C6H6-m213-2D.sdf) benzene standard and error free 2D formatted mod/sdf file:

xtb C6H6-m213-2D.sdf --opt
#ERROR! (goedecker_solve) DSYSV failed
abnormal termination of xtb

Case 3 (C6H6-m213-aroma.sdf) (benzene aromatized bonds):

xtb C6H6-m213-aroma.sdf --opt
#ERROR! (goedecker_solve) DSYSV failed
abnormal termination of xtb

Case 4 (water-chemaxon.sdf) (correctly formatted SDF) atom number mismatch:

xtb water-chemaxon.sdf --opt
#ERROR! Atom number missmatch in Xmol file
abnormal termination of xtb

Case 5 (xtb water-openbabel-2D.sdf) (correctly formatted 2D SDF):

 --opt xtb water-openbabel-2D.sdf --opt
#ERROR! (goedecker_solve) DSYSV failed
abnormal termination of xtb

Case 6 (xtb water-openbabel-3D.sdf) (correctly formatted 3D SDF):
This case is interesting, because a different energy is presented, even with --opt extreme options. When compared to the working h2o.sdf example from the handbook, the energy is different and somewhat interesting the optimization failed, even when repeated.

xtb water-openbabel-3D.sdf --opt extreme

optimized geometry written to: xtbopt.sdf
           -------------------------------------------------
          | TOTAL ENERGY               63.813219402131 Eh   |
          | GRADIENT NORM            1205.618864654011 Eh/α |
          | HOMO-LUMO GAP              16.735238594075 eV   |
           -------------------------------------------------

########################################################################
# WARNING! Some non-fatal runtime exceptions were caught, please check:
#  - GEOMETRY OPTIMIZATION FAILED!
########################################################################

Please provide all input and output file such that we confirm your report.
See attached. XTB-sdf-issues.zip contains the examples shown here.

Additional recommendations

  • (R1) Input check, add 2D or 3D format requirement (will xtb be able to handle flat 2D mol files?)
  • (R2) Input check, add information if Kekule or aromatized bonds are required, if this information is ignored check for bond lengths. Aromatized bonds are "4" in MDL format, this is a common error in many SDF/mol software tools
  • (R3) Input check, add information if explicit hydrogens are required in the mol/sdf file (xtb does not care about molecules to be chemically valid, they currently can have hydrogens attached or not, xtb will perform an optimization either way) give potential warning if hydrogens are missing.
  • (R4) Input check, add warning if multiple molecules are contained in an SDF file. The molecules are detected by "$$$$". By definition SDF files allow for multiple molecules, while MOL files do not. Currently xtb only reads the first molecule and ignores all other molecules. Potentially it would be nice if xtb could stream and process a single SDF file with thousands/millions of molecules.
  • (R5) Input check, check for charge, radicals, multiplicity many SDF/MOL input files can have different representations for charges (nitro groups etc).
  • (R6) Add warning if final optimized connection table changed (check via InchiKey or bond length). That happens quite often and is an extended feature. It has implications on database duplicates.

Additional context

# sdf extension (molconvert)
molconvert sdf C6H6-3D-quickH-opt-ok.sdf -m -o C6H6-m.sdf

# mol extension (molconvert)
molconvert mol C6H6-3D-quickH-opt-ok.sdf -m -o C6H6-m.mol

# tmol extension (openbabel)
obabel C6H6-3D-quickH-opt-ok.sdf -otmol -O C6H6-m.tmol -m
@tobigithub tobigithub added the unconfirmed This report has not yet been confirmed by the developers label Oct 20, 2019
@awvwgk
Copy link
Member

awvwgk commented Oct 20, 2019

Thanks for reporting. There is indeed a bug in the released version, which should be fixed by #4.

@awvwgk awvwgk added bug Something isn't working and removed unconfirmed This report has not yet been confirmed by the developers labels Oct 20, 2019
@awvwgk
Copy link
Member

awvwgk commented Oct 21, 2019

I would very much appreciate having your recommendations as separate issues, since I cannot come up if a single pull request addressing all of them. Maybe a few thoughts here

  1. 2D structures are of course not reasonable as input for an GFN-xTB calculation, but could be used after a preprocessing/conversion step, either done externally or potentially on the xtb side.
  2. I don't get this one, xtb doesn't use any bond topology information (it actually drops them while reading the SDF, which is part of the problem you reported). In principle we are producing bond order information in xtb and could amend those from the SDF.
  3. Same as for 1., we need hydrogen atoms for the calculation, unless we preprocess the structure explicitly such input is not suitable for GFN-xTB calculations.
  4. I always thought this is out-of-scope for xtb. I know that GFN-xTB is suited for this kind of high-throughput calculations, but the xtb program is not designed to work this way. Don't get me wrong here, I like the idea, but I don't think it is possible to get this running in xtb in the near future.
  5. This is kind of related to 1. and 3., first we don't use this information as per atom information, but we need to sum it up as molecular information. So it is more likely a bug that we are not taking this into account properly.
  6. This is actually the most easiest thing to address, since we are producing topological information. If we stop throwing away the SDF information, we could perform this check immediately.

Let's sum this up:

  • we need to reimplement the SDF reader beyond the small fix in GBSA refactoring and API fixes #4, R5 is kind of a bug in the current SDF reader and should be fixed along with this issue.
  • R2 and R6 could be implemented with only minor changes. (separate issue or PR appreciated)
  • R1 and R3 are both a larger feature request, I will discuss those with my boss and our group, maybe we can come up with something useful here.
  • R4 is out of scope in my opinion, that means, I will not spend time to implement it in xtb.

@ghutchis
Copy link

IMHO, if a 2D case or an SDF is supplied without explicit hydrogens, I'd suggest calling obabel to convert. (Granted, I'm biased.)

@tobigithub
Copy link
Author

@awvwgk thank you for the additional comments, I will open individual requests with the new version. As you mentioned, not everything will be doable, but I have a test pipeline of a couple thousand of structures I want to send through. I will see what problems remain open after the patch. In the mean time I will just use the coord (tmol) input.

@ghutchis
Copy link

I've run through ~100k compounds using XYZ from Open Babel, but I have a validation set I've been using for holdout that I can run ~70k optimization + Hessian.

@tobigithub
Copy link
Author

@ghutchis will you be able to share the results and your methods?

I am using either bash shell scripts or some single liner constructs that may add some penalty to the overall computational time. For example using the C60 fullerene data from Comprehensive theoretical study of all 1812 C60 isomers [XYZ] one can calculate single point energies with a wonderful speedup factor of 80 in 0.7 seconds for each molecule on a node with 44 CPUs/88 threads. The final time boils down to 22 min for 1812 molecules, based on the really awesome scaling of xtb.

$ for d in ./*/ ; do (cd "$d" && echo -n "active dir: $d" && xtb geom.xyz 
| grep --line-buffered "TOTAL ENERGY"); done

active dir: ./1/          | TOTAL ENERGY             -128.453291352107 Eh   |
normal termination of xtb
active dir: ./10/          | TOTAL ENERGY             -128.331358624594 Eh   |
normal termination of xtb
active dir: ./100/          | TOTAL ENERGY             -128.265600476068 Eh   |
normal termination of xtb
active dir: ./1000/          | TOTAL ENERGY             -128.166664861876 Eh   |
normal termination of xtb
active dir: ./1001/          | TOTAL ENERGY             -128.168084576528 Eh   |
normal termination of xtb
active dir: ./1002/          | TOTAL ENERGY             -128.169887114880 Eh   |
normal termination of xtb
active dir: ./1003/          | TOTAL ENERGY             -128.168197428449 Eh   |

image

Of course once we calculate opt-freq we need to store the data, but I guess even 100k molecules should be doable in a couple of hours on multiple nodes. I wonder if one could pipe the SDF conversions from Open Babel directly to xtb and then extract the energies or other values one the fly. Its probably best done with the python API? I have not looked into that yet.

@ghutchis
Copy link

@tobigithub - this seems outside the discussion on this issue. Feel free to contact me. At the moment, the data isn't public, but we should be making several GFN-related manuscripts and data sets available soon. We generally use Python, but your shell script is similar and won't add much overhead.

awvwgk referenced this issue in awvwgk/xtb Oct 29, 2019
- PDB reader will try to read atom types from atom name
- VASP reader will keep information on the input format
- SDF reader will save information on atom tags
- SDF reader will warn if hydrogen atoms are missing, fixes #26
@awvwgk
Copy link
Member

awvwgk commented Oct 29, 2019

I guess we are almost there, see #30. So there are a few open questions here, I need a reliable way to determine:

  • that a structure is 2D (e.g. benzen is 2D in reality, so it should not catch this one)
  • if explicit hydrogen atoms are missing (e.g. S8 doesn't need hydrogen atoms)

If it is not easily possible, I will go with garbage in / garbage out here.

@ghutchis
Copy link

A SDF is supposed to note whether it's a 2D depiction or a 3D coordinates (that might be flat):
https://gist.github.com/ghutchis/270df9db7c4f2f4a30519aabbaa3f73d

Note that in the 2nd line, a 2D depiction ends in '2D' and a 3D depiction ends in '3D'

As far as hydrogen atoms, I'd probably go with 'we're going to run what you sent us' - maybe with a warning if the charge/spin seem weird.

One request though would be to add up the formal charges in a SDF file from the M CHG lines:
https://gist.github.com/ghutchis/29130f768fd85444333b2a0ec643a441

M CHG 1 3 -1
(one charge on atom 3, -1)
M CHG 2 3 -1 5 -1
(two charges, atom 3 is -1 and atom 5 is -1)

I'd have to check on the radical options (M RAD), but there are a lot more cases with formal charges (e.g., amino acids, side chains, etc.)

M ISO might be useful for some purposes too.

@tobigithub
Copy link
Author

I guess we are almost there, see #30. So there are a few open questions here, I need a reliable way to determine:

  • that a structure is 2D (e.g. benzen is 2D in reality, so it should not catch this one)
  • if explicit hydrogen atoms are missing (e.g. S8 doesn't need hydrogen atoms)

If it is not easily possible, I will go with garbage in / garbage out here.

@awvwgk I think that is something to check with tests, while it is impossible to cover all MDL MOL definitions, it will still leave room for errors. But covering 90% of the correct cases should be fine.

Meaning if molecules pass openbabel or molconvert they are fine, If xtb accepts molecules that do not pass a openbabel conversion something is wrong. Plus xtb should be able to calculate energies for a random selection of good molecules such as those below.

Data sources of good molecules can be PubChem, EBI, HMDB, NCI
ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D/01_conf_per_cmpd/SDF/
ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
https://www.ebi.ac.uk/chebi/downloadsForward.do

@tobigithub
Copy link
Author

@awvwgk

need a reliable way to determine:
that a structure is 2D (e.g. benzen is 2D in reality, so it should not catch this one)

A way to determine if a structure is 2D in MOL or SDF is that a single column's coordinates sum equals zero. In the case below, the z-axis is zero. It could also be x or y but most packages just zero out the z-axis. While there could be other cases (rotation along a specific bond) this is 99% never the case. I am sure there are matrix operations to test for all cases, but the sum of z-axis coordinates equals zero is a good test for having 2D molecules. Its what most of the software tools do.

Another observation is that xtb actually handles 2D molecules quite well. So it might not be a big issue, but additional tests are required. I am running the CHEBI molecules (https://www.ebi.ac.uk/chebi/downloadsForward.do) with a single energy point calculation to see if something is wrong.


  Marvin  05071312412D          

 12 13  0  0  0  0            999 V2000
   16.4064   -5.7650    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   16.4064   -6.6172    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   17.0908   -7.0082    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   17.8173   -6.6172    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   17.8173   -5.7650    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   17.0908   -5.3740    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.5856   -5.5415    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   19.0745   -6.1981    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.5856   -6.8407    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   17.0908   -4.5637    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   19.8987   -6.1981    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   15.7079   -7.0223    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  5  4  2  0  0  0  0
  9  4  1  0  0  0  0
  3  4  1  0  0  0  0
  7  5  1  0  0  0  0
  6  5  1  0  0  0  0
  8  7  1  0  0  0  0
  8  9  1  0  0  0  0
 11  8  2  0  0  0  0
  2  3  2  0  0  0  0
  2  1  1  0  0  0  0
  6  1  1  0  0  0  0
 12  2  1  0  0  0  0
 10  6  2  0  0  0  0
M  END
> <ChEBI ID>
CHEBI:52617

> <ChEBI Name>
7,8-dihydro-8-oxoguanine

> <Star>
3


@tobigithub
Copy link
Author

@awvwgk

if explicit hydrogen atoms are missing (e.g. S8 doesn't need hydrogen atoms)

The test would be for organic compounds, most commonly carbon bound to hydrogen, nitrogen, oxygen, sulfur. So I would not be concerned with exotic cases, including metals or single elements. Elements CHNSOP are enough. Elements such as SNP will have multiple valencies, but here one could use the most common valence states for each element. The cheapest way would be to test if there is any carbon at all, short of fullerenes and pure carbon compounds.

A way to test would be via Lewis and Senior rules and check that against the charge. See
ANALOGOUS ODD-EVEN PARITIESIN MATHEMATICS AND CHEMISTRY
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.665.1134&rep=rep1&type=pdf

@awvwgk awvwgk closed this as completed in #30 Nov 4, 2019
awvwgk added a commit that referenced this issue Nov 4, 2019
- add type-bound procedures for tb_molecule class
  - generic reader from unit
  - generic writer to unit
  - allow additional tags on atoms
    - molfile/SDF specific tags
    - PDB specific tags
  - allow meta data stored along with the molecule
    - VASP input format information
    - SDF info block (raw string block)
- rework backend reader/writer for common formats (remove rewind statements)
  - Turbomole coordinate format
  - xyz coordinate format
  - molfile coordinate format
  - SDF coordinate format (see #26)
  - VASP coordinate format
  - PDB coordinate format
- add topology information to molecule class (new topology class)
- add fragment information to molecule class (new fragment class)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants