-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TautomerCanonicalizer gives unexpected/forbidden form of phosphoric acid #20
Comments
NADH looks like this original_smiles = 'NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@H](N4C=NC5=C4N=CN=C5N)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1'
original_mol = Chem.MolFromSmiles(original_smiles)
tautomerized_mol = TautomerCanonicalizer().canonicalize(original_mol)
Draw.MolsToGridImage([original_mol,tautomerized_mol],
molsPerRow=1,subImgSize=(600,300),
legends=['original','tautomer']) |
I think this is caused by the phosphonic acid rules: https://github.com/mcs07/MolVS/blob/master/molvs/tautomer.py#L130 It can probably be fixed by making the SMARTS pattern more strict to match only the intended target: |
You are correct, removing that rule stops that moiety from being modified. When you say, "more strict", you think specify an explicit number of bonds on the Phosphorous in the SMARTS pattern? Why does rdkit allow 7 bonds on the phosphorous? Rdkit is a vast package, but looking at the definition of Phosphorous, it has max bonds of 5. If I do SantizeMol, the hydrogen stays put. When I paste the structure into ChemDraw, its not valid. |
Updates SMARTS definitions for phosphinic acids. Requires 3 explicit (X3) and 3 total (D3) connections for tautomerizing phosphinic acids. New behavior properly handles compounds with 4 connections (e.g., phosphates, phosphonic acids). ```python from rdkit import Chem from molvs.tautomer import TautomerCanonicalizer import pandas as pd my_transforms = ( TautomerTransform('phosphonic acid f', '[OH]-[PD3X3H0]', bonds='='), TautomerTransform('phosphonic acid r', '[PD3X3H1]=[O]', bonds='-') ) cpds = ['methylphosphinic acid','methylphosphonous acid','methylphosphonic acid','NADPH'] smiles = ['CP(=O)O','CP(O)O','CP(=O)(O)O','NC(=O)C1=CN([C@@h]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@h](N4C=NC5=C4N=CN=C5N)[C@H](O)[C@@h]3O)[C@@h](O)[C@H]2O)C=CC1'] mols = [Chem.MolFromSmiles(smi) for smi in smiles] can_taut = [TautomerCanonicalizer(transforms=my_transforms).canonicalize(mol) for mol in mols] smiles_taut = [Chem.MolToSmiles(mol) for mol in can_taut] df = pd.DataFrame({'cpd':cpds,'smi':smiles,'taut_smi':smiles_taut}) cpd smi taut_smi 0 methylphosphinic acid CP(=O)O C[PH](=O)O 1 methylphosphonous acid CP(O)O C[PH](=O)O 2 methylphosphonic acid CP(=O)(O)O CP(=O)(O)O 3 NADPH NC(=O)C1=CN([C@@h]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@h](N4C=NC5=C4N=CN=C5N)[C@H](O)[C@@h]3O)[C@@h](O)[C@H]2O)C=CC1 NC(=O)C1=CN([C@@h]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@h](n4cnc5c(N)ncnc54)[C@H](O)[C@@h]3O)[C@@h](O)[C@H]2O)C=CC1 ```
I'm converting all the molecules in my database to canonical-tautomers and noticed that things like NADH looked weird. You can see it most plainly for phosphoric acid. I didn't expect the Hydrogen on the phosphorous. Is this the correct/expected behavior?
The text was updated successfully, but these errors were encountered: