Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init draft aromatic #38

Merged
merged 18 commits into from May 8, 2024
Merged

Conversation

fgrunewald
Copy link
Collaborator

@fgrunewald fgrunewald commented May 3, 2024

This is the initial draft of the new aromaticity algorithm following the ideas outlined here and here.

The key difference is that we are not trying to assign chemical aromaticity but rather kekulize the molecule (i.e. fixing hcount). In a nutshell, the algorithm proceeds as follows:

  1. Assign a preliminary hcount to all non-hydrogen atoms. This is a bit awkward but needed for the next step, because we want to be able to deal with cases where implicit hydrogen are part of the non-aromatic atoms.
  2. Remove all nodes that have a full valance, which we can only assess after implicit hydrogens have been added to non-aromatic residues.
  3. Get the connected components of the resulting fragmented graph. Each component has to be a delocalized system and is potentially aromatic.
  4. For each component we check if there exists a maximum matching and if that matching is perfect. If it is not perfect the delocalized system is written incorrectly and a syntax error is raised. It's like checking if we have perfect alternating single and double bonds.
    5a. If the system is cyclic we assume it to be (anti-)aromatic and give it a bond order of 1.5.
    5b. If it is not cyclic then we simply assign a bond order of 2 to the edges that constitute the perfect matching.

Some differences in behaviour to the previous version:

SMILES VALID AROMATIC old new
c1c[nH]cc1 yes no pass pass
c1cNcc1 yes no pass pass
c1cncc1 no raises Error fail (no hydrogen on N) pass
c1cscc1 yes no fail pass
c1cScc1 yes no pass pass
c1cnc[nH]1 yes no pass pass
c1cncN1 yes no pass pass
N12ccccc1ccc2 yes no pass pass
n12ccccc1ccc2 yes no pass pass
c12ccccc1Ncc2 yes no pass pass
c12ccccc1[nH]cc2 yes no pass pass
c12ccccc1ncc2 no raises Error fail (no hyrdogen on N) error
c1cscn1 yes no fail pass
cccc yes no fail (raises Error) pass
OCCn2c(=N)n(CCOc1ccc(Cl)cc1Cl)c3ccccc23 yes partly fail ( h on aromatic n) pass

The molecule from this blog-post mentioned in #19 is also fixed.

Overall I'd say this algorithm is more robust as it raises an Error for hard fails like c1cncc1 but is also linenet towards chemically intuative smiles like cccc.

The major problems are:

  • how to deal with wildcards ? For now they are just ignored because we don't know the valency and bond order. That means [*]1[*][*]1 is not aromatic anymore.
  • how to deal with charges in the intial valance assignment ? I think it should be missing = valance - bonds + charges
  • how to assing aromarticity for fused rings. Currently napthalene is not aromatic, which might be fine???

@pckroon
Copy link
Owner

pckroon commented May 6, 2024

Many thanks! First your questions, I'll have a look at the code as well.

how to deal with wildcards ? For now they are just ignored because we don't know the valency and bond order. That means []1[][*]1 is not aromatic anymore.

Wildcard should always be able to form a double bond without inducing a charge, so they should be part of the delocalized subgraph. I'm not a hundred percent sure why [*]1[*][*]1 should be aromatic though, since it only has 3 atoms.

how to deal with charges in the initial valence assignment ? I think it should be missing = valence - bonds + charges

I need to brood on this. Can this be summarized in such a general way, or do we need to do actualy octet-rule?

how to assign aromaticity for fused rings. Currently naphthalene is not aromatic, which might be fine???

This is a problem IMO, naphthalene does show DIME.

Copy link
Owner

@pckroon pckroon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I do think it can be cleaned up a little bit more though.
We can also draw some inspiration from https://github.com/aspuru-guzik-group/selfies/blob/master/selfies/mol_graph.py#L287

We can assume that any atom that did not specify any number of hydrogens is saturated and can be pruned from the DS. So no need to prefill the valence.
The wildcard atoms can become an issue if they result in a non-perfect matching, so maybe all wildcards that are not in a maximal matching should be removed before checking whether it's perfect.

pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Show resolved Hide resolved
@fgrunewald
Copy link
Collaborator Author

@pckroon small update: so naphthalene is correctly assigned (now). I had an earlier error. All systems that show DIME are identified as aromatic as far as test-cases go. Only those like Thiophene are not.

pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
@fgrunewald fgrunewald requested a review from pckroon May 8, 2024 13:19
Copy link
Owner

@pckroon pckroon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Small nitpick on some code, and a doubt on the indole test

pysmiles/smiles_helper.py Outdated Show resolved Hide resolved
tests/test_read_smiles.py Outdated Show resolved Hide resolved
Copy link
Owner

@pckroon pckroon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, I love it <3

@pckroon
Copy link
Owner

pckroon commented May 8, 2024

Could you add 2 more testcases? One with a triangle, and one that cannot be kekulized and trips the error. That should bring the coverage back up

@pckroon pckroon merged commit a1684af into pckroon:master May 8, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants