Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Molecules update #440

Merged
merged 114 commits into from
Feb 7, 2023
Merged

Conversation

espottesmith
Copy link
Contributor

(Mostly) small changes to the recently included molecule builder pipeline. The biggest change is in creation of an MPculeID, a new ID format. Previous discussions led me to believe that an ID based on task IDs is fragile, potentially breaking if old tasks become deprecated. In order to avoid this problem while also guaranteeing the uniqueness of IDs, the MPculeID follows the format:

(prefix-)hash-charge-spin

where the prefix is optional and the hash is a Weisfeiler Lehman on a molecular graph representation (generated using networkx). The node attributes used to generate the WL hash are either the XYZ coordinates of the molecule (used for molecule association) or the atom species (used for constructing final molecule docs). Because charge and spin are also included in the ID, the risk of IDs no longer being unique due to a hash collision becomes negligible. To further protect against this, we could add the alphabetical formula, so that the ID would become (prefix-)hash-formula-charge-spin.

I note that these IDs are not easily understood or remembered by humans but can be easily generated (there is a utility function in emmet.core.utils; this utility might eventually be better placed in pymatgen).

Contributor Checklist

  • I have broken down my PR scope into the following TODO tasks
    • Make it so bonding is not a required component of a build pipeline
    • Fix SMD string representation for G2 solvent
    • Add MPcule ID format
  • I have run the tests locally and they passed.
  • I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

"ACETONITRILE",
"BENZENE",
"METHANOL",
]

PCM_DIELECTRICS = {
"WATER": 78.39,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @espottesmith ! I don't have any comments on the molecule ID questions; I defer to you and others on that one. Your approach sounds robust to me.

Since you're making some updates to calc_types I wonder if I can request a small change to roll into this PR?

At this line you use 2 decimals of precision on the dielectric constant to make sure a PCM(water) task is valid. That precision seems a little much to me (and it caused the build pipeline to fail when I had run calculations using a dielectric of 78.4). Do you think it would be reasonable to relax this to 78.4 (or maybe even 78), and/or to use a rounding approach to validate the task type (e.g., if the dielectric is within 0.1 units of 78.4, call it PCM(water)? A similar consideration might apply to other calc_types that use PCM.

Happy to hear alternative suggestions; I just don't feel the build pipeline should fail to validate a task just because the dielectric is 0.01 units different than the setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll play around with this.

Honestly, I'm on the fence about the whole naming solvents thing. It's nice from a "human readability" perspective - what does "78.4" mean? It means "water"! It's also kind of important to have standard names for level of theory comparisons. You don't want to mix calculations with different solvents. Even if 78 and 78.39 are similar, it's strictly speaking not sound to directly compare the two. Just like you shouldn't directly compare energies obtained with wB97X-D and wB97X-D3.

On the other hand, as you say, there's differing degrees of specificity that you could use. And it's also not easy to assign a name to a dielectric constant - dimethoxyethane and tetrahydrofuran are both around 7.

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2022

Codecov Report

Base: 92.74% // Head: 87.77% // Decreases project coverage by -4.98% ⚠️

Coverage data is based on head (04e27f4) compared to base (91e7bf6).
Patch coverage: 86.15% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #440      +/-   ##
==========================================
- Coverage   92.74%   87.77%   -4.98%     
==========================================
  Files         134      108      -26     
  Lines       24791     6157   -18634     
==========================================
- Hits        22992     5404   -17588     
+ Misses       1799      753    -1046     
Impacted Files Coverage Δ
emmet-core/emmet/core/settings.py 88.33% <ø> (-10.00%) ⬇️
emmet-core/emmet/core/thermo.py 96.70% <ø> (ø)
emmet-core/emmet/core/vasp/validation.py 69.87% <ø> (ø)
emmet-core/emmet/core/qchem/molecule.py 28.81% <26.47%> (-50.04%) ⬇️
emmet-core/emmet/core/utils.py 60.60% <56.75%> (-25.38%) ⬇️
emmet-core/emmet/core/structure_group.py 49.57% <71.42%> (ø)
emmet-core/emmet/core/molecules/thermo.py 87.50% <75.00%> (+22.63%) ⬆️
emmet-core/emmet/core/qchem/task.py 91.54% <83.33%> (+0.78%) ⬆️
emmet-core/emmet/core/mpid.py 91.08% <89.36%> (-5.21%) ⬇️
emmet-core/emmet/core/molecules/atomic.py 93.42% <94.44%> (+62.07%) ⬆️
... and 63 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

espottesmith and others added 5 commits June 23, 2022 10:05
… names and instead create utility function to define a solvent string
Reminder: need to regenerate "builder_XXX_set.json" test files once builders are all done
@munrojm
Copy link
Member

munrojm commented Jul 1, 2022

@espottesmith are you good if I merge when the tests pass? Is there any remaining issue you would like to discuss?

@espottesmith
Copy link
Contributor Author

I'm making some changes in response to @rkingsbury's comments. I'll let you know when I'm ready to merge.

Comment on lines 235 to 242
doc = MoleculeDoc.from_tasks(group)
molecule_id = "{}-{}-{}".format(
doc.coord_hash,
str(int(doc.charge)).replace("-", "m"),
doc.spin_multiplicity,
)
doc.molecule_id = molecule_id
molecules.append(doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@espottesmith when testing out the build pipeline on solvated ion clusters I found a bug in this section of code. For monatomic ions with the same charge, this will result in duplicate molecule_id b/c the coordinates, charge, and multiplicity can all be identical.

I worked around the issue by doing this; although there is probably a more robust / general solution:

        for group in self.filter_and_group_tasks(tasks):
            try:
                # TODO - this molecule_id formula will return identical ids for single atoms
                # that have the same charge and spin
                doc = MoleculeDoc.from_tasks(group)
                # use something different for single atoms
                if len(doc.composition)==1:
                    molecule_id = MPculeID("{}-{}-{}".format(
                        doc.species_hash,
                        str(int(doc.charge)).replace("-", "m"),
                        doc.spin_multiplicity,
                    ))
                else:
                    molecule_id = MPculeID("{}-{}-{}".format(
                        doc.coord_hash,
                        str(int(doc.charge)).replace("-", "m"),
                        doc.spin_multiplicity,
                    ))
                doc.molecule_id = molecule_id
                molecules.append(doc)

I tried just changing the coord_hash to species_hash for all tasks, but that can result in duplicated IDs for other reasons, so I think it's important to keep the coordinates in there in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I realized this a while back and hadn't had a chance to fix it. The solution that I'm going to go for is actually just making MPculeID a little longer and including the formula. I actually also think that'll help it be more readable. The hash still won't be easily interpreted, but the user will at least have a clue what they're looking at if the formula is there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. If you have any prototype code for that (even if very rough), feel free to send my way and I can test it on my builds (but if you haven't got there yet, no worries)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we also need to consider what happens if we have the same single atom / ion in different solvents. My solutions breaks in that case - all solvents collapse to the same MP-ID which results in calcs getting lost in the build pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine, I think? Since I've changed how level of theory is handled, and molecules in different solvents should be collapsed into a single document anyways. You're right that this will be treated somewhat differently in the association stage, because in general molecules in different solvents won't have identical structures. But I don't believe it's inconsistent. Will test this to confirm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as long as molecules in different solvents wind up in the same document / ID this should be OK. I'm not seeing that happen in my current build pipeline testing (it seems that only Water gets past the association stage), but I haven't had time to troubleshoot much yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can send me a minimal test case (let's say one molecule in 2 or 3 different solvents), that'd actually be a big help.

@espottesmith
Copy link
Contributor Author

espottesmith commented Feb 5, 2023

Found an edge case where monatomic hydrogen (H1) was causing certain builders to fail. Resolved now - have built collections involving 48,672 unique molecules.

Should be good to merge now.

@munrojm
Copy link
Member

munrojm commented Feb 6, 2023

Found an edge case where monatomic hydrogen (H1) was causing certain builders to fail. Resolved now - have built collections involving 48,672 unique molecules.

Should be good to merge now.

Great, that sounds good. Working to get the tests fixed, then I will merge.

@rkingsbury
Copy link
Collaborator

@espottesmith apologies I haven't had a chance to do much testing with this yet. It is still on the docket though, possibly later this week / early next. I don't have any objections to merging, but if there's anything in particular you want me to test out please let me know. I'd say this is a tremendous foundation for molecule docs though, and we can always address small issues in subsequent PRs.

@munrojm munrojm merged commit 5ff41fa into materialsproject:main Feb 7, 2023
@espottesmith espottesmith deleted the molecules_update branch February 7, 2023 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants