Molecules update #440

espottesmith · 2022-06-22T19:01:37Z

(Mostly) small changes to the recently included molecule builder pipeline. The biggest change is in creation of an MPculeID, a new ID format. Previous discussions led me to believe that an ID based on task IDs is fragile, potentially breaking if old tasks become deprecated. In order to avoid this problem while also guaranteeing the uniqueness of IDs, the MPculeID follows the format:

(prefix-)hash-charge-spin

where the prefix is optional and the hash is a Weisfeiler Lehman on a molecular graph representation (generated using networkx). The node attributes used to generate the WL hash are either the XYZ coordinates of the molecule (used for molecule association) or the atom species (used for constructing final molecule docs). Because charge and spin are also included in the ID, the risk of IDs no longer being unique due to a hash collision becomes negligible. To further protect against this, we could add the alphabetical formula, so that the ID would become (prefix-)hash-formula-charge-spin.

I note that these IDs are not easily understood or remembered by humans but can be easily generated (there is a utility function in emmet.core.utils; this utility might eventually be better placed in pymatgen).

Contributor Checklist

I have broken down my PR scope into the following TODO tasks
- Make it so bonding is not a required component of a build pipeline
- Fix SMD string representation for G2 solvent
- Add MPcule ID format
I have run the tests locally and they passed.
I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

…pdate

…through molecule world)

…ing tests pass in emmet-core (including remaking JSON files for summary)

rkingsbury · 2022-06-22T19:30:47Z

emmet-core/emmet/core/qchem/calc_types/calc_types.py

+    "ACETONITRILE",
+    "BENZENE",
+    "METHANOL",
+]

 PCM_DIELECTRICS = {
    "WATER": 78.39,


Thanks @espottesmith ! I don't have any comments on the molecule ID questions; I defer to you and others on that one. Your approach sounds robust to me.

Since you're making some updates to calc_types I wonder if I can request a small change to roll into this PR?

At this line you use 2 decimals of precision on the dielectric constant to make sure a PCM(water) task is valid. That precision seems a little much to me (and it caused the build pipeline to fail when I had run calculations using a dielectric of 78.4). Do you think it would be reasonable to relax this to 78.4 (or maybe even 78), and/or to use a rounding approach to validate the task type (e.g., if the dielectric is within 0.1 units of 78.4, call it PCM(water)? A similar consideration might apply to other calc_types that use PCM.

Happy to hear alternative suggestions; I just don't feel the build pipeline should fail to validate a task just because the dielectric is 0.01 units different than the setting.

Yeah, I'll play around with this.

Honestly, I'm on the fence about the whole naming solvents thing. It's nice from a "human readability" perspective - what does "78.4" mean? It means "water"! It's also kind of important to have standard names for level of theory comparisons. You don't want to mix calculations with different solvents. Even if 78 and 78.39 are similar, it's strictly speaking not sound to directly compare the two. Just like you shouldn't directly compare energies obtained with wB97X-D and wB97X-D3.

On the other hand, as you say, there's differing degrees of specificity that you could use. And it's also not easy to assign a name to a dielectric constant - dimethoxyethane and tetrahydrofuran are both around 7.

codecov-commenter · 2022-06-23T00:03:37Z

Codecov Report

Base: 92.74% // Head: 87.77% // Decreases project coverage by -4.98% ⚠️

Coverage data is based on head (04e27f4) compared to base (91e7bf6).
Patch coverage: 86.15% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #440      +/-   ##
==========================================
- Coverage   92.74%   87.77%   -4.98%     
==========================================
  Files         134      108      -26     
  Lines       24791     6157   -18634     
==========================================
- Hits        22992     5404   -17588     
+ Misses       1799      753    -1046

Impacted Files	Coverage Δ
emmet-core/emmet/core/settings.py	`88.33% <ø> (-10.00%)`	⬇️
emmet-core/emmet/core/thermo.py	`96.70% <ø> (ø)`
emmet-core/emmet/core/vasp/validation.py	`69.87% <ø> (ø)`
emmet-core/emmet/core/qchem/molecule.py	`28.81% <26.47%> (-50.04%)`	⬇️
emmet-core/emmet/core/utils.py	`60.60% <56.75%> (-25.38%)`	⬇️
emmet-core/emmet/core/structure_group.py	`49.57% <71.42%> (ø)`
emmet-core/emmet/core/molecules/thermo.py	`87.50% <75.00%> (+22.63%)`	⬆️
emmet-core/emmet/core/qchem/task.py	`91.54% <83.33%> (+0.78%)`	⬆️
emmet-core/emmet/core/mpid.py	`91.08% <89.36%> (-5.21%)`	⬇️
emmet-core/emmet/core/molecules/atomic.py	`93.42% <94.44%> (+62.07%)`	⬆️
... and 63 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

… names and instead create utility function to define a solvent string

…th LevelOfTheory with pre-defined solvents)

…agate to property documents

Reminder: need to regenerate "builder_XXX_set.json" test files once builders are all done

munrojm · 2022-07-01T02:51:13Z

@espottesmith are you good if I merge when the tests pass? Is there any remaining issue you would like to discuss?

espottesmith · 2022-07-01T05:16:47Z

I'm making some changes in response to @rkingsbury's comments. I'll let you know when I'm ready to merge.

…_update

rkingsbury · 2022-08-08T14:28:34Z

emmet-builders/emmet/builders/qchem/molecules.py

+                doc = MoleculeDoc.from_tasks(group)
+                molecule_id = "{}-{}-{}".format(
+                    doc.coord_hash,
+                    str(int(doc.charge)).replace("-", "m"),
+                    doc.spin_multiplicity,
+                )
+                doc.molecule_id = molecule_id
+                molecules.append(doc)


@espottesmith when testing out the build pipeline on solvated ion clusters I found a bug in this section of code. For monatomic ions with the same charge, this will result in duplicate molecule_id b/c the coordinates, charge, and multiplicity can all be identical.

I worked around the issue by doing this; although there is probably a more robust / general solution:

for group in self.filter_and_group_tasks(tasks): try: # TODO - this molecule_id formula will return identical ids for single atoms # that have the same charge and spin doc = MoleculeDoc.from_tasks(group) # use something different for single atoms if len(doc.composition)==1: molecule_id = MPculeID("{}-{}-{}".format( doc.species_hash, str(int(doc.charge)).replace("-", "m"), doc.spin_multiplicity, )) else: molecule_id = MPculeID("{}-{}-{}".format( doc.coord_hash, str(int(doc.charge)).replace("-", "m"), doc.spin_multiplicity, )) doc.molecule_id = molecule_id molecules.append(doc)

I tried just changing the coord_hash to species_hash for all tasks, but that can result in duplicated IDs for other reasons, so I think it's important to keep the coordinates in there in most cases.

Yeah, I realized this a while back and hadn't had a chance to fix it. The solution that I'm going to go for is actually just making MPculeID a little longer and including the formula. I actually also think that'll help it be more readable. The hash still won't be easily interpreted, but the user will at least have a clue what they're looking at if the formula is there.

Sounds good. If you have any prototype code for that (even if very rough), feel free to send my way and I can test it on my builds (but if you haven't got there yet, no worries)

I just realized we also need to consider what happens if we have the same single atom / ion in different solvents. My solutions breaks in that case - all solvents collapse to the same MP-ID which results in calcs getting lost in the build pipeline.

That should be fine, I think? Since I've changed how level of theory is handled, and molecules in different solvents should be collapsed into a single document anyways. You're right that this will be treated somewhat differently in the association stage, because in general molecules in different solvents won't have identical structures. But I don't believe it's inconsistent. Will test this to confirm.

Yeah as long as molecules in different solvents wind up in the same document / ID this should be OK. I'm not seeing that happen in my current build pipeline testing (it seems that only Water gets past the association stage), but I haven't had time to troubleshoot much yet

If you can send me a minimal test case (let's say one molecule in 2 or 3 different solvents), that'd actually be a big help.

Hopefully this fixes everything

…tly with no orig???

espottesmith · 2023-02-05T15:42:40Z

Found an edge case where monatomic hydrogen (H1) was causing certain builders to fail. Resolved now - have built collections involving 48,672 unique molecules.

Should be good to merge now.

munrojm · 2023-02-06T20:03:48Z

Found an edge case where monatomic hydrogen (H1) was causing certain builders to fail. Resolved now - have built collections involving 48,672 unique molecules.

Should be good to merge now.

Great, that sounds good. Working to get the tests fixed, then I will merge.

rkingsbury · 2023-02-06T21:56:44Z

@espottesmith apologies I haven't had a chance to do much testing with this yet. It is still on the docket though, possibly later this week / early next. I don't have any objections to merging, but if there's anything in particular you want me to test out please let me know. I'd say this is a tremendous foundation for molecule docs though, and we can always address small issues in subsequent PRs.

espottesmith added 15 commits May 5, 2022 15:52

Merge remote-tracking branch 'materialsproject/main' into main

7ff11d4

Merge remote-tracking branch 'materialsproject/main' into main

e6f333a

Make bonding not mandatory

0308994

Fix bug for case where no tasks are valid

cfba6b3

Small type change

4ad6b6c

Disallow certain task types in Q-Chem builder by default

84312ca

Merge remote-tracking branch 'materialsproject/main' into molecules_u…

1ed42fe

…pdate

Add graph hashes to molecule document format

80d48e1

Change SMD parameters for G2

dbc049b

change id format

c02e1aa

Fixed bugs with graph hashes; added new ID class (needs to propagate …

381c277

…through molecule world)

Working hard at getting new ID format working

b752036

Tests now pass on all builders; just need to make sure all correspond…

313de1b

…ing tests pass in emmet-core (including remaking JSON files for summary)

Updated tests; everything (I think) passes; black

cf1c77b

Now everything actually passes

c9866de

rkingsbury reviewed Jun 22, 2022

View reviewed changes

espottesmith added 3 commits June 22, 2022 16:17

Remove unused variable

b921303

mypy changes

9115abe

ID didn't match - trying without?

fbdb5cb

espottesmith and others added 5 commits June 23, 2022 10:05

Separate level of theory from solvent information; get rid of solvent…

f6fe2da

… names and instead create utility function to define a solvent string

Create composite lot_solvent concept (capturing what we used to do wi…

d0110ea

…th LevelOfTheory with pre-defined solvents)

Updates to molecule doc (change to ID not done yet); now need to prop…

cca022e

…agate to property documents

Core should be updated post-solvent LOT changes (surprisingly easy).

47f95ef

Reminder: need to regenerate "builder_XXX_set.json" test files once builders are all done

Merge branch 'main' into molecules_update

08fcd13

espottesmith added 2 commits July 14, 2022 10:21

Merge remote-tracking branch 'origin/molecules_update' into molecules…

a3a9e3a

…_update

Need to regenerate test files

f9925a6

rkingsbury reviewed Aug 8, 2022

View reviewed changes

espottesmith added 18 commits January 9, 2023 14:42

Merge branch 'main' into molecules_update

cc6ed4a

Merge branch 'main' into molecules_update

082439e

Hopefully this fixes everything

Testing import issues

0b58ae5

Modify requirements

0b87ffa

Seems the root of the issue is that eigen is missing?

e3b33e0

One more attempt before I call in the big guns

3ad78f8

Last try before I just gut these tests; not worth it

29a3eb9

jkjk

3daa5c6

Trying to figure out new bug on property builders - task docs apparen…

ce074b6

…tly with no orig???

Use debugger more effectively

429e64b

Trying to catch an error again

c4f7445

Resolved issue?

0066554

Reverting testing change; think all is well in the world.

7e1fdfb

Floating validation on partial charges/spin docs

f4564dc

Fix for H1 specifically (I think)

34fb516

Remove unnecessary printing

e495a1d

Merge branch 'testing' into molecules_update

9ff0cdd

Merge branch 'main' into molecules_update

8e3f07c

espottesmith and others added 2 commits February 5, 2023 08:04

Fix mypy issue

f18245a

Add pip install to openbabel macos step

9e86e60

munrojm added 5 commits February 6, 2023 12:26

Full custom openbabel mac install

8b96c7e

Retry brew install

7e8b4c3

Add back pip install

290aab1

Switch to macos-11 for testing

bf95cde

Temp remove mac testing

04e27f4

munrojm merged commit 5ff41fa into materialsproject:main Feb 7, 2023

espottesmith deleted the molecules_update branch February 7, 2023 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Molecules update #440

Molecules update #440

espottesmith commented Jun 22, 2022

rkingsbury Jun 22, 2022

espottesmith Jun 22, 2022

codecov-commenter commented Jun 23, 2022 •

edited

Loading

munrojm commented Jul 1, 2022

espottesmith commented Jul 1, 2022

rkingsbury Aug 8, 2022

espottesmith Aug 8, 2022

rkingsbury Aug 8, 2022

rkingsbury Aug 8, 2022

espottesmith Aug 8, 2022

rkingsbury Aug 8, 2022

espottesmith Aug 8, 2022

espottesmith commented Feb 5, 2023 •

edited

Loading

munrojm commented Feb 6, 2023

rkingsbury commented Feb 6, 2023

Molecules update #440

Molecules update #440

Conversation

espottesmith commented Jun 22, 2022

Contributor Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 23, 2022 • edited Loading

Codecov Report

munrojm commented Jul 1, 2022

espottesmith commented Jul 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

espottesmith commented Feb 5, 2023 • edited Loading

munrojm commented Feb 6, 2023

rkingsbury commented Feb 6, 2023

codecov-commenter commented Jun 23, 2022 •

edited

Loading

espottesmith commented Feb 5, 2023 •

edited

Loading