-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Molecules update #440
Molecules update #440
Conversation
…through molecule world)
…ing tests pass in emmet-core (including remaking JSON files for summary)
"ACETONITRILE", | ||
"BENZENE", | ||
"METHANOL", | ||
] | ||
|
||
PCM_DIELECTRICS = { | ||
"WATER": 78.39, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @espottesmith ! I don't have any comments on the molecule ID questions; I defer to you and others on that one. Your approach sounds robust to me.
Since you're making some updates to calc_types
I wonder if I can request a small change to roll into this PR?
At this line you use 2 decimals of precision on the dielectric constant to make sure a PCM(water) task is valid. That precision seems a little much to me (and it caused the build pipeline to fail when I had run calculations using a dielectric of 78.4). Do you think it would be reasonable to relax this to 78.4 (or maybe even 78), and/or to use a rounding approach to validate the task type (e.g., if the dielectric is within 0.1 units of 78.4, call it PCM(water)? A similar consideration might apply to other calc_types that use PCM.
Happy to hear alternative suggestions; I just don't feel the build pipeline should fail to validate a task just because the dielectric is 0.01 units different than the setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll play around with this.
Honestly, I'm on the fence about the whole naming solvents thing. It's nice from a "human readability" perspective - what does "78.4" mean? It means "water"! It's also kind of important to have standard names for level of theory comparisons. You don't want to mix calculations with different solvents. Even if 78 and 78.39 are similar, it's strictly speaking not sound to directly compare the two. Just like you shouldn't directly compare energies obtained with wB97X-D and wB97X-D3.
On the other hand, as you say, there's differing degrees of specificity that you could use. And it's also not easy to assign a name to a dielectric constant - dimethoxyethane and tetrahydrofuran are both around 7.
Codecov ReportBase: 92.74% // Head: 87.77% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #440 +/- ##
==========================================
- Coverage 92.74% 87.77% -4.98%
==========================================
Files 134 108 -26
Lines 24791 6157 -18634
==========================================
- Hits 22992 5404 -17588
+ Misses 1799 753 -1046
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
… names and instead create utility function to define a solvent string
…th LevelOfTheory with pre-defined solvents)
…agate to property documents
Reminder: need to regenerate "builder_XXX_set.json" test files once builders are all done
@espottesmith are you good if I merge when the tests pass? Is there any remaining issue you would like to discuss? |
I'm making some changes in response to @rkingsbury's comments. I'll let you know when I'm ready to merge. |
doc = MoleculeDoc.from_tasks(group) | ||
molecule_id = "{}-{}-{}".format( | ||
doc.coord_hash, | ||
str(int(doc.charge)).replace("-", "m"), | ||
doc.spin_multiplicity, | ||
) | ||
doc.molecule_id = molecule_id | ||
molecules.append(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@espottesmith when testing out the build pipeline on solvated ion clusters I found a bug in this section of code. For monatomic ions with the same charge, this will result in duplicate molecule_id
b/c the coordinates, charge, and multiplicity can all be identical.
I worked around the issue by doing this; although there is probably a more robust / general solution:
for group in self.filter_and_group_tasks(tasks):
try:
# TODO - this molecule_id formula will return identical ids for single atoms
# that have the same charge and spin
doc = MoleculeDoc.from_tasks(group)
# use something different for single atoms
if len(doc.composition)==1:
molecule_id = MPculeID("{}-{}-{}".format(
doc.species_hash,
str(int(doc.charge)).replace("-", "m"),
doc.spin_multiplicity,
))
else:
molecule_id = MPculeID("{}-{}-{}".format(
doc.coord_hash,
str(int(doc.charge)).replace("-", "m"),
doc.spin_multiplicity,
))
doc.molecule_id = molecule_id
molecules.append(doc)
I tried just changing the coord_hash
to species_hash
for all tasks, but that can result in duplicated IDs for other reasons, so I think it's important to keep the coordinates in there in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I realized this a while back and hadn't had a chance to fix it. The solution that I'm going to go for is actually just making MPculeID a little longer and including the formula. I actually also think that'll help it be more readable. The hash still won't be easily interpreted, but the user will at least have a clue what they're looking at if the formula is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. If you have any prototype code for that (even if very rough), feel free to send my way and I can test it on my builds (but if you haven't got there yet, no worries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we also need to consider what happens if we have the same single atom / ion in different solvents. My solutions breaks in that case - all solvents collapse to the same MP-ID which results in calcs getting lost in the build pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fine, I think? Since I've changed how level of theory is handled, and molecules in different solvents should be collapsed into a single document anyways. You're right that this will be treated somewhat differently in the association stage, because in general molecules in different solvents won't have identical structures. But I don't believe it's inconsistent. Will test this to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah as long as molecules in different solvents wind up in the same document / ID this should be OK. I'm not seeing that happen in my current build pipeline testing (it seems that only Water gets past the association stage), but I haven't had time to troubleshoot much yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can send me a minimal test case (let's say one molecule in 2 or 3 different solvents), that'd actually be a big help.
Hopefully this fixes everything
…tly with no orig???
Found an edge case where monatomic hydrogen (H1) was causing certain builders to fail. Resolved now - have built collections involving 48,672 unique molecules. Should be good to merge now. |
Great, that sounds good. Working to get the tests fixed, then I will merge. |
@espottesmith apologies I haven't had a chance to do much testing with this yet. It is still on the docket though, possibly later this week / early next. I don't have any objections to merging, but if there's anything in particular you want me to test out please let me know. I'd say this is a tremendous foundation for molecule docs though, and we can always address small issues in subsequent PRs. |
(Mostly) small changes to the recently included molecule builder pipeline. The biggest change is in creation of an MPculeID, a new ID format. Previous discussions led me to believe that an ID based on task IDs is fragile, potentially breaking if old tasks become deprecated. In order to avoid this problem while also guaranteeing the uniqueness of IDs, the MPculeID follows the format:
(prefix-)hash-charge-spin
where the prefix is optional and the hash is a Weisfeiler Lehman on a molecular graph representation (generated using networkx). The node attributes used to generate the WL hash are either the XYZ coordinates of the molecule (used for molecule association) or the atom species (used for constructing final molecule docs). Because charge and spin are also included in the ID, the risk of IDs no longer being unique due to a hash collision becomes negligible. To further protect against this, we could add the alphabetical formula, so that the ID would become (prefix-)hash-formula-charge-spin.
I note that these IDs are not easily understood or remembered by humans but can be easily generated (there is a utility function in emmet.core.utils; this utility might eventually be better placed in pymatgen).
Contributor Checklist