-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for a catch-all bin for atom types during graph construction #164
Allow for a catch-all bin for atom types during graph construction #164
Conversation
Wouldn't it be better for there to be a method that gets a list of elements from a list of structures instead? That is easy to code. Also, the DEFAULT_ELEMENT_TYPES can be used, which pretty much includes all elements in the periodic table. That should cover all instances. |
Perhaps, but not at this level of abstraction. For the single graph use case, based on the existing code, I think this is the way to go. I'm basing my changes here off of the current structure of the code.
The problem is that is a huge list. So for one-hot encoding, you might end up with a huge matrix of 0's and 1's that you don't need. With these changes, the user can decide on a feature vector length of a fixed size. It could be very useful when you have a dataset with many instances of elements X, Y and Z, but very few instances of 50 different elements. I say this because I actually need this feature in my own work and these are the problems I'm running into! 😁 |
@shyuep I suppose if you're ok with this change I can also propagate it through to some of the models (which I also want to use with this code). Ultimately though I'd like to know if you're ok with the changes. I'm happy to explore defining the elements list from the dataset, but I think that's a higher-level feature which can be distinct from this minor API change. |
I think the get_element_list method in matgl.ext.pymatgen already accomplishes what you are trying to do. A similar method can be coded for ASE quite easily. My main issue with this is that the list of elements is constantly being updated with new structures and there are a lot of checks embedded into the code at runtime. |
@shyuep It doesn't. It does something quite different. Consider the following situation. I have 100 structures each with 10 atoms each in the unit cell. Say 80% of the atoms are Ti, and 15% are O. The remaining 5% are spread between 50 other types of atoms. If I used the In my implemented method, I would set the elements list to simply The utility here is significantly reducing the feature vector sizes for very diverse databases. I think this change would be quite useful for users and it doesn't impact the default use cases... am I missing something here? |
Ah I see. Ok. I misunderstood that. I have no problems with the changes in general. Pls add tests for this feature. One thing though, I am not sure that having a catch_all would be good ML practice given that essentially, you are saying the identity of any atoms other than Ti or O has no/small effect on the properties. But this is a question I will leave it to you as the end user to determine. |
No worries at all!
Yup it's on my todo list.
Absolutely a fair criticism and there's no doubt that physically this is bad. But from the data perspective, there might not be enough variance for that element across the dataset to provide meaningful information anyway. So the tradeoff is bagging a feature that doesn't occur very often and in return you get a feature vector that might be 10x smaller, which can help training in other ways. Anyway I'll ping you when the tests are done 👍 |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #164 +/- ##
==========================================
- Coverage 98.07% 97.82% -0.25%
==========================================
Files 28 28
Lines 1820 1842 +22
==========================================
+ Hits 1785 1802 +17
- Misses 35 40 +5
☔ View full report in Codecov by Sentry. |
@shyuep I'm happy to make the changes in the models (meaning allowing the user to pass these arguments forward where necessary) as well or I can save that for another PR. |
@shyuep Actually I think it might make more sense to make the changes to the models as well before merging this (allow for pass-through of this feature to the models). What do you think? |
I've made the changes to Going to rebase onto the new changes in |
In the construction of graphs from structures and molecules, a list of allowed elements is provided. However, if an atom is provided in a structure not found in the provided lists, a Value or KeyError is raised. This commit allows the user to set an argument `allow_other_atoms` during instantiation of `Structure2Graph` and `Molecule2Graph`, which if True, treats any atom not found in the provided list as its own category. For example, if `element_types=["Ti", "O"]` but a structure contains Ti, O and Mg, then Ti will be treated by index 0, O will be treated by index 1 and Mg will be treated by index `len(element_types)`, which in this case is 2.
ef3b0bd
to
f55e2c6
Compare
Hi @matthewcarbone, sorry for the late reply. I think raising an error for predicting structures consisting of elements not in the training set is necessary since the property or PES models are not very reliable during extrapolation for unseen elements. |
Certainly true in cases where those elements are key features. And the default behavior of the approach should definitely not be changed. No model is going to be reliable if it's extrapolating, so with this I completely agree. However...
... the current way that the featurization happens is no better. If Ti occurs 10k times and U occurs 3 times in a dataset, the information captured by U is going to just be lost in the gradient updates anyway. The model is going to learn that it can't really determine the impact of those elements amongst all of the other more commonly occurring elements. Allowing for a single "misc" atom type might help with training (i.e. might just let the network learn this faster). With the embedding method that you're using in the code, it won't cause any extra memory/storage problems though I concede that, but it's still a form of embedding prior knowledge about the distributions of atoms in the training set. I'm certainly not suggesting this as the default, just as an option for the user to set. Also keep in mind people might want to use these models for more than just PES 😊 |
Thanks for your quick reply and for raising interesting questions! I agree that the gradient update for U may be not sensitive for such extreme cases as you mentioned if we do gradient update per atom. But I would say it depends on how you update these weights. if you update the weights per element by doing proper accumulation of gradients and normalizations, this should address the problem. As for allowing a single "misc" atom type, I am not sure that the model will learn faster. In contrast, I think it will introduce contradicting information for the training. For example, if we consider "H" and "O" as a single misc atom type and we put an H2 molecule on the Cu surface and an O2 molecule on the Cu surface with exactly the same configuration, the property and PES models will predict the same result for these two cases. For current featurization in MatGL, at least H and O are embedded into atomic feature vectors with different values and they are able to give different predictions. Please let me know if I understand anything wrong. Overall, the current featurization scheme has many possible improvements and this is definitely an interesting direction for further improving the transferability and accuracy of property and PES models😊. |
@matthewcarbone Hi Matthew, I would like to ask some questions to better understand what improvements this PR is leading to. Take the Ti, O, X system as an example, the dimension of feature in the atom embedding layer is shortened to 3, while it can no longer distinguish different X. Can I ask what improvements can be achieved with this shortening of atom-embedding features? It would be nice if you can give some related quantitative results. Also, wouldn't this make the X elements indistinguishable? In which case can this be beneficial? |
It's hard to explain the utility of this idea without considering an entire database. Let's say I want to train a new potential or some other model on all titanium oxides I can find. I end up with a database that, by atoms, has something like 10k Ti, 8k O and maybe 500 atoms which are one of 50 other atom types. On average, there are 500 / 50 = 10 atoms for each of these 50 atom types. There is no way that any model is going to learn how this particular atom type affects the results unless it makes a really large difference. Hence, my suggestion is to give the user the option to bin these 50 atoms into a single "misc" atom type
Regarding point 2 a bit more, you might say that this is not a good idea because we should only use the models with atoms that were used during training. But this is not the complete story: my counter argument to this would be that if these models are trained on MP data, there is likely a large discrepancy in the different proportions of atoms in the database anyway. So while your trained models might work with uranium, do we really know if there are enough uranium atoms in the training database to make accurate predictions? In the silly limiting case where there is a single U atom in the entire training database, you're still allowing your users to feed in U atoms when really there's epsilon small information about how U impacts the results. If it's working, you're getting lucky. But at the end of the day, as I said, I'm not suggesting a change to any defaults here... this will just allow users another option to experiment. |
Thanks for the further explanation and summarizing these two points. And I have a quick follow up question. Let's continue to use Ti, O, X system as an example. If you are sure that X shouldn't make a difference and the M3GNet model doesn't need to distinguish X, then you may pre-process the dataset to replace all X to a single specie, e.g., Al. This pre-processing step should already accomplish what you would achieve here without changing the matgl codes. Do you agree with it? I would like to know if I understand it correctly. @matthewcarbone Besides, I'm quite interested in the rare element's training and predictions by M3GNet. I don't think we cannot learn them well because there is just a small amount of related training data. If that's the case, we will try to improve the model or training data so that it can learn those rare elements well. |
👍
Yes I agree with this.
I do think it's interesting, but I'm just not sure everyone wants to train their models like this.
I'm pretty sure any single example will get "lost in the gradients" of the other updates during training. It's part of the reason deep learning architectures are robust to outliers. Sticking with the silly uranium example, if you have 0 U in your training database, your user's materials in their "uranium oxide" database will not even be "featurizable". If you have 1 U in your training database, your users's data will be featurizable, but it probably shouldn't be.
That's fair, but we often have situations where we just don't have the data available at training time. |
Hi @matthewcarbone, it seems the simplest solution would be Ji's suggestion that can address your problem. Given that different "X" elements would predict the same results and it seems not physical to me, I would be very careful about merging experimental features in the release version of MatGL before having preliminary results to show the actual improvement. This can avoid many unnecessary questions raised by other users. |
Let me state my final decision on this:
|
Sounds good, all. @kenko911 I definitely respect not want to push super experimental stuff, but I don't think this is too crazy. It won't change any defaults and it will allow users to experiment a bit. @shyuep Sounds good. How would you like to proceed? This PR doesn't exactly do what you're suggesting (allow for "X" in the species list). |
In the construction of graphs from structures and molecules, a list of allowed elements is provided. However, if an atom is provided in a structure not found in the provided lists, a Value or KeyError is raised.
This commit allows the user to set an argument
allow_other_atoms
during instantiation ofStructure2Graph
andMolecule2Graph
, which if True, treats any atom not found in the provided list as its own category.For example, if
element_types=["Ti", "O"]
but a structure contains Ti, O and Mg, then Ti will be treated by index 0, O will be treated by index 1 and Mg will be treated by indexlen(element_types)
, which in this case is 2.Similarly, if a structure contains Ti, O, Mg and Fe, then both Mg and Fe will have index 2.
Summary
Major changes:
matgl.graph.converters.GraphConverter
to support this new feature.Structure2Graph
andMolecule2Graph
inmatgl.ext.pymatgen
to support this new feature.allow_other_atoms
is False and an atom is in a structure which is not in the list of elements provided.Todos
I'll probably need to write a test for this. It should be completely backwards compatible for when
allow_other_atoms
is False.Checklist
ruff
.mypy
.duecredit
@due.dcite
decorators to reference relevant papers by DOI (example)