Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for a catch-all bin for atom types during graph construction #164

Conversation

matthewcarbone
Copy link

@matthewcarbone matthewcarbone commented Sep 13, 2023

In the construction of graphs from structures and molecules, a list of allowed elements is provided. However, if an atom is provided in a structure not found in the provided lists, a Value or KeyError is raised.

This commit allows the user to set an argument allow_other_atoms during instantiation of Structure2Graph and Molecule2Graph, which if True, treats any atom not found in the provided list as its own category.

For example, if element_types=["Ti", "O"] but a structure contains Ti, O and Mg, then Ti will be treated by index 0, O will be treated by index 1 and Mg will be treated by index len(element_types), which in this case is 2.

Similarly, if a structure contains Ti, O, Mg and Fe, then both Mg and Fe will have index 2.

Summary

Major changes:

  • Modified matgl.graph.converters.GraphConverter to support this new feature.
  • Changed the Structure2Graph and Molecule2Graph in matgl.ext.pymatgen to support this new feature.
  • More detailed error messages for when allow_other_atoms is False and an atom is in a structure which is not in the list of elements provided.

Todos

I'll probably need to write a test for this. It should be completely backwards compatible for when allow_other_atoms is False.

Checklist

  • Google format doc strings added. Check with ruff.
  • Type annotations included. Check with mypy.
  • Tests added for new features/fixes.
  • If applicable, new classes/functions/modules have duecredit @due.dcite decorators to reference relevant papers by DOI (example)

@shyuep
Copy link
Contributor

shyuep commented Sep 13, 2023

Wouldn't it be better for there to be a method that gets a list of elements from a list of structures instead? That is easy to code. Also, the DEFAULT_ELEMENT_TYPES can be used, which pretty much includes all elements in the periodic table. That should cover all instances.

@matthewcarbone
Copy link
Author

Wouldn't it be better for there to be a method that gets a list of elements from a list of structures instead? That is easy to code.

Perhaps, but not at this level of abstraction. For the single graph use case, based on the existing code, I think this is the way to go. I'm basing my changes here off of the current structure of the code.

Also, the DEFAULT_ELEMENT_TYPES can be used, which pretty much includes all elements in the periodic table. That should cover all instances.

The problem is that is a huge list. So for one-hot encoding, you might end up with a huge matrix of 0's and 1's that you don't need. With these changes, the user can decide on a feature vector length of a fixed size. It could be very useful when you have a dataset with many instances of elements X, Y and Z, but very few instances of 50 different elements.

I say this because I actually need this feature in my own work and these are the problems I'm running into! 😁

@matthewcarbone
Copy link
Author

@shyuep I suppose if you're ok with this change I can also propagate it through to some of the models (which I also want to use with this code).

Ultimately though I'd like to know if you're ok with the changes. I'm happy to explore defining the elements list from the dataset, but I think that's a higher-level feature which can be distinct from this minor API change.

@matthewcarbone matthewcarbone marked this pull request as draft September 13, 2023 14:55
@shyuep
Copy link
Contributor

shyuep commented Sep 13, 2023

I think the get_element_list method in matgl.ext.pymatgen already accomplishes what you are trying to do. A similar method can be coded for ASE quite easily. My main issue with this is that the list of elements is constantly being updated with new structures and there are a lot of checks embedded into the code at runtime.

@matthewcarbone
Copy link
Author

matthewcarbone commented Sep 13, 2023

@shyuep It doesn't. It does something quite different.

Consider the following situation. I have 100 structures each with 10 atoms each in the unit cell. Say 80% of the atoms are Ti, and 15% are O. The remaining 5% are spread between 50 other types of atoms.

If I used the get_elements_list featurizer, I would end up with a feature vector of length 52. Two of these columns would be densely populated, but the rest would be very sparse.

In my implemented method, I would set the elements list to simply ["Ti", "O"] with allow_other_atoms=True. Then my feature vector would be of length 3 only. The first two correspond to if an atom is a Ti or O, and the last corresponds to "everything else".

The utility here is significantly reducing the feature vector sizes for very diverse databases.

I think this change would be quite useful for users and it doesn't impact the default use cases... am I missing something here?

@shyuep
Copy link
Contributor

shyuep commented Sep 13, 2023

Ah I see. Ok. I misunderstood that. I have no problems with the changes in general. Pls add tests for this feature. One thing though, I am not sure that having a catch_all would be good ML practice given that essentially, you are saying the identity of any atoms other than Ti or O has no/small effect on the properties. But this is a question I will leave it to you as the end user to determine.

@matthewcarbone
Copy link
Author

Ah I see. Ok. I misunderstood that. I have no problems with the changes in general.

No worries at all!

Pls add tests for this feature.

Yup it's on my todo list.

One thing though, I am not sure that having a catch_all would be good ML practice given that essentially, you are saying the identity of any atoms other than Ti or O has no/small effect on the properties. But this is a question I will leave it to you as the end user to determine.

Absolutely a fair criticism and there's no doubt that physically this is bad. But from the data perspective, there might not be enough variance for that element across the dataset to provide meaningful information anyway. So the tradeoff is bagging a feature that doesn't occur very often and in return you get a feature vector that might be 10x smaller, which can help training in other ways.

Anyway I'll ping you when the tests are done 👍

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Patch coverage: 80.00% and project coverage change: -0.25% ⚠️

Comparison is base (c55cf17) 98.07% compared to head (f55e2c6) 97.82%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #164      +/-   ##
==========================================
- Coverage   98.07%   97.82%   -0.25%     
==========================================
  Files          28       28              
  Lines        1820     1842      +22     
==========================================
+ Hits         1785     1802      +17     
- Misses         35       40       +5     
Files Changed Coverage Δ
matgl/graph/converters.py 87.50% <75.00%> (-12.50%) ⬇️
matgl/ext/pymatgen.py 100.00% <100.00%> (ø)
matgl/models/_m3gnet.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@matthewcarbone matthewcarbone marked this pull request as ready for review September 13, 2023 17:42
@matthewcarbone
Copy link
Author

matthewcarbone commented Sep 13, 2023

@shyuep I'm happy to make the changes in the models (meaning allowing the user to pass these arguments forward where necessary) as well or I can save that for another PR.

@matthewcarbone
Copy link
Author

@shyuep Actually I think it might make more sense to make the changes to the models as well before merging this (allow for pass-through of this feature to the models). What do you think?

@matthewcarbone matthewcarbone marked this pull request as draft September 14, 2023 12:32
@matthewcarbone
Copy link
Author

I've made the changes to _m3gnet. Happy to revert if you'd like, but this will allow users to pass through the changes I've made to M3GNet itself, so I think it'll be useful.

Going to rebase onto the new changes in main as well and will force push so things are synced.

In the construction of graphs from structures and molecules, a list
of allowed elements is provided. However, if an atom is provided in
a structure not found in the provided lists, a Value or KeyError is
raised.

This commit allows the user to set an argument `allow_other_atoms`
during instantiation of `Structure2Graph` and `Molecule2Graph`, which
if True, treats any atom not found in the provided list as its own
category.

For example, if `element_types=["Ti", "O"]` but a structure contains
Ti, O and Mg, then Ti will be treated by index 0, O will be treated
by index 1 and Mg will be treated by index `len(element_types)`,
which in this case is 2.
@kenko911
Copy link
Contributor

Hi @matthewcarbone, sorry for the late reply. I think raising an error for predicting structures consisting of elements not in the training set is necessary since the property or PES models are not very reliable during extrapolation for unseen elements.
Regarding the issue of large 0s and 1s (one-hot encoding), are you referring to the graph construction part or isolated atomic energies for PES?

@matthewcarbone
Copy link
Author

Hi @matthewcarbone, sorry for the late reply. I think raising an error for predicting structures consisting of elements not in the training set is necessary since the property or PES models are not very reliable during extrapolation for unseen elements.

Certainly true in cases where those elements are key features. And the default behavior of the approach should definitely not be changed. No model is going to be reliable if it's extrapolating, so with this I completely agree. However...

Regarding the issue of large 0s and 1s (one-hot encoding), are you referring to the graph construction part or isolated atomic energies for PES?

... the current way that the featurization happens is no better. If Ti occurs 10k times and U occurs 3 times in a dataset, the information captured by U is going to just be lost in the gradient updates anyway. The model is going to learn that it can't really determine the impact of those elements amongst all of the other more commonly occurring elements. Allowing for a single "misc" atom type might help with training (i.e. might just let the network learn this faster). With the embedding method that you're using in the code, it won't cause any extra memory/storage problems though I concede that, but it's still a form of embedding prior knowledge about the distributions of atoms in the training set.

I'm certainly not suggesting this as the default, just as an option for the user to set.

Also keep in mind people might want to use these models for more than just PES 😊

@kenko911
Copy link
Contributor

kenko911 commented Sep 17, 2023

Thanks for your quick reply and for raising interesting questions! I agree that the gradient update for U may be not sensitive for such extreme cases as you mentioned if we do gradient update per atom. But I would say it depends on how you update these weights. if you update the weights per element by doing proper accumulation of gradients and normalizations, this should address the problem.

As for allowing a single "misc" atom type, I am not sure that the model will learn faster. In contrast, I think it will introduce contradicting information for the training. For example, if we consider "H" and "O" as a single misc atom type and we put an H2 molecule on the Cu surface and an O2 molecule on the Cu surface with exactly the same configuration, the property and PES models will predict the same result for these two cases. For current featurization in MatGL, at least H and O are embedded into atomic feature vectors with different values and they are able to give different predictions. Please let me know if I understand anything wrong.

Overall, the current featurization scheme has many possible improvements and this is definitely an interesting direction for further improving the transferability and accuracy of property and PES models😊.

@JiQi535
Copy link
Contributor

JiQi535 commented Sep 17, 2023

@matthewcarbone Hi Matthew, I would like to ask some questions to better understand what improvements this PR is leading to.

Take the Ti, O, X system as an example, the dimension of feature in the atom embedding layer is shortened to 3, while it can no longer distinguish different X. Can I ask what improvements can be achieved with this shortening of atom-embedding features? It would be nice if you can give some related quantitative results. Also, wouldn't this make the X elements indistinguishable? In which case can this be beneficial?

@matthewcarbone
Copy link
Author

matthewcarbone commented Sep 18, 2023

@matthewcarbone Hi Matthew, I would like to ask some questions to better understand what improvements this PR is leading to.
Take the Ti, O, X system as an example, the dimension of feature in the atom embedding layer is shortened to 3, while it can no longer distinguish different X. Can I ask what improvements can be achieved with this shortening of atom-embedding features? It would be nice if you can give some related quantitative results. Also, wouldn't this make the X elements indistinguishable? In which case can this be beneficial?

It's hard to explain the utility of this idea without considering an entire database. Let's say I want to train a new potential or some other model on all titanium oxides I can find. I end up with a database that, by atoms, has something like 10k Ti, 8k O and maybe 500 atoms which are one of 50 other atom types.

On average, there are 500 / 50 = 10 atoms for each of these 50 atom types. There is no way that any model is going to learn how this particular atom type affects the results unless it makes a really large difference. Hence, my suggestion is to give the user the option to bin these 50 atoms into a single "misc" atom type X. I believe this could help with two things:

  1. It might help with model training. The model will not have to learn that it can't really determine the impact of each of these 50 atoms. Thus it is a way to encode prior knowledge that we simply don't have enough information.
  2. It would allow the network to work with any input material, since even an atom that wasn't used during training can be labeled as X.

Regarding point 2 a bit more, you might say that this is not a good idea because we should only use the models with atoms that were used during training. But this is not the complete story: my counter argument to this would be that if these models are trained on MP data, there is likely a large discrepancy in the different proportions of atoms in the database anyway. So while your trained models might work with uranium, do we really know if there are enough uranium atoms in the training database to make accurate predictions? In the silly limiting case where there is a single U atom in the entire training database, you're still allowing your users to feed in U atoms when really there's epsilon small information about how U impacts the results. If it's working, you're getting lucky.

But at the end of the day, as I said, I'm not suggesting a change to any defaults here... this will just allow users another option to experiment.

@JiQi535
Copy link
Contributor

JiQi535 commented Sep 18, 2023

Thanks for the further explanation and summarizing these two points. And I have a quick follow up question.

Let's continue to use Ti, O, X system as an example. If you are sure that X shouldn't make a difference and the M3GNet model doesn't need to distinguish X, then you may pre-process the dataset to replace all X to a single specie, e.g., Al. This pre-processing step should already accomplish what you would achieve here without changing the matgl codes. Do you agree with it? I would like to know if I understand it correctly. @matthewcarbone

Besides, I'm quite interested in the rare element's training and predictions by M3GNet. I don't think we cannot learn them well because there is just a small amount of related training data. If that's the case, we will try to improve the model or training data so that it can learn those rare elements well.

@matthewcarbone
Copy link
Author

Thanks for the further explanation and summarizing these two points. And I have a quick follow up question.

👍

Let's continue to use Ti, O, X system as an example. If you are sure that X shouldn't make a difference and the M3GNet model doesn't need to distinguish X, then you may pre-process the dataset to replace all X to a single specie, e.g., Al. This pre-processing step should already accomplish what you would achieve here without changing the matgl codes. Do you agree with it? I would like to know if I understand it correctly. @matthewcarbone

Yes I agree with this.

Besides, I'm quite interested in the rare element's training and predictions by M3GNet.

I do think it's interesting, but I'm just not sure everyone wants to train their models like this.

I don't think we cannot learn them well because there is just a small amount of related training data.

I'm pretty sure any single example will get "lost in the gradients" of the other updates during training. It's part of the reason deep learning architectures are robust to outliers.

Sticking with the silly uranium example, if you have 0 U in your training database, your user's materials in their "uranium oxide" database will not even be "featurizable". If you have 1 U in your training database, your users's data will be featurizable, but it probably shouldn't be.

If that's the case, we will try to improve the model or training data so that it can learn those rare elements well.

That's fair, but we often have situations where we just don't have the data available at training time.

@kenko911
Copy link
Contributor

Hi @matthewcarbone, it seems the simplest solution would be Ji's suggestion that can address your problem. Given that different "X" elements would predict the same results and it seems not physical to me, I would be very careful about merging experimental features in the release version of MatGL before having preliminary results to show the actual improvement. This can avoid many unnecessary questions raised by other users.

@shyuep
Copy link
Contributor

shyuep commented Sep 18, 2023

Let me state my final decision on this:

  1. The elemental vector should support DummySpecies, e.g., X. This will allow us to model pseudo-atoms such as defect sites too. I believe this will be important in the long run.
  2. For the use case that @matthewcarbone is thinking of, he will map the structures such that the irrelevant sites are mapped to X and then can use the code as is.
  3. @JiQi535 and @kenko911 We do not need to worry about the physicality of other models. The end user will determine if the model makes sense. We worry about the correctness of the code and functionality.

@matthewcarbone
Copy link
Author

Sounds good, all.

@kenko911 I definitely respect not want to push super experimental stuff, but I don't think this is too crazy. It won't change any defaults and it will allow users to experiment a bit.

@shyuep Sounds good. How would you like to proceed? This PR doesn't exactly do what you're suggesting (allow for "X" in the species list).

@kenko911 kenko911 closed this Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants