Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optical and transport data as elemental pseudo-inverse contributions #892

Merged
merged 52 commits into from
Apr 10, 2024

Conversation

gbrunin
Copy link
Contributor

@gbrunin gbrunin commented Nov 18, 2022

Summary

This is work done with @davidwaroquiers, @gpetretto and @gmrigna.

The idea is to use the data from refractiveindex.info and the transport properties from the Materials Project to featurize new systems based on their composition.

As an example, let's take the effective mass of electrons. From the MP, we have >45 000 systems with corresponding effective masses. We can write the equations
Composition matrix x Pseudo-inverse contributions ≃ Effective masses
where each of these matrices have > 45 000 lines. The composition matrix has a number of columns equal to the number of chemical elements present in the dataset, and the others have a single column. The pseudo-inverse contributions can be computed for a given dataset. They represent the least-square fit between the compositions and the effective mass, and can be seen as the average contribution of each element to the effective mass once they are present in a system (could be negative if the presence of an element generally decreases the effective mass).

From our tests on industrial cases, including these pseudo-inverse contributions as composition features improves the ML models (it will depend on what is predicted though).

In this PR, we have done this for optical data (refractive index, extinction coefficient, reflectivity), as taken from refractiveindex.info, and for transport properties (all those present in the MP). For optical data, the properties are spectra and by default 10 wavelengths are selected in the visible range. This range and frequency selection can be changed by the user if, say, the IR spectra is more important for their application.
The code can be used to generate new pseudo-inverse contributions from new data and add these as features as well.

TODO

Since the user can change the range and sampling of the optical spectra, the whole database from refractiveindex should be stored. We have added it in a tar.xz format (< 2 Mb). The code starts by untarring the file in a ~/.matminer directory that can be changed manually by the user if this is not desirable. This is to avoid adding too much untarred files in the source code that would more than double the current size of the repo.
This is of course open for discussion, depending on what you would prefer.

We are open to having a chat about all this if you think it is necessary. Maybe I did not explain everything correctly and things have to be clarified.

gbrunin and others added 26 commits June 28, 2022 14:52
…d the optical database, better use of files for restart
Added pseudo-inverse contributions of elements to properties as Element features. The properties already included are the optical data from refractiveindex.info and the transport properties from the Materials Project.
@davidwaroquiers
Copy link

Good work @gbrunin!

Following up on this PR, @computron is there anyone we should contact to have it merged or discussed ?

Thanks,

David

@davidwaroquiers
Copy link

Hello @janosh,

Would you need any additional information to follow up on this PR ?

Thanks,

David

@janosh
Copy link
Member

janosh commented Feb 14, 2023

@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one.

@davidwaroquiers
Copy link

@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one.

Hello @janosh ,

Ok thanks for the update!

@computron and @ardunn do you need any additional input about this topic ?

Best,

@ml-evs ml-evs self-requested a review March 29, 2024 15:04
Copy link
Collaborator

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, hope we can get this in very soon @gbrunin!

I'll keep triggering the workflows if you make any changes but will also try to make the effort to test this myself very soon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only "big" file that I can see, and it is approximately twice the size of the next biggest descriptor data (Jarvis). If I compress it with gzip or xz I get it down to ~2.6 MB which might be preferable but will need the decompression logic added too.

I'm not sure its worth using LFS for this (as GH much more heavily enforces bandwidth limits on LFS vs normal transactions, and very annoyingly includes GH actions traffic in this bandwidth in a very hard-to-cache way). An extra 2 MB for a file that will never change does not seem like a big deal to me compared to the additional effort.

The alternative would be hosting the file on figshare/Zenodo and downloading it on-demand. Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compressed it, the total repo size increases now by 20%. I agree that the additional space taken by the repo does not justify the additional effort to either use LFS or Zenodo. We should see with Anubhav if that's fine by him.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, are you happy to contact him again? I can't see there being a problem... (you have my blessing)

matminer/utils/data.py Show resolved Hide resolved
matminer/featurizers/composition/element.py Outdated Show resolved Hide resolved
@ml-evs
Copy link
Collaborator

ml-evs commented Apr 9, 2024

3.9 failures are still simply for the Test PyPI upload which won't work from forks (I'll probably fix this at some point after this PR) -- see #933

Copy link
Collaborator

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me now, let's just wait to check about the data size before merging. Thanks @gbrunin!

@ml-evs
Copy link
Collaborator

ml-evs commented Apr 10, 2024

I'm happy that everything works locally, and we're fine to merge the dataset in. I'll raise a couple of minor issues that have come up, but otherwise great work and thanks again @gbrunin!

@ml-evs ml-evs merged commit 0763527 into hackingmaterials:main Apr 10, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants