Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLBuilder and MLDoc to record machine learning property predictions #806

Merged
merged 45 commits into from
Sep 26, 2023

Conversation

janosh
Copy link
Member

@janosh janosh commented Aug 14, 2023

This PR implements a new MLIPBuilder that generates a complete set of matcalc property predictions using a given machine learning interatomic potential such as M3GNet or CHGNet. This is what a complete set of properties looks like:

- metadata
    - material_id (str): MP ID
    - structure (Structure): pymatgen Structure object
    - deprecated (bool): whether this material is deprecated in MP
    - calculator (str): name of model used as ML potential.
    - version (str): version of matcalc used to generate this document
- relaxation
    - final_structure: relaxed pymatgen Structure object
    - energy (float): final energy in eV
    - volume (float): final volume in Angstrom^3
    - lattice parameters (float): a, b, c, alpha, beta, gamma
- equation of state
    - eos (dict[str, list[float]]): with keys energies and volumes
    - bulk_modulus_bm (float): Birch-Murnaghan bulk modulus in GPa
- phonon
    - temperatures (list[float]): temperatures in K
    - free_energy (list[float]): Helmholtz energies at those temperatures in eV
    - entropy (list[float]): entropies at those temperatures in eV/K
    - heat_capacities (list[float]): heat capacities at constant volume in eV/K
- elasticity
    - elastic_tensor (ElasticTensorDoc): pydantic model from emmet.core.elasticity
    - shear_modulus (ShearModulus): Voigt-Reuss-Hill shear modulus
    - bulk_modulus (BulkModulus): Voigt-Reuss-Hill bulk modulus
    - youngs_modulus (float): Young's modulus

Next steps

Run the MLIPBuilder with M3GNet and CHGNet over every structures in MP Core.

…PRelaxationDoc

for interatomic potential predictions
runs CHGNet and MEGNet over all structures in the materials store and with every matcalc PropCalc class
@shyuep
Copy link
Member

shyuep commented Aug 15, 2023

I personally don't think each property needs to be a separate doc. Otherwise I don't think I have much to say here

@janosh janosh changed the title WIP: MLIPBuilder and pydantic models for matcalc.PropCalc classes Add MLIPBuilder and pydantic model MLIPDoc Aug 15, 2023
@munrojm
Copy link
Member

munrojm commented Aug 15, 2023

Other than the small comment I added, this is looking good to me.

@mkhorton
Copy link
Member

I don’t have any stake here, but wouldn’t it be ideal to have a single ElasticityDoc, regardless of the method that was used to generate it? Perhaps with metadata to track method used for generation, fields can be optional if not all moduli are predicted, etc. but elastic properties are elastic properties at the end of the day.

The motivator here is that any downstream analysis that is built to accept elasticity data should still work regardless of the provenance of that data.

@janosh
Copy link
Member Author

janosh commented Aug 23, 2023

@mkhorton Good point! I think providing users with a consistent schema across data sources is a huge upside.

Afaik, so far there's been a one-to-one mapping from pydantic models to data sources. Breaking that by using the existing ElasticityDoc for matcalc results would couple the schema in matcalc to that in emmet.

But this might be a non-issue given pydantic supports aliases on each Field. I checked with @munrojm and it also sounds like adding new fields that matcalc might decide to add later that are not yet in the ElasticityDoc would be fine. Such fields could be made optional.

One thing that probably has to be manually ensured is that equivalent fields in matcalc and ElasticityDoc have the same units. There's a little less chance for user error from different units if the two documents were separate.

@mkhorton
Copy link
Member

One thing that probably has to be manually ensured is that equivalent fields in matcalc and ElasticityDoc have the same units.

Yes, the units question is an important one, and one that we never fully resolved during my time there.

One proposal was that we could lay the groundwork for better unit support in future, and that this could start by making sure every field with units in a PropertyDoc has a shadow field specifying the unit (e.g., bulk_modulus, _bulk_modulus_unit etc.), or equivalently make the field be an object {value: ..., unit: ...}. These could be given default values so would not need to be stored in a database etc. unless overridden with non-default values.

This would allow for potential future development e.g. during validation, the appropriate type with units could be reconstructed with an appropriate units library. Values with units could be type checked, etc. The MPContribs approach of storing the value with its unit as a string is also possible, but risks losing precision and prevents a lot of indexing/searching operations on that field in the database.

An easier option is just requiring one canonical unit, and making sure this is specified in the help string. This probably should be done regardless, although of course you can't do automated unit checking this way.

Overall, not an easy problem and probably out of scope for this PR in any case :)

@shyuep
Copy link
Member

shyuep commented Aug 23, 2023

  1. I wouldn't store something like units in the DB. The reason is very simple - it is a complete waste of space. Just think about how you usually store data. I can tell you all my data is in GPa, and give you values 1, 2, 100. Or I can write it out as 1GPa, 2GPa, 100GPa. The latter has a lot of redundancy and the multiple GPas give zero information. In other words, you are causing information entropy to decrease. I would rather you simply state that all elasticity values in the DB are in GPa and that is a single point of definition.

  2. As for a single elasticity doc, I think the schema definition can be the same and defined somewhere (isn't pymatgen.elasticity the canonical definition place?). But I wouldn't bother with creating hierarchies upon hierarchies of docs. MLP elasticity will basically be available for all materials. I doubt DFT elasticity will catch up anytime soon. If we talk about expt elasticity, you are lucky if you have the bulk and shear modulus and maybe 1-2 elastic tensor values.

@matthewkuner
Copy link

@janosh might one of these eventually replace the ForceFieldTaskDocument from Atomate2?

@janosh
Copy link
Member Author

janosh commented Aug 29, 2023

@matthewkuner Good question! Haven't looked at ForceFieldTaskDocument recently and it's not my decision to remove or merge it into sth else anyway but I think it's worth checking how much duplication there is.

@janosh
Copy link
Member Author

janosh commented Sep 23, 2023

@munrojm Tests here are failing due to matcalc ImportError. I added it to extra deps 'all' alongside e.g. robocrys. Should it go somewhere else? Or do the auto-generated req files need to be updated first?

@janosh janosh added enhancement Core Any updates for Emmet-Core Builders Any updates for Emmet-Builders labels Sep 26, 2023
janosh and others added 9 commits September 25, 2023 19:13
* update dependencies for emmet-api (ubuntu-latest/py3.10)

* update dependencies for emmet-api (ubuntu-latest/py3.11)

* update dependencies for emmet-api (ubuntu-latest/py3.8)

* update dependencies for emmet-api (ubuntu-latest/py3.9)

* update dependencies for emmet-builders (ubuntu-latest/py3.10)

* update dependencies for emmet-builders (ubuntu-latest/py3.11)

* update dependencies for emmet-builders (ubuntu-latest/py3.8)

* update dependencies for emmet-builders (ubuntu-latest/py3.9)

* update dependencies for emmet-core (ubuntu-latest/py3.10)

* update dependencies for emmet-core (ubuntu-latest/py3.11)

* update dependencies for emmet-core (ubuntu-latest/py3.8)

* update dependencies for emmet-core (ubuntu-latest/py3.9)

---------

Co-authored-by: github-actions <github-actions@github.com>
@codecov-commenter
Copy link

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (e4bb49f) 89.87% compared to head (88b9bbe) 78.97%.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #806       +/-   ##
===========================================
- Coverage   89.87%   78.97%   -10.90%     
===========================================
  Files         107       75       -32     
  Lines        9527     4204     -5323     
===========================================
- Hits         8562     3320     -5242     
+ Misses        965      884       -81     
Files Coverage Δ
emmet-core/emmet/core/feff/task.py 61.53% <ø> (ø)
emmet-core/emmet/core/thermo.py 52.74% <ø> (-43.96%) ⬇️
emmet-core/emmet/core/structure.py 67.16% <0.00%> (-32.84%) ⬇️
emmet-core/emmet/core/utils.py 32.00% <20.00%> (-32.80%) ⬇️

... and 42 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@janosh janosh changed the title Add MLIPBuilder and pydantic model MLIPDoc Add MLBuilder and MLDoc to record machine learning property predictions Sep 26, 2023
@munrojm munrojm merged commit ff4b452 into main Sep 26, 2023
1 of 10 checks passed
@janosh janosh deleted the mlip-builder branch September 26, 2023 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders Any updates for Emmet-Builders Core Any updates for Emmet-Core enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants