Skip to content

mathsphy/paper-data-redundancy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploiting redundancy in large materials datasets for efficient machine learning with less data


Repo for the paper Exploiting redundancy in large materials datasets for efficient machine learning with less data published in Nat Commun 14, 7283 (2023):

Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95 % of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the "bigger is better" mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Schematic of redundancy evaluation

image
(a) Dataset splits. (b) Three prediction tasks to evaluate model performance and data redundancy.

  

Data redundancy manifested in in-distribution (ID) performance

image
RMSE on the ID test sets. (a-c) JARVIS18, MP18, and OQMD14 formation energy prediction. (d-f) JARVIS18, MP18, and OQMD14 band gap prediction. The random baseline results to for the XGB and RF (or ALIGNN) models are obtained by averaging over the results of 10 (or 5) random data selections for each training set size. The X axis is in the log scale.

  

Data redundancy manifested in out-of-distribution (OOD) performance

image
RMSE on the OOD test sets. (a) JARVIS formation energy prediction. (b) MP formation energy prediction. (c) OQMD band gap prediction. Performance on the ID test set is shown for comparison.

  

In-distribution performance during active learning campaigns

image
RMSE on the ID test sets by the XGB models trained on the data selected using the active learning algorithms. (a) MP21 formation energy prediction. (b) JARVIS22 formation energy prediction. (c) OQMD14 band gap prediction. QBC: query by committee, RF-U: random forest uncertainty, XGB-U: XGBoost uncertainty. The performance obtained using the random sampling and the pruning algorithm is shown for comparison.

  

Dependencies

Please follow setup_env.bash to setup the python environment.

env_name="test"
conda create -n $env_name -y python=3.10
conda activate $env_name
conda install -y -c conda-forge scikit-learn py-xgboost-gpu  pandas matplotlib

# For data preprocessing and featurization 
conda install -y -c conda-forge pymatgen

# For getting uncertainty estimates of XGBoost
pip install ibug

# For using ALIGNN
pip install dgl -f https://data.dgl.ai/wheels/cu116/repo.html
pip install dglgo -f https://data.dgl.ai/wheels-test/repo.html
pip install alignn 

Data

The JARVIS, Materials Project and OQMD dataset snapshots considered in this study and their description can be found on Zenodo.

Scripts

The bash files get_featurized_data.bash and run_al.bash are provided as a demo to reproduce the QBC active learning results for the JARVIS22 dataset, which requires relatively low training cost. Scripts in the codes folder can be used to produce the full results but would require much higher training budget.

About

Repo for the paper "Exploiting redundancy in large materials datasets for efficient machine learning with less data"

Resources

License

Stars

Watchers

Forks

Packages

No packages published