Hi developers, after learning from the section of "Train MatterGen yourself" in README, I wanna to train MatterGen myself using my own dataset and now there are some questions as following:
-
For the number of structures, taking an example of mp_20 dataset, I notice that there are about 20k structures in the train.csv. For my own training dataset, how many structures are acceptable in the dataset? If I only have a small dataset like 200 structures, could the model be trained efficiently?
-
Could thoes disordered structures with partially occupied atoms be involved in the training dataset? For example, the following cif file with the Sr atoms partially occupied:
_atom_site_type_symbol
_atom_site_label
_atom_site_symmetry_multiplicity
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_occupancy
Sr Sr0 1 0.89006600 0.10993400 0.25000000 0.5
Sr Sr1 1 0.10993400 0.89006600 0.75000000 0.5
-
I notice that there are three files in the mp_20.zip of train.csv, test.csv and val.csv which might be the training, testing and validating dataset. After I collecting all the structures I needed, if necessary to divided the whole dataset into training, testing and validating dataset? In other words, if I only have the training dataset (train.csv), could the model be trained efficiently?
-
For the base model training, I notice that the properties such as "dft_band_gap" and their values were already specified in the train.csv file. So, the properties and their values need to be set first in the base model training?
-
There are many properties in the header of train.csv like ,material_id,formation_energy_per_atom,dft_band_gap,pretty_formula,e_above_hull,elements,cif,spacegroup_number,azure_bulk_modulus,larsen_score_2d,Si_100_mismatch,azure_band_gap,dft_bulk_modulus,dft_poisson_ratio,dft_mag_density
. If I wanna to add my own property like "dft_density", how many properties are necessary and must be specified in the header of train.csv file. Could I just simply set the header as ",dft_density"?
Hi developers, after learning from the section of "Train MatterGen yourself" in README, I wanna to train MatterGen myself using my own dataset and now there are some questions as following:
For the number of structures, taking an example of mp_20 dataset, I notice that there are about 20k structures in the train.csv. For my own training dataset, how many structures are acceptable in the dataset? If I only have a small dataset like 200 structures, could the model be trained efficiently?
Could thoes disordered structures with partially occupied atoms be involved in the training dataset? For example, the following cif file with the Sr atoms partially occupied:
I notice that there are three files in the mp_20.zip of train.csv, test.csv and val.csv which might be the training, testing and validating dataset. After I collecting all the structures I needed, if necessary to divided the whole dataset into training, testing and validating dataset? In other words, if I only have the training dataset (train.csv), could the model be trained efficiently?
For the base model training, I notice that the properties such as "dft_band_gap" and their values were already specified in the train.csv file. So, the properties and their values need to be set first in the base model training?
There are many properties in the header of train.csv like
,material_id,formation_energy_per_atom,dft_band_gap,pretty_formula,e_above_hull,elements,cif,spacegroup_number,azure_bulk_modulus,larsen_score_2d,Si_100_mismatch,azure_band_gap,dft_bulk_modulus,dft_poisson_ratio,dft_mag_density. If I wanna to add my own property like "dft_density", how many properties are necessary and must be specified in the header of train.csv file. Could I just simply set the header as ",dft_density"?