Skip to content

Some questions on preparing my own training dataset #86

@yuedawang

Description

@yuedawang

Hi developers, after learning from the section of "Train MatterGen yourself" in README, I wanna to train MatterGen myself using my own dataset and now there are some questions as following:

  1. For the number of structures, taking an example of mp_20 dataset, I notice that there are about 20k structures in the train.csv. For my own training dataset, how many structures are acceptable in the dataset? If I only have a small dataset like 200 structures, could the model be trained efficiently?

  2. Could thoes disordered structures with partially occupied atoms be involved in the training dataset? For example, the following cif file with the Sr atoms partially occupied:

 _atom_site_type_symbol
 _atom_site_label
 _atom_site_symmetry_multiplicity
 _atom_site_fract_x
 _atom_site_fract_y
 _atom_site_fract_z
 _atom_site_occupancy
  Sr  Sr0  1  0.89006600  0.10993400  0.25000000  0.5
  Sr  Sr1  1  0.10993400  0.89006600  0.75000000  0.5
  1. I notice that there are three files in the mp_20.zip of train.csv, test.csv and val.csv which might be the training, testing and validating dataset. After I collecting all the structures I needed, if necessary to divided the whole dataset into training, testing and validating dataset? In other words, if I only have the training dataset (train.csv), could the model be trained efficiently?

  2. For the base model training, I notice that the properties such as "dft_band_gap" and their values were already specified in the train.csv file. So, the properties and their values need to be set first in the base model training?

  3. There are many properties in the header of train.csv like ,material_id,formation_energy_per_atom,dft_band_gap,pretty_formula,e_above_hull,elements,cif,spacegroup_number,azure_bulk_modulus,larsen_score_2d,Si_100_mismatch,azure_band_gap,dft_bulk_modulus,dft_poisson_ratio,dft_mag_density
    . If I wanna to add my own property like "dft_density", how many properties are necessary and must be specified in the header of train.csv file. Could I just simply set the header as ",dft_density"?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions