Some questions on preparing my own training dataset

Hi developers, after learning from the section of "Train MatterGen yourself" in README, I wanna to train MatterGen myself using my own dataset and now there are some questions as following:

1. For the number of structures, taking an example of mp_20 dataset, I notice that there are about 20k structures in the train.csv. For my own training dataset, how many structures are acceptable in the dataset? If I only have a small dataset like 200 structures, could the model be trained efficiently?

2. Could thoes disordered structures with partially occupied atoms be involved in the training dataset? For example, the following cif file with the Sr atoms partially occupied: 
```
 _atom_site_type_symbol
 _atom_site_label
 _atom_site_symmetry_multiplicity
 _atom_site_fract_x
 _atom_site_fract_y
 _atom_site_fract_z
 _atom_site_occupancy
  Sr  Sr0  1  0.89006600  0.10993400  0.25000000  0.5
  Sr  Sr1  1  0.10993400  0.89006600  0.75000000  0.5
```

3. I notice that there are three files in the mp_20.zip of train.csv, test.csv and val.csv which might be the training, testing and validating dataset. After I collecting all the structures I needed, if necessary to divided the whole dataset into training, testing and validating dataset? In other words, if I only have the training dataset (train.csv), could the model be trained efficiently?

4. For the base model training, I notice that the properties such as "dft_band_gap" and their values were already specified in the train.csv file. So, the properties and their values need to be set first in the base model training?

5. There are many properties in the header of train.csv like `,material_id,formation_energy_per_atom,dft_band_gap,pretty_formula,e_above_hull,elements,cif,spacegroup_number,azure_bulk_modulus,larsen_score_2d,Si_100_mismatch,azure_band_gap,dft_bulk_modulus,dft_poisson_ratio,dft_mag_density`
. If I wanna to add my own property like "dft_density", how many properties are necessary and must be specified in the header of train.csv file. Could I just simply set the header as ",dft_density"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions on preparing my own training dataset #86

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Some questions on preparing my own training dataset #86

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions