This section introduces the construction of the dataset and the training of the machine learning model. 

# Dataset Construction

The construction of the dataset is primarily divided into two steps: 
- The first step involves organizing scattered surface structure files and their corresponding target quantities into an ASE database.
- The second step relies on the ASE database to construct graph structures and add feature quantities to them.



In [15]:
from heaict.data.dataset import build_database, get_n_neighbor_slab
from pymatgen.core import Structure
from pymatgen.io.ase import AseAtomsAdaptor
from ase.visualize import view
from ase.db import connect
import pandas as pd

## Get n nearest neighbor slab
First, we recommend using the initial unrelaxed surface structures for learning.   

In extended surfaces, adsorption energy predictions are made using an N-nearest-neighbor surface model around the adsorption site. Therefore, during training, we also aim to use the N-nearest-neighbor surface model to enhance the reliability of the dataset.   

The construction of this N-nearest-neighbor surface model depends on identifying the adsorption site and its neighboring atoms, and using unrelaxed configurations yields higher accuracy.  

First, you should have a set of parameters for Function **infer_adsorption_site** to accurately determine the adsorption sites for all configurations, as discussed in the previous tutorial (parameters we used here are all default values). Afterwards, you can accomplish this process using Function **get_n_neighbor_slab**.

Here we generate a 3-nearest-neighbor-surface (the 1st nearest neighbor is the adsorption site itself) model for 109_N2h_120_17b structure.

In [4]:
slab = Structure.from_file(f'Data/Example/Site_Identification/Calculated_adss_initial/109_N2h_120_17b.vasp')
nn_slab = get_n_neighbor_slab(slab=slab, number_site_atoms=4, adsorbate_elements=['N', 'H'], number_neighbors=3, cutoff=3)
nn_slab = AseAtomsAdaptor.get_atoms(nn_slab)

In [9]:
view(atoms=nn_slab)

<Popen: returncode: None args: ['E:\\ProgramData\\Anaconda3\\envs\\fair-chem...>

## Build ASE db
By passing the corresponding function parameters into function **build_database**, you can complete the construction of the N-th nearest neighbor surface structure ASE database.   
In this process, you also need to provide the target quantity for each sample, which requires a data table and the column name corresponding to the target quantity.

In [14]:
df = pd.read_csv(f'Data/Example/Site_Identification/data2.csv', index_col=0)
build_database(
    slabs_direction=f'Data/Example/Site_Identification/Calculated_adss_initial/',
    df_target=df,
    targets=['G', 'Cadsb', 'Csite'],
    database=f'Data/Example/db_and_ML/slab.db',
    disable_tqdm=False,
    jupyter_tqdm=True,
    adsorbate_elements=['N', 'H'],
    para_n_neib_slab={}, # all default parameter
    para_infer_ads_site={} # all default parameter
)

  0%|          | 0/18 [00:00<?, ?it/s]

In [17]:
db_slab = connect(f'Data/Example/db_and_ML/slab.db')

In [20]:
db_slab[1].key_value_pairs

{'struc_name': '109_H_0_0h', 'G': -0.459272385, 'Cadsb': 4.0, 'Csite': 1.0}

## Slab structure to graph data

To convert a structural file into graph data and add features, the following three functions need to be used in sequence. 
- **construct_graph** is used to convert the structural file into graph data. 
- **grab_element_feature** obtains a table containing the features of the required elements. 
- **add_features** is then used to add feature vectors such as element features and structural features to the graph data.


In [21]:
from heaict.data.dataset import construct_graph, grab_element_feature, add_features

First, convert the first neighboring slab structure on the ASE database into a graph structure. 

In [27]:
graph = construct_graph(
    slab=AseAtomsAdaptor.get_structure(db_slab[1].toatoms()),
    cutoff_substrate=3,
    cutoff_adsorbate=1.3,
    cutoff_bond=2.7,
    adsb=['N', 'H'],
    forbidden_bonds=[['H', 'H']],
    para_infer_ads_site={}
)
graph

Data(edge_index=[2, 414], atomic_number=[56], atomic_type=[56], bond_pair=[3, 2], num_bond=[1], bonded_pair=[3, 1])

Then, generate a table of element features. For specific optional features and details, please refer to the following.

In [None]:
grab_element_feature?

In [35]:
df_feature = grab_element_feature(
    elements=['Ni', 'Pd', 'Cu', 'Mo', 'Mn', 'Fe', 'Ru', 'Co', 'N', 'H'],
    features='All',
    use_onehot=True,
    onehot_dim=10,
    compress_onehot=True,
)
df_feature

Unnamed: 0,number-1,number-7,number-25,number-26,number-27,number-28,number-29,number-42,number-44,number-46,row-1,row-2,row-4,row-5,group-1,group-6,group-7,group-8,group-9,group-10,group-11,group-15,atomic_radius-0,atomic_radius-3,atomic_radius-8,atomic_radius-9,atomic_mass-0,atomic_mass-1,atomic_mass-5,atomic_mass-9,Molar volume-0,Molar volume-1,Molar volume-2,Molar volume-4,Molar volume-6,Molar volume-9,X-0,X-1,X-2,X-4,X-9,electron_affinity-0,electron_affinity-2,electron_affinity-3,electron_affinity-6,electron_affinity-7,electron_affinity-8,electron_affinity-9,ionization_energy-0,ionization_energy-1,ionization_energy-8,ionization_energy-9,valence-1,valence-3,valence-5,valence-6,valence-7,valence-8,valence-10,Atomic orbitals-0,Atomic orbitals-2,Atomic orbitals-4,Atomic orbitals-5,Atomic orbitals-7,Atomic orbitals-8,Atomic orbitals-9
28,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
46,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
29,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
42,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
25,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
26,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
44,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
27,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
7,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0


Function **add_features** will assign features and target quantities to the graph data passed to it. Target quantities are not considered here for now.   
In addition to element features, this function can also assign four additional types of graph structure features:
- Add **distance features** to edges and nodes to describe their corresponding distances to the adsorbate molecule node.
- Add **coordination number** features to distinguish whether atomic nodes belong to the bulk, surface, or other regions.
- Add **bond features**, such as metallic bonds, covalent bonds between molecules, and chemical bonds between molecules and metals.
- Add **NRR-related features**, such as bond orders of Nâ€“N bonds. Note that this feature is not universal

For specific parameter settings, please refer to the relevant documentation:

In [None]:
add_features?

In [36]:
graph = add_features(
    graph=graph,
    df_feature=df_feature,
    targets=None,
    bond_feature=True,
    coordinate=[9, 12],
    distance_feature=True,
    use_NRRBOP=False
)

Now the node and edge features have been added to the x and edge_attr attributes of the graph, respectively.

In [42]:
graph.x

tensor([[0., 0., 0.,  ..., 0., 1., 0.],
        [0., 0., 0.,  ..., 0., 0., 1.],
        [0., 0., 0.,  ..., 0., 1., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 1., 0.],
        [0., 0., 0.,  ..., 0., 0., 1.],
        [1., 0., 0.,  ..., 0., 0., 0.]])

In [43]:
graph.edge_attr

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 1., 0.,  ..., 0., 1., 0.],
        [0., 1., 0.,  ..., 0., 1., 0.],
        [0., 1., 0.,  ..., 0., 1., 0.]])

## The batch construction of graph data. 
Through function **load_graph_from_database**, it is possible to directly read structural files and target quantities from the ASE database to batch-construct graph structures. During this process, it is necessary to pass some parameters from the above three functions to function load_graph_from_database.  

The dictionaries of parameter sets for some functions used in the paper are stored in the **para** module. Here, we use the settings from the paper for functions grab_element_feature and add_features, while function construct_graph uses the default parameter settings.

In [46]:
from heaict.data.dataset import load_graph_from_database
from heaict.para import para_add_feat, para_grab_feature

In [47]:
graphs = load_graph_from_database(
    database=f'Data/Example/db_and_ML/slab.db',
    targets=['G', 'Cadsb', 'Csite'],
    save=True,
    pt_file=f'Data/Example/db_and_ML/graphs.pt',
    disable_tqdm=False, jupyter_tqdm=True,
    para_grab_ele_feature=para_grab_feature,
    para_add_feature=para_add_feat,
    para_construct_graph={}
)

  0%|          | 0/18 [00:00<?, ?it/s]

In [49]:
graphs[0].x

tensor([[0., 0., 1.,  ..., 0., 1., 0.],
        [0., 0., 1.,  ..., 0., 0., 1.],
        [0., 0., 1.,  ..., 0., 1., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 1., 0.],
        [0., 0., 0.,  ..., 0., 0., 1.],
        [1., 0., 0.,  ..., 0., 0., 0.]])

In [50]:
graphs[0].edge_attr

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 1., 0.,  ..., 0., 0., 1.],
        [0., 1., 0.,  ..., 0., 0., 1.],
        [0., 1., 0.,  ..., 0., 0., 1.]])

In [51]:
graphs[0].y

tensor([[-0.4593,  4.0000,  1.0000]])