In [1]:
!pip install matgraphdb
!pip install ipykernel



# 03 - Graph Generators in ParquetGraphDB

In this notebook, we'll learn how to:

1. Create node generator
2. Add the node generator to the graph
3. Create edge generator
4. Add the edge generator to the graph
5. Defining dependencies between generators

We'll use the `ParquetGraphDB` class from `parquetdb` to demonstrate these features. If you haven't already installed `parquetdb`, run the previous cell.


## Download example data

For this tutorial, we will start from example materials data. You can download the data by running the following cell. which downloads the data from the [ParquetDB GitHub repository](https://github.com/lllangWV/ParquetDB/tree/main/tests/graphdb/data/materials).

In [14]:
from pathlib import Path
import os
import shutil
import pandas as pd

FILE_DIR = "."

Next, we can load the materials data which can be retrieved using `ParquetDB` and then add it into `ParquetGraphDB`.

In [12]:
from parquetdb import ParquetGraphDB

# Create a temporary directory for our database
GRAPH_DB_DIR = FILE_DIR / "GraphDB"
if GRAPH_DB_DIR.exists():
    shutil.rmtree(GRAPH_DB_DIR)
GRAPH_DB_DIR.mkdir(parents=True, exist_ok=True)


# Initialize ParquetGraphDB
db = ParquetGraphDB(storage_path=GRAPH_DB_DIR)

data = [
    {},
    {}
    ]
db.add_nodes(node_type="material", data=data)

print(db.summary(show_column_names=True))


ArrowInvalid: Field type did not match data type

## Generators

A **Generator** is a callable (function) that returns a [PyArrow Table](https://arrow.apache.org/docs/python/api/table.html) of either nodes or edges. By adding a generator to `ParquetGraphDB`, you can:

1. Register the generator, so it can be re-run on demand.
2. Optionally specify arguments/kwargs to pass into the generator.
3. Automatically store the output in a **NodeStore** or **EdgeStore** with the same name as the generator function (or a custom name, if you prefer).

This is especially handy for generating nodes from external data sources or from computational routines.

In the following sections we will create custom node and edge generators. These can be create by wrapping existing functions with the `node_generator` or `edge_generator` decorators.

These can be imported like:

```python
from parquetdb import node_generator, edge_generator
```

### Element Node Generator


#### 1. Define the Generator

In our first example, we will create a node generator that creates element nodes. 

As mentioned above to create a node generator, we will wrap an existing function with the `node_generator` decorator. The function name will be the name of the node type.

```python
@node_generator
def elements():
    ...
```

For this example, we will import an existing periodic table data from the `matgraphdb` package. This is a dataframe with 118 rows representing 118 elements of the periodic table. We have also added some transformations to the data to make it more useful for our purposes.

In [4]:
### Element Node Generator


# Define the generator with the @node_generator decorator
@node_generator
def elements(base_file=BASE_ELEMENT_FILE):
    """
    Creates Element nodes from a local file (CSV or Parquet).
    Returns a Pandas DataFrame (or PyArrow Table) with one row per element.
    """

    try:
        # Read the file
        file_ext = os.path.splitext(base_file)[-1][
            1:
        ].lower()  # e.g. "parquet" or "csv"
        if file_ext == "parquet":
            df = pd.read_parquet(base_file)
        elif file_ext == "csv":
            df = pd.read_csv(base_file)
        else:
            raise ValueError("base_file must be a parquet or csv file")

        # Apply some transformations
        # Example transformations
        df["oxidation_states"] = df["oxidation_states"].apply(
            lambda x: x.replace("]", "").replace("[", "")
        )
        df["oxidation_states"] = df["oxidation_states"].apply(
            lambda x: ",".join(x.split())
        )
        df["oxidation_states"] = df["oxidation_states"].apply(
            lambda x: eval("[" + x + "]")
        )
        df["experimental_oxidation_states"] = df["experimental_oxidation_states"].apply(
            lambda x: eval(x)
        )
        df["ionization_energies"] = df["ionization_energies"].apply(lambda x: eval(x))

    except Exception as e:
        print(f"Error reading element file: {e}")
        return None

    return df  # Return the transformed dataframe


df = elements()

print(df)

       long_name symbol  abundance_universe  abundance_solar  \
0       Hydrogen      H        7.500000e+01     7.500000e+01   
1         Helium     He        2.300000e+01     2.300000e+01   
2        Lithium     Li        6.000000e-07     6.000000e-09   
3      Beryllium     Be        1.000000e-07     1.000000e-08   
4          Boron      B        1.000000e-07     2.000000e-07   
..           ...    ...                 ...              ...   
113    Flerovium     Fl        0.000000e+00     0.000000e+00   
114    Moscovium     Mc        0.000000e+00     0.000000e+00   
115  Livermorium     Lv        0.000000e+00     0.000000e+00   
116   Tennessine     Ts        0.000000e+00     0.000000e+00   
117    Oganesson     Og        0.000000e+00     0.000000e+00   

     abundance_meteor  abundance_crust  abundance_ocean  abundance_human  \
0            2.400000     1.500000e-01     1.100000e+01     1.000000e+01   
1            0.000000     5.500000e-07     7.200000e-10     0.000000e+00   
2  

#### 2. Add the Generator to the MatGraphDB

Now that we have defined the generator, we can add it to the `MatGraphDB` instance. We do this by calling the `add_node_generator` method. Here we give the function, the arguments, and the kwargs. We also have the option to run the generator immediately or later. Default is True.

The node generator will be stored in the `node_generator_store` of the `MatGraphDB` instance.



In [5]:
mdb.add_node_generator(
    generator_func=elements,
    generator_args={},
    generator_kwargs={"base_file": BASE_ELEMENT_FILE},
    run_immediately=False,  # We have the option to run the generator immediately or later. Default is True.
)

# Check the node generators in the MatGraphDB

print(mdb.node_generator_store)

GENERATOR STORE SUMMARY
• Number of generators: 1
Storage path: MatGraphDB\node_generators


############################################################
METADATA
############################################################
• class: GeneratorStore
• class_module: matgraphdb.core.generator_store

############################################################
GENERATOR DETAILS
############################################################
• Columns:
    - generator_func
    - generator_kwargs.base_file
    - generator_name
    - id

• Generator names:
    - elements



### Running a Node Generator Later

Now we can run the node generator with `mdb.run_node_generator(generator_name)`.


In [6]:
# Here we run the node generator. Notice how we do not need pass the arguments or kwargs, this information is stored in the node generator store.
# However, we can override the arguments or kwargs if we want to.
mdb.run_node_generator("elements")

Unnamed: 0,long_name,symbol,abundance_universe,abundance_solar,abundance_meteor,abundance_crust,abundance_ocean,abundance_human,adiabatic_index,allotropes,...,is_halogen,is_lanthanoid,is_metal,is_metalloid,is_noble_gas,is_post_transition_metal,is_quadrupolar,is_rare_earth_metal,experimental_oxidation_states,ionization_energies
0,Hydrogen,H,7.500000e+01,7.500000e+01,2.400000,1.500000e-01,1.100000e+01,1.000000e+01,5-Jul,Dihydrogen,...,False,False,False,False,False,False,True,False,[],[1312.0]
1,Helium,He,2.300000e+01,2.300000e+01,0.000000,5.500000e-07,7.200000e-10,0.000000e+00,3-May,,...,False,False,False,False,True,False,False,False,[],"[2372.3, 5250.5]"
2,Lithium,Li,6.000000e-07,6.000000e-09,0.000170,1.700000e-03,1.800000e-05,3.000000e-06,,,...,False,False,True,False,False,False,True,False,[1],"[520.2, 7298.1, 11815.0]"
3,Beryllium,Be,1.000000e-07,1.000000e-08,0.000003,1.900000e-04,6.000000e-11,4.000000e-08,,,...,False,False,True,False,False,False,True,False,[2],"[899.5, 1757.1, 14848.7, 21006.6]"
4,Boron,B,1.000000e-07,2.000000e-07,0.000160,8.600000e-04,4.400000e-04,7.000000e-05,,"Alpha Rhombohedral Boron, Beta Rhombohedral Bo...",...,False,False,False,True,False,False,True,False,[3],"[800.6, 2427.1, 3659.7, 25025.8, 32826.7]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113,Flerovium,Fl,0.000000e+00,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,,,...,False,False,False,False,False,False,False,False,[2],"[832.2, 1600.0, 3370.0, 4400.0, 5850.0]"
114,Moscovium,Mc,0.000000e+00,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,,,...,False,False,False,False,False,False,False,False,[3],"[538.3, 1760.0, 2650.0, 4680.0, 5720.0]"
115,Livermorium,Lv,0.000000e+00,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,,,...,False,False,False,False,False,False,False,False,[-2],"[663.9, 1330.0, 2850.0, 3810.0, 6080.0]"
116,Tennessine,Ts,0.000000e+00,0.000000e+00,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,,,...,False,False,False,False,False,False,False,False,[-1],"[736.9, 1435.4, 2161.9, 4012.9, 5076.4]"


Lets check the node store for the elements.


In [7]:
element_node_store = mdb.get_node_store("elements")
print(element_node_store)

NODE STORE SUMMARY
Node type: elements
• Number of nodes: 118
• Number of features: 99
Storage path: MatGraphDB\nodes\elements


############################################################
METADATA
############################################################
• class: NodeStore
• class_module: matgraphdb.core.nodes
• node_type: elements
• name_column: id

############################################################
NODE DETAILS
############################################################
• Columns:
    - abundance_crust
    - abundance_human
    - abundance_meteor
    - abundance_ocean
    - abundance_solar
    - abundance_universe
    - adiabatic_index
    - allotropes
    - appearance
    - atomic_mass
    - atomic_number
    - block
    - boiling_point
    - classifications_cas_number
    - classifications_cid_number
    - classifications_dot_hazard_class
    - classifications_dot_numbers
    - classifications_rtecs_number
    - coefficient_of_linear_thermal_expansion
    - conducti

### Material-Element Edge Generator

#### 1. Define the Generator

An **edge generator** is similar to a node generator but returns a PyArrow Table describing edges. Each generated edge must have at least these fields:

- `source_id` (int)
- `source_type` (string)
- `target_id` (int)
- `target_type` (string)

Additionally, edge_generators must have the corresponding node_stores in the `MatGraphDB` instance as an argument. This is to ensure that the ids of the nodes are valid and in the correct node store. 

For edges we use the `edge_generator` decorator.


In [8]:
from matgraphdb import edge_generator
import pyarrow as pa


@edge_generator
def material_element_has(
    material_store, element_store
):  # We have the material_store and element_store as an argument
    try:
        connection_name = "has"

        # We select only the necessary columns from the node stores
        material_table = material_store.read_nodes(
            columns=["id", "core.material_id", "core.elements"]
        )
        element_table = element_store.read_nodes(columns=["id", "symbol"])

        # We rename for utility purposes
        material_table = material_table.rename_columns(
            {"id": "source_id", "core.material_id": "material_name"}
        )
        material_table = material_table.append_column(
            "source_type", pa.array(["material"] * material_table.num_rows)
        )

        element_table = element_table.rename_columns({"id": "target_id"})
        element_table = element_table.append_column(
            "target_type", pa.array(["elements"] * element_table.num_rows)
        )

        # We convert the tables to pandas for easier manipulation
        material_df = material_table.to_pandas()
        element_df = element_table.to_pandas()

        # We create a map of the element symbols to the target_id for quick lookup
        element_target_id_map = {
            row["symbol"]: row["target_id"] for _, row in element_df.iterrows()
        }

        # We create a dictionary to store the edge data
        table_dict = {
            "source_id": [],
            "source_type": [],
            "target_id": [],
            "target_type": [],
            "edge_type": [],
            "name": [],
            "weight": [],
        }

        # We iterate over the material nodes
        for _, row in material_df.iterrows():
            # We get the elements composing the material
            elements = row["core.elements"]
            source_id = row["source_id"]
            material_name = row["material_name"]
            if elements is None:
                continue

            # We iterate over the elements
            for element in elements:
                # We get the target_id for the element
                target_id = element_target_id_map[element]

                # We append the edge data to the dictionary. Here we could also define the reverse edge as well.
                table_dict["source_id"].append(source_id)
                table_dict["source_type"].append(material_store.node_type)
                table_dict["target_id"].append(target_id)
                table_dict["target_type"].append(element_store.node_type)
                table_dict["edge_type"].append(connection_name)

                name = f"{material_name}_{connection_name}_{element}"
                table_dict["name"].append(name)
                table_dict["weight"].append(1.0)

        # edge_table = ParquetDB.construct_table(table_dict)

        # logger.debug(
        #     f"Created material-element-has relationships. Shape: {edge_table.shape}"
        # )
        df = pd.DataFrame(table_dict)
    except Exception as e:
        print(f"Error creating material-element-has relationships: {e}")

    return df

#### 2. Add the Generator to the MatGraphDB

Now that we have defined the generator, we can add it to the `MatGraphDB` instance. We do this by calling the `add_edge_generator` method.

The edge generator will be stored in the `edge_generator_store` of the `MatGraphDB` instance.


In [9]:
element_store = mdb.get_node_store("elements")
material_store = mdb.get_node_store("materials")

mdb.add_edge_generator(
    generator_func=material_element_has,
    generator_args={
        "material_store": material_store,
        "element_store": element_store,
    },
    generator_kwargs={},
    run_immediately=True,
)

Lets check the edge generator store.

In [10]:
print(mdb.edge_generator_store)

GENERATOR STORE SUMMARY
• Number of generators: 1
Storage path: MatGraphDB\edge_generators


############################################################
METADATA
############################################################
• class: GeneratorStore
• class_module: matgraphdb.core.generator_store

############################################################
GENERATOR DETAILS
############################################################
• Columns:
    - generator_args.element_store
    - generator_args.material_store
    - generator_func
    - generator_name
    - id

• Generator names:
    - material_element_has



Let's check to see if the edge created the edges in the edge store.

In [11]:
edge_store = mdb.get_edge_store("material_element_has")
print(edge_store)

EDGE STORE SUMMARY
Edge type: material_element_has
• Number of edges: 3348
• Number of features: 8
Storage path: MatGraphDB\edges\material_element_has


############################################################
METADATA
############################################################
• class: EdgeStore
• class_module: matgraphdb.core.edges

############################################################
EDGE DETAILS
############################################################
• Columns:
    - edge_type
    - id
    - name
    - source_id
    - source_type
    - target_id
    - target_type
    - weight



### Updates to node stores.

By default, when node and edge generators are added their argument store dependencies are added to the `MatGraphDB` instance. This means that when parent stores are updated, the geneator will run and update their corresponding stores.

These stores are stored in the `MatGraphDB/generator_dependency.json` file.


In [12]:
materials_df = mdb.read_materials(columns=["id"], ids=[0]).to_pandas()
print(materials_df)

mdb.delete_materials(ids=[0])

materials_df = mdb.read_materials(columns=["id"], ids=[0]).to_pandas()
print(materials_df)

   id
0   0
2025-02-11 10:52:11 - matgraphdb.materials.nodes.materials - INFO - Deleting data [0]
2025-02-11 10:52:11 - matgraphdb.materials.nodes.materials - INFO - Data deleted successfully.
2025-02-11 10:52:11 - matgraphdb.core.graph_db - INFO - Running dependent generators: materials
2025-02-11 10:52:11 - matgraphdb.core.graph_db - INFO - Running dependent generator: material_element_has
2025-02-11 10:52:12 - matgraphdb.core.graph_db - INFO - Removing existing edge store: material_element_has
2025-02-11 10:52:12 - matgraphdb.core.graph_db - INFO - Removing edge store of type material_element_has
2025-02-11 10:52:12 - matgraphdb.core.graph_db - INFO - Running dependent generators: material_element_has
2025-02-11 10:52:12 - matgraphdb.core.graph_db - INFO - Creating edges of type 'material_element_has'
2025-02-11 10:52:12 - matgraphdb.core.graph_db - INFO - Creating new EdgeStore for type: material_element_has
2025-02-11 10:52:12 - matgraphdb.core.edges - INFO - Successfully created 

In [13]:
edge_store = mdb.get_edge_store("material_element_has")
print(edge_store)

EDGE STORE SUMMARY
Edge type: material_element_has
• Number of edges: 3345
• Number of features: 8
Storage path: MatGraphDB\edges\material_element_has


############################################################
METADATA
############################################################
• class: EdgeStore
• class_module: matgraphdb.core.edges

############################################################
EDGE DETAILS
############################################################
• Columns:
    - edge_type
    - id
    - name
    - source_id
    - source_type
    - target_id
    - target_type
    - weight



In [14]:
df = edge_store.read_edges().to_pandas()
print(df)

     edge_type    id               name  source_id source_type  target_id  \
0          has     0   mp-1222351_has_F          1   materials          8   
1          has     1  mp-1222351_has_Fe          1   materials         25   
2          has     2  mp-1222351_has_Li          1   materials          2   
3          has     3    mp-651087_has_F          2   materials          8   
4          has     4   mp-651087_has_Gd          2   materials         63   
...        ...   ...                ...        ...         ...        ...   
3340       has  3340  mp-2714707_has_Al        999   materials         12   
3341       has  3341  mp-2714707_has_Na        999   materials         10   
3342       has  3342   mp-2714707_has_O        999   materials          7   
3343       has  3343   mp-2714707_has_S        999   materials         15   
3344       has  3344  mp-2714707_has_Zn        999   materials         29   

     target_type  weight  
0       elements     1.0  
1       elements     

In [15]:
print(mdb)

GRAPH DATABASE SUMMARY
Name: MatGraphDB
Storage path: MatGraphDB
└── Repository structure:
    ├── nodes/                 (MatGraphDB\nodes)
    ├── edges/                 (MatGraphDB\edges)
    ├── edge_generators/       (MatGraphDB\edge_generators)
    ├── node_generators/       (MatGraphDB\node_generators)
    └── graph/                 (MatGraphDB\graph)

############################################################
NODE DETAILS
############################################################
Total node types: 2
------------------------------------------------------------
• Node type: materials
  - Number of nodes: 999
  - Number of features: 136
  - Columns:
       - bonding.cutoff_method.bond_connections
       - bonding.electric_consistent.bond_connections
       - bonding.electric_consistent.bond_orders
       - bonding.geometric_consistent.bond_connections
       - bonding.geometric_consistent.bond_orders
       - bonding.geometric_electric_consistent.bond_connections
       - bond

## 6. Summary

In this notebook, we showed how to define custom node and edge generators and showed how to run them.