In [1]:
!pip install parquetdb
!pip install ipykernel



# 01 - Managing Graphs in ParuqetGraphDB

In this notebook, we'll learn how to:

1. Add new nodes and node types.
2. Add new edges and edge types.
3. Create node generators that automatically produce nodes based on a predefined function.
4. Create edge generators that automatically produce edges based on a predefined function.

We'll use the `ParquetGraphDB` class from `parquetdb` to demonstrate these features. If you haven't already installed `parquetdb`, run the previous cell.

## Setup

Here we will setup the directory structure where are test data will come from.

In [1]:
from pathlib import Path
import shutil
import pandas as pd


FILE_DIR = Path(".")
DATA_DIR = FILE_DIR / "data"

if DATA_DIR.exists():
    shutil.rmtree(DATA_DIR)
    
DATA_DIR.mkdir(parents=True, exist_ok=True)

## 1. Initializing ParquetGraphDB


Here we will initialize the `ParquetGraphDB` instance. We can print out a summary of the database by directly printing the instance or calling the `summary` method. The summary has additional key word arguments that can be passed used to show additional information like column_names/fields.

In [2]:
from parquetdb import ParquetGraphDB

# Create a temporary directory for our database
GRAPH_DB_DIR = DATA_DIR / "GraphDB"
if GRAPH_DB_DIR.exists():
    shutil.rmtree(GRAPH_DB_DIR)
GRAPH_DB_DIR.mkdir(parents=True, exist_ok=True)


# Initialize ParquetGraphDB
db = ParquetGraphDB(storage_path=GRAPH_DB_DIR)

print(db)

[INFO] 2025-04-30 08:10:23 - parquetdb.utils.config[37][load_config] - Config file: C:\Users\lllang\AppData\Local\parquetdb\parquetdb\config.yml
[INFO] 2025-04-30 08:10:23 - parquetdb.utils.config[41][load_config] - Setting data_dir to Z:\data\parquetdb\data
[INFO] 2025-04-30 08:10:24 - parquetdb.graph.parquet_graphdb[39][__init__] - Initializing GraphDB at root path: GraphDB
[INFO] 2025-04-30 08:10:24 - parquetdb.graph.parquet_graphdb[174][_load_existing_node_stores] - Loading existing node stores
[INFO] 2025-04-30 08:10:24 - parquetdb.graph.parquet_graphdb[203][_load_existing_stores] - Found 0 store types
[INFO] 2025-04-30 08:10:24 - parquetdb.graph.parquet_graphdb[182][_load_existing_edge_stores] - Loading existing edge stores
[INFO] 2025-04-30 08:10:24 - parquetdb.graph.parquet_graphdb[203][_load_existing_stores] - Found 0 store types
[INFO] 2025-04-30 08:10:24 - parquetdb.core.parquetdb[200][__init__] - Initializing ParquetDB with db_path: c:\Users\lllang\Desktop\Current_Projects\

Each ParquetGraphDB is a directory containing the following:

- `nodes`: A directory containing the node data. This will be where we store in nodes in the form of NodeStores, which extend `ParquetDB` class
- `edges`: A directory containing the edge data. This will be where we store in edges in the form of EdgeStores, which extend `ParquetDB` class
- `node_generators`: A directory containing the node generator data. This will be where we store node generator functions that can control the creation and update of children nodes that depend on parent nodes or edges
- `edge_generators`: A directory containing the edge generator data. This will be where we store edge generator functions that can control the creation and update of children edges that depend on parent nodes or edges
- `generator_dependency.json`: A json file containing the dependency graph of the generators. This is used to determine the order of execution of the generators.

As you can see, it is currently empty. Let's add some nodes and edges to the database.


## 1. New Nodes

By default, no custom node types. You initialize a new node type in three ways:

1. You can add your own node types via `add_node_type(...)`, which creates an empty `NodeStore` for that type. 
2. You can add nodes directly by using the `add_nodes(node_type, data)` method and supply the `node_type` and `data`.
3. You can add `NodeStore` instances directly by using the `add_node_store(node_store)` method.

Once the node type is initialized, the main method to add nodes is through the `add_nodes(node_type, data)` method.


Let's initialize a new node type.


In [3]:
# Add a node type called 'user'
custom_node_type = "users"

db.add_node_type(custom_node_type)

# These nodes will be stored in Parquet/nodes/users
print("Current node_stores:", list(db.node_stores.keys()))

print(db.summary(show_column_names=True))

Current node_stores: ['users']
GRAPH DATABASE SUMMARY
Name: GraphDB
Storage path: GraphDB
└── Repository structure:
    ├── nodes/                 (GraphDB\nodes)
    ├── edges/                 (GraphDB\edges)
    ├── edge_generators/       (GraphDB\edge_generators)
    ├── node_generators/       (GraphDB\node_generators)
    └── graph/                 (GraphDB\graph)

############################################################
NODE DETAILS
############################################################
Total node types: 1
------------------------------------------------------------
• Node type: users
  - Number of nodes: 0
  - Number of features: 1
  - Columns:
       - id
  - db_path: GraphDB\nodes\users
------------------------------------------------------------

############################################################
EDGE DETAILS
############################################################
Total edge types: 0
------------------------------------------------------------

#######

The node store instances are stored in the `node_stores` attribute, which is a dictionary of `node_type` to `NodeStore` instances.

Now, when we print the summary, we see the node type is now included. Here, we also print the column names for the node store. You can see that we have the `id` column, which is the unique local identifier for the node. Any new instances of nodes will be assigned an id automatically.

Our database is currently empty, so let's add some nodes to it.


### Adding Nodes

As mentioned above, once a node type is registered, you can add nodes to it using the `add_nodes(node_type, data)` method. The `data` argument can take the following forms:

1. list of dictionaries (Each dictionary represents a node)
2. dictionary of arrays (Each key is a column name and each value is an array representing the column values for a node)
3. `pandas.DataFrame` (Each row is a node)
4. `pyarrow.Table` (Each row is a node)

> Note: you can also automatically register a new node type by calling the `add_nodes` as well


Here, we'll add data to the existing node type and add a new node type at the same time.

In [4]:
# Add some user nodes
users = [
    {"name": "Jimmy"},
    {"name": "John"},
]

computers = [
    {
        "name": "Computer1",
        "specs": {"cpu": "AMD Ryzen 9", "ram": "32GB", "storage": "1TB"},
    },
    {
        "name": "Computer2",
        "specs": {"cpu": "Intel i7", "ram": "16GB", "storage": "512GB"},
    },
]

users_node_type = "users"
computers_node_type = "computers"

db.add_nodes(node_type=users_node_type, data=users)
db.add_nodes(node_type=computers_node_type, data=computers)

print(db.summary(show_column_names=True))

GRAPH DATABASE SUMMARY
Name: GraphDB
Storage path: GraphDB
└── Repository structure:
    ├── nodes/                 (GraphDB\nodes)
    ├── edges/                 (GraphDB\edges)
    ├── edge_generators/       (GraphDB\edge_generators)
    ├── node_generators/       (GraphDB\node_generators)
    └── graph/                 (GraphDB\graph)

############################################################
NODE DETAILS
############################################################
Total node types: 2
------------------------------------------------------------
• Node type: users
  - Number of nodes: 2
  - Number of features: 2
  - Columns:
       - id
       - name
  - db_path: GraphDB\nodes\users
------------------------------------------------------------
• Node type: computers
  - Number of nodes: 2
  - Number of features: 5
  - Columns:
       - id
       - name
       - specs.cpu
       - specs.ram
       - specs.storage
  - db_path: GraphDB\nodes\computers
---------------------------------

Great! Now we have two node types, `users` and `computers`, and we have added some nodes to each. As you can see the summary now includes the new node types with details about each node type.

Now, that we added some nodes, we need to know how to manage them.


### Managing the node store

Once the data is registered, you can access it through the corresponding node store. You can get the node store either through the `node_stores` attribute or the `get_node_store(node_type)` method.


In [5]:
computers_node_store = db.get_node_store(computers_node_type)
print(type(computers_node_store))
print(computers_node_store)


users_node_store = db.node_stores[users_node_type]
print(type(users_node_store))

print(users_node_store)

<class 'parquetdb.graph.nodes.NodeStore'>
NODE STORE SUMMARY
Node type: computers
• Number of nodes: 2
• Number of features: 5
Storage path: GraphDB\nodes\computers


############################################################
METADATA
############################################################
• class: NodeStore
• class_module: parquetdb.graph.nodes
• node_type: computers
• name_column: id

############################################################
NODE DETAILS
############################################################

<class 'parquetdb.graph.nodes.NodeStore'>
NODE STORE SUMMARY
Node type: users
• Number of nodes: 2
• Number of features: 2
Storage path: GraphDB\nodes\users


############################################################
METADATA
############################################################
• class: NodeStore
• class_module: parquetdb.graph.nodes
• node_type: users
• name_column: id

############################################################
NODE DETAILS
########

### Reading from the node store

There are multiple ways to read from the node store. You can use the `read_nodes` method from the `MatGraphDB` instance, you can use the `read_nodes` method from the `NodeStore` instance, or you can use the `read` method from the `NodeStore` instance. These reads methods behave very similarly as the read features introduced in the previous notebook, such as you can read columns using filters or columns

In [9]:
import pyarrow.compute as pc

df = db.read_nodes(node_type=users_node_type).to_pandas()
print(df)

print("-"*100)

df = computers_node_store.read().to_pandas()

print(df)

print("-"*100)

# We can filter this similar to ParquetDB
df = computers_node_store.read(
    filters=[pc.field("specs.cpu") == "Intel i7"]
).to_pandas()

print(df)
print("-"*100)


# Notice if you rebuild the nested struct, the way you access the nested data is different
df = computers_node_store.read_nodes(
    columns=["name", "id", "specs"],
    filters=[pc.field("specs", "cpu") == "AMD Ryzen 9"],
    rebuild_nested_struct=True,
).to_pandas()
print(df)




   id   name
0   0  Jimmy
1   1   John
----------------------------------------------------------------------------------------------------
   id       name    specs.cpu specs.ram specs.storage
0   0  Computer1  AMD Ryzen 9      32GB           1TB
1   1  Computer2     Intel i7      16GB         512GB
----------------------------------------------------------------------------------------------------
   id       name specs.cpu specs.ram specs.storage
0   1  Computer2  Intel i7      16GB         512GB
----------------------------------------------------------------------------------------------------
        name  id                                              specs
0  Computer1   0  {'cpu': 'AMD Ryzen 9', 'ram': '32GB', 'storage...


### Updating the node store

You can update the node store by using the `update_nodes` method from the `ParquetGraphDB` instance, or the `update_nodes` method from the `NodeStore` instance.

In [10]:
computer_update_data = [
    {"name": "Computer1", "specs": {"ram": "128GB", "storage": "1TB"}},
    {"name": "Computer2", "specs": {"ram": "256GB", "storage": "2TB"}},
]

db.update_nodes(
    node_type=computers_node_type, data=computer_update_data, update_keys=["name"]
)

df = db.read_nodes(node_type=computers_node_type).to_pandas()
print(df)

   id       name    specs.cpu specs.ram specs.storage
0   0  Computer1  AMD Ryzen 9     128GB           1TB
1   1  Computer2     Intel i7     256GB           2TB


## 2. Adding New Edges

Edges are managed in the same way as nodes, but they are stored in the `EdgeStore` instance. EdgeStores differ from NodeStores as they have to store the source and target node ids, as well as the edge type. These must be specified to add an edge.

You can create a new edge type using `add_edge_type(edge_type)`. Then, you can add edges by calling `add_edges(edge_type, data)`.
- `source_id` and `source_type`
- `target_id` and `target_type`


The `ids` and `types` must match the node types and ids nodes in `MatGraphDB`.

In [41]:
# Add edge type
edge_type_test = "user_access"

# We'll connect the 'user' nodes to the 'item' nodes
edge_data = [
    {
        "source_id": 0,  # This is the id of the user node
        "source_type": users_node_type,
        "target_id": 0,  # This is the id of the computer node
        "target_type": computers_node_type,
        "edge_type": edge_type_test,
        "name": "Jimmy has access to Computer1",
    },
    {
        "source_id": 0,  # This is the id of the user node
        "source_type": users_node_type,
        "target_id": 1,  # This is the id of the computer node
        "target_type": computers_node_type,
        "edge_type": edge_type_test,
        "name": "Jimmy has access to Computer2",
    },
    {
        "source_id": 1,
        "source_type": users_node_type,
        "target_id": 1,
        "target_type": computers_node_type,
        "edge_type": edge_type_test,
        "name": "John has access to Computer2",
    },
    {
        "source_id": 0,
        "source_type": computers_node_type,
        "target_id": 1,
        "target_type": computers_node_type,
        "edge_type": edge_type_test,
        "name": "Computer1 has access to Computer2",
    },
    {
        "source_id": 1,
        "source_type": computers_node_type,
        "target_id": 0,
        "target_type": computers_node_type,
        "edge_type": edge_type_test,
        "name": "Computer2 has access to Computer1",
    },
    {
        "source_id": 0,
        "source_type": computers_node_type,
        "target_id": 0,
        "target_type": computers_node_type,
        "edge_type": edge_type_test,
        "name": "Computer1 has access to Computer1",
        "extra_detail": "This is the main computer",
    },
]

db.add_edges(edge_type=edge_type_test, data=edge_data)

edges = db.read_edges(edge_type=edge_type_test)
print("Number of edges of type 'test_edge':", len(edges))
df_edges = edges.to_pandas()
print(df_edges)

Number of edges of type 'test_edge': 6
     edge_type               extra_detail  id  \
0  user_access                       None   0   
1  user_access                       None   1   
2  user_access                       None   2   
3  user_access                       None   3   
4  user_access                       None   4   
5  user_access  This is the main computer   5   

                                name  source_id source_type  target_id  \
0      Jimmy has access to Computer1          0       users          0   
1      Jimmy has access to Computer2          0       users          1   
2       John has access to Computer2          1       users          1   
3  Computer1 has access to Computer2          0   computers          1   
4  Computer2 has access to Computer1          1   computers          0   
5  Computer1 has access to Computer1          0   computers          0   

  target_type  
0   computers  
1   computers  
2   computers  
3   computers  
4   computers  
5 

In this example we have defined the computer access edges between users and computers. Note that we can specify self-edges and directionality of the edges by choosing which node is the source and which is the target.

Also we are free to add additional columns/features to the edges, such as `extra_detail` in this case.

### Updating the edges

You can update the edges by using the `update_edges` method from the `MatGraphDB` instance, or the `update_edges` method from the `EdgeStore` instance.


In [42]:
update_data = [
    {"id": 0, "weight": 1.0},
    {"id": 1, "weight": 1.0},
]

db.update_edges(edge_type=edge_type_test, data=update_data)

edges = db.read_edges(
    edge_type=edge_type_test, columns=["id", "source_id", "target_id", "weight", "name"]
).to_pandas()
print("Number of edges of type 'test_edge':", len(edges))
print(edges)

Number of edges of type 'test_edge': 6
   id  source_id  target_id  weight                               name
0   0          0          0     1.0      Jimmy has access to Computer1
1   1          0          1     1.0      Jimmy has access to Computer2
2   2          1          1     NaN       John has access to Computer2
3   3          0          1     NaN  Computer1 has access to Computer2
4   4          1          0     NaN  Computer2 has access to Computer1
5   5          0          0     NaN  Computer1 has access to Computer1


We can also update by specifying the source and target ids and types. To do this we need to specify `source_id`, `target_id`, `source_type`, and `target_type` in the `update_keys` argument.


In [43]:
update_data = [
    {
        "source_id": 0,
        "source_type": users_node_type,
        "target_id": 0,
        "target_type": computers_node_type,
        "weight": 0.5,
    },
]

db.update_edges(
    edge_type=edge_type_test,
    data=update_data,
    update_keys=["source_id", "target_id", "source_type", "target_type"],
)


edges = db.read_edges(
    edge_type=edge_type_test, columns=["id", "source_id", "target_id", "weight", "name"]
).to_pandas()
print("Number of edges of type 'test_edge':", len(edges))
print(edges)

Number of edges of type 'test_edge': 6
   id  source_id  target_id  weight                               name
0   0          0          0     0.5      Jimmy has access to Computer1
1   1          0          1     1.0      Jimmy has access to Computer2
2   2          1          1     NaN       John has access to Computer2
3   3          0          1     NaN  Computer1 has access to Computer2
4   4          1          0     NaN  Computer2 has access to Computer1
5   5          0          0     NaN  Computer1 has access to Computer1


In [44]:
print(db)

GRAPH DATABASE SUMMARY
Name: GraphDB
Storage path: GraphDB
└── Repository structure:
    ├── nodes/                 (GraphDB\nodes)
    ├── edges/                 (GraphDB\edges)
    ├── edge_generators/       (GraphDB\edge_generators)
    ├── node_generators/       (GraphDB\node_generators)
    └── graph/                 (GraphDB\graph)

############################################################
NODE DETAILS
############################################################
Total node types: 2
------------------------------------------------------------
• Node type: users
  - Number of nodes: 2
  - Number of features: 2
  - db_path: GraphDB\nodes\users
------------------------------------------------------------
• Node type: computers
  - Number of nodes: 2
  - Number of features: 5
  - db_path: GraphDB\nodes\computers
------------------------------------------------------------

############################################################
EDGE DETAILS
####################################

## Conclusion

In this notebook, we explored the process of managing graphs using ParquetGraphDB. Specifically, we:

- Added new node types and registered nodes within those types.
- Learned how to create and manage edge types, including adding and updating edges.
- Explored the functionality of reading and updating data from both node and edge stores.

These capabilities form the foundation for representing and manipulating complex graph-based data efficiently. 

### What's Next?

In the next notebook, we will go into adding node and edge generators. Generators allow the creation of nodes and edges dynamically based on predefined functions. This allows ParquetGraphDB to propagate updates to dependent nodes and edges if there are any changes to the parent nodes or edges.
