# Export and integration with PyG example

This notebook exemplifies how to use the `graphdatascience` and PyTorch Geometric (PyG) Python libraries to:
* Import the CORA dataset directly into GDS
* Sample a part of CORA using the [GDS Random walk wi](https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/rwr/)
* Export the CORA sample client side
* Define and train a Graph Convolutional Neural Network (GCN) on the CORA sample
* Evaluate the GCN on a test set

## Prerequisites

Running this notebook requires a Neo4j server with a recent GDS version (2.2+) installed.
We recommend using Neo4j Desktop with GDS, or AuraDS.

Also required are of course the Python libraries:
* `graphdatascience` (see [docs](https://neo4j.com/docs/graph-data-science-client/current/installation/) for installation instructions)
* PyG (see [PyG docs](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html) for installation instructions)

## Setup

We start by importing our dependencies and setting up our GDS client connection to the database.

In [None]:
# Install necessary dependencies
%pip install torch torch_scatter torch_sparse torch_geometric pandas graphdatascience

In [None]:
import pandas as pd
from graphdatascience import GraphDataScience
import torch
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit
import random
import numpy as np

In [None]:
# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

In [None]:
# Override NEO4J_URI and NEO4J_AUTH here according to your setup
NEO4J_URI = "bolt://localhost:7687"
NEO4J_AUTH = None
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

# Necessary if you enabled Arrow on the db
gds.set_database("neo4j")

## Sampling CORA

Next we use the built in CORA loader to get the data into GDS. We will then sample it to get a smaller graph to train on.

In [None]:
G = gds.graph.load_cora()

# Let's make sure we have exactly 2708 "Paper" nodes
assert G.node_labels() == ["Paper"]
assert G.node_count() == 2708

In [None]:
# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.alpha.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)

# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(G_sample.node_count())

## Exporting sampled CORA

We can now export the topology and node properties of the sampled graph that we need to train our model.

In [None]:
# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks
sample_topology = gds.beta.graph.relationships.stream(G_sample).by_rel_type()

# Let's make sure that our expected "CITES" relationship type is the only one present
sample_topology.keys()

In [None]:
# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
    G_sample,
    ["subject", "features"],
    separate_property_columns=True,
)

# Let's make sure we got the data we expected
display(sample_node_properties)

## Constructing GCN input

Now that we have all information we need client side, we can construct the PyG `Data` object we will use as training input.

In [None]:
# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
    # Create a reverse mapping based on new idx -> old idx
    old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
    return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]

In [None]:
# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(
    normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"]), dtype=torch.long
)

# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)
# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)

data = Data(x=x, y=y, edge_index=edge_index)
num_classes = y.unique().shape[0]

In [None]:
# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
transform(data)

## Training and evaluating a GCN

Let's now define and traing the GCN using PyG and our sampled CORA as input. We adapt the CORA GCN example from the [PyG documentation](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#learning-methods-on-graphs).

In this example we evaluate the model on a test set of the sampled CORA. Please note however, that since GCN is an inductive algorithm we could also have evaluated it on the full CORA dataset, or even another (similar) graph for that matter.

In [None]:
num_classes = y.unique().shape[0]

# Define the GCN architecture
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GCN().to(device)
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Train the GCN using the CORA sample represented by `data`
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

In [None]:
# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f"Accuracy: {acc:.4f}")

## Cleanup

We remove the CORA graphs from the GDS graph catalog.

In [None]:
G_sample.drop()
G.drop()