## Graph neural network basics

In this Colab, we are going to introduce some basics of graph neural network (GNN) and build a pipeline for node classification tasks by PyTorch Geometric (PyG). See more introduction about [PyG](https://pytorch-geometric.readthedocs.io/en/latest/).




## Outline



- Basic operation of PyG
- Build a GNN by PyG For Node Classification
- Link Prediction Task by Pyg
- Graph Classification task by Pyg

## Basic operation of PyG

In [3]:
# import the pytorch library into environment and check its version
import os
import torch
print("Using torch", torch.__version__)

Using torch 2.3.0


Let's start installing PyG by `pip`. The version of PyG should match the current version of PyTorch. Here we follow the [instruction](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html) of PyG:

In [4]:
#!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-2.0.1+cu118.html

### Create a Graph

A single graph in PyG is described by an instance of `torch_geometric.data` which holds the some important attributes by default, like edge_index. We can easily create a graph of various number of edges and nodes by PyG. Take the following graph as an example:

![](https://github.com/Graph-and-Geometric-Learning/CPSC483-colab/blob/main/fig/graph_example.png?raw=1)


In [5]:
# import torch_geometric.data into environment
from torch_geometric.data import Data
from torch_geometric import nn
import torch_geometric.transforms as T

  Referenced from: <CA14ED34-FA3D-31FE-B4AD-2B2A8446B324> /opt/anaconda3/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)
  Referenced from: <2711955E-91F2-3C44-B702-16E8D8D60085> /opt/anaconda3/lib/python3.11/site-packages/torch_scatter/_version_cpu.so
  Expected in:     <E6933B13-F4A0-3821-8466-03F22A3B3739> /opt/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cpu.dylib
  Referenced from: <C1232DBC-0962-3DA1-AA56-38919D5E14F0> /opt/anaconda3/lib/python3.11/site-packages/torch_cluster/_version_cpu.so
  Expected in:     <E6933B13-F4A0-3821-8466-03F22A3B3739> /opt/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cpu.dylib
  Referenced from: <4D0712B5-3B32-32C8-8D39-B31B447BC8EE> /opt/anaconda3/lib/pytho

In [6]:
from torch_geometric.datasets import AmazonBook

In [7]:
dataset = AmazonBook(root = './amazonbook')
data = dataset[0]

We can see the number of the nodes and edges in cora:

In [8]:
num_nodes = data.num_nodes
print('amazonbook has {} nodes'.format(num_nodes))

num_edges = data.num_edges
print('amazonbook has {} edges'.format(num_edges))

amazonbook has 144242 nodes
amazonbook has 4761460 edges


## Link Prediction



### Dataset preprocess

As shown in the following figure, link prediction is to predict whether two nodes in a graph have a link, which can be considered as a binary classification task. We will construct a link prediction dataset containing training, validation, and test set based on Cora.

<br/>
<center>
<img src="https://i0.wp.com/spotintelligence.com/wp-content/uploads/2024/01/link-prediction-graphical-neural-network-1024x576.webp?resize=1024%2C576&ssl=1" height="200" width="350"/>
</center>
<br/>




Given a graph, we divide the initial edge set into three distinct edge sets which represent the training, validation, and test set. Training set and validation set share a same graph structure. Test set contains some edges which does not exist in training and validation set to prevent data leakage.
<!-- Training set does not include edges in validation and test set, and the validation split does not include edges in the test split. Validation and test data should not be leaked into the training set. -->

Our model will be optimized on the training set. We can use `transforms` function in PyG to easily generate the data splits:

The data will be transformed from a data object to three tuples, where each element represents the corresponding split:

In order to split the data for link prediction, we need to 
1. Prepare the edges (extract the existing edges (positive edges)), generate negative edges for the pairs of nodes not connected 
2. Split the edges 
3. create subgraphs for training and evaluation 


Decided not to transform the heterogenous graph into a homogenous graph because we want to preserve the unique relationships between various nodes

In [9]:
from torch_geometric.utils import negative_sampling, train_test_split_edges
from torch_geometric.data import HeteroData
from sklearn.model_selection import train_test_split 

In [10]:
print(data.metadata())


(['user', 'book'], [('user', 'rates', 'book'), ('book', 'rated_by', 'user')])


In [11]:
print("Node types:", data.node_types)

e1 = data.edge_types[0]
e2 = data.edge_types[1]
print("Edge types:", e1, e2)


Node types: ['user', 'book']
Edge types: ('user', 'rates', 'book') ('book', 'rated_by', 'user')


In [12]:
print('user nodes: ', data['user'].num_nodes)
print('book nodes: ', data['book'].num_nodes)

user nodes:  52643
book nodes:  91599


Because this colab is just a way for us to learn and study the dataset we will create a subset of the Amazonbook data set with just 100 user and 100 book nodes. 

In [13]:
# Function to filter edges based on node subsets
def filter_edges(edge_index, valid_src, valid_dst):
    src_mask = torch.isin(edge_index[0], valid_src)
    dst_mask = torch.isin(edge_index[1], valid_dst)
    edge_mask = src_mask & dst_mask
    return edge_index[:, edge_mask]


In [14]:
selected_books = torch.randint(0, data['book'].num_nodes, (5000,))
selected_users = torch.randint(0, data['user'].num_nodes, (5000,))

In [15]:

# Step 2: Filter `user -> book` edges
user_book_edges = data[e1].edge_index
filtered_user_book_edges = filter_edges(user_book_edges, selected_users, selected_books)

# Step 3: Identify the remaining valid users and books
valid_users = torch.unique(filtered_user_book_edges[0])
valid_books = torch.unique(filtered_user_book_edges[1])

# Step 4: Filter `book -> user` edges
book_user_edges = data[e2].edge_index
filtered_book_user_edges = filter_edges(book_user_edges, valid_books, valid_users)

# Step 5: Finalize the subgraph with sizes 
final_users = torch.unique(filtered_book_user_edges[1])
final_books = torch.unique(filtered_book_user_edges[0])

# Step 6: Re-filter edges to match final users and books
filtered_user_book_edges = filter_edges(user_book_edges, final_users, final_books)
filtered_book_user_edges = filter_edges(book_user_edges, final_books, final_users)

# Step 7: Create the subgraph
subset_data = HeteroData()
subset_data['user'].num_nodes = final_users.size(0)
subset_data['book'].num_nodes = final_books.size(0)
subset_data['user', 'rates', 'book'].edge_index = filtered_user_book_edges
subset_data['book', 'rated_by', 'user'].edge_index = filtered_book_user_edges


In [16]:
# Verify the subgraph
print(f"Number of users: {subset_data['user'].num_nodes}")
print(f"Number of books: {subset_data['book'].num_nodes}")
print(f"User -> Book edges: {subset_data['user', 'rates', 'book'].edge_index.shape[1]}")
print(f"Book -> User edges: {subset_data['book', 'rated_by', 'user'].edge_index.shape[1]}")

Number of users: 3686
Number of books: 3665
User -> Book edges: 11278
Book -> User edges: 11278


## Create Node Features for each Node Type

AmazonBook does not come with node features. As such, we need to create features for each node type 

In [17]:
print("Node types:", subset_data.node_types)
print("Edge types:", subset_data.edge_types)


Node types: ['user', 'book']
Edge types: [('user', 'rates', 'book'), ('book', 'rated_by', 'user')]


In [18]:
feature_dim = 16  # Choose a smaller dimension to save memory

In [19]:
subset_data['user'].x=torch.eye(subset_data['user'].num_nodes)
subset_data['book'].x = torch.eye(subset_data['book'].num_nodes)


# Splitting into train, test, and validate data

In [20]:
train_data = HeteroData()
val_data = HeteroData()
test_data = HeteroData()


In [27]:
train_data.node_types=subset_data.node_types 
train_data.edge_types= subset_data.edge_types

test_data.node_types= subset_data.node_types
test_data.edge_types=subset_data.edge_types

val_data.node_types = subset_data.node_types
val_data.edge_types =subset_data.edge_types

In [28]:
for node_type in data.node_types:
    if node_type in train_data:
        print(f"{node_type} exists in train_data.")
    else:
        print(f"{node_type} is missing in train_data.")


user is missing in train_data.
book is missing in train_data.


In [29]:
train_data = HeteroData()
val_data = HeteroData()
test_data = HeteroData()

# Ensure num_nodes is set in subset_data
for node_type in subset_data.node_types:
    if 'num_nodes' not in subset_data[node_type] or subset_data[node_type].num_nodes is None:
        subset_data[node_type].num_nodes = int(subset_data[node_type].x.shape[0])

# Set x_dict globally for all subsets
train_data.x_dict = subset_data.x_dict
val_data.x_dict = subset_data.x_dict
test_data.x_dict = subset_data.x_dict

for edge_type in subset_data.edge_types:
    # Edge splitting
    edge_index = subset_data[edge_type].edge_index.T
    train_edges, test_edges = train_test_split(edge_index, test_size=0.2, random_state=42)
    train_edges, val_edges = train_test_split(train_edges, test_size=0.25, random_state=42)

    # Convert back to tensors
    train_edges = torch.tensor(train_edges).T
    val_edges = torch.tensor(val_edges).T
    test_edges = torch.tensor(test_edges).T

    # Assign edges to corresponding HeteroData objects
    train_data[edge_type].edge_index = train_edges
    val_data[edge_type].edge_index = val_edges
    test_data[edge_type].edge_index = test_edges

    # Assign edge_label_index
    train_data[edge_type].edge_label_index = train_edges
    val_data[edge_type].edge_label_index = val_edges
    test_data[edge_type].edge_label_index = test_edges

    # Assign edge_label (positive class for all edges)
    train_data[edge_type].edge_label = torch.ones(train_edges.shape[1])
    val_data[edge_type].edge_label = torch.ones(val_edges.shape[1])
    test_data[edge_type].edge_label = torch.ones(test_edges.shape[1])

    # Negative sampling
    num_nodes = max(subset_data[edge_type[0]].num_nodes, subset_data[edge_type[-1]].num_nodes)
    neg_samples = negative_sampling(
        edge_index=train_edges,
        num_nodes=num_nodes,
        num_neg_samples=min(train_edges.shape[1], 1000)
    )
    train_data[edge_type].neg_edge_index = neg_samples
    train_data[edge_type].neg_edge_label = torch.zeros(neg_samples.shape[1])

    print(f"Edge type {edge_type}:")
    print("Number of nodes:", subset_data[edge_type[0]].num_nodes)
    print("Train edges:", train_edges.shape[1])
    print("Validation edges:", val_edges.shape[1])
    print("Test edges:", test_edges.shape[1])
    print("Edge labels in train:", train_edges.shape[1])


Edge type ('user', 'rates', 'book'):
Number of nodes: 3686
Train edges: 6766
Validation edges: 2256
Test edges: 2256
Edge labels in train: 6766
Edge type ('book', 'rated_by', 'user'):
Number of nodes: 3665
Train edges: 6766
Validation edges: 2256
Test edges: 2256
Edge labels in train: 6766


  train_edges = torch.tensor(train_edges).T
  val_edges = torch.tensor(val_edges).T
  test_edges = torch.tensor(test_edges).T


In [22]:
train_data.node_types

[]

In [24]:
data.node_types

['user', 'book']

In [25]:
subset_data.node_types

['user', 'book']

{}

In [33]:
train_data = HeteroData()
val_data = HeteroData()
test_data = HeteroData()

# Ensure num_nodes is set in subset_data
for node_type in subset_data.node_types:
    if 'num_nodes' not in subset_data[node_type] or subset_data[node_type].num_nodes is None:
        subset_data[node_type].num_nodes = int(subset_data[node_type].x.shape[0])

# Set x_dict globally for all subsets
train_data.x_dict = subset_data.x_dict
val_data.x_dict = subset_data.x_dict
test_data.x_dict = subset_data.x_dict

for edge_type in subset_data.edge_types:
    # Edge splitting
    edge_index = subset_data[edge_type].edge_index.T
    train_edges, test_edges = train_test_split(edge_index, test_size=0.2, random_state=42)
    train_edges, val_edges = train_test_split(train_edges, test_size=0.25, random_state=42)

    # Convert back to tensors
    train_edges = torch.tensor(train_edges).T
    val_edges = torch.tensor(val_edges).T
    test_edges = torch.tensor(test_edges).T

    # Assign edges to corresponding HeteroData objects
    train_data[edge_type].edge_index = train_edges
    val_data[edge_type].edge_index = val_edges
    test_data[edge_type].edge_index = test_edges

    # Assign edge_label_index
    train_data[edge_type].edge_label_index = train_edges
    val_data[edge_type].edge_label_index = val_edges
    test_data[edge_type].edge_label_index = test_edges

    # Assign edge_label (positive class for all edges)
    train_data[edge_type].edge_label = torch.ones(train_edges.shape[1])
    val_data[edge_type].edge_label = torch.ones(val_edges.shape[1])
    test_data[edge_type].edge_label = torch.ones(test_edges.shape[1])

    # Negative sampling
    num_nodes = max(subset_data[edge_type[0]].num_nodes, subset_data[edge_type[-1]].num_nodes)
    neg_samples = negative_sampling(
        edge_index=train_edges,
        num_nodes=num_nodes,
        num_neg_samples=min(train_edges.shape[1], 1000)
    )
    train_data[edge_type].neg_edge_index = neg_samples
    train_data[edge_type].neg_edge_label = torch.zeros(neg_samples.shape[1])

    print(f"Edge type {edge_type}:")
    print("Number of nodes:", subset_data[edge_type[0]].num_nodes)
    print("Train edges:", train_edges.shape[1])
    print("Validation edges:", val_edges.shape[1])
    print("Test edges:", test_edges.shape[1])
    print("Edge labels in train:", train_edges.shape[1])



Edge type ('user', 'rates', 'book'):
Number of nodes: 3686
Train edges: 6766
Validation edges: 2256
Test edges: 2256
Edge labels in train: 6766
Edge type ('book', 'rated_by', 'user'):
Number of nodes: 3665
Train edges: 6766
Validation edges: 2256
Test edges: 2256
Edge labels in train: 6766


  train_edges = torch.tensor(train_edges).T
  val_edges = torch.tensor(val_edges).T
  test_edges = torch.tensor(test_edges).T


In [35]:
# Ensure num_nodes is correctly set for all node types
for node_type in train_data.node_types:
    if train_data[node_type].num_nodes is None:
        train_data[node_type].num_nodes = train_data[node_type].x.shape[0]

# Filter edge_index to ensure indices are valid
for edge_type in train_data.edge_types:
    edge_index = train_data[edge_type].edge_index
    num_nodes_src = train_data[edge_type[0]].num_nodes
    num_nodes_dst = train_data[edge_type[-1]].num_nodes

    # Check for out-of-range indices
    mask = (edge_index[0] < num_nodes_src) & (edge_index[1] < num_nodes_dst)
    train_data[edge_type].edge_index = edge_index[:, mask]

    # Print edge index ranges for debugging
    print(f"Edge type {edge_type}:")
    print(f"Max index in edge_index[0]: {train_data[edge_type].edge_index[0].max().item()}")
    print(f"Max index in edge_index[1]: {train_data[edge_type].edge_index[1].max().item()}")




TypeError: '<' not supported between instances of 'Tensor' and 'NoneType'

In [None]:
train_data

In Link Prediction, we have to add labels to indicate the edges to be predicted 

In [32]:
for edge_type in subset_data.edge_types:
    # Use the same edge indices as edge_label_index
    train_data[edge_type].edge_label_index = train_data[edge_type].edge_index
    val_data[edge_type].edge_label_index = val_data[edge_type].edge_index
    test_data[edge_type].edge_label_index = test_data[edge_type].edge_index


In [33]:
from torch_geometric.utils import negative_sampling

for edge_type in subset_data.edge_types:
    num_nodes = subset_data[edge_type[0]].num_nodes  # Source node type: Users 
    neg_samples = negative_sampling(
        edge_index=train_data[edge_type].edge_index,
        num_nodes=num_nodes,
        num_neg_samples=train_data[edge_type].edge_index.size(1)
    )
    train_data[edge_type].neg_edge_index = neg_samples


In [31]:
for node_type in train_data.node_types:
    print(f"Node type: {node_type}, num_nodes = {train_data[node_type].num_nodes}")


In [32]:
train_data.node_types

[]

In [34]:
for edge_type in subset_data.edge_types:
    print(f"Edge type {edge_type}:")
    print("Number of nodes:", train_data[edge_type[0]].num_nodes)
    print("Train edges:", train_data[edge_type].edge_index.shape[1])
    print("Validation edges:", val_data[edge_type].edge_index.shape[1])
    print("Test edges:", test_data[edge_type].edge_index.shape[1])
    print("Edge labels in train:", train_data[edge_type].edge_label_index.shape[1])


Edge type ('user', 'rates', 'book'):
Number of nodes: None
Train edges: 6777
Validation edges: 2259
Test edges: 2259
Edge labels in train: 6777
Edge type ('book', 'rated_by', 'user'):
Number of nodes: None
Train edges: 6777
Validation edges: 2259
Test edges: 2259
Edge labels in train: 6777




Now data object has two attributes of edge: `edge_index` and `edge_label_index`. `edge_index` denotes the graph structure used for performing message passing in GNN. `edge_label_index` denotes the edge index used to calculate loss in training set, or to evaluate the model in validation and test set.


Printing the statistics of data:

### Pipeline

Homework originally uses a GCNConv but we can't because it is only for homogenous graphs. 
Using SAFEConv because need a layer that supports bipartite graphs. These layers are designed for graphs where edges connect nodes of different types 

In [30]:
from torch_geometric.nn import HeteroConv, SAGEConv
import torch 

class HeteroGCN(torch.nn.Module):
    def __init__(self, metadata, hidden_channels, out_channels):
        super().__init__()
        self.convs = HeteroConv({
            edge_type: SAGEonv((-1,-1), hidden_channels, add_self_loops=False) for edge_type in metadata[1]
        }, aggr='sum')  # Aggregates messages across edge types
        self.out_conv = HeteroConv({
            edge_type: SAGEConv((-1,-1), out_channels, add_self_loops=False) for edge_type in metadata[1]
        }, aggr='sum')

    def forward(self, x_dict, edge_index_dict):
        # x_dict: Dict of node features for each node type
        # edge_index_dict: Dict of edge_index for each edge type
        x_dict = self.convs(x_dict, edge_index_dict)
        x_dict = {key: torch.relu(x) for key, x in x_dict.items()}
        x_dict = self.out_conv(x_dict, edge_index_dict)
        return x_dict


In [36]:
subset_data.metadata()

(['user', 'book'], [('user', 'rates', 'book'), ('book', 'rated_by', 'user')])

In [37]:
model = HeteroGCN(subset_data.metadata(), hidden_channels=128, out_channels=64)

In [38]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)

Similar as the what we do in the node classification task, we first apply the GCN model to produce the representation of each node in the graph. Usually we will use **inner product** to measure the similarity between two node representations to determine how likely it is for these two nodes to be connected.

#### Question 7 (5 points)

Following the instruction and implement the function to calculate the inner product:

In [39]:
def compute_similarity(node_embs, edge_index):
    result = 0

    # TODO: Define similarity function.
    # 1. calculate the inner product between all the pairs in the edge_index
    # Note: the shape of node_embs is [n, h] where n is the number of nodes, and h is the embedding size
    # the shape of edge_index is [2, m] where m is the number of edges

    ############# Your code here ############
    ## (~1 line of code)
    result = (node_embs[edge_index[0]] *node_embs[edge_index[1]]).sum(dim=1)
    #########################################

    return result

n, h = 5, 10  # number of nodes and embedding size
node_embs = torch.rand(n, h)
edge_index = torch.tensor([[0, 1, 2, 3],
                           [2, 3, 0, 1]])  # compute the similarity of (0, 2), (1, 3), (2, 0), (3, 1)
similarity = compute_similarity(node_embs, edge_index)
print("Similairty:", similarity)

Similairty: tensor([2.5647, 1.3936, 2.5647, 1.3936])


We optimize the model by minimizing the loss function. Here we consider the link prediction task as a binary classification task (edge exists or no), and apply binary cross entropy loss:

In [40]:
loss_fn = torch.nn.BCEWithLogitsLoss()

The edges in the graph will be taken as the positive examples with label=1 in the loss function. To prevent model from collapse, we usually will feed some **negative examples** to the loss function, which is the non-existing edges in the graph. The number of negative examples should equal to the number of positive ones.

With the help of PyG, we can easily perform the negative sampling. Here is an example:

In [41]:
for edge_type in subset_data.edge_types:
    print(f"Edge type {edge_type}:")
    print("Number of nodes:", train_data[edge_type[0]].num_nodes)
    print("Train edges:", train_data[edge_type].edge_index.shape[1])
    print("Validation edges:", val_data[edge_type].edge_index.shape[1])
    print("Test edges:", test_data[edge_type].edge_index.shape[1])
    print("Edge labels in train:", train_data[edge_type].edge_label_index.shape[1])


Edge type ('user', 'rates', 'book'):
Number of nodes: None
Train edges: 6777
Validation edges: 2259
Test edges: 2259
Edge labels in train: 6777
Edge type ('book', 'rated_by', 'user'):
Number of nodes: None
Train edges: 6777
Validation edges: 2259
Test edges: 2259
Edge labels in train: 6777


In [42]:
# from torch_geometric.utils import negative_sampling

# for edge_type in data.edge_types:
#     num_nodes = data[edge_type[0]].num_nodes  # Source node type: Users 
#     neg_samples = negative_sampling(
#         edge_index=train_data[edge_type].edge_index,
#         num_nodes=num_nodes,
#         num_neg_samples=train_data[edge_type].edge_index.size(1)
#     )
#     train_data[edge_type].neg_edge_index = neg_samples


t1 = subset_data.edge_types[0]
t2 = subset_data.edge_types[1]
print("Positive edges")
print(train_data[t1].edge_label_index)
print(train_data[t2].edge_label_index)
print("============================")
print("Negative edge")
print(train_data[t1].neg_edge_index)
print((train_data[t2].neg_edge_index))

Positive edges
tensor([[12919, 25169, 23894,  ..., 10945, 29776,   625],
        [11678, 48532, 43909,  ..., 49199, 37157, 34127]])
tensor([[11678, 48532, 43909,  ..., 49199, 37157, 34127],
        [12919, 25169, 23894,  ..., 10945, 29776,   625]])
Negative edge
tensor([[ 575,  150, 3215,  ..., 2111, 2579, 1439],
        [2732,  308, 1923,  ...,  325, 2454, 1441]])
tensor([[2032, 3261, 2969,  ..., 2642,  859, 1161],
        [2007, 2210, 3042,  ..., 3365,  474, 1140]])


In [43]:
#code for homogenous graph 
# from torch_geometric.utils import negative_sampling

# neg_edge_index = negative_sampling(
#       edge_index=train_data.edge_index,  # positive edges in the graph
#       num_nodes=train_data.num_nodes,  # number of nodes
#       num_neg_samples=5,  # number of negative examples
#     )

# print("shape of neg_edge_index:", neg_edge_index.shape)  # [2, num_neg_samples]
# print("negative examples:", neg_edge_index)

Positive examples (`edge_label_index`) will be assigned the label 1, and negative ones will be assigned the label 0. We can obtain the label of positive examples like this:

Now we can construct training and testing pipeline.

#### Question 8 (15 points)

Please follow the instruction and implement a function that trains a model.

In [47]:
for edge_type in train_data.edge_types:
    print(f"{edge_type}: Train={train_data[edge_type].edge_index.shape[1]}, "
          f"Val={val_data[edge_type].edge_index.shape[1]}, Test={test_data[edge_type].edge_index.shape[1]}")


('user', 'rates', 'book'): Train=6777, Val=2259, Test=2259
('book', 'rated_by', 'user'): Train=6777, Val=2259, Test=2259


In [54]:
# def train(model, data, optimizer, loss_fn):

#     loss = 0

#     # ARGS: 
#     #model - the heterogeneous GCN model: HeteroGCN 
#     #data - HeteroData object with x_dict, edge_index_dict and target labels zz
#     #optimizer: Adam 
#     #criterion: Loss function, Binary Cross Entropy Loss

#     ############# Your code here ############
#     ## (~10 line of code)
#     model.train() 
#     optimizer.zero_grad() 

#     #forward pass: pass node features and edge indices through the model 
#     out_dict = model(data.x_dict, data.edge_index_dict)
#     print("Model output:", {key: out.shape for key, out in out_dict.items()})

#     #compute loss for all edge types (eg. link prediction)
#     total_loss = 0 
#     for edge_type in data.edge_types: 
#         if 'edge_label' in data[edge_type]: #check if edge labels exist for this type
#             edge_label_index = data[edge_type].edge_label_index
#             pred = compute_similarity(out_dict[edge_type[-1]], edge_label_index).view(-1)
#             target = data[edge_type].edge_label#true labels for this edge type
#             total_loss += loss_fn(pred, target)


#     total_loss.backward() 
#     optimizer.step()

#     #########################################

#     return total_loss.item()

def train(model, data, optimizer, criterion):
    model.train()
    optimizer.zero_grad()

    # Verify node features
    for node_type in data.node_types:
        if 'x' not in data[node_type]:
            raise ValueError(f"Node type {node_type} is missing 'x' (node features).")

    # Forward pass
    out_dict = model(data.x_dict, data.edge_index_dict)

    # Compute loss for all edge types
    total_loss = 0
    for edge_type in data.edge_types:
        if 'edge_label' in data[edge_type]:
            pred = out_dict[edge_type[-1]]  # Predictions for target node type
            target = data[edge_type].edge_label
            total_loss += criterion(pred, target)

    total_loss.backward()
    optimizer.step()
    return total_loss.item()


We usually use [AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) to evaluate the performance of model on binary classification task. The test function is as followed:

In [48]:
from sklearn.metrics import roc_auc_score

@torch.no_grad()
def test(model, data):
    model.eval()

    total_auc = 0 
    total_edges = 0 

    for edge_type in data.edge_types: 
        if 'edge_label' in data[edge_type]:#ensure labels exist 
            #perform message passing and get predictions for this edge type 
            out = model(data.x_dict, data.edge_index_dict)
            node_embeddings = out[edge_type[-1]]# target node type embeddings

            #compute similarity for edges in edge_label_index
            edge_label_index = data[edge_type].edge_label_index
            predictions = compute_similarity(out[edge_type[-1]], edge_label_index).view(-1).sigmoid()
            print("Predictions shape:", predictions.shape)
            print("Edge labels shape:", edge_label_index.shape)


            #compute AUC for this edge type 
            edge_label = data[edge_type].edge_label
            auc = roc_auc_score(edge_label.cpu().numpy(), predictions.cpu().numpy())
            
            # Weighted aggregation of AUC scores
            num_edges = edge_label.size(0)
            total_auc += auc * num_edges
            total_edges += num_edges

    # Compute the final AUC (weighted average across edge types)
    return total_auc / total_edges if total_edges > 0 else 0

Now we can start to train our model based on `train` and `test` function:

In [53]:
epochs = 50

best_val_auc = final_test_auc = 0
for epoch in range(1, epochs + 1):
    loss = train(model, train_data, optimizer, loss_fn)
    
    valid_auc = test(model, val_data)
    test_auc = test(model, test_data)
    if valid_auc > best_val_auc:
        best_val_auc = valid_auc
        final_test_auc = test_auc
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val: {valid_auc:.4f}, Test: {test_auc:.4f}')

KeyError: "Tried to collect 'x' but did not find any occurrences of it in any node and/or edge type"