# Pytorch Geometric (PyG) Tutorial

Borrowed from: https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html

PyG (PyTorch Geometric) is a library built upon  PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

Installation:

>pip install torch_geometric

If you want to utilize the full set of features from PyG, there exists several additional libraries you may want to install:

- pyg-lib
- torch-scatter
- torch-sparse
- torch-cluster
- torch-spline-conv

### Data in PyG

A graph is used to model pairwise relations (edges) between objects (nodes). A single graph in PyG is described by an instance of `torch_geometric.data.Data`, which holds the following attributes by default (although none of these attributes are strictly required):

- data.x: Node feature matrix with shape [num_nodes, num_node_features]

- data.edge_index: Graph connectivity with shape [2, num_edges] and type torch.long

- data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]

- data.y: Target to train against (may have arbitrary shape), e.g., node-level targets of shape [num_nodes, *] or graph-level targets of shape [1, *]

In [9]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)

#Note: the elements in edge_index only hold indices in the range { 0, ..., num_nodes - 1}
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)

> Basic attributes of Data object:

In [10]:
print(data.num_nodes)
print(data.num_edges)
print(data.num_node_features)
print(data.has_isolated_nodes())
print(data.has_self_loops())
print(data.is_directed())

3
4
1
False
False
False


### Common Benchmark Datasets

PyG provides access to a wide range of popular benchmark datasets, including all Planetoid datasets (Cora, Citeseer, Pubmed), graph classification datasets from TUDatasets (along with their cleaned versions), and several 3D mesh/point cloud datasets like FAUST, ModelNet10/40, and ShapeNet.

Loading a dataset is simple—PyG automatically downloads and processes the raw data into the standardized Data format.

Let’s download Cora, the standard benchmark dataset for semi-supervised graph node classification. It consists of scientific publications, where:
- Nodes represent individual research papers.
- Edges represent citation relationships between these papers.
- Each node is described by a feature vector based on the presence of certain words in the paper's abstract.
- Nodes are labeled with one of seven classes, corresponding to the paper's topic (e.g., machine learning, information retrieval).

The Cora dataset is commonly used to evaluate graph neural networks (GNNs) due to its graph-structured nature and relatively small size, making it an accessible starting point for experiments on citation networks.

In [13]:
# Load Cora dataset:

from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='tmp/Cora', name='Cora')

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!


In [17]:
# number of graphs
print("Number of graphs: ", len(dataset))
# number of features
print("Number of features: ", dataset.num_features)
# number of classes
print("Number of classes: ", dataset.num_classes)

# select the only graph
data = dataset[0]
# number of nodes
print("Number of nodes: ", data.num_nodes)
# number of edges
print("Number of edges: ", data.num_edges)
# check if directed
print("Is directed: ", data.is_directed())

#standard splits
# training nodes
print("# of nodes to train on: ", data.train_mask.sum().item())
# test nodes
print("# of nodes to test on: ", data.test_mask.sum().item())
# validation nodes
print("# of nodes to validate on: ", data.val_mask.sum().item())

print("X shape: ", data.x.shape)
print("Edge shape: ", data.edge_index.shape)
print("Y shape: ", data.y.shape)

Number of graphs:  1
Number of features:  1433
Number of classes:  7
Number of nodes:  2708
Number of edges:  10556
Is directed:  False
# of nodes to train on:  140
# of nodes to test on:  1000
# of nodes to validate on:  500
X shape:  torch.Size([2708, 1433])
Edge shape:  torch.Size([2, 10556])
Y shape:  torch.Size([2708])


Let' try another one! The ENZYMES dataset in the TUDataset collection is a popular benchmark dataset used for graph classification tasks, particularly in bioinformatics. It consists of graphs representing protein structures, where the goal is to classify enzymes into one of six different enzyme classes.

Here are the key details:

- Graphs: The dataset contains 600 graphs, each representing the tertiary structure of a protein.
- Nodes: Each node in a graph represents a secondary structure element (SSE), such as an α-helix or a β-sheet.
- Edges: Edges between nodes represent spatial proximity or structural relationships between SSEs.
- Features: Each node is labeled with a feature vector that encodes properties of the SSE, such as the type of SSE and its 3D spatial coordinates.
- Labels: The graphs are classified into 6 enzyme classes, based on the enzyme's function.

In [18]:
from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')

Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Processing...
Done!


In [20]:
print(len(dataset))
print(dataset.num_classes)
print(dataset.num_node_features)
print(dataset[0])
# Note that the first graph in the dataset contains 37 nodes, each one having 3 features. 
# There are 168/2 = 84 undirected edges and the graph is assigned to exactly one class. 
# In addition, the data object is holding exactly one graph-level target.

600
6
3
Data(edge_index=[2, 168], x=[37, 3], y=[1])


### Batch graph processing

Neural networks are typically trained using batches. PyTorch Geometric (PyG) facilitates parallel processing within a mini-batch by constructing sparse block diagonal adjacency matrices (as defined by edge_index) and concatenating feature and target matrices along the node dimension. This approach accommodates graphs with varying numbers of nodes and edges within a single batch.
<center>
<img src="images/batch.png" width="500"/>
</center>

In [27]:
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    pass
print(batch) #batch is a column vector which maps each node to its respective graph in the batch
print(batch.num_graphs)



DataBatch(edge_index=[2, 2842], x=[735, 3], y=[24], batch=[735], ptr=[25])
24


### Learning Graph Model 

In [41]:
from torch_geometric.nn import GCNConv
import torch.nn.functional as F

from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='tmp/Cora', name='Cora')
data = dataset[0]

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)

        return x

model = GCN()
data = dataset[0]
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

evaluate the model on the test nodes:

In [43]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.8060


### An example of graph-level classification using minibatches

In [54]:
from torch_geometric.datasets import TUDataset
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn import GCNConv, global_mean_pool
from sklearn.model_selection import train_test_split
dataset = TUDataset(root='/path/to/your/dataset', name='PROTEINS')
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

dataset = dataset.shuffle()
#splitting the dataset into training and testing
train_dataset, test_dataset = train_test_split(dataset, test_size=0.2)

#creating dataloaders
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels_1, hidden_channels_2, hidden_channels_3, hidden_channels_4):
        """Graph Convolutional Network with 3 layers following by a post message passing layer"""

        super(GCN, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GCNConv(dataset.num_node_features, hidden_channels_1)
        self.conv2 = GCNConv(hidden_channels_1, hidden_channels_2)
        self.conv3 = GCNConv(hidden_channels_2, hidden_channels_3)
        self.mlp = torch.nn.Sequential(
            torch.nn.Linear(hidden_channels_3, hidden_channels_4),
            torch.nn.Dropout(0.2),
            torch.nn.Linear(hidden_channels_4, dataset.num_classes)
        )

    def forward(self, x, edge_index, batch):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.2, training=self.training)

        x = self.conv2(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.25, training=self.training)

        x = self.conv3(x, edge_index)
        x = F.dropout(x, p=0.5, training=self.training)
        
        x = global_mean_pool(x, batch)

        x = self.mlp(x)
        embedding = x

        x = F.log_softmax(x, dim=1)
        return x, embedding
    
model = GCN(32,64,64,32)
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=8e-5, weight_decay = 1e-4)

#Create empty results dictionary
results = {
    "train_loss": [],
    "train_accuracy": [],
    "test_loss": [],
    "test_accuracy": []
}

for epoch in range(1,201):
    model.train()
    
    train_loss, train_accuracy = 0, 0

    for data in train_loader:
        optimizer.zero_grad()
        output, _ = model(data.x, data.edge_index, data.batch)
        loss = loss_fn(output, data.y)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_accuracy += (output.argmax(dim=1) == data.y).sum().item()

    train_loss /= len(train_loader.dataset)
    train_accuracy /= len(train_loader.dataset)
    with torch.no_grad():
        model.eval()

        test_loss, test_accuracy = 0, 0

        for data in test_loader:
            output, _ = model(data.x, data.edge_index, data.batch)
            loss = loss_fn(output, data.y)

            test_loss += loss.item()
            test_accuracy += (output.argmax(dim=1) == data.y).sum().item()

        test_loss /= len(test_loader.dataset)
        test_accuracy /= len(test_loader.dataset)

    results["train_loss"].append(train_loss)
    results["train_accuracy"].append(train_accuracy)
    results["test_loss"].append(test_loss)
    results["test_accuracy"].append(test_accuracy)
    if epoch%10==0:
        print(f'Epoch: {epoch+1:03d}, Train Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy:.4f}, Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')


Dataset: PROTEINS(1113):
Number of graphs: 1113
Number of features: 3
Number of classes: 2
Number of training graphs: 890
Number of test graphs: 223
Epoch: 011, Train Loss: 0.0211, Train Accuracy: 0.5955, Test Loss: 0.0211, Test Accuracy: 0.5964
Epoch: 021, Train Loss: 0.0207, Train Accuracy: 0.6191, Test Loss: 0.0209, Test Accuracy: 0.5874
Epoch: 031, Train Loss: 0.0199, Train Accuracy: 0.6674, Test Loss: 0.0205, Test Accuracy: 0.6143
Epoch: 041, Train Loss: 0.0197, Train Accuracy: 0.6876, Test Loss: 0.0204, Test Accuracy: 0.6502
Epoch: 051, Train Loss: 0.0193, Train Accuracy: 0.6933, Test Loss: 0.0204, Test Accuracy: 0.6592
Epoch: 061, Train Loss: 0.0194, Train Accuracy: 0.6933, Test Loss: 0.0204, Test Accuracy: 0.6502
Epoch: 071, Train Loss: 0.0195, Train Accuracy: 0.6865, Test Loss: 0.0204, Test Accuracy: 0.6457
Epoch: 081, Train Loss: 0.0193, Train Accuracy: 0.7000, Test Loss: 0.0204, Test Accuracy: 0.6457
Epoch: 091, Train Loss: 0.0191, Train Accuracy: 0.6910, Test Loss: 0.0204, 

---
### Bonus Tip: Loading custome datasets

To create datasets that are not included with PyTorch Geometric (PyG), you can utilize the torch_geometric.data.Dataset class. This approach closely mirrors the structure and functionality of datasets found in torchvision, making it intuitive for users familiar with image data handling. It expects the following methods to be implemented in addition:

- Dataset.len(): Returns the number of examples in your dataset.

- Dataset.get(): Implements the logic to load a single graph.

Here is a practical example of how defining a Custome Dataset can be done in PyG:

The implemented `GMLDataset` class is tailored to load graph data from GML files, specifically for the `MUTAG` dataset, which includes `188` graphs stored individually in `GML` format. The class initializes with the root directory and a label file, automatically identifies the GML files, and reads the corresponding labels.

Key methods include `len`, which returns the total number of graphs, and `get`, which loads a specific graph, converts it from NetworkX format to a PyTorch Geometric Data object, and assigns the appropriate label.

This class serves as a blueprint for creating custom datasets in PyTorch Geometric. By inheriting from `torch_geometric.data.Dataset`, you can implement essential methods for customization based on specific datasets. 

In [64]:
import os
import torch
from torch_geometric.data import Dataset, Data
import networkx as nx
from torch_geometric.utils import from_networkx

class GMLDataset(Dataset):
    def __init__(self, root, label_file, transform=None, pre_transform=None):
        super(GMLDataset, self).__init__(root, transform, pre_transform)
        self.root = root
        self.label_file = label_file
        self.graph_files = self.raw_file_names
        self.labels = self.load_labels()

    @property
    def raw_file_names(self):
        # Automatically find all GML files in the root directory
        return [f for f in os.listdir(self.root) if f.endswith('.gml')]

    def load_labels(self):
        # Load labels from the label.txt file
        label_path = os.path.join(self.root, self.label_file)
        with open(label_path, 'r') as f:
            labels = [int(line.strip()) for line in f.readlines()]
        return labels

    def len(self):
        return len(self.graph_files)

    def get(self, idx):
        file_path = os.path.join(self.root, self.graph_files[idx])
        # Load the graph from GML file using networkx
        nx_graph = nx.read_gml(file_path, label="id")
        data = from_networkx(nx_graph)
        # Assign the corresponding label to the data object
        data.y = torch.tensor([self.labels[idx]], dtype=torch.long)
        return data
# Define the root directory where the GML dataset is located
# Download the dataset from https://github.com/BorgwardtLab/P-WL/tree/master/data/MUTAG 
root_dir = 'data/MUTAG'

# Create an instance of the dataset
dataset = GMLDataset(root=root_dir, label_file='Labels.txt')

# Print the number of graphs in the dataset
print("Number of graphs:", len(dataset))

# Access the first graph data (this will trigger loading)
graph_data = dataset[0]
print("Edge index shape:", graph_data.edge_index.shape)
print("Label:", graph_data.y)

Number of graphs: 188
Edge index shape: torch.Size([2, 38])
Label: tensor([1])
