# Introduction to PyTorch Geometric by Examples
Learning from: https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html

## Data Handling of Graphs and `torch_geometric.data.Data`

### `torch_geometric.data.Data`
A single graph in PyG is described by an instance of `torch_geometric.data.Data`, which holds the following attributes by default:

- `data.x`: Node feature matrix with shape `[num_nodes, num_node_features]`, i.e., $n_f \times d_f$ where $n_f$ is the number of nodes and $d_f$ is the number of features per node.
- `data.edge_index`: Graph connectivity in [COO format](https://pytorch.org/docs/stable/sparse.html#sparse-coo-docs) with shape `[2, num_edges]` and type `torch.long`
    - COO(coordinate) format is a sparse matrix storage format.
- `data.edge_attr`: Edge feature matrix with shape `[num_edges, num_edge_features]`, i.e., $n_e \times d_e$ where $n_e$ is the number of edges and $d_e$ is the number of features per edge.
- `data.y`: Target to train against (may have arbitrary shape), e.g., node-level targets of shape `[num_nodes, *]` or graph-level targets of shape `[1, *]` or edge-level targets of shape `[num_edges, *]`.
- `data.pos`: Node position matrix with shape `[num_nodes, num_dimensions]`, e.g., spatial coordinates.

Note: none of these attributes are required. They can be removed, replaced or extended dynamically via `data.yo = ...` assignment.

#### Simple Example 1

In [1]:
import torch
from torch_geometric.data import Data

In [2]:
edge_index = torch.tensor([
    [0, 1, 1, 2],
    [1, 0, 2, 1],
], dtype=torch.long)
# this defines a graph with 3 nodes and 4 directed edges or 2 undirected edges
# 0->1, 1->0, 1->2, 2->1 or 0<->1, 1<->2

x = torch.tensor([
    [-1], [0], [1]
], dtype=torch.float) 
# this defines the node features, here we have 3 nodes and each node has 1 feature

data = Data(x=x, edge_index=edge_index)
# this defines the graph data structure
# x: node features, edge_index: edge index tensor

print(data) # data.__str__() = data.__repr__()
data # data.__repr__() -> gives us the shape of the tensors, x and edge_index

Data(x=[3, 1], edge_index=[2, 4])


Data(x=[3, 1], edge_index=[2, 4])

#### Simple Example 2

In [3]:
edge_index2 = torch.tensor([
    [0, 1],
    [1, 0],
    [1, 2],
    [2, 1]
])
# this defines a graph with 3 nodes and 4 directed edges or 2 undirected edges
# same as the previous `edge_index`, but in a different format

data2 = Data(x=x, edge_index=edge_index2.t().contiguous())

data2

Data(x=[3, 1], edge_index=[2, 4])

#### Some of the `Data` attributes and methods
Note: `Data` kind of works like a python `dict` object.

In [4]:
data.validate(raise_on_error=False) # checks if the data object is valid
# one of the checks it runs is to see if the edge_index values are within the range of the number of nodes
# the values must be between 0 and the number of nodes - 1

True

In [5]:
print(data.keys) # something's wrong here! `data.keys` is a method, not an attribute, yet can't be called as a method

<bound method BaseData.keys of Data(x=[3, 1], edge_index=[2, 4])>


In [6]:
print(data.keys()) # this is how we call a method

['edge_index', 'x']


In [7]:
print(data['x']) # this is how we access the node features

tensor([[-1.],
        [ 0.],
        [ 1.]])


In [8]:
data.edge_index

tensor([[0, 1, 1, 2],
        [1, 0, 2, 1]])

In [9]:
data2.edge_index

tensor([[0, 1, 1, 2],
        [1, 0, 2, 1]])

In [10]:
for key, item in data:
    print(f"'{key}' found in data")

'x' found in data
'edge_index' found in data


In [11]:
'edge_attr' in data # checks if the data object has an edge_attr attribute

False

In [12]:
print("number of nodes:", data.num_nodes) # number of nodes in the graph
print("number of edges:", data.num_edges) # number of edges in the graph
print("number of node features:", data.num_node_features) # number of node features
print("number of edge features:", data.num_edge_features) # number of edge features
print("has isolated nodes:", data.has_isolated_nodes()) # checks if the graph has isolated nodes
print("has self-loops:", data.has_self_loops()) # checks if the graph has self-loops
print("is directed:", data.is_directed()) # checks if the graph is directed
print("is undirected:", data.is_undirected()) # checks if the graph is undirected

number of nodes: 3
number of edges: 4
number of node features: 1
number of edge features: 0
has isolated nodes: False
has self-loops: False
is directed: False
is undirected: True


In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [14]:
data = data.to(device) # moves the data object to the device

## Common benchmark datasets

### TUDataset - ENZYMES (Graph Lavel Classification)

In [15]:
from torch_geometric.datasets import TUDataset

In [16]:
dataset = TUDataset(root='datasets/ENZYMES', name='ENZYMES') # downloads the ENZYMES dataset to /tmp/ENZYMES, a collection of 600 graphs (Data objects)

In [17]:
print(len(dataset)) # number of graphs in the dataset
print(dataset.num_classes)
print(dataset.num_node_features)

600
6
3


In [18]:
dataset

ENZYMES(600)

In [19]:
data = dataset[0] # get the first graph object in the dataset
data

Data(edge_index=[2, 168], x=[37, 3], y=[1])

In [20]:
data.is_undirected()

True

In [21]:
data.y

tensor([5])

In [22]:
print(dataset.y.shape)
dataset.y

torch.Size([600])


tensor([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
        4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

#### Graph Lavel Dataset splitting

In [23]:
train_dataset = dataset[:540] # first 540 graphs
test_dataset = dataset[540:] # last 60 graphs

In [24]:
train_dataset

ENZYMES(540)

#### Graph Lavel Dataset shuffle

In [25]:
# method 1
dataset = dataset.shuffle() # shuffles the dataset

In [26]:
dataset[0]

Data(edge_index=[2, 128], x=[36, 3], y=[1])

In [27]:
# method 2
perm = torch.randperm(len(dataset)) # generates a random permutation of the indices
dataset = dataset[perm] # shuffles the dataset
dataset[0]

Data(edge_index=[2, 94], x=[25, 3], y=[1])

### Planetoid - Cora (Node Lavel Semi-Supervised Classification)

In [28]:
from torch_geometric.datasets import Planetoid

In [29]:
dataset = Planetoid(root='datasets/Cora/', name='Cora') # downloads the Cora dataset to here, a collection of 2708 citation graphs (Data objects)
dataset

Cora()

In [30]:
len(dataset)

1

In [31]:
print("num classes:", dataset.num_classes)
print("num node features:", dataset.num_node_features)
print("num edge features:", dataset.num_edge_features)
dataset.y

num classes: 7
num node features: 1433
num edge features: 0


tensor([3, 4, 4,  ..., 3, 3, 3])

In [32]:
data = dataset[0]
print("has isolated nodes:", data.has_isolated_nodes())
print("has self-loops:", data.has_self_loops())
print("is directed:", data.is_directed())
data

has isolated nodes: False
has self-loops: False
is directed: False


Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

In [33]:
for key, item in data:
    print(f"'{key}' found in data")

'x' found in data
'edge_index' found in data
'y' found in data
'train_mask' found in data
'val_mask' found in data
'test_mask' found in data


#### Node Lavel Dataset splitting (train, val, test) through `masking`

In [34]:
data['train_mask'] # a mask to identify the nodes that are to be used for training

tensor([ True,  True,  True,  ..., False, False, False])

In [35]:
print(data.train_mask.sum().item()) # number of nodes to be used for training
print(data.train_mask.sum().item() / data.num_nodes) # percentage of nodes to be used for training
print(data.val_mask.sum().item()) # number of nodes to be used for validation
print(data.val_mask.sum().item() / data.num_nodes) # percentage of nodes to be used for validation
print(data.test_mask.sum().item()) # number of nodes to be used for testing

140
0.051698670605613
500
0.18463810930576072
1000


In [36]:
data.shuffle() # can't shuffle a data object, i.e., a single graph 

AttributeError: 'GlobalStorage' object has no attribute 'shuffle'

## Mini-batches and `torch_geometric.data.DataLoader`
> PyG achieves parallelization over a mini-batch by creating sparse block diagonal adjacency matrices (defined by `edge_index`) and concatenating feature and target matrices in the node dimension.

\- PyG Docs

In [37]:
from torch_geometric.loader import DataLoader

In [38]:
dataset_cora = Planetoid(root='datasets/Cora/', name='Cora')
dataset_enzymes = TUDataset(root='datasets/ENZYMES/', name='ENZYMES', use_node_attr=True) # use_node_attr=True to use node features(?!)

In [39]:
loader_cora = DataLoader(dataset_cora, batch_size=32, shuffle=True) # creates a data loader object

In [40]:
for batch in loader_cora:
    print(batch.num_graphs, batch) # a batch of graphs

1 DataBatch(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], batch=[2708], ptr=[2])


In [41]:
dataset_enzymes

ENZYMES(600)

In [42]:
loader_enzymes = DataLoader(dataset_enzymes, batch_size=32, shuffle=True)
loader_enzymes

<torch_geometric.loader.dataloader.DataLoader at 0x2998179d6d0>

In [43]:
for i, batch in enumerate(loader_enzymes):
    print(i, batch.num_graphs, batch)

0 32 DataBatch(edge_index=[2, 4164], x=[1055, 21], y=[32], batch=[1055], ptr=[33])
1 32 DataBatch(edge_index=[2, 4044], x=[1020, 21], y=[32], batch=[1020], ptr=[33])
2 32 DataBatch(edge_index=[2, 4182], x=[1045, 21], y=[32], batch=[1045], ptr=[33])
3 32 DataBatch(edge_index=[2, 4036], x=[1010, 21], y=[32], batch=[1010], ptr=[33])
4 32 DataBatch(edge_index=[2, 3556], x=[913, 21], y=[32], batch=[913], ptr=[33])
5 32 DataBatch(edge_index=[2, 4126], x=[1059, 21], y=[32], batch=[1059], ptr=[33])
6 32 DataBatch(edge_index=[2, 3986], x=[1051, 21], y=[32], batch=[1051], ptr=[33])
7 32 DataBatch(edge_index=[2, 4336], x=[1129, 21], y=[32], batch=[1129], ptr=[33])
8 32 DataBatch(edge_index=[2, 3658], x=[948, 21], y=[32], batch=[948], ptr=[33])
9 32 DataBatch(edge_index=[2, 4184], x=[1035, 21], y=[32], batch=[1035], ptr=[33])
10 32 DataBatch(edge_index=[2, 4042], x=[1098, 21], y=[32], batch=[1098], ptr=[33])
11 32 DataBatch(edge_index=[2, 4016], x=[1207, 21], y=[32], batch=[1207], ptr=[33])
12 32 

## Data Transforms

In [44]:
from torch_geometric.datasets import ShapeNet

In [45]:
dataset_shapenet = ShapeNet(root='datasets/ShapeNet/', categories=['Airplane']) # downloads the Airplane category of the ShapeNet dataset to here, a collection of 2690 graphs (Data objects)
dataset_shapenet

ShapeNet(2349, categories=['Airplane'])

In [46]:
print("num data(graphs)", len(dataset_shapenet))
print("num classes:", dataset_shapenet.num_classes)
print("num node features:", dataset_shapenet.num_node_features)
print("num edge features:", dataset_shapenet.num_edge_features)
print("num labels:", len(dataset_shapenet.y))
print("")
dataset_shapenet.y

num data(graphs) 2349
num classes: 50
num node features: 3
num edge features: 0
num labels: 6044171



tensor([0, 0, 3,  ..., 0, 0, 0])

In [47]:
dataset_shapenet[0] 
# 2518 nodes, 3 features per node, 0 edges, 0 edge features, 1 class, 2518 labels (one for each node)
# each node has a label, but the graph has a class
# each node is a point in 3D space, and the graph is a 3D object (!)

Data(x=[2518, 3], y=[2518], pos=[2518, 3], category=[1])

In [48]:
dataset_shapenet[0].x

tensor([[-0.0392,  0.3344,  0.9416],
        [ 0.0011,  0.3488, -0.9372],
        [-0.2507,  0.9366,  0.2447],
        ...,
        [ 0.6270, -0.5863,  0.5130],
        [-0.2090,  0.9760, -0.0607],
        [-0.2459,  0.9653, -0.0878]])

In [49]:
dataset_shapenet[0].y

tensor([0, 0, 3,  ..., 3, 1, 1])

In [50]:
dataset_shapenet[0].pos

tensor([[-0.0145, -0.0164,  0.0320],
        [-0.0119, -0.0657,  0.0145],
        [-0.1424, -0.0370, -0.0519],
        ...,
        [ 0.0342, -0.0931, -0.0523],
        [-0.0108, -0.0600,  0.0522],
        [-0.0165, -0.0593,  0.0560]])

In [51]:
dataset_shapenet[0].y.unique(return_counts=True) # classes of the nodes

(tensor([0, 1, 2, 3]), tensor([1326,  589,  323,  280]))

In [52]:
# applying transforms to the dataset
import torch_geometric.transforms as T

In [53]:
dataset_shapenet_transformed = ShapeNet(
    root='datasets/ShapeNet_Transformed/', 
    categories=['Airplane'], 
    pre_transform=T.KNNGraph(k=6)
) # creates a k-NN graph for each graph in the dataset

In [54]:
dataset_shapenet_transformed2 = ShapeNet(
    root='datasets/ShapeNet_Transformed2/', 
    categories=['Airplane'], 
    pre_transform=T.KNNGraph(k=6), 
    transform=T.RandomJitter(0.01)
) # creates a k-NN graph for each graph in the dataset and applies random jittering to the node positions

In [55]:
dataset_shapenet_transformed

ShapeNet(2349, categories=['Airplane'])

In [56]:
dataset_shapenet_transformed2

ShapeNet(2349, categories=['Airplane'])

In [57]:
dataset_shapenet_transformed[0]

Data(x=[2518, 3], y=[2518], pos=[2518, 3], category=[1], edge_index=[2, 15108])

In [58]:
dataset_shapenet_transformed2[0]

Data(x=[2518, 3], y=[2518], pos=[2518, 3], category=[1], edge_index=[2, 15108])

## Implementing a Graph Neural Network (GCN) for Node Classification

In [59]:
dataset_cora

Cora()

In [60]:
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

### GCN Model

In [61]:
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = GCNConv(dataset_cora.num_node_features, 16) # 1433 input features, 16 output features
        self.conv2 = GCNConv(16, dataset_cora.num_classes) # 16 input features, 7 output features
    
    def forward(self, data: Data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

In [62]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Training

In [63]:
model = GCN().to(device)
data = dataset_cora[0].to(device)

In [64]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) # weight_decay is L2 regularization, i.e., weight decay (dunno what's that)

In [65]:
model.train() # sets the model to training mode
for epoch in range(5):
    optimizer.zero_grad() # clears the gradients
    z = model(data) # forward pass
    loss = F.nll_loss(z[data.train_mask], data.y[data.train_mask]) # loss calculation
    loss.backward() # backward pass
    optimizer.step() # updates the parameters
    
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

Epoch: 000, Loss: 1.9537
Epoch: 001, Loss: 1.8424
Epoch: 002, Loss: 1.7221
Epoch: 003, Loss: 1.6156
Epoch: 004, Loss: 1.4786


In [66]:
model.train() # sets the model to training mode
for epoch in range(epoch+1, 45):
    optimizer.zero_grad() # clears the gradients
    z = model(data) # forward pass
    loss = F.nll_loss(z[data.train_mask], data.y[data.train_mask]) # loss calculation
    loss.backward() # backward pass
    optimizer.step() # updates the parameters
    
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

Epoch: 005, Loss: 1.3317
Epoch: 006, Loss: 1.2439
Epoch: 007, Loss: 1.1144
Epoch: 008, Loss: 1.0214
Epoch: 009, Loss: 0.8766
Epoch: 010, Loss: 0.7678
Epoch: 011, Loss: 0.6962
Epoch: 012, Loss: 0.5990
Epoch: 013, Loss: 0.5918
Epoch: 014, Loss: 0.4556
Epoch: 015, Loss: 0.4027
Epoch: 016, Loss: 0.3995
Epoch: 017, Loss: 0.3156
Epoch: 018, Loss: 0.2991
Epoch: 019, Loss: 0.2431
Epoch: 020, Loss: 0.2286
Epoch: 021, Loss: 0.2087
Epoch: 022, Loss: 0.2212
Epoch: 023, Loss: 0.1821
Epoch: 024, Loss: 0.1498
Epoch: 025, Loss: 0.1630
Epoch: 026, Loss: 0.1141
Epoch: 027, Loss: 0.1274
Epoch: 028, Loss: 0.1039
Epoch: 029, Loss: 0.1172
Epoch: 030, Loss: 0.0987
Epoch: 031, Loss: 0.0842
Epoch: 032, Loss: 0.0810
Epoch: 033, Loss: 0.0920
Epoch: 034, Loss: 0.0869
Epoch: 035, Loss: 0.0844
Epoch: 036, Loss: 0.0723
Epoch: 037, Loss: 0.0592
Epoch: 038, Loss: 0.0763
Epoch: 039, Loss: 0.0803
Epoch: 040, Loss: 0.0863
Epoch: 041, Loss: 0.0718
Epoch: 042, Loss: 0.0612
Epoch: 043, Loss: 0.0578
Epoch: 044, Loss: 0.0750


### Evaluation

In [67]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.7870
