# Introduction to PyTorch Geometric (PyG)

PyTorch Geometric (PyG): https://pytorch-geometric.readthedocs.io/en/latest/

## 0. Instllation 

See https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

## 1. Data Format

See https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs and understand the meaning of `edge_index`.

## 2. Example

The following code provides an example to use PyG for building GCN to solve the node classification task. We will walk through the code and write code comments in this lecture. 

In [9]:
from torch_geometric.datasets import Planetoid # Planetoid is class for some datasets 
import torch
dataset = Planetoid(root='/tmp/Citeseer', name='Citeseer') # root: directory to save dataset files; name: dataset name

In [10]:
dataset 

Citeseer()

In [11]:
len(dataset)

1

In [12]:
dataset[0]

Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])

In [13]:
print('#training nodes:', (dataset[0].train_mask).sum())

#training nodes: tensor(120)


In [14]:
print('#validation nodes:', (dataset[0].val_mask).sum())

#validation nodes: tensor(500)


In [15]:
dataset[0].train_mask

tensor([ True,  True,  True,  ..., False, False, False])

In [16]:
torch.arange(3327)[dataset[0].train_mask]

tensor([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
         42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
         56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,
         70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
         84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
         98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
        112, 113, 114, 115, 116, 117, 118, 119])

In [17]:
dataset[0].x.shape

torch.Size([3327, 3703])

In [18]:
print("#classes:", dataset[0].y.max() + 1)

#classes: tensor(6)


In [19]:
dataset[0].edge_index

tensor([[   0,    1,    1,  ..., 3324, 3325, 3326],
        [ 628,  158,  486,  ..., 2820, 1643,   33]])

In [20]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv # torch_geometric.nn: provdes many GNN layers

class GCN(torch.nn.Module):
    """Parameters
    ----
    num_node_features: input feature dimension for the GCN
    num_hidden: hidden dimension of the GCN
    num_classes: output dimension of the GCN
    """
    
    def __init__(self, num_node_features, num_hidden, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_node_features, num_hidden) 
        self.conv2 = GCNConv(num_hidden, num_classes)

    def forward(self, x, edge_index):
        """
        ---
        x: node features
        edge_index: edges for the graph (could be converted from the adjacency matrix)
        """
        x = self.conv1(x, edge_index)
        x = F.relu(x) # activation function
        x = F.dropout(x, training=self.training) # dropout to allievate overfitting
        x = self.conv2(x, edge_index) 

        return F.log_softmax(x, dim=1) # softmax to construct predicted class distribution

In [21]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = dataset[0].to(device)
model = GCN(num_node_features=data.x.shape[1], 
            num_hidden=16,
            num_classes=(data.y.max()+1).item()
           ).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train() # put the model on the training mode
x, edge_index = data.x, data.edge_index

for epoch in range(200): # write a loop to train the model
    optimizer.zero_grad() # we need to eliminate the grads in the parameters
    out = model(x, edge_index) # forward the inputs to the GCN model
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) # calculate the loss
    loss.backward() # to do backprogation; calculate the gradients for the parameters
    optimizer.step() # to update the parameters based on the calculated gradients
    if epoch % 10 == 0:
        print('Epoch {0}: {1}'.format(epoch, loss.item()))

Epoch 0: 1.7929905652999878
Epoch 10: 0.3723503351211548
Epoch 20: 0.09584302455186844
Epoch 30: 0.06954531371593475
Epoch 40: 0.051150355488061905
Epoch 50: 0.031871359795331955
Epoch 60: 0.03918081894516945
Epoch 70: 0.02406204678118229
Epoch 80: 0.02507113851606846
Epoch 90: 0.04897269234061241
Epoch 100: 0.03246430307626724
Epoch 110: 0.04963958263397217
Epoch 120: 0.03701436519622803
Epoch 130: 0.041598498821258545
Epoch 140: 0.020775815472006798
Epoch 150: 0.01904776133596897
Epoch 160: 0.02503553405404091
Epoch 170: 0.03285317122936249
Epoch 180: 0.026185614988207817
Epoch 190: 0.020884616300463676


In [22]:
model.eval() # put the model on evaluation mode
pred = model(x, edge_index).argmax(dim=1) # obtain the prediction for all the nodes
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.6760


## Q: How to use PyG in our project?

Recall that in `x = self.conv1(x, edge_index)` we need two input, i.e., node feature `x` and edge indices `edge_index`.

Essentially,  we majorly need to convert the provided data into the PyG format.

In [32]:
import scipy.sparse as sp
import numpy as np
import json

adj = sp.load_npz('adj.npz')
feat  = np.load('features.npy')
labels = np.load('labels.npy')
splits = json.load(open('splits.json'))

idx_train, idx_test = splits['idx_train'], splits['idx_test']


In [61]:
feat.shape

(2480, 1390)

In [56]:
from torch_geometric.utils import from_scipy_sparse_matrix
our_edge_index, _ = from_scipy_sparse_matrix(adj) 


In [57]:
our_x = torch.FloatTensor(feat[idx_train]) # convert np.array to torch.FloatTensor
our_x.shape

torch.Size([496, 1390])

In [58]:
our_model = GCN(num_node_features=our_x.shape[1], 
            num_hidden=16,
            num_classes=labels.max()+1,
           ).to(device)

In [59]:
our_model

GCN(
  (conv1): GCNConv(1390, 16)
  (conv2): GCNConv(16, 7)
)

In [60]:
our_model(our_x, our_edge_index).shape # this is the forward process of GCN; with this we can have the representations for nodes

RuntimeError: index 1084 is out of bounds for dimension 0 with size 496

In [44]:
our_model(our_x, our_edge_index).argmax(1)

tensor([3, 4, 4,  ..., 4, 4, 1])

## Q/A 
Any other questions on PyG and our project

Top 3 performances: 65-68

20-23%: your model is wrong? or the data is wrong? 

[IMPORTANT] On april 15th, I updated the `labels.npy`