# RTML Lab 11: GNNs

Today we explore GNNs, in particular [Graph Convolutional Networks by Kipf and Welling (2017)](https://arxiv.org/abs/1609.02907).

This tutorial is based on the [PyTorch Geometric tutorial for GCN](https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html).

## Setup

PyG only seems to work with recent versions of PyTorch, so you may need to upgrade:

In [1]:
id = 123012
name = 'Todsavad Tangtortan'

In [2]:
# !pip install --upgrade torch
# !pip install torch_geometric

Be sure to restart your kernel if you're doing this in Jupyter.

## Our first graph

OK, with that set up, let's use PyG to build a graph:

In [3]:
import torch_geometric
torch_geometric.__version__

'2.3.0'

In [4]:
import torch
from torch_geometric.data import Data
import os

# os.environ['http_proxy'] = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'
    
edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)

In [5]:
data

Data(x=[3, 1], edge_index=[2, 4])

So we see that this defines a graph with three nodes and two edges. The edge index
is in so-called [COO format](https://pytorch.org/docs/stable/sparse.html#sparse-coo-docs) (a format for representing sparse matrices)
and in this case is a $2\times 2|E|$ tensor.

The indices are indexing the set of nodes, in the range 0 to $N-1$.

The following is equivalent. Note that each undirected edge is denoted by two pairs of indices:

In [6]:
edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [1, 2],
                           [2, 1]], dtype=torch.long)

The tensor `x` is providing the *features* of the nodes in the graph, in this case a scalar for each node. As we know,
node features can be vectors as well, but all nodes must have the same feature dimensionality.

We can check the validity of our graph with the `validate()` method:

In [7]:
data.validate(raise_on_error=True)

True

Try some other attributes and:
- `x`
- `edge_index`
- `edge_attr`
- `num_nodes`
- `num_edges`
- `num_node_features`
- `has_isolated_nodes()`
- `has_self_loops()`
- `is_directed()`

And you can move a graph to a GPU just like a tensor:

In [8]:
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
data = data.to(device)

## PyG datasets (Cora)

Let's load the Cora dataset:

In [9]:
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='./data/Cora', name='Cora')

In [10]:
data = dataset[0]
data

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

Try a few of the dataset's methods:
- `len(dataset)`
- `dataset.num_classes`
- `dataset.num_node_features`
- `dataset[0]`

From this we see that Cora is a single graph.
We will only be doing transduction with this dataset.
After assigning the single graph to a variable such as `data`,
explore its characteristics:
- `data.is_undirected()`
- `data.train_mask.sum().item()`
- `data.val_mask.sum().item()`
- `data.test_mask.sum().item()`

As this graph is mainly used to test semi-supervised learning algorithms, we see that the validation and test data are more numerous than the training data.


## Mini-batches

How can we train on multiple examples in parallel? PyG puts multiple graphs together with block-diagonal adjacency matrices and concatenated features.
See the [tutorial section on mini-batches](https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html#mini-batches) for more
information.

For a GCN on Cora, however, we don't need fancy data loaders or parallel mini-batches, as we're dealing with just one graph.

## Define a GCN

So let's define a GCN for Cora, then:

In [11]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
import torch.nn as nn

class GCN(torch.nn.Module):
    def __init__(self,dataset):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.normalize(x)
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

Moving the model to a GPU and training is the same as for any other PyTorch model. The "data" is treated as a single example:

In [12]:
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
model = GCN(dataset).to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    if (epoch+1) % 50 == 0: 
        print(f'Epoch : {epoch+1}')

Epoch : 50
Epoch : 100
Epoch : 150
Epoch : 200


After training, we an evaluate on the test dataset:

In [13]:
model.eval() #No dropout
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.8170


## Exercises

1. Implement early stopping using the validation set for the Cora GCN.
2. Reproduce the rest of the results from Kipf and Welling with your GCN.
3. Implement the [Graph Attention Network](https://arxiv.org/abs/1710.10903) and reproduce their results on the datasets available via PyG.


## 1. Early Stopping 
Implement early stopping using the validation set for the Cora GCN.

In [9]:
def train(model,data):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss

def valid(model,data):
    model.eval() #No dropout
    out = model(data)
    loss = F.nll_loss(out[data.val_mask], data.y[data.val_mask])
    loss.backward()
    return loss

def test(model,data):
    model.eval() #No dropout
    pred = model(data).argmax(dim=1)
    correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
    acc = int(correct) / int(data.test_mask.sum())
    return acc

In [10]:
class EarlyStopping:
    def __init__(self, monitor='val_loss', patience=10, mode='min', delta=0.00001):
        self.monitor = monitor
        self.patience = patience
        self.mode = mode
        self.delta = delta
        self.counter = 0
        self.best_score = None
        self.early_stop = False

    def __call__(self, epoch, current_score):
        if self.mode == 'min':
            current_score *= -1  # flip the sign for minimization
        if self.best_score is None:
            self.best_score = current_score
        elif current_score < self.best_score + self.delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = current_score

In [11]:
import torch

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
model = GCN(dataset).to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
num_epochs = 200
early_stopping = EarlyStopping(patience=20)

In [12]:
import numpy as np

for epoch in range(num_epochs):
    train_loss = train(model, data)
    val_loss  = valid(model, data)
    test_acc = test(model,data)
    
    early_stopping(epoch, val_loss)
    if early_stopping.early_stop:
        print("We are at epoch:", epoch)
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))
        break
              
    if (epoch+1) % 10 == 0: 
        # print(f'Epoch : {epoch+1:03d}, Loss: {val_loss:.4f}, Test Acc: {test_acc:.4f}')
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))

Epoch: 010 Train Loss: 1.5452 Val Loss: 1.6988
Epoch: 020 Train Loss: 1.0170 Val Loss: 1.3865
Epoch: 030 Train Loss: 0.6126 Val Loss: 1.0893
Epoch: 040 Train Loss: 0.3385 Val Loss: 0.9202
Epoch: 050 Train Loss: 0.2685 Val Loss: 0.8397
Epoch: 060 Train Loss: 0.2027 Val Loss: 0.8114
Epoch: 070 Train Loss: 0.1924 Val Loss: 0.7849
Epoch: 080 Train Loss: 0.1634 Val Loss: 0.7695
Epoch: 090 Train Loss: 0.1485 Val Loss: 0.7542
Epoch: 100 Train Loss: 0.1606 Val Loss: 0.7391
Epoch: 110 Train Loss: 0.1282 Val Loss: 0.7420
We are at epoch: 110
Epoch: 111 Train Loss: 0.1061 Val Loss: 0.7409


In [13]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.8150


## 2. Kipf and Welling
Reproduce the rest of the results from Kipf and Welling with your GCN.

We used the following sets of hyperparameters for
- Cora 0.5 (dropout rate), 5 · 10^−4
(L2 regularization) and 16 (number of hidden units)
- NELL: 0.1 (dropout rate), 1 · 10^−5
(L2 regularization) and 64 (number of hidden
units).


In [14]:
# !pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 torch_geometric --extra-index-url https://download.pytorch.org/whl/cu117
# !pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.13.0+cu117.html

In [14]:
from torch_geometric.datasets import Planetoid
from torch_geometric.datasets import NELL

dataset_cora = Planetoid(root='./data/Cora', name='Cora')
# dataset_Citeseer = Planetoid(root='./data/Citeseer', name='Citeseer')
# dataset_Pubmed = Planetoid(root='./data/Pubmed', name='Pubmed')
dataset_NELL = NELL(root='./data/NELL')

In [15]:
def init_weights(model):
    if isinstance(model, nn.Linear):
        torch.nn.init.xavier_uniform(model.weight)
        model.bias.data.fill_(0.01)

In [16]:
import torch

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')

data = dataset_cora[0].to(device)

model = GCN(dataset_cora).to(device)
model.apply(init_weights)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
num_epochs = 200
early_stopping = EarlyStopping(patience=20)

In [17]:
import numpy as np

for epoch in range(num_epochs):
    train_loss = train(model, data)
    val_loss  = valid(model, data)
    # test_acc = test(model,data)
    
    early_stopping(epoch, val_loss)
    if early_stopping.early_stop:
        print("We are at epoch:", epoch)
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))
        break
              
    if (epoch+1) % 10 == 0: 
        # print(f'Epoch : {epoch+1:03d}, Loss: {val_loss:.4f}, Test Acc: {test_acc:.4f}')
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))

Epoch: 010 Train Loss: 1.5303 Val Loss: 1.6895
Epoch: 020 Train Loss: 1.0476 Val Loss: 1.3734
Epoch: 030 Train Loss: 0.5806 Val Loss: 1.0790
Epoch: 040 Train Loss: 0.3818 Val Loss: 0.9020
Epoch: 050 Train Loss: 0.2339 Val Loss: 0.8162
Epoch: 060 Train Loss: 0.1933 Val Loss: 0.7873
Epoch: 070 Train Loss: 0.1571 Val Loss: 0.7653
Epoch: 080 Train Loss: 0.1482 Val Loss: 0.7530
Epoch: 090 Train Loss: 0.1430 Val Loss: 0.7410
Epoch: 100 Train Loss: 0.1310 Val Loss: 0.7386
Epoch: 110 Train Loss: 0.1166 Val Loss: 0.7275
Epoch: 120 Train Loss: 0.1172 Val Loss: 0.7265
We are at epoch: 123
Epoch: 124 Train Loss: 0.1099 Val Loss: 0.7249


In [18]:
def accuracy(data):
    model.eval() #No dropout
    pred = model(data).argmax(dim=1)
    correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
    acc = int(correct) / int(data.test_mask.sum())
    print(f'Accuracy: {acc:.4f}')

accuracy(data)

Accuracy: 0.8130


In [1]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
import torch.nn as nn

class GCN_NELL(torch.nn.Module):
    def __init__(self,dataset):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 64)
        self.conv2 = GCNConv(64, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x.to_dense(), data.edge_index.to_dense()

        x = F.normalize(x)
        
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training, p=0.1)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

  from .autonotebook import tqdm as notebook_tqdm


**Warning** 

*Memory requirement* In the current setup with full-batch gradient descent, memory requirement
grows linearly in the size of the dataset. We have shown that for large graphs that do not fit in GPU
memory, training on CPU can still be a viable option. 

Mini-batch stochastic gradient descent can
alleviate this issue. The procedure of generating mini-batches, however, should take into account the
number of layers in the GCN model, as the Kth-order neighborhood for a GCN with K layers has to
be stored in memory for an exact procedure. For very large and densely connected graph datasets,
further approximations might be necessary

In [10]:
# device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
device = 'cpu'

data = dataset_NELL[0].to(device)

model = GCN_NELL(dataset_NELL).to(device)
model.apply(init_weights)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-4)
num_epochs = 100
early_stopping = EarlyStopping(patience=20)

In [12]:
import numpy as np

for epoch in range(num_epochs):
    train_loss = train(model, data)
    val_loss  = valid(model, data)
    # test_acc = test(model,data)
    
    early_stopping(epoch, val_loss)
    if early_stopping.early_stop:
        print("We are at epoch:", epoch)
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))
        break
              
    if (epoch+1) % 10 == 0: 
        # print(f'Epoch : {epoch+1:03d}, Loss: {val_loss:.4f}, Test Acc: {test_acc:.4f}')
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))

Epoch: 010 Train Loss: 4.6822 Val Loss: 4.7744
Epoch: 020 Train Loss: 3.6778 Val Loss: 4.3350
Epoch: 030 Train Loss: 2.4306 Val Loss: 3.8665
Epoch: 040 Train Loss: 1.2967 Val Loss: 3.3216
Epoch: 050 Train Loss: 0.7001 Val Loss: 2.9760
Epoch: 060 Train Loss: 0.4625 Val Loss: 2.8736
Epoch: 070 Train Loss: 0.3803 Val Loss: 2.7510
Epoch: 080 Train Loss: 0.3105 Val Loss: 2.6610
Epoch: 090 Train Loss: 0.2754 Val Loss: 2.5940
Epoch: 100 Train Loss: 0.2468 Val Loss: 2.5692


In [14]:
def accuracy(data):
    model.eval() #No dropout
    pred = model(data).argmax(dim=1)
    correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
    acc = int(correct) / int(data.test_mask.sum())
    print(f'Accuracy: {acc:.4f}')

accuracy(data)

Accuracy: 0.4861


## 3. GAT via PyG
Implement the [Graph Attention Network](https://arxiv.org/abs/1710.10903) and reproduce their results on the datasets available via PyG.

**Transductive learning** 
For the transductive learning tasks, we apply a two-layer GAT model. 
Its architectural hyperparameters have been optimized on the Cora dataset.

The 1st layer : K = 8 attention heads, 8 features each (a total of 64 features), followed by an exponential linear unit (ELU)nonlinearity. 

The 2nd layer is used for classification: a single attention head that computes C features followed by a softmax activation. 

During training, we apply L2 regularization with λ = 0.0005. Furthermore, dropout with p = 0.6 is applied to both layers’ inputs, as well as to the normalized attention coefficients (critically, this means that at each training iteration, each node is exposed to a stochastically sampled neighborhood). 

In [24]:
import torch.nn as nn
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv

class GAT_Transductive(torch.nn.Module):
    def __init__(self, dataset):
        super().__init__()
        self.head = 8
        self.feature = 8 
        self.conv1 = GATConv(in_channels = dataset.num_node_features, 
                             out_channels = self.feature, 
                             heads = self.head, 
                             dropout=0.6)
        self.conv2 = GATConv(in_channels = self.feature * self.head, 
                             out_channels = dataset.num_classes, 
                             heads=1, 
                             concat=True,
                             dropout=0.6)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        
        x = F.normalize(x)

        x = self.conv1(x, edge_index)
        x = F.elu(x)
        # x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In [25]:
from torch_geometric.datasets import Planetoid
dataset_cora = Planetoid(root='./data/Cora', name='Cora')

In [26]:
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')

data = dataset_cora[0].to(device)

model = GAT_Transductive(dataset_cora).to(device)
model.apply(init_weights)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)
num_epochs = 100
early_stopping = EarlyStopping(patience=20)

In [27]:
def train(model, data):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss

def valid(model,data):
    model.eval() #No dropout
    out = model(data)
    loss = F.cross_entropy(out[data.val_mask], data.y[data.val_mask])
    loss.backward()
    return loss

def test(model,data):
    model.eval() #No dropout
    pred = model(data).argmax(dim=1)
    correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
    acc = int(correct) / int(data.test_mask.sum())
    return acc

In [28]:
import numpy as np

for epoch in range(num_epochs):
    train_loss = train(model, data)
    val_loss  = valid(model, data)
    # test_acc = test(model,data)
    
    early_stopping(epoch, val_loss)
    if early_stopping.early_stop:
        print("We are at epoch:", epoch)
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))
        break
              
    if (epoch+1) % 10 == 0: 
        # print(f'Epoch : {epoch+1:03d}, Loss: {val_loss:.4f}, Test Acc: {test_acc:.4f}')
        print('Epoch: {:03d}'.format(epoch+1),
            'Train Loss: {:.4f}'.format(train_loss.item()),
            'Val Loss: {:.4f}'.format(val_loss.item()*-1))

Epoch: 010 Train Loss: 1.5202 Val Loss: 1.6461
Epoch: 020 Train Loss: 0.9906 Val Loss: 1.2806
Epoch: 030 Train Loss: 0.7572 Val Loss: 0.9640
Epoch: 040 Train Loss: 0.5484 Val Loss: 0.7996
Epoch: 050 Train Loss: 0.5442 Val Loss: 0.7287
Epoch: 060 Train Loss: 0.4993 Val Loss: 0.6829
Epoch: 070 Train Loss: 0.5078 Val Loss: 0.6785
Epoch: 080 Train Loss: 0.3625 Val Loss: 0.6598
Epoch: 090 Train Loss: 0.4113 Val Loss: 0.6464
Epoch: 100 Train Loss: 0.4098 Val Loss: 0.6457


After training, we an evaluate on the test dataset:

In [29]:
def accuracy(data):
    model.eval() #No dropout
    pred = model(data).argmax(dim=1)
    correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
    acc = int(correct) / int(data.test_mask.sum())
    print(f'Accuracy: {acc:.4f}')

accuracy(data)

Accuracy: 0.8230


## Conclusion

|                       | Cora        | NELL    |
|-----------------------|-------------|---------|
| GCN (paper)           | 81.5 %      | 66.0  % |
| GCN early stopping    | 81.5 %      |         |
| GCN (Kipf and Welling)| 81.3 %      | 48.61 % |
| GAT (paper)           | 83.0 ± 0.7% |         |
| GAT (own)             | 82.3 %      |         |

After changing Hyperparameter following, in GCN experiment, as Kipf and Welling, our implementational accuracy of Cora is same as paper. In contrast, NELL Dataset's accuracy cannot reach the performance as GCN paper. 

In GAT experiment, it was followed configuraiton as paper attempt. Thus, it return performance equally.