# Dense GNN implementation

In this exercise we are implementing a GNN from scratch using dense matrices.
Note that as the memory requirement of a dense matrix scales quadratically with the number of nodes in a graph, this limits us to datasets with only small graphs. 

We will use the following dataset molHIV.

For the network we need a message-passing layer and pooling function.

1. Describe the datasets in your own words. Also talk about its features and statistical properties of the graphs and labels.
1. Implement the class GCNLayer to perform one round of message passing. You may use any variant of message passing here.
1. Implement a pooling layer like MeanPooling or SumPooling (or both).
1. Implement a one-hot-encoding of the atom type (this will positively affect classification performance)
1. Implement the model class GraphGCN that builds upon your GCNLayer and Pooling layer.
1. Create and train a GraphGCN model on MolHIV. As MOlHIV is highly imbalanced, it will make sense to adapt class weights in your loss function.

For the dataset molHIV we aim to reach something like 0.64 ROC (or higher). Note that for me the training was quite unstable, so several runs got stuck at 0.5.

Note: In this exercise, we use PyG only for utilities and not to build models. Feel free to edit/ignore any of the provided code as you see fit.

In [None]:
import torch
import torch_geometric as pyg
import numpy as np
from ogb.graphproppred import PygGraphPropPredDataset,Evaluator

from tqdm import tqdm

In [45]:
# find device
if torch.cuda.is_available(): # NVIDIA
    device = torch.device('cuda')
elif torch.backends.mps.is_available(): # apple silicon
    device = torch.device('mps') 
else:
    device = torch.device('cpu') # fallback
device

device(type='cuda')

In [None]:
class GCNLayer(torch.nn.Module):
    def __init__(self, in_features: int, out_features: int, activation=torch.nn.functional.relu):
        super(GCNLayer, self).__init__()
        raise NotImplementedError

    def forward(self, H: torch.Tensor, adj: torch.Tensor):
        raise NotImplementedError

In [None]:
class MeanPooling(torch.nn.Module):
    def __init__(self, dim: int | tuple[int, ...]):
        raise NotImplementedError

    def forward(self, H: torch.Tensor):
        raise NotImplementedError

In [None]:
class SumPooling(torch.nn.Module):
    def __init__(self, dim: int | tuple[int, ...]):
        raise NotImplementedError

    def forward(self, H: torch.Tensor):
        raise NotImplementedError

In [None]:
class GraphGCN(torch.nn.Module):
    def __init__(self, ):
        super(GraphGCN, self).__init__()
        raise NotImplementedError

    def forward(self, H_in: torch.Tensor, adj: torch.Tensor):
        raise NotImplementedError


## MolHIV

Pytorch Geometric stores its graphs in a sparse format using the variable edge_index.
We will thus need to create our own (torch) dataloader and extract the graphs into dense adjacency matrices.

In terms of model accuracy, it really helped me to add an "Atom encoding", i.e. a one-hot-encoding of the atoms instead of just having the atomic numbers appear in the first column of the node features.

In [None]:
class GraphDataset(torch.utils.data.Dataset):
    def __init__(self, adjacencies, features, targets):
        self.adjacencies = torch.tensor(adjacencies, dtype=torch.float32)
        self.features = torch.tensor(features, dtype=torch.float32)
        self.targets = targets
    
    def __len__(self):
        return len(self.targets)
    
    def __getitem__(self, idx):
        return self.adjacencies[idx], self.features[idx], self.targets[idx]
    
    def num_features(self):
        return self.features.shape[-1]
    
    def compute_class_weights(self):
        raise NotImplementedError


In [None]:
def extract_graphs_and_features(dataset):
    raise NotImplementedError

### Create Data Loaders for MolHIV

In [None]:
batch_size = 32

molHIV = PygGraphPropPredDataset(name = "ogbg-molhiv") 
split_idx = molHIV.get_idx_split() 
all_adjacencies, all_features, all_targets, atoms_to_index = extract_graphs_and_features(molHIV)
all_targets = all_targets.to(torch.int64)

# Create datasets using split_idx indices
graph_dataset = GraphDataset(all_adjacencies, all_features, all_targets)
train_dataset = torch.utils.data.Subset(graph_dataset, split_idx["train"])
val_dataset = torch.utils.data.Subset(graph_dataset, split_idx["valid"])
test_dataset = torch.utils.data.Subset(graph_dataset, split_idx["test"]) 

# Create DataLoaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


### Model and Training for MolHIV

The evaluation of MolHIV (and all other datasets from ogb) should happen through an Evaluator. You can also try playing around with learning rate schedulers.

In [None]:
evaluator = Evaluator(name='ogbg-molhiv')

def evaluate(model, loader):
    model.eval()

    y_true = list()
    y_pred = list()

    for adjacencies, features, targets in loader:
        adjacencies, features = adjacencies.to(device), features.to(device)

        with torch.no_grad():
            pred = model(features, adjacencies)
        y_pred.append(pred.argmax(dim=-1, keepdims=True))
        y_true.append(targets)

    y_true = torch.cat(y_true, dim=0).detach().cpu()
    y_pred = torch.cat(y_pred, dim=0).detach().cpu()

    input_dict = {"y_true": y_true, "y_pred": y_pred}

    return evaluator.eval(input_dict)['rocauc']

In [None]:
# Model definition and Training loop
raise NotImplementedError