# Project Assignment:  Amazon products dataset

## Students

* Team: `9`
* Students: `Mateja Ilić & Miloš Novaković` (for the team submission)

## Part 2.2. Trying out a Graph Neural Network model

There are strong indications from the literature that exploiting the underlying structure in our data to automatically extract the features that will be the most informative for our learning task can work much better that doing manual feature engineering and much faster than just loading the raw data into one big Neural Network and hoping it extracts the structure on its own. Such an approach was successful with CNNs and 2D regular grid data (i.e. images), with RNNs and time-series etc. Graph Neural Networks try to recreate this approach for non-Euclidean structured data - in this case, our feature vectors are node attributes on a graph.

GraphSAGE is a Message Passing GNN, introduced by W.L. Hamilton, R. Ying, and J. Leskovec in "Inductive Representation Learning on Large Graphs". GraphSAGE is a model that learns node embeddings based on sampling and aggregating features from a node’s local neighborhood, which should allow for better scalability. Both the function that generates embeddings and the function that aggregates them from layer to layer are learnable. This allows GraphSAGE to generate embeddings for previously unseen nodes.

In [None]:
# downloading the necessary packages
!pip install torch-scatter torch-sparse torch-cluster torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html
!pip install torchmetrics
!pip install class-resolver

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.pyg.org/whl/torch-1.11.0+cu113.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-1.11.0%2Bcu113/torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl (7.9 MB)
[K     |████████████████████████████████| 7.9 MB 8.5 MB/s 
[?25hCollecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-1.11.0%2Bcu113/torch_sparse-0.6.13-cp37-cp37m-linux_x86_64.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 59.5 MB/s 
[?25hCollecting torch-cluster
  Downloading https://data.pyg.org/whl/torch-1.11.0%2Bcu113/torch_cluster-1.6.0-cp37-cp37m-linux_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 48.2 MB/s 
[?25hCollecting torch-geometric
  Downloading torch_geometric-2.0.4.tar.gz (407 kB)
[K     |████████████████████████████████| 407 kB 7.9 MB/s 
Building wheels for collected packages: torch-geometric
  Building whe

In [None]:
# importing the necessary packages and classes
import torch
import torch.nn.functional as F
from torch import nn
import torch_geometric as pyg
from torch_geometric.data import Data
from torch_geometric.datasets import AmazonProducts
from torch_geometric.nn import GraphSAGE
import torchmetrics
import matplotlib.pyplot as plt
import numpy as np
import time

In [None]:
if torch.cuda.is_available():
    device = 'cuda'
else:
    print("No GPU :(")
    device = 'cpu'

print(f"The device used is {device}.")

The device used is cuda.


In [None]:
# upload your code to your Google Drive to import the data automatically
from google.colab import drive
drive.mount('/content/drive')
data_folder_path = '/content/drive/MyDrive/data/NML_Final_Project/'

Mounted at /content/drive


In [None]:
# we load the dataset
dataset = AmazonProducts(root=data_folder_path)
data = dataset[0]
print(data)

Data(x=[1569960, 200], edge_index=[2, 264339468], y=[1569960, 107], train_mask=[1569960], val_mask=[1569960], test_mask=[1569960])


In [None]:
num_nodes = data.x.shape[0]
num_features = data.x.shape[1]
num_classes = data.y.shape[1]
num_edges = data.num_edges
num_edge_features = data.num_edge_features

print(f"Feature X matrix \n {data.x}", end = "\n\n")
print(f"Feature X matrix shape \n {(data.x.shape[0], data.x.shape[1])}", end = "\n------------------------------------------------------------------\n")
print(f"Label Y matrix \n {data.y}", end = "\n\n")
print(f"Label Y matrix shape \n {(data.y.shape[0], data.y.shape[1])}", end = "\n------------------------------------------------------------------\n")
print(f"Edge List E \n {data.edge_index}", end = "\n\n")
print(f"Edge List E shape \n {(data.edge_index.shape[0], data.edge_index.shape[1])}", end = "\n------------------------------------------------------------------\n")
print(f"Number of nodes N = {num_nodes}", end = "\n\n")
print(f"Number of node features D = {num_features}", end = "\n\n")
print(f"Number of label classes C = {num_classes}", end = "\n\n")
print(f"Number of edges E = {num_edges}", end = "\n\n")

Feature X matrix 
 tensor([[-0.1466,  0.2226, -0.3597,  ...,  0.1699,  0.8974,  1.6527],
        [-0.2805,  0.0190,  0.4301,  ..., -1.1758, -1.8365, -1.1693],
        [ 0.2554,  0.2519, -0.0291,  ...,  1.3751, -0.0735,  0.6262],
        ...,
        [-0.8121,  0.3626, -0.7781,  ...,  0.0639,  0.8645,  0.0389],
        [ 1.5977, -2.3989, -0.0569,  ..., -1.4413,  0.2966,  0.0985],
        [-0.1663,  0.0629, -0.0474,  ...,  0.1853, -0.1216, -0.9181]])

Feature X matrix shape 
 (1569960, 200)
------------------------------------------------------------------
Label Y matrix 
 tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])

Label Y matrix shape 
 (1569960, 107)
------------------------------------------------------------------
Edge List E 
 tensor([[      0,       0,       0,  ..., 1569958, 1569959, 1569959],
        [     

### Notes on the architecture and training:

We try to match the dimensions of the MLP used to establish our performance baseline. GNNs typically don't go to deep in terms of layers, so for most models instantiating 2-3 embedding layers at most is recommended. After the node embedding, we input the data into an MLP classifier, so the rest of the architecture and training is the same as when using just an MLP for multi-class, multi-label classification tasks (i.e. sigmoid outer layer act. function + binary cross-entropy).

In [None]:
hidden_dim = 512

class GNN_SAGE(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.gnn_block = GraphSAGE(in_channels=num_features,
                                   hidden_channels=hidden_dim,
                                   num_layers=2,
                                   out_channels=hidden_dim)
        self.classifier = nn.Sequential(nn.Linear(hidden_dim, hidden_dim),
                                        nn.ReLU(),
                                        nn.Linear(hidden_dim, num_classes))
        
    def forward(self, x, edge_index) -> torch.Tensor:
        x = self.gnn_block(x, edge_index)
        return torch.sigmoid(self.classifier(x))

In [None]:
def train(
    model: nn.Module,
    data: Data,
    optimizer: torch.optim.Optimizer,
    nb_epochs: int
):
    model.to(device)
    model.train()
    
    # container to save the losses for plotting
    losses = []
    
    for epoch in range(nb_epochs):
      optimizer.zero_grad()
      
      # forward pass of the model generates the prediction
      y_pred = model(data.x.to(device), data.edge_index.to(device))
      
      # calcualting the loss
      loss = F.binary_cross_entropy(input = y_pred[data.train_mask],
                                    target = data.y[data.train_mask].to(device, dtype=torch.float))
      
      # after each calculation of loss function, save it to losses
      losses.append(loss.item())
      
      # backward pass (updating the gradients of the parameters of our model)
      loss.backward()
      
      # update of the parameters according to the chosen optimizer
      optimizer.step()

    return losses

def evaluate(
    model: nn.Module,
    metric: torchmetrics.Metric,
    data: Data,
    mask: torch.Tensor):
  
    model.eval()  # Deactivate dropout
    model.to(device)

    with torch.no_grad():
        # do a forward pass (i.e. a prediction) of the model
        # on the features x and edge_index
        y_pred = model(data.x.to(device), data.edge_index.to(device))

        # cast the float class prediction probabilites to discrete integer predictions
        y_pred = y_pred[mask]

        # get the ground truth labels
        y = data.y[mask].to(device)

        # update the petric that evaluates 
        metric.update(y_pred, y)

    return metric.compute().item()

In [None]:
# Define the Model that will be trained
model = GNN_SAGE()

# define the the learning rate
lr = 1 * 1e-2

# define the optimization algorithm used in the train of the model
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# define the number of epochs in the training loop
nb_epochs = 50

# call the training function
start_training_time = time.time()
losses_ = train(model, data, optimizer, nb_epochs)
end_training_time = time.time()

# total training time
print(f"Total training time is {(end_training_time - start_training_time) // 60 : .2f} minutes and {(end_training_time - start_training_time) % 60 : .2f} seconds.")

RuntimeError: ignored

### Results:

No tests are instantiated, since the training can not be completed on any hardware that was available to us, including Google Collab Pro +, which allocates around 51Gb of RAM to its user. It appears that, even with the added scalability of the node neighborhood subsampling, this approach is out of reach for single users working on large datasets such as this one. The original paper specifies that the initial GraphSAGE tests were ran on a machine with 256Gb of RAM.

These results are why we turn to our final GNN model - GraphSAINT, that promises even better scalability through sampling of the whole graph minibatches of subgraphs, instead of subsampling node neighborhoods.