# **CS224W - Colab 2**

In Colab 2, we will work to construct our own graph neural network using PyTorch Geometric (PyG) and then apply that model on two Open Graph Benchmark (OGB) datasets. These two datasets will be used to benchmark your model's performance on two different graph-based tasks: 1) node property prediction, predicting properties of single nodes and 2) graph property prediction, predicting properties of entire graphs or subgraphs.

First, we will learn how PyTorch Geometric stores graphs as PyTorch tensors.

Then, we will load and inspect one of the Open Graph Benchmark (OGB) datasets by using the `ogb` package. OGB is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. The `ogb` package not only provides data loaders for each dataset but also model evaluators.

Lastly, we will build our own graph neural network using PyTorch Geometric. We will then train and evaluate our model on the OGB node property prediction and graph property prediction tasks.

**Note**: Make sure to **sequentially run all the cells in each section**, so that the intermediate variables / packages will carry over to the next cell

We recommend you save a copy of this colab in your drive so you don't lose progress!

Have fun and good luck on Colab 2 :)

# Device
You might need to use a GPU for this Colab to run quickly.

Please click `Runtime` and then `Change runtime type`. Then set the `hardware accelerator` to **GPU**.

# Setup
As discussed in Colab 0, the installation of PyG on Colab can be a little bit tricky. First let us check which version of PyTorch you are running

In [1]:
import torch
import os
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.2.1


Download the necessary packages for PyG. Make sure that your version of torch matches the output from the cell above. In case of any issues, more information can be found on the [PyG's installation page](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html).

In [None]:
# Install torch geometric
# if 'IS_GRADESCOPE_ENV' not in os.environ:
  # !pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.13.1+cu116.html
  # !pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.13.1+cu116.html
  # !pip install torch-geometric
  # !pip install ogb

# 1) PyTorch Geometric (Datasets and Data)


PyTorch Geometric has two classes for storing and/or transforming graphs into tensor format. One is `torch_geometric.datasets`, which contains a variety of common graph datasets. Another is `torch_geometric.data`, which provides the data handling of graphs in PyTorch tensors.

In this section, we will learn how to use `torch_geometric.datasets` and `torch_geometric.data` together.

## PyG Datasets

The `torch_geometric.datasets` class has many common graph datasets. Here we will explore its usage through one example dataset.

In [2]:
from torch_geometric.datasets import TUDataset

In [None]:
# if 'IS_GRADESCOPE_ENV' not in os.environ:
  # root = './enzymes'
  # name = 'ENZYMES'

  # The ENZYMES dataset
  # pyg_dataset= TUDataset(root, name)

  # You will find that there are 600 graphs in this dataset
  # print(pyg_dataset)

In [3]:
# root directory where dataset should be saved
root = "../datasets/enzymes"
name = "ENZYMES"

In [4]:
pyg_dataset = TUDataset(root, name) # The ENZYMES dataset

In [5]:
print(pyg_dataset)

ENZYMES(600)


In [6]:
pyg_dataset.num_features

3

## Question 1: What is the number of classes and number of features in the ENZYMES dataset? (5 points)

In [7]:
def get_num_classes(pyg_dataset):
  # TODO: Implement a function that takes a PyG dataset object
  # and returns the number of classes for that dataset.

  num_classes = 0

  ############# Your code here ############
  ## (~1 line of code)
  ## Note
  ## 1. Colab autocomplete functionality might be useful.
  num_classes = pyg_dataset.num_classes  

  #########################################

  return num_classes

def get_num_features(pyg_dataset):
  # TODO: Implement a function that takes a PyG dataset object
  # and returns the number of features for that dataset.

  num_features = 0

  ############# Your code here ############
  ## (~1 line of code)
  ## Note
  ## 1. Colab autocomplete functionality might be useful.
  num_features = pyg_dataset.num_features  

  #########################################

  return num_features

# if 'IS_GRADESCOPE_ENV' not in os.environ:
num_classes = get_num_classes(pyg_dataset)
num_features = get_num_features(pyg_dataset)
print("{} dataset has {} classes".format(name, num_classes))
print("{} dataset has {} features".format(name, num_features))

ENZYMES dataset has 6 classes
ENZYMES dataset has 3 features


## PyG Data

Each PyG dataset stores a list of `torch_geometric.data.Data` objects, where each `torch_geometric.data.Data` object represents a graph. We can easily get the `Data` object by indexing into the dataset.

For more information such as what is stored in the `Data` object, please refer to the [documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data).

In [8]:
len(pyg_dataset)

600

## Question 2: What is the label of the graph with index 100 in the ENZYMES dataset? (5 points)

In [9]:
def get_graph_class(pyg_dataset, idx):
  # TODO: Implement a function that takes a PyG dataset object,
  # an index of a graph within the dataset, and returns the class/label
  # of the graph (as an integer).

  label = -1

  ############# Your code here ############
  ## (~1 line of code)
  label = pyg_dataset[idx].y.item()
  #########################################

  return label

# Here pyg_dataset is a dataset for graph classification
# if 'IS_GRADESCOPE_ENV' not in os.environ:
graph_0 = pyg_dataset[0]
print(graph_0)
idx = 100
label = get_graph_class(pyg_dataset, idx)
print('Graph with index {} has label {}'.format(idx, label))

Data(edge_index=[2, 168], x=[37, 3], y=[1])
Graph with index 100 has label 4


In [12]:
d = pyg_dataset[0]

In [13]:
d.has_self_loops()

False

In [14]:
d = pyg_dataset[0].edge_index
d[:, d[0,:] <= d[1,:]]

tensor([[ 0,  0,  0,  1,  1,  1,  1,  2,  2,  2,  3,  3,  3,  4,  4,  4,  5,  5,
          5,  6,  6,  7,  7,  7,  8,  9,  9,  9, 10, 10, 11, 11, 12, 12, 13, 13,
         13, 13, 14, 14, 14, 15, 15, 16, 17, 17, 18, 18, 19, 20, 20, 20, 20, 21,
         21, 21, 21, 22, 22, 23, 24, 24, 24, 25, 25, 26, 26, 27, 27, 28, 30, 30,
         30, 31, 31, 31, 32, 32, 32, 33, 33, 33, 34, 34],
        [ 1,  2,  3,  2,  3, 24, 27,  3, 27, 28,  4,  5, 28,  5,  6, 29,  6,  7,
         29,  7,  8,  8,  9, 10,  9, 10, 11, 12, 11, 12, 12, 26, 25, 26, 14, 15,
         16, 25, 15, 16, 25, 16, 17, 17, 18, 19, 19, 20, 20, 21, 22, 23, 30, 22,
         23, 30, 35, 23, 35, 33, 27, 28, 29, 26, 29, 28, 29, 28, 29, 29, 33, 34,
         35, 32, 34, 36, 33, 34, 36, 34, 35, 36, 35, 36]])

## Question 3: How many edges does the graph with index 200 have? (5 points)

In [15]:
def get_graph_num_edges(pyg_dataset, idx):
  # TODO: Implement a function that takes a PyG dataset object,
  # the index of a graph in the dataset, and returns the number of
  # edges in the graph (as an integer). You should not count an edge
  # twice if the graph is undirected. For example, in an undirected
  # graph G, if two nodes v and u are connected by an edge, this edge
  # should only be counted once.

  num_edges = 0

  ############# Your code here ############
  ## Note:
  ## 1. You can't return the data.num_edges directly
  ## 2. We assume the graph is undirected
  ## 3. Look at the PyG dataset built in functions
  ## (~4 lines of code)
  data = pyg_dataset[idx].edge_index # get the graph connectivity (edge list)
  for i in range(data.shape[1]): # iterate over edge list
      if data[0][i] <= data[1][i]: # if the source label is less than the target label, then count edge 
          num_edges+=1 # there could be self loops, so make condition less than or equal
          
  # or we can use torch masking
  # edge = data[:, data[0,:] <= data[1,:]]  
  #########################################

  return num_edges

# if 'IS_GRADESCOPE_ENV' not in os.environ:
idx = 200
num_edges = get_graph_num_edges(pyg_dataset, idx)
print('Graph with index {} has {} edges'.format(idx, num_edges))

Graph with index 200 has 53 edges


# 2) Open Graph Benchmark (OGB)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can then be evaluated by using the OGB Evaluator in a unified manner.

## Dataset and Data

OGB also supports PyG dataset and data classes. Here we take a look on the `ogbn-arxiv` dataset.

In [16]:
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset

In [17]:
# if 'IS_GRADESCOPE_ENV' not in os.environ:
dataset_name = 'ogbn-arxiv'
root = "../datasets/"

# Load the dataset and transform it to sparse tensor
# Note: the graph contains an adjacency matrix with 1,166,243 number of non-zero entries
dataset = PygNodePropPredDataset(name=dataset_name,
                                 root=root,
                              transform=T.ToSparseTensor())

print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))

# Extract the graph
data = dataset[0]
print(data)

The ogbn-arxiv dataset has 1 graph
Data(num_nodes=169343, x=[169343, 128], node_year=[169343, 1], y=[169343, 1], adj_t=[169343, 169343, nnz=1166243])


## Question 4: How many features are in the ogbn-arxiv graph? (5 points)

In [18]:
def graph_num_features(data):
  # TODO: Implement a function that takes a PyG data object,
  # and returns the number of features in the graph (as an integer).

  num_features = 0

  ############# Your code here ############
  ## (~1 line of code)
  num_features = data.num_features
  #########################################

  return num_features

# if 'IS_GRADESCOPE_ENV' not in os.environ:
num_features = graph_num_features(data)
print('The graph has {} features'.format(num_features))

The graph has 128 features


# 3) GNN: Node Property Prediction

In this section we will build our first graph neural network using PyTorch Geometric. Then we will apply it to the task of node property prediction (node classification).

Specifically, we will use GCN as the foundation for your graph neural network ([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)). To do so, we will work with PyG's built-in `GCNConv` layer.

## Setup

In [19]:
import torch
import pandas as pd
import torch.nn.functional as F
print(torch.__version__)

# The PyG built-in GCNConv
from torch_geometric.nn import GCNConv

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

2.2.1


## Load and Preprocess the Dataset

In [20]:
# if 'IS_GRADESCOPE_ENV' not in os.environ:
dataset_name = 'ogbn-arxiv'
root = "../datasets/"

dataset = PygNodePropPredDataset(name=dataset_name,
                                 root=root,
                              transform=T.ToSparseTensor())
data = dataset[0]

In [21]:
print(data)

Data(num_nodes=169343, x=[169343, 128], node_year=[169343, 1], y=[169343, 1], adj_t=[169343, 169343, nnz=1166243])


In [82]:
# the original dataset is directed

data.is_directed()

True

In [83]:
# Make the adjacency matrix to symmetric 
# Convert the sparse matrix to undirected

data.adj_t = data.adj_t.to_symmetric()

In [84]:
data.is_directed()

False

In [90]:
print(data)

Data(num_nodes=169343, x=[169343, 128], node_year=[169343, 1], y=[169343, 1], adj_t=[169343, 169343, nnz=2315598])


In [93]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# If you use GPU, the device should be cuda
print('Device: {}'.format(device))

data = data.to(device) # move data to device
split_idx = dataset.get_idx_split() # it will return the index splits of train, val and test (it looks for eaither a dictionary or separate folders) 
train_idx = split_idx['train'].to(device) # create a tensor to store train indices

Device: cpu


In [104]:
train_idx.shape  # ~90k nodes for training

torch.Size([90941])

## GCN Model

Now we will implement our GCN model!

Please follow the figure below to implement the `forward` function.


[test](https://drive.google.com/uc?id=128AuYAXNXGg7PIhJJ7e420DoPWKb-RtL)

In [50]:
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, return_embeds=False):
        # TODO: Implement a function that initializes self.convs,
        # self.bns, and self.softmax.
        '''
        input_dim: the shape dimension of inputs
        hidden_dim: the dimension of the hidden layer
        output_dim: the dimension of output after it was fed to the hidden layer
        num_layers: the number of GCNs
        dropout: the proportion of nodes to drop
        '''

        super(GCN, self).__init__()

        # A list of GCNConv layers
        '''
        We need at least two layers, the input layer and the output layer
        The range function in python produces an increasing sequence of integers.
        Thus, if given a negative number without a start, then it will produce an empty list of ranges
        '''
        self.convs = torch.nn.ModuleList(
            [GCNConv(in_channels=input_dim, out_channels=hidden_dim)] +
            [GCNConv(in_channels=hidden_dim, out_channels=hidden_dim) for i in range(num_layers - 2)] +
            [GCNConv(in_channels=hidden_dim, out_channels=output_dim)]
        )

        # A list of 1D batch normalization layers
        '''
        There is a batch norm layer after every CONV layer
        The number of features is the number of channels
        '''
        self.bns = torch.nn.ModuleList([torch.nn.BatchNorm1d(num_features=hidden_dim) for i in range(num_layers-1)])

        # The log softmax layer
        self.softmax = torch.nn.LogSoftmax(dim=-1) # compute softmax in the last dimension

        ############# Your code here ############
        ## Note:
        ## 1. You should use torch.nn.ModuleList for self.convs and self.bns
        ## 2. self.convs has num_layers GCNConv layers
        ## 3. self.bns has num_layers - 1 BatchNorm1d layers
        ## 4. You should use torch.nn.LogSoftmax for self.softmax
        ## 5. The parameters you can set for GCNConv include 'in_channels' and
        ## 'out_channels'. For more information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv
        ## 6. The only parameter you need to set for BatchNorm1d is 'num_features'
        ## For more information please refer to the documentation:
        ## https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
        ## (~10 lines of code)
        

        #########################################

        # Probability of an element getting zeroed
        self.dropout = dropout

        # Skip classification layer and return node embeddings
        self.return_embeds = return_embeds

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters() # it resets the model weights 
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        # TODO: Implement a function that takes the feature tensor x and
        # edge_index tensor adj_t and returns the output tensor as
        # shown in the figure.

        out = None

        ############# Your code here ############
        ## Note:
        ## 1. Construct the network as shown in the figure
        ## 2. torch.nn.functional.relu and torch.nn.functional.dropout are useful
        ## For more information please refer to the documentation:
        ## https://pytorch.org/docs/stable/nn.functional.html
        ## 3. Don't forget to set F.dropout training to self.training
        ## 4. If return_embeds is True, then skip the last softmax layer
        ## (~7 lines of code)
        for conv_layer, bn_layer in zip(self.convs[:-1],self.bns): # ignore output layer
            x = F.relu(bn_layer(conv_layer(x, adj_t)))
            x = F.dropout(x, p=self.dropout, training=self.training) 
            
        out = self.convs[-1](x, adj_t) # feed current tensor to output layer
        if not self.return_embeds: # if not returning embeddings, return predictions
            out= self.softmax(out)
        #########################################

        return out # otherwise return embeddings

In [164]:
def train(model, data, train_idx, optimizer, loss_fn):
    # TODO: Implement a function that trains the model by
    # using the given optimizer and loss_fn.
    model.train()
    loss = 0

    ############# Your code here ############
    ## Note:
    ## 1. Zero grad the optimizer
    ## 2. Feed the data into the model
    ## 3. Slice the model output and label by train_idx
    ## 4. Feed the sliced output and label to loss_fn
    ## (~4 lines of code)
    optimizer.zero_grad()
    output = model(data.x, data.adj_t) # train on all nodes

    train_pred = output[train_idx] # get predictions for training nodes
    train_labels = data.y[train_idx].squeeze() # transform to vector as required by nll_loss

    loss = loss_fn(train_pred, train_labels)
    #########################################

    loss.backward() # compute gradients
    optimizer.step() # update weights

    return loss.item()

In [165]:
# Test function here
@torch.no_grad() # the function uses the no_grad() decorator to disable gradient computation during inference
def test(model, data, split_idx, evaluator, save_model_results=False):
    # TODO: Implement a function that tests the model by
    # using the given split_idx and evaluator.
    model.eval() # set model to eval mode

    # The output of model on all data
    out = None

    ############# Your code here ############
    ## (~1 line of code)
    ## Note:
    ## 1. No index slicing here
    out = model(data.x, data.adj_t)
    #########################################

    y_pred = out.argmax(dim=-1, keepdim=True) # predict the index with the largest probability

    train_acc = evaluator.eval({
        'y_true': data.y[split_idx['train']],
        'y_pred': y_pred[split_idx['train']],
    })['acc']
    valid_acc = evaluator.eval({
        'y_true': data.y[split_idx['valid']],
        'y_pred': y_pred[split_idx['valid']],
    })['acc']
    test_acc = evaluator.eval({
        'y_true': data.y[split_idx['test']],
        'y_pred': y_pred[split_idx['test']],
    })['acc']

    if save_model_results:
      print ("Saving Model Predictions")

      data = {}
      data['y_pred'] = y_pred.view(-1).cpu().detach().numpy()

      df = pd.DataFrame(data=data)
      # Save locally as csv
      df.to_csv('ogbn-arxiv_node.csv', sep=',', index=False) # saves predictions of the model 


    return train_acc, valid_acc, test_acc

In [174]:
# Please do not change the args
# if 'IS_GRADESCOPE_ENV' not in os.environ:
args = {
  'device': device,
  'num_layers': 3,
  'hidden_dim': 256,
  'dropout': 0.5,
  'lr': 0.01,
  'epochs': 10,
}
args

{'device': 'cpu',
 'num_layers': 3,
 'hidden_dim': 256,
 'dropout': 0.5,
 'lr': 0.01,
 'epochs': 10}

In [None]:
# if 'IS_GRADESCOPE_ENV' not in os.environ:
model = GCN(data.num_features, args['hidden_dim'],
          dataset.num_classes, args['num_layers'],
          args['dropout']).to(device)

In [None]:
evaluator = Evaluator(name='ogbn-arxiv')

In [176]:
'''
Get a summary from the model
'''

from torch_geometric.nn import summary

x = torch.randn(100, 128) # create dummy node data 
edge_index = torch.randint(100, size=(2, 20)) # create dummy edge data

print(summary(model, x, edge_index))

+-----------------------+---------------------+----------------+----------+
| Layer                 | Input Shape         | Output Shape   | #Param   |
|-----------------------+---------------------+----------------+----------|
| GCN                   | [100, 128], [2, 20] | [100, 40]      | 110,120  |
| ├─(convs)ModuleList   | --                  | --             | 109,096  |
| │    └─(0)GCNConv     | [100, 128], [2, 20] | [100, 256]     | 33,024   |
| │    └─(1)GCNConv     | [100, 256], [2, 20] | [100, 256]     | 65,792   |
| │    └─(2)GCNConv     | [100, 256], [2, 20] | [100, 40]      | 10,280   |
| ├─(bns)ModuleList     | --                  | --             | 1,024    |
| │    └─(0)BatchNorm1d | [100, 256]          | [100, 256]     | 512      |
| │    └─(1)BatchNorm1d | [100, 256]          | [100, 256]     | 512      |
| ├─(softmax)LogSoftmax | [100, 40]           | [100, 40]      | --       |
+-----------------------+---------------------+----------------+----------+


In [177]:
# Please do not change these args
# Training should take <10min using GPU runtime
import copy
import time
# if 'IS_GRADESCOPE_ENV' not in os.environ:
# reset the parameters to initial random value
model.reset_parameters()

optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
loss_fn = F.nll_loss # input is expected to be in log-probabilities

In [None]:
best_model = None
best_valid_acc = 0

for epoch in range(1, 1 + args["epochs"]):
    start_time = time.time()
    loss = train(model, data, train_idx, optimizer, loss_fn) # train the model and compute loss
    result = test(model, data, split_idx, evaluator) # test the model on train, val, and test, splits
    train_acc, valid_acc, test_acc = result
    end_time = time.time()
    if valid_acc > best_valid_acc: # if the validation acc is better than the best one
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model) # save a copy of the best model
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')
    print(f"Time elapsed: {round(end_time - start_time)}")

## Question 5: What are your `best_model` validation and test accuracies?(20 points)

Run the cell below to see the results of your best of model and save your model's predictions to a file named *ogbn-arxiv_node.csv*.

You can view this file by clicking on the *Folder* icon on the left side pannel. As in Colab 1, when you sumbit your assignment, you will have to download this file and attatch it to your submission.

In [None]:
best_result = test(best_model, data, split_idx, evaluator, save_model_results=True)
train_acc, valid_acc, test_acc = best_result
print(f'Best model: '
    f'Train: {100 * train_acc:.2f}%, '
    f'Valid: {100 * valid_acc:.2f}% '
    f'Test: {100 * test_acc:.2f}%')

# 4) GNN: Graph Property Prediction

In this section we will create a graph neural network for graph property prediction (graph classification).


## Load and preprocess the dataset

In [27]:
from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.loader import DataLoader
from tqdm.notebook import tqdm

In [28]:
# if 'IS_GRADESCOPE_ENV' not in os.environ:
# Load the dataset
dataset = PygGraphPropPredDataset(name='ogbg-molhiv', root='../datasets/')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device: {}'.format(device))

split_idx = dataset.get_idx_split()

# Check task type
print('Task type: {}'.format(dataset.task_type))

Device: cpu
Task type: binary classification


In [29]:
len(dataset) # number of graphs in the dataset

41127

In [30]:
# Load the dataset splits into corresponding dataloaders
# We will train the graph classification task on a batch of 32 graphs
# Shuffle the order of graphs for training set
if 'IS_GRADESCOPE_ENV' not in os.environ:
  train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
  valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
  test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)

In [34]:
len(train_loader), len(valid_loader), len(test_loader)

(1029, 129, 129)

In [41]:
for batch in train_loader:
    example_batch = batch
    break

In [76]:
# The batch contains 32 graphs, each can be indexed individually, and the example batch is a concatenation of all other graphs
example_batch

DataBatch(edge_index=[2, 1540], edge_attr=[1540, 3], x=[727, 9], y=[32, 1], num_nodes=727, batch=[727], ptr=[33])

In [49]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  # Please do not change the args
  args = {
      'device': device,
      'num_layers': 5,
      'hidden_dim': 256,
      'dropout': 0.5,
      'lr': 0.001,
      'epochs': 30,
  }
  args

## Graph Prediction Model

### Graph Mini-Batching
Before diving into the actual model, we introduce the concept of mini-batching with graphs. In order to parallelize the processing of a mini-batch of graphs, PyG combines the graphs into a single disconnected graph data object (*torch_geometric.data.Batch*). *torch_geometric.data.Batch* inherits from *torch_geometric.data.Data* (introduced earlier) and contains an additional attribute called `batch`.

The `batch` attribute is a vector mapping each node to the index of its corresponding graph within the mini-batch:

    batch = [0, ..., 0, 1, ..., n - 2, n - 1, ..., n - 1]

This attribute is crucial for associating which graph each node belongs to and can be used to e.g. average the node embeddings for each graph individually to compute graph level embeddings.



### Implemention
Now, we have all of the tools to implement a GCN Graph Prediction model!  

We will reuse the existing GCN model to generate `node_embeddings` and then use  `Global Pooling` over the nodes to create graph level embeddings that can be used to predict properties for the each graph. Remeber that the `batch` attribute will be essential for performining Global Pooling over our mini-batch of graphs.

In [51]:
from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.nn import global_add_pool, global_mean_pool

In [75]:
# count the total number of nodes in this batch
sum([example_batch[i].num_nodes for i in range(len(example_batch))])

727

In [88]:
# Test out the molecule encoder
encoder = AtomEncoder(100) # create embeddings of dimension 100

print(example_batch[2].x[:3]) # example of node data
print(example_batch[2].x.shape)

print(encoder(example_batch[2].x[0:1]))
print(encoder(example_batch[2].x).shape) # example of output embeddings

tensor([[5, 0, 4, 5, 3, 0, 2, 0, 0],
        [7, 0, 2, 5, 0, 0, 1, 0, 0],
        [5, 0, 3, 5, 0, 0, 1, 1, 1]])
torch.Size([20, 9])
tensor([[-0.1715,  0.3477,  0.3981, -0.1655, -0.1977,  0.1557, -0.3424,  0.2687,
         -0.0577, -0.5748, -0.5736,  0.0199,  0.1888,  0.2286,  0.2420, -0.2467,
          0.3970,  0.1647,  0.4794,  0.4815,  0.1288, -0.6507,  0.2565,  0.9478,
         -0.6620, -0.3255,  0.4599,  0.0568, -0.2444, -0.2136, -0.8143, -0.6217,
         -0.4580,  0.4377, -0.8823, -1.1215,  0.9894,  0.4021, -0.4093, -0.1832,
          0.1327,  0.5727,  0.1113, -0.3365, -0.2101, -0.2479,  0.3047,  0.2963,
         -0.2182, -0.1673,  0.4669, -0.5770,  0.6525, -0.4454,  0.7122, -0.0149,
         -0.1781, -0.0908, -0.5356,  0.3018, -0.3281,  0.0104,  0.4993, -0.2459,
         -0.2312,  0.5500,  0.5061, -0.0865,  0.2999, -0.0628,  0.4464, -0.1697,
          0.7962,  0.2416, -0.3783,  0.0551, -0.3426,  0.3654, -0.9410,  0.4510,
          0.4122,  0.4775, -0.1227, -0.5058, -0.2735,  0.1

In [92]:
### GCN to predict graph property
class GCN_Graph(torch.nn.Module):
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        super(GCN_Graph, self).__init__()

        # Load encoders for Atoms in molecule graphs
        self.node_encoder = AtomEncoder(hidden_dim) # it provides an embedding for each molecule

        # Node embedding model
        # Note that the input_dim and output_dim are set to hidden_dim
        self.gnn_node = GCN(hidden_dim, hidden_dim,
            hidden_dim, num_layers, dropout, return_embeds=True) # reuse our own GCN implementation

        ############# Your code here ############
        ## Note:
        ## 1. Initialize self.pool as a global mean pooling layer
        ## For more information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        
        self.pool = global_mean_pool # it averages the node features over the node dimension for a given batch
        #########################################

        # Output layer
        self.linear = torch.nn.Linear(hidden_dim, output_dim) # a linear layer to transform aggregated embeddings to outputs


    def reset_parameters(self):
      self.gnn_node.reset_parameters()
      self.linear.reset_parameters()

    def forward(self, batched_data):
        # TODO: Implement a function that takes as input a
        # mini-batch of graphs (torch_geometric.data.Batch) and
        # returns the predicted graph property for each graph.
        #
        # NOTE: Since we are predicting graph level properties,
        # your output will be a tensor with dimension equaling
        # the number of graphs in the mini-batch


        # Extract important attributes of our mini-batch
        '''
        x - contains the nodes in a given batch
        edge_index - contains all edges in a batch 
        batch - it indicates which nodes belong to which graph in a batch
        '''
        x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
        embed = self.node_encoder(x)

        out = None

        ############# Your code here ############
        ## Note:
        ## 1. Construct node embeddings using existing GCN model
        ## 2. Use the global pooling layer to aggregate features for each individual graph
        ## For more information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        ## 3. Use a linear layer to predict each graph's property
        ## (~3 lines of code)
        out = self.gnn_node(embed, edge_index) # learn a GNN, use the embeddings for nodes
        out = self.pool(out, batch) # send the transformed embeds to the global mean pooling
        out = self.linear(out) # send graph embeddings to the prediction head
        #########################################

        return out

In [103]:
def train(model, device, data_loader, optimizer, loss_fn):
    # TODO: Implement a function that trains your model by
    # using the given optimizer and loss_fn.
    model.train()
    loss = 0

    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
      batch = batch.to(device)

      if batch.x.shape[0] == 1 or batch.batch[-1] == 0: # if the batch only contains one node, or the last node in the batch is 0
          pass
      else:
        ## ignore nan targets (unlabeled) when computing training loss.
        is_labeled = batch.y == batch.y 

        ############# Your code here ############
        ## Note:
        ## 1. Zero grad the optimizer
        ## 2. Feed the data into the model
        ## 3. Use `is_labeled` mask to filter output and labels
        ## 4. You may need to change the type of label to torch.float32
        ## 5. Feed the output and label to the loss_fn
        ## (~3 lines of code)
        optimizer.zero_grad()
        out = model(batch) # give batch to the model
        pred_labels = out[is_labeled]
        true_labels = batch.y[is_labeled].squeeze().type(torch.float32)
        loss = loss_fn(pred_labels, true_labels)  
        #########################################

        loss.backward()
        optimizer.step()

    return loss.item()

In [109]:
# The evaluation function
def eval(model, device, loader, evaluator, save_model_results=False, save_file=None):
    model.eval()
    y_true = []
    y_pred = []

    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch) 

            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())

    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()

    input_dict = {"y_true": y_true, "y_pred": y_pred}

    if save_model_results:
        print ("Saving Model Predictions")

        # Create a pandas dataframe with a two columns
        # y_pred | y_true
        data = {}
        data['y_pred'] = y_pred.reshape(-1)
        data['y_true'] = y_true.reshape(-1)

        df = pd.DataFrame(data=data)
        # Save to csv
        df.to_csv('ogbg-molhiv_graph_' + save_file + '.csv', sep=',', index=False)

    return evaluator.eval(input_dict)

In [105]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  model = GCN_Graph(args['hidden_dim'],
              dataset.num_tasks, args['num_layers'],
              args['dropout']).to(device)
  evaluator = Evaluator(name='ogbg-molhiv')

In [None]:
# Please do not change these args
# Training should take <10min using GPU runtime
import copy

if 'IS_GRADESCOPE_ENV' not in os.environ:
  model.reset_parameters()

  optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
  loss_fn = torch.nn.BCEWithLogitsLoss()

  best_model = None
  best_valid_acc = 0

  for epoch in range(1, 1 + args["epochs"]):
    print('Training...')
    loss = train(model, device, train_loader, optimizer, loss_fn)

    print('Evaluating...')
    train_result = eval(model, device, train_loader, evaluator)
    val_result = eval(model, device, valid_loader, evaluator)
    test_result = eval(model, device, test_loader, evaluator)

    train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

## Question 6: What are your `best_model` validation and test ROC-AUC scores? (20 points)

Run the cell below to see the results of your best of model and save your model's predictions over the validation and test datasets. The resulting files are named *ogbn-arxiv_graph_valid.csv* and *ogbn-arxiv_graph_test.csv*.

Again, you can view these files by clicking on the *Folder* icon on the left side pannel. As in Colab 1, when you sumbit your assignment, you will have to download these files and attatch them to your submission.

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  train_acc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
  valid_acc = eval(best_model, device, valid_loader, evaluator, save_model_results=True, save_file="valid")[dataset.eval_metric]
  test_acc  = eval(best_model, device, test_loader, evaluator, save_model_results=True, save_file="test")[dataset.eval_metric]

  print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

## Question 7 (Optional): Experiment with the two other global pooling layers in Pytorch Geometric.

# Submission

To submit Colab 2, please submit to the following assignments on Gradescope:

1. "Colab 2": submit your answers to the questions in this assignment
2. "Colab 2 Code": submit your completed *CS224W_Colab2.ipynb*. From the "File" menu select "Download .ipynb" to save a local copy of your completed Colab. **PLEASE DO NOT CHANGE THE NAME!** The autograder depends on the .ipynb file being called "CS224W_Colab2.ipynb".