# **CS224W - Colab 2**

In this Colab, we will construct our own graph neural network by using PyTorch Geometric (PyG) and apply the model on two of Open Graph Benchmark (OGB) datasets. Those two datasets are used to benchmark the model performance on two different graph-related tasks. One is node property prediction, predicting properties of single nodes. Another one is graph property prediction, predicting the entire graphs or subgraphs.  


> 여기서는 PyG를 이용해서 GNN을 구현해보고 OGB 데이터셋 중 두 가지 데이터셋에 적용해볼 것입니다. 이 두 데이터셋은 두 가지 다른 그래프 관련 테스크에서 모델 성능의 benchmark로 사용됩니다. 하나는 노드 특성 예측이고 다른 하나는 그래프 특성 예측으로 그래프 전체나 혹은 서브그래프를 예측하는 테스크 입니다.

At first, we will learn how PyTorch Geometric stores the graphs in PyTorch tensor.  


> 먼저 PyG 가 Graph data를 어떻게 PyTorch Tensor로 변환하는지 살펴볼 것입니다.  

We will then load and take a quick look on one of the Open Graph Benchmark (OGB) datasets by using the `ogb` package. OGB is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. The `ogb` package not only provides the data loader of the dataset but also the evaluator.

> 'ogb' 패키지를 이용해서 OGB 데이터셋들 중 하나를 빠르게 살펴보겠습니다. OGB는 머신러닝을 위한 그래프 형태의 실질적이고 규모가 큰 다양한 벤치마크 데이터셋을 보유하고 있습니다. 'ogb' 패키지는 data loader 뿐만 아니라 evaluator도 제공합니다.  

At last, we will build our own graph neural networks by using PyTorch Geometric. And then apply and evaluate the models on node property prediction and grpah property prediction tasks.
  
> 마지막에는 PyG를 이용하여 우리만의 GNN을 빌드해볼 것입니다. 그리고나서 우리 모델을 node property prediction과 graph property prediction tasks에 적용시켜보고 평가해보겠습니다.  

**Note**: Make sure to **sequentially run all the cells in each section**, so that the intermediate variables / packages will carry over to the next cell

Have fun on Colab 2 :)

# Device
You might need to use GPU for this Colab.

Please click `Runtime` and then `Change runtime type`. Then set the `hardware accelerator` to **GPU**.

# Installation

In [1]:
import torch
torch.__version__

'1.8.1+cu101'

In [2]:
# !pip uninstall torch-scatter -y -q
# !pip uninstall torch-sparse -y -q
# !pip uninstall torch-geometric -y -q
# !pip uninstall ogb -y -q

In [3]:
!pip install torch-scatter -q -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-sparse -q -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-cluster -q -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-spline-conv -q -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-geometric -q
!pip install ogb -q

[K     |████████████████████████████████| 2.6MB 307kB/s 
[K     |████████████████████████████████| 1.5MB 306kB/s 
[K     |████████████████████████████████| 1.0MB 3.6MB/s 
[K     |████████████████████████████████| 389kB 3.3MB/s 
[K     |████████████████████████████████| 215kB 8.7MB/s 
[K     |████████████████████████████████| 235kB 16.0MB/s 
[K     |████████████████████████████████| 2.2MB 17.9MB/s 
[K     |████████████████████████████████| 51kB 9.3MB/s 
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 71kB 5.7MB/s 
[?25h  Building wheel for littleutils (setup.py) ... [?25l[?25hdone


# 1 PyTorch Geometric (Datasets and Data)


PyTorch Geometric generally has two classes for storing or transforming the graphs into tensor format. One is the `torch_geometric.datasets`, which contains a variety of common graph datasets. Another one is `torch_geometric.data` that provides the data handling of graphs in PyTorch tensors.

In this section, we will learn how to use the `torch_geometric.datasets` and `torch_geometric.data`.  
  
> PyG에는 graph를 tensor 형태로 저장, 변형하기 위해 두 개의 classes를 가지고 있습니다. 하나는 'torch_geometric.datasets'인데 다양한 common graph datasets을 가지고 있습니다. 다른 하나는 'torch_geometric.data'인데, PyTorch tensors에서 그래프 데이터 핸들링을 제공합니다.

## PyG Datasets

The `torch_geometric.datasets` has many common graph datasets. Here we will explore the usage by using one example dataset.  
  
[관련 PyG doc](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html)  
  
TUDataset에서 ENZYMES 데이터를 불러옵니다.  
효소라는 뜻으로  
600개의 그래프, 6개의 클래스로 구성되어 있습니다.

In [4]:
from torch_geometric.datasets import TUDataset

root = './enzymes'
name = 'ENZYMES'

# The ENZYMES dataset
pyg_dataset= TUDataset('./enzymes', 'ENZYMES')

# You can find that there are 600 graphs in this dataset
print(pyg_dataset)

Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Extracting enzymes/ENZYMES/ENZYMES.zip
Processing...
Done!
ENZYMES(600)


## Question 1: What is the number of classes and number of features in the ENZYMES dataset? (5 points)

In [5]:
def get_num_classes(pyg_dataset):
    # TODO: Implement this function that takes a PyG dataset object
    # and return the number of classes for that dataset.

    num_classes = 0

    ############# Your code here ############
    ## (~1 line of code)
    ## Note
    ## 1. Colab autocomplete functionality might be useful.
    num_classes = pyg_dataset.num_classes

    #########################################

    return num_classes

def get_num_features(pyg_dataset):
    # TODO: Implement this function that takes a PyG dataset object
    # and return the number of features for that dataset.

    num_features = 0

    ############# Your code here ############
    ## (~1 line of code)
    ## Note
    ## 1. Colab autocomplete functionality might be useful.
    num_features = pyg_dataset.num_features
    
    #########################################

    return num_features

# You may find that some information need to be stored in the dataset level,
# specifically if there are multiple graphs in the dataset

num_classes = get_num_classes(pyg_dataset)
num_features = get_num_features(pyg_dataset)
print("{} dataset has {} classes".format(name, num_classes))
print("{} dataset has {} features".format(name, num_features))

ENZYMES dataset has 6 classes
ENZYMES dataset has 3 features


## PyG Data

Each PyG dataset usually stores a list of `torch_geometric.data.Data` objects. Each `torch_geometric.data.Data` object usually represents a graph. You can easily get the `Data` object by indexing on the dataset.

For more information such as what will be stored in `Data` object, please refer to the [documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data).  
  
> 각각의 PyG 데이터셋은 주로 'torch_geometric.data.Data' objects의 리스트를 저장합니다. 각 'torch_geometric.data.Data' object는 대부분 그래프를 표현합니다. 데이터셋에 인덱싱해서 'Data' object를 쉽게 얻을 수 있습니다.

## Question 2: What is the label of the graph (index 100 in the ENZYMES dataset)? (5 points)

In [6]:
def get_graph_class(pyg_dataset, idx):
  # TODO: Implement this function that takes a PyG dataset object,
  # the index of the graph in dataset, and returns the class/label 
  # of the graph (in integer).

  label = -1

  ############# Your code here ############
  ## (~1 line of code)
  # label = pyg_dataset.data.y[idx]
  label = pyg_dataset[idx]['y']

  #########################################

  return label

# Here pyg_dataset is a dataset for graph classification
graph_0 = pyg_dataset[0]
print(graph_0)
idx = 100
label = get_graph_class(pyg_dataset, idx)
print('Graph with index {} has label {}'.format(idx, label))

Data(edge_index=[2, 168], x=[37, 3], y=[1])
Graph with index 100 has label tensor([4])


## Question 3: What is the number of edges for the graph (index 200 in the ENZYMES dataset)? (5 points)

In [7]:
def get_graph_num_edges(pyg_dataset, idx):
    # TODO: Implement this function that takes a PyG dataset object,
    # the index of the graph in dataset, and returns the number of 
    # edges in the graph (in integer). You should not count an edge 
    # twice if the graph is undirected. For example, in an undirected 
    # graph G, if two nodes v and u are connected by an edge, this edge
    # should only be counted once.

    num_edges = 0

    ############# Your code here ############
    ## Note:
    ## 1. You can't return the data.num_edges directly
    ## 2. We assume the graph is undirected
    ## (~4 lines of code)
    num_edges = pyg_dataset[200].num_edges // 2
    
    #########################################

    return num_edges

idx = 100
num_edges = get_graph_num_edges(pyg_dataset, idx)
print('Graph with index {} has {} edges'.format(idx, num_edges))

Graph with index 100 has 53 edges


# 2 Open Graph Benchmark (OGB)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can also be evaluated by using the OGB Evaluator in a unified manner.

## Dataset and Data

OGB also supports the PyG dataset and data. Here we take a look on the `ogbn-arxiv` dataset.

In [8]:
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset

dataset_name = 'ogbn-arxiv'
# Load the dataset and transform it to sparse tensor
dataset = PygNodePropPredDataset(name=dataset_name,
                                 transform=T.ToSparseTensor())
print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))

# Extract the graph
data = dataset[0]
print(data)

Downloaded 0.00 GB:   1%|          | 1/81 [00:00<00:08,  9.21it/s]

Downloading http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip


Downloaded 0.08 GB: 100%|██████████| 81/81 [00:11<00:00,  7.19it/s]


Extracting dataset/arxiv.zip
Processing...
Loading necessary files...
This might take a while.


100%|██████████| 1/1 [00:00<00:00, 842.57it/s]
100%|██████████| 1/1 [00:00<00:00, 417.39it/s]

Processing graphs...
Converting graphs into PyG objects...
Saving...
Done!





The ogbn-arxiv dataset has 1 graph
Data(adj_t=[169343, 169343, nnz=1166243], node_year=[169343, 1], x=[169343, 128], y=[169343, 1])


## Question 4: What is the number of features in the ogbn-arxiv graph? (5 points)

In [9]:
def graph_num_features(data):
  # TODO: Implement this function that takes a PyG data object,
  # and returns the number of features in the graph (in integer).

  num_features = 0

  ############# Your code here ############
  ## (~1 line of code)
  num_features = data.num_features

  #########################################

  return num_features

num_features = graph_num_features(data)
print('The graph has {} features'.format(num_features))

The graph has 128 features


# 3 GNN: Node Property Prediction

In this section we will build our first graph neural network by using PyTorch Geometric and apply it on node property prediction (node classification).

We will build the graph neural network by using GCN operator ([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)).

You should use the PyG built-in `GCNConv` layer directly. 

## Setup

In [10]:
import torch
import torch.nn.functional as F
print(torch.__version__)

# The PyG built-in GCNConv
from torch_geometric.nn import GCNConv

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

1.8.1+cu101


### Wandb setup

In [11]:
# install wandb
!pip install Wandb -q

[K     |████████████████████████████████| 1.8MB 7.5MB/s 
[K     |████████████████████████████████| 133kB 51.2MB/s 
[K     |████████████████████████████████| 174kB 51.4MB/s 
[K     |████████████████████████████████| 102kB 14.5MB/s 
[K     |████████████████████████████████| 71kB 11.1MB/s 
[?25h  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [12]:
import wandb

In [13]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [14]:
wandb.init(project="GCN_tutorial",
           magic=True)
wandb.watch_called = False

[34m[1mwandb[0m: Currently logged in as: [33mhoyoon[0m (use `wandb login --relogin` to force relogin)


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

## Load and Preprocess the Dataset

data split 관련 정보 : https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv

In [15]:
dataset_name = 'ogbn-arxiv'
dataset = PygNodePropPredDataset(name=dataset_name,
                                 transform=T.ToSparseTensor())
data = dataset[0]

# Make the adjacency matrix to symmetric
data.adj_t = data.adj_t.to_symmetric()

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# If you use GPU, the device should be cuda
print('Device: {}'.format(device))

data = data.to(device) # 데이터를 gpu에 올려 놓는 건가??

split_idx = dataset.get_idx_split()
train_idx = split_idx['train'].to(device)

Device: cuda


In [16]:
data.adj_t

SparseTensor(row=tensor([     0,      0,      0,  ..., 169341, 169342, 169342], device='cuda:0'),
             col=tensor([   411,    640,   1162,  ..., 163274,  27824, 158981], device='cuda:0'),
             size=(169343, 169343), nnz=2315598, density=0.01%)

## GCN Model

Now we will implement our GCN model!

Please follow the figure below to implement your `forward` function.


![test](https://drive.google.com/uc?id=128AuYAXNXGg7PIhJJ7e420DoPWKb-RtL)

In [17]:
'''null_loss 알아보기
input = torch.randn(5,3,requires_grad=True)
target = torch.tensor([[1],[0],[4]])

print(torch.nn.LogSoftmax(input))

output = F.nll_loss(F.log_softmax(input), target)
print(output)
output.backward()
'''

'null_loss 알아보기\ninput = torch.randn(5,3,requires_grad=True)\ntarget = torch.tensor([[1],[0],[4]])\n\nprint(torch.nn.LogSoftmax(input))\n\noutput = F.nll_loss(F.log_softmax(input), target)\nprint(output)\noutput.backward()\n'

In [18]:
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, loss_fn, return_embeds=False):
        # TODO: Implement this function that initializes self.convs, 
        # self.bns, and self.softmax.
        super(GCN, self).__init__()

        ############# Your code here ############
        ## Note:
        ## 1. You should use torch.nn.ModuleList for self.convs and self.bns
        ## 2. self.convs has num_layers GCNConv layers
        ## 3. self.bns has num_layers - 1 BatchNorm1d layers
        ## 4. You should use torch.nn.LogSoftmax for self.softmax
        ## 5. The parameters you can set for GCNConv include 'in_channels' and 
        ## 'out_channels'. More information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv
        ## 6. The only parameter you need to set for BatchNorm1d is 'num_features'
        ## More information please refer to the documentation: 
        ## https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
        ## (~10 lines of code)

        self.num_layers = num_layers

        # A list of GCNConv layers
        self.convs = torch.nn.ModuleList()
        for i in range(num_layers):
            if i == 0:
                self.convs.append(GCNConv(in_channels=input_dim,
                                          out_channels=hidden_dim))
            elif i == num_layers-1:
                self.convs.append(GCNConv(in_channels=hidden_dim,
                                          out_channels=output_dim))
            else:
                self.convs.append(GCNConv(in_channels=hidden_dim,
                                          out_channels=hidden_dim))

        # A list of 1D batch normalization layers
        self.bns = torch.nn.ModuleList()
        for i in range(num_layers-1):
            self.bns.append(torch.nn.BatchNorm1d(num_features=hidden_dim))

        # The log softmax layer
        if loss_fn == "LogSoftmax":
            self.softmax = torch.nn.LogSoftmax()
        elif loss_fn == "Softmax":
            self.softmax = torch.nn.Softmax()

        # Probability of an element to be zeroed
        self.dropout = dropout

        # Skip classification layer and return node embeddings
        self.return_embeds = return_embeds
        #########################################

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        # TODO: Implement this function that takes the feature tensor x,
        # edge_index tensor adj_t and returns the output tensor as
        # shown in the figure.

        out = None
        ############# Your code here ############
        ## Note:
        ## 1. Construct the network as showing in the figure
        ## 2. torch.nn.functional.relu and torch.nn.functional.dropout are useful
        ## More information please refer to the documentation:
        ## https://pytorch.org/docs/stable/nn.functional.html
        ## 3. Don't forget to set F.dropout training to self.training
        ## 4. If return_embeds is True, then skip the last softmax layer
        ## (~7 lines of code)
        for i in range(self.num_layers-1):
            x = self.convs[i](x, adj_t)
            x = self.bns[i](x)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)

        x = self.convs[i+1](x, adj_t)
        if not self.return_embeds:
            out = self.softmax(x)
            # out = F.log_softmax(x)
            return out
        else:
            out = x
        #########################################

        return out

In [19]:
def train(model, data, train_idx, optimizer, loss_fn):
    # TODO: Implement this function that trains the model by 
    # using the given optimizer and loss_fn.
    model.train()
    loss = 0

    ############# Your code here ############
    ## Note:
    ## 1. Zero grad the optimizer
    ## 2. Feed the data into the model
    ## 3. Slicing the model output and label by train_idx
    ## 4. Feed the sliced output and label to loss_fn
    ## (~4 lines of code)
    optimizer.zero_grad()
    out = model(data.x, data.adj_t)
    print(f"model output shape : {out.shape}")
    print(f"train_idx : {len(train_idx)}")

    # shape이 (Batch, predicted vectore) 와 (Batch,) 였군
    output, label = out[train_idx], torch.reshape(data.y[train_idx], (-1,))

    print(f"Get loss by using only {output.shape}")

    loss = loss_fn(output, label)
    #########################################

    loss.backward()
    optimizer.step()

    # 관심있는 값들은 log로 남겨줘야함
    wandb.log({'Loss' : loss})

    return loss.item()

In [20]:
# Test function here
@torch.no_grad()
def test(model, data, split_idx, evaluator):
    # TODO: Implement this function that tests the model by 
    # using the given split_idx and evaluator.
    model.eval()

    # The output of model on all data
    out = None

    ############# Your code here ############
    ## (~1 line of code)
    ## Note:
    ## 1. No index slicing here
    out = model(data.x, data.adj_t)
    #########################################

    y_pred = out.argmax(dim=-1, keepdim=True)

    train_acc = evaluator.eval({
        'y_true': data.y[split_idx['train']],
        'y_pred': y_pred[split_idx['train']],
    })['acc']
    valid_acc = evaluator.eval({
        'y_true': data.y[split_idx['valid']],
        'y_pred': y_pred[split_idx['valid']],
    })['acc']
    test_acc = evaluator.eval({
        'y_true': data.y[split_idx['test']],
        'y_pred': y_pred[split_idx['test']],
    })['acc']

    # 관심있는 값들은 log로 남겨줘야함
    wandb.log({
        'train_acc' : train_acc,
        'valid_acc' : valid_acc,
        'test_acc' : test_acc
    })

    return train_acc, valid_acc, test_acc

In [21]:
# wandb config
config = wandb.config
config.device = device
config.num_layers = 3
config.hidden_dim = 256
config.dropout = 0.5
config.lr = 0.01
config.epochs = 100
config.loss_fn = 'LogSoftmax'

In [22]:
# # Please do not change the args
# args = {
#     'device': device,
#     'num_layers': 3,
#     'hidden_dim': 256,
#     'dropout': 0.5,
#     'lr': 0.01,
#     'epochs': 3,
# }
# args

In [23]:
model = GCN(data.num_features, config.hidden_dim,
            dataset.num_classes, config.num_layers,
            config.dropout, config.loss_fn).to(device)
            
evaluator = Evaluator(name='ogbn-arxiv')

model

GCN(
  (convs): ModuleList(
    (0): GCNConv(128, 256)
    (1): GCNConv(256, 256)
    (2): GCNConv(256, 40)
  )
  (bns): ModuleList(
    (0): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (softmax): LogSoftmax(dim=None)
)

In [24]:
# wandb에 log 남기기
wandb.watch(model, log='all')

[<wandb.wandb_torch.TorchGraph at 0x7f90358b5290>]

In [25]:
import copy

def train_pipe():
    # reset the parameters to initial random value
    loss_fn = F.nll_loss
    best_model = None
    best_valid_acc = 0

    with wandb.init(config=config):
        
        model = GCN(data.num_features, config.hidden_dim,
            dataset.num_classes, config.num_layers,
            config.dropout, config.loss_fn).to(device)
        
        model.reset_parameters()
        optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)
        
        for epoch in range(1, 1 + config.epochs):
            loss = train(model, data, train_idx, optimizer, loss_fn)
            result = test(model, data, split_idx, evaluator)
            train_acc, valid_acc, test_acc = result
            if valid_acc > best_valid_acc:
                best_valid_acc = valid_acc
                best_model = copy.deepcopy(model)
            print(f'Epoch: {epoch:02d}, '
                    f'Loss: {loss:.4f}, '
                    f'Train: {100 * train_acc:.2f}%, '
                    f'Valid: {100 * valid_acc:.2f}% '
                    f'Test: {100 * test_acc:.2f}%')
        
        best_result = test(best_model, data, split_idx, evaluator)
        train_acc, valid_acc, test_acc = best_result

        wandb.log({'Best Model Train_acc(%)' : 100 * train_acc,
                   'Best Model valid acc(%)' : 100 * valid_acc,
                   'Best Model Test acc(%)' : 100 * test_acc})


In [27]:
# best_result = test(best_model, data, split_idx, evaluator)
# train_acc, valid_acc, test_acc = best_result
# print(f'Best model: '
#       f'Train: {100 * train_acc:.2f}%, '
#       f'Valid: {100 * valid_acc:.2f}% '
#       f'Test: {100 * test_acc:.2f}%')

In [28]:
sweep_config = {
    'method': 'grid',
    'metric' : {
        'name' : 'Loss',
        'goal' : 'minimize'
    },
    'parameters': {
        'num_layers': {
            'values': [3, 6, 9]
        },
        'hidden_dim' : {
            'values' : [64, 256, 512]
        },
        'dropout' : {
            'values' : [0.1, 0.5, 0.7]
        },
        'lr' : {
            'values' : [0.1, 0.01, 0.001]
        },
        'loss_fn' : {
            'values' : ["LogSoftmax", "Softmax"]
        }
    }
}


In [29]:
sweep_id = wandb.sweep(sweep_config, project='GCN_tutorial')



Create sweep with ID: xfxygcc7
Sweep URL: https://wandb.ai/hoyoon/GCN_tutorial/sweeps/xfxygcc7


In [30]:
wandb.agent(sweep_id, function=train_pipe)

Output hidden; open in https://colab.research.google.com to view.

## Question 5: What are your `best_model` validation and test accuracy? Please report them on Gradescope. For example, for an accuracy such as 50.01%, just report 50.01 and please don't include the percent sign. (20 points).  
  
The answer is 71.7

# 4 GNN: Graph Property Prediction

In this section we will create a graph neural network for graph property prediction (graph classification)


## Load and preprocess the dataset

In [31]:
from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.data import DataLoader
from tqdm.notebook import tqdm

# Load the dataset 
dataset = PygGraphPropPredDataset(name='ogbg-molhiv')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device: {}'.format(device))

split_idx = dataset.get_idx_split()

# Check task type
print('Task type: {}'.format(dataset.task_type))

Downloaded 0.00 GB: 100%|██████████| 3/3 [00:00<00:00, 20.43it/s]

Downloading http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip
Extracting dataset/hiv.zip
Processing...
Loading necessary files...
This might take a while.



 22%|██▏       | 8902/41127 [00:00<00:00, 89019.45it/s]

Processing graphs...


100%|██████████| 41127/41127 [00:00<00:00, 89645.38it/s]
 40%|███▉      | 16340/41127 [00:00<00:00, 163390.42it/s]

Converting graphs into PyG objects...


100%|██████████| 41127/41127 [00:00<00:00, 159152.75it/s]


Saving...
Done!
Device: cuda
Task type: binary classification


In [32]:
dataset[0]

Data(edge_attr=[40, 3], edge_index=[2, 40], x=[19, 9], y=[1, 1])

In [33]:
dataset.meta_info

num tasks                                                                1
eval metric                                                         rocauc
download_name                                                          hiv
version                                                                  1
url                      http://snap.stanford.edu/ogb/data/graphproppre...
add_inverse_edge                                                      True
data type                                                              mol
has_node_attr                                                         True
has_edge_attr                                                         True
task type                                            binary classification
num classes                                                              2
split                                                             scaffold
additional node files                                                 None
additional edge files    

In [34]:
dataset.get_idx_split()

{'test': tensor([    0,     1,     2,  ..., 10122, 10124, 10125]),
 'train': tensor([    3,     4,     5,  ..., 41124, 41125, 41126]),
 'valid': tensor([10127, 10129, 10132,  ..., 22785, 22786, 22788])}

In [35]:
# trian valid test size
print(f"Train data size : {len(split_idx['train'])}\n\
Valid data size :  {len(split_idx['valid'])}\n\
Test data size :   {len(split_idx['test'])}")

Train data size : 32901
Valid data size :  4113
Test data size :   4113


In [36]:
# Load the data sets into dataloader
# We will train the graph classification task on a batch of 32 graphs
# Shuffle the order of graphs for training set
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)

In [37]:
# Please do not change the args
args = {
    'device': device,
    'num_layers': 5,
    'hidden_dim': 256,
    'dropout': 0.5,
    'lr': 0.001,
    'epochs': 30,
}
args

{'device': 'cuda',
 'dropout': 0.5,
 'epochs': 30,
 'hidden_dim': 256,
 'lr': 0.001,
 'num_layers': 5}

## Graph Prediction Model

Now we will implement our GCN Graph Prediction model!

We will reuse the existing GCN model to generate `node_embeddings` and use  Global Pooling on the nodes to predict properties for the whole graph.

In [38]:
from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.nn import global_add_pool, global_mean_pool

### GCN to predict graph property
class GCN_Graph(torch.nn.Module):
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        super(GCN_Graph, self).__init__()

        # Load encoders for Atoms in molecule graphs
        self.node_encoder = AtomEncoder(hidden_dim)

        # Node embedding model
        # Note that the input_dim and output_dim are set to hidden_dim
        self.gnn_node = GCN(hidden_dim, hidden_dim,
            hidden_dim, num_layers, dropout, return_embeds=True)

        self.pool = None

        ############# Your code here ############
        ## Note:
        ## 1. Initialize the self.pool to global mean pooling layer
        ## More information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        ## (~1 line of code)
        self.pool = global_mean_pool

        #########################################

        # Output layer
        self.linear = torch.nn.Linear(hidden_dim, output_dim)


    def reset_parameters(self):
      self.gnn_node.reset_parameters()
      self.linear.reset_parameters()

    def forward(self, batched_data):
        # TODO: Implement this function that takes the input tensor batched_data,
        # returns a batched output tensor for each graph.
        x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
        embed = self.node_encoder(x)

        out = None

        ############# Your code here ############
        ## Note:
        ## 1. Construct node embeddings using existing GCN model
        ## 2. Use global pooling layer to construct features for the whole graph
        ## More information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        ## 3. Use a linear layer to predict the graph property 
        ## (~3 lines of code)
        x = self.gnn_node(embed, edge_index)
        x = self.pool(x, batch)
        out = self.linear(x)


        #########################################

        return out

In [39]:
def train(model, device, data_loader, optimizer, loss_fn):
    # TODO: Implement this function that trains the model by 
    # using the given optimizer and loss_fn.
    model.train()
    loss = 0

    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
      batch = batch.to(device)

      if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
          pass
      else:
        ## ignore nan targets (unlabeled) when computing training loss.
        is_labeled = batch.y == batch.y

        ############# Your code here ############
        ## Note:
        ## 1. Zero grad the optimizer
        ## 2. Feed the data into the model
        ## 3. Use `is_labeled` mask to filter output and labels
        ## 4. You might change the type of label
        ## 5. Feed the output and label to loss_fn
        ## (~3 lines of code)
        optimizer.zero_grad()
        out = model(batch)
        loss = loss_fn(out[is_labeled], batch.y[is_labeled].to(torch.float64))
        #########################################

        loss.backward()
        optimizer.step()

    return loss.item()

In [40]:
# The evaluation function
def eval(model, device, loader, evaluator):
    model.eval()
    y_true = []
    y_pred = []

    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch)

            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())

    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()

    input_dict = {"y_true": y_true, "y_pred": y_pred}

    return evaluator.eval(input_dict)

In [41]:
model = GCN_Graph(args['hidden_dim'],
            dataset.num_tasks, args['num_layers'],
            args['dropout']).to(device)
evaluator = Evaluator(name='ogbg-molhiv')

TypeError: ignored

In [None]:
import copy

model.reset_parameters()

optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
loss_fn = torch.nn.BCEWithLogitsLoss()

best_model = None
best_valid_acc = 0

for epoch in range(1, 1 + args["epochs"]):
  print('Training...')
  loss = train(model, device, train_loader, optimizer, loss_fn)

  print('Evaluating...')
  train_result = eval(model, device, train_loader, evaluator)
  val_result = eval(model, device, valid_loader, evaluator)
  test_result = eval(model, device, test_loader, evaluator)

  train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
  if valid_acc > best_valid_acc:
      best_valid_acc = valid_acc
      best_model = copy.deepcopy(model)
  print(f'Epoch: {epoch:02d}, '
        f'Loss: {loss:.4f}, '
        f'Train: {100 * train_acc:.2f}%, '
        f'Valid: {100 * valid_acc:.2f}% '
        f'Test: {100 * test_acc:.2f}%')

In [None]:
train_acc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
valid_acc = eval(best_model, device, valid_loader, evaluator)[dataset.eval_metric]
test_acc = eval(best_model, device, test_loader, evaluator)[dataset.eval_metric]

print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

In [None]:
def rocauc(model, data_loader):
    from sklearn.metrics import roc_auc_score

    model.eval()
    y_true = []
    y_pred = []

    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch)

            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())

    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()

    return roc_auc_score(y_true, y_pred)

rocauc(best_model, test_loader)

## Question 6: What are your `best_model` validation and test ROC-AUC score? Please report them on Gradescope. For example, for an ROC-AUC score such as 50.01%, just report 50.01 and please don't include the percent sign. (20 points)  
  
> 학습이 끝나고 얻은 best_model로 ROCAUC를 측정한 결과 약 0.75이다.

## Question 7 (Optional): Experiment with other two global pooling layers 

other than mean pooling in Pytorch Geometric.

# Submission

In order to get credit, you must go submit your answers on Gradescope.

Also, you need to submit the `ipynb` file of Colab 2, by clicking `File` and `Download .ipynb`. Please make sure that your output of each cell is available in your `ipynb` file.

In [None]:
while True:pass