<a href="https://colab.research.google.com/github/huangtinglin/test_colab/blob/main/CPSC483_colab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Graph neural network basics

In this Colab, we are going to introduce some basics of graph neural network (GNN) and build a pipeline for node classification tasks by PyTorch Geometric (PyG). See more introduction about [PyG](https://pytorch-geometric.readthedocs.io/en/latest/).




## Outline



- Basic operation of PyG
- Build a GNN by PyG
- Training and testing

## Basic operation of PyG

In [12]:
# import the pytorch library into environment and check its version
import os
import torch
print("Using torch", torch.__version__)

Using torch 1.12.1+cu113


Let's start installing PyG by `pip`. The version of PyG should match the current version of PyTorch. Here we follow the [instruction](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html) of PyG:

In [None]:
!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu113.html
!pip install ogb  # for datasets

### Create a Graph

A single graph in PyG is described by an instance of `torch_geometric.data` which holds the some important attributes by default, like edge_index. We can easily create a graph of various number of edges and nodes by PyG. Take the following graph as an example:

![](https://github.com/huangtinglin/test_colab/blob/main/graph_example.png?raw=1)


In [9]:
# import torch_geometric.data into environment
from torch_geometric.data import Data

We have 6 edges (undirected graph) and 3 nodes in this graph. So the edge index can be defined as:

In [13]:
edge_index = torch.tensor([[0, 1, 1, 2, 0, 2],
                           [1, 0, 2, 1, 2, 0]], dtype=torch.long)

Besides, each node can have a node feature which describes the node's property:

In [15]:
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

Then we can define a `Data` object with edge index and node attribute:

In [16]:
data = Data(x=x, edge_index=edge_index)

`Data` object supports many useful utility functions. For example, we can see the number of the nodes, and whether the graph is a undirected graph:

In [None]:
num_nodes = data.num_nodes
print("number of nodes is:", num_nodes)

is_directed = data.is_directed()
print("graph is directed or not:", is_directed)

### Question 1 (5 points)

What is the number of the neighbors of node 0 in the graph?

In [None]:
def get_n_neighbors(graph, idx):
  # TODO: Implement a function that takes a Data object,
  # an index of a node, and returns the number of the neighbors 
  # of this node (as an integer).

  n_neighbors = 0

  ############# Your code here ############
  ## (~1 line of code)

  #########################################

  return n_neighbors

idx = 0
n_neighbors = get_n_neighbors(data, idx)
print('Node with index {} has {} neighbors'.format(idx, n_neighbors))

PyG has a number of graph data with various scales. Cora is one of the most famous dataset in graph learning, and we can use it by PyG:

In [22]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid('/tmp/cora', 'cora')
data = dataset[0]

We can see the number of the nodes and edges in cora:

In [None]:
num_nodes = data.num_nodes
print('cora has {} nodes'.format(num_nodes))

num_edges = data.num_edges
print('cora has {} edges'.format(num_edges))

### Question 2 (10 points)

1. What is the number of the classes in cora dataset? 
2. Which node in Cora has the most number of neighbors?

In [None]:
def get_num_classes(data):
  # TODO: Implement a function that takes a dataset object
  # and returns the number of classes for that dataset.

  num_classes = 0

  ############# Your code here ############
  ## (~1 line of code)
  
  #########################################

  return num_classes

def get_idx_with_most_neighbors(data):
  # TODO: Implement a function that takes a dataset object
  # and returns the number of classes for that dataset.

  idx = -1

  ############# Your code here ############
  ## (~2 line of code)
  
  #########################################

  return idx

num_classes = get_num_classes(data)
print("cora has {} classes".format(num_classes))

idx = get_idx_with_most_neighbors(data)
print("{} in cora has the most number of neighbors".format(idx))

## Build a GNN by PyG

In this section we will use PyG to build a classic graph neural network called GCN([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)). Then we will apply this model to handle node classification task in cora.
A GCN is built by stacking multiple graph convolution layers `GCNConv` which passes the messages from neighbors to the center node. Here we can define a `GCNConv` by PyG:


In [33]:
from torch_geometric.nn import GCNConv

conv = GCNConv(in_channels=1433, out_channels=200, normalize=True)

`in_channels` is the dimension of node's input feature, `out_channels` is the  dimension of the output representation of node, and `normalize` is whether to add self-loops and compute symmetric normalization on the adjacent matrix. 
The feature's dimension in cora is 1433, so `in_channels` is set as 1433. We can perform a message passing on cora like this:

In [None]:
node_feature = data.x
edge_index = data.edge_index

node_representation = conv(node_feature, edge_index)

print("dimension of node_feature:", node_feature.shape)
print("dimension of node_representation:", node_representation.shape)

We can see that the inputs of `GCNConv` are node feature and edge index. Then the convolution module will perform a message passing like GCN. 
Recall the MLP we build in colab0. Here we also use `nn.Module` to define a MLP class containing the basic modules of GCN. 

### Question 3 (10 points)

Following the instruction and build a GCN class using the `GCNConv` modules. 


In [39]:
class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()

        # TODO: Define two GCNConv modules and a ReLU function.
        # The input size and output size of first GCNConv module should be in_channels and hidden_channels
        # The input size and output size of second GCNConv module should be hidden_channels and out_channels

        ############# Your code here ############
        ## (~3 line of code)

        #########################################

    def forward(self, node_feature, edge_index):

        output = None

        # TODO: Use the modules you define in __init__ to perform message passing.
        # ReLU function should be used in the middle of two GCNConv modules.

        ############# Your code here ############
        ## (~3 line of code)

        #########################################

        return output

## Training and Testing

Now we can try to 