<a href="https://colab.research.google.com/github/hli8nova/DGL-KDD20-Hands-on-Tutorial/blob/master/DGL_Node_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install required packages.
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

#!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
#!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
#!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

# Helper function for visualization.
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize(h, color):
    z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())

    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])

    plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
    plt.show()

1.13.1+cu116


In [2]:
!pip install -q dgl -f https://data.dgl.ai/wheels/cu116/repo.html
!pip install -q dglgo -f https://data.dgl.ai/wheels-test/repo.html

# Node Classification with Graph Neural Networks

[Previous: Introduction: Hands-on Graph Neural Networks](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8)

This tutorial will teach you how to apply **Graph Neural Networks (GNNs) to the task of node classification**.
Here, we are given the ground-truth labels of only a small subset of nodes, and want to infer the labels for all the remaining nodes (*transductive learning*).

To demonstrate, we make use of the `Cora` dataset, which is a **citation network** where nodes represent documents.
Each node is described by a 1433-dimensional bag-of-words feature vector.
Two documents are connected if there exists a citation link between them.
The task is to infer the category of each document (7 in total).

This dataset was first introduced by [Yang et al. (2016)](https://arxiv.org/abs/1603.08861) as one of the datasets of the `Planetoid` benchmark suite.
We can make use [DGL] for an easy access to this dataset via [`dgl.data.CoraGraphDataset`](https://docs.dgl.ai/en/0.9.x/generated/dgl.data.CoraGraphDataset.html#dgl.data.CoraGraphDataset):

In [3]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

In [4]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)

  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
Number of categories: 7


In [5]:
print()
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
g = dataset[0]  # Get the first graph object.
print(g)

print('\nNode features:')
print(g.ndata)
print('\nEdge features:')
print(g.edata)


Dataset: Dataset("cora_v2", num_graphs=1, save_path=/root/.dgl/cora_v2):
Number of graphs: 1
Graph(num_nodes=2708, num_edges=10556,
      ndata_schemes={'feat': Scheme(shape=(1433,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'train_mask': Scheme(shape=(), dtype=torch.bool)}
      edata_schemes={})

Node features:
{'feat': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), 'label': tensor([3, 4, 4,  ..., 3, 3, 3]), 'test_mask': tensor([False, False, False,  ...,  True,  True,  True]), 'val_mask': tensor([False, False, False,  ..., False, False, False]), 'train_mask': tensor([ True,  True,  True,  ..., False, False, False])}

Edge features:
{}


In [6]:
features = g.ndata['feat']
number_features = list(g.ndata['feat'].size())[1]

print(f'Number of features: {number_features}')
print(f'Number of classes: {dataset.num_classes}')

print()
print('===========================================================================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {g.num_nodes()}')
print(f'Number of edges: {g.num_edges()}')
average_node_degree = g.num_edges() / g.num_nodes()
print(f'Average node degree: {average_node_degree:.2f}')

train_mask = g.ndata['train_mask']
print(f'Number of training nodes: {train_mask.sum()}')


Number of features: 1433
Number of classes: 7

Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140


We can see that the `Cora` network holds 2,708 nodes and 10,556 edges, resulting in an average node degree of 3.9.
For training this dataset, we are given the ground-truth categories of 140 nodes (20 for each class).

This graph holds the attributes `val_mask` and `test_mask`, which denotes which nodes should be used for validation and testing.


## Training a Multi-layer Perception Network (MLP)

In theory, we should be able to infer the category of a document solely based on its content, *i.e.* its bag-of-words feature representation, without taking any relational information into account.

Let's verify that by constructing a simple MLP that solely operates on input node features (using shared weights across all nodes):

In [7]:
import torch
from torch.nn import Linear
import torch.nn.functional as F


class MLP(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.lin1 = Linear(number_features, hidden_channels)
        self.lin2 = Linear(hidden_channels, dataset.num_classes)

    def forward(self, g, in_feat):
        h = self.lin1(in_feat)
        h = h.relu()
        h = F.dropout(h, p=0.5, training=self.training)
        h = self.lin2(h)
        return h

model = MLP(hidden_channels=256)
print(model)

MLP(
  (lin1): Linear(in_features=1433, out_features=256, bias=True)
  (lin2): Linear(in_features=256, out_features=7, bias=True)
)


Our MLP is defined by two linear layers and enhanced by [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html?highlight=relu#torch.nn.ReLU) non-linearity and [dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html?highlight=dropout#torch.nn.Dropout).
Here, we first reduce the 1433-dimensional feature vector to a low-dimensional embedding (`hidden_channels=128`), while the second linear layer acts as a classifier that should map each low-dimensional node embedding to one of the 7 classes.

Let's train our simple MLP by following a similar procedure as described in [the first part of this tutorial](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8).
We again make use of the **cross entropy loss** and **Adam optimizer**.
This time, we also define a **`test` function** to evaluate how well our final model performs on the test node set (which labels have not been observed during training).

In [15]:
import math
def train(g, model, epoch = 200):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    best_val_acc = 0
    best_test_acc = 0
    print_step = math.floor(epoch/10)

    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    for e in range(epoch+1):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that you should only compute the losses of the nodes in the training set.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if e % print_step == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))


In [21]:
# importing the module
import time
# records start time
start = time.perf_counter()

g = g.to('cuda')
mlp_model = MLP(hidden_channels=256).to('cuda')
train(g, mlp_model, 100)

# record end time
end = time.perf_counter()
 
# find elapsed time in seconds
ms = (end-start)
print(f"\nElapsed {ms:.03f} secs.")


In epoch 0, loss: 1.946, val acc: 0.060 (best 0.060), test acc: 0.064 (best 0.064)
In epoch 10, loss: 1.460, val acc: 0.446 (best 0.476), test acc: 0.449 (best 0.424)
In epoch 20, loss: 0.465, val acc: 0.486 (best 0.512), test acc: 0.507 (best 0.482)
In epoch 30, loss: 0.060, val acc: 0.508 (best 0.522), test acc: 0.510 (best 0.516)
In epoch 40, loss: 0.012, val acc: 0.530 (best 0.530), test acc: 0.488 (best 0.488)
In epoch 50, loss: 0.005, val acc: 0.522 (best 0.532), test acc: 0.500 (best 0.507)
In epoch 60, loss: 0.004, val acc: 0.496 (best 0.532), test acc: 0.500 (best 0.507)
In epoch 70, loss: 0.002, val acc: 0.510 (best 0.544), test acc: 0.516 (best 0.530)
In epoch 80, loss: 0.002, val acc: 0.520 (best 0.544), test acc: 0.506 (best 0.530)
In epoch 90, loss: 0.003, val acc: 0.512 (best 0.544), test acc: 0.513 (best 0.530)
In epoch 100, loss: 0.002, val acc: 0.534 (best 0.544), test acc: 0.505 (best 0.530)

Elapsed 0.414 secs.


As one can see, our MLP performs rather bad with only about 53% test accuracy.
But why does the MLP do not perform better?
The main reason for that is that this model suffers from heavy overfitting due to only having access to a **small amount of training nodes**, and therefore generalizes poorly to unseen node representations.

It also fails to incorporate an important bias into the model: **Cited papers are very likely related to the category of a document**.
That is exactly where Graph Neural Networks come into play and can help to boost the performance of our model.



## Training a Graph Neural Network (GNN)

We can easily convert our MLP to a GNN by swapping the `torch.nn.Linear` layers with DGL's GNN operators.

Following-up on [the first part of this tutorial](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8), we replace the linear layers by the GCN module.
To recap, the **GCN layer** ([Kipf et al. (2017)](https://arxiv.org/abs/1609.02907)) is defined as

$$
\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \sum_{w \in \mathcal{N}(v) \, \cup \, \{ v \}} \frac{1}{c_{w,v}} \cdot \mathbf{x}_w^{(\ell)}
$$

where $\mathbf{W}^{(\ell + 1)}$ denotes a trainable weight matrix of shape `[num_output_features, num_input_features]` and $c_{w,v}$ refers to a fixed normalization coefficient for each edge.
In contrast, a single `Linear` layer is defined as

$$
\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \mathbf{x}_v^{(\ell)}
$$

which does not make use of neighboring node information.

In [17]:
from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = F.dropout(h, p=0.5, training=self.training)
        h = self.conv2(g, h)
        return h

# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)

#Train the GNN

In [18]:
# importing the module
import time
# records start time
start = time.perf_counter()

g = g.to('cpu')
model = GCN(g.ndata['feat'].shape[1], 256, dataset.num_classes).to('cpu')
train(g, model, 50)

# record end time
end = time.perf_counter()
 
# find elapsed time in seconds
ms = (end-start)
print(f"\nElapsed {ms:.03f} secs.")

In epoch 0, loss: 1.947, val acc: 0.136 (best 0.136), test acc: 0.114 (best 0.114)
In epoch 5, loss: 1.704, val acc: 0.618 (best 0.618), test acc: 0.633 (best 0.633)
In epoch 10, loss: 1.273, val acc: 0.730 (best 0.730), test acc: 0.751 (best 0.751)
In epoch 15, loss: 0.760, val acc: 0.760 (best 0.768), test acc: 0.774 (best 0.768)
In epoch 20, loss: 0.392, val acc: 0.774 (best 0.778), test acc: 0.785 (best 0.783)
In epoch 25, loss: 0.199, val acc: 0.784 (best 0.784), test acc: 0.783 (best 0.782)
In epoch 30, loss: 0.096, val acc: 0.768 (best 0.784), test acc: 0.784 (best 0.782)
In epoch 35, loss: 0.055, val acc: 0.788 (best 0.788), test acc: 0.782 (best 0.782)
In epoch 40, loss: 0.029, val acc: 0.766 (best 0.788), test acc: 0.781 (best 0.782)
In epoch 45, loss: 0.018, val acc: 0.762 (best 0.788), test acc: 0.776 (best 0.782)
In epoch 50, loss: 0.014, val acc: 0.768 (best 0.790), test acc: 0.769 (best 0.774)

Elapsed 4.174 secs.


We certainly can do better by training our model.
The training and testing procedure is once again the same, but this time we make use of the node features `x` **and** the graph connectivity as input to our GCN model.

**There it is!**
By simply swapping the linear layers with GNN layers, we can reach **78% of test accuracy**!
This is in stark contrast to the 53% of test accuracy obtained by our MLP, indicating that relational information plays a crucial role in obtaining better performance.


We can also specific GPU for training to get much better speed for large dataset.
- Note, the total clock time may not improve much for smaller datasets due to overheads of sending to and receiving data from GPU. 

In [19]:
# importing the module
import time
# records start time
start = time.perf_counter()

g = g.to('cuda')
model = GCN(g.ndata['feat'].shape[1], 256, dataset.num_classes).to('cuda')
train(g, model, 50)

# record end time
end = time.perf_counter()
 
# find elapsed time in seconds
ms = (end-start)
print(f"\nElapsed {ms:.03f} secs.")

In epoch 0, loss: 1.946, val acc: 0.074 (best 0.074), test acc: 0.131 (best 0.131)
In epoch 5, loss: 1.700, val acc: 0.604 (best 0.604), test acc: 0.625 (best 0.625)
In epoch 10, loss: 1.268, val acc: 0.722 (best 0.722), test acc: 0.748 (best 0.748)
In epoch 15, loss: 0.773, val acc: 0.754 (best 0.766), test acc: 0.770 (best 0.759)
In epoch 20, loss: 0.389, val acc: 0.758 (best 0.776), test acc: 0.773 (best 0.784)
In epoch 25, loss: 0.189, val acc: 0.778 (best 0.778), test acc: 0.779 (best 0.779)
In epoch 30, loss: 0.096, val acc: 0.764 (best 0.786), test acc: 0.794 (best 0.787)
In epoch 35, loss: 0.054, val acc: 0.786 (best 0.786), test acc: 0.786 (best 0.787)
In epoch 40, loss: 0.027, val acc: 0.778 (best 0.786), test acc: 0.777 (best 0.787)
In epoch 45, loss: 0.021, val acc: 0.774 (best 0.786), test acc: 0.778 (best 0.787)
In epoch 50, loss: 0.015, val acc: 0.772 (best 0.786), test acc: 0.756 (best 0.787)

Elapsed 0.339 secs.


## Conclusion

In this chapter, you have seen how to apply GNNs to real-world problems, and, in particular, how they can effectively be used for boosting a model's performance.
In the next section, we will look into how GNNs can be used for the task of graph classification.

[Next: Graph Classification with Graph Neural Networks](https://colab.research.google.com/drive/1I8a0DfQ3fI7Njc62__mVXUlcAleUclnb)