## Node Classification with Deep Graph Library

In [1]:
!pip install dgl

Collecting dgl
  Downloading dgl-0.6.1-cp37-cp37m-manylinux1_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.1 MB/s 
Installing collected packages: dgl
Successfully installed dgl-0.6.1


In [2]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


Using backend: pytorch


## Loading Cora Dataset

In [3]:
import dgl.data as data
dataset = data.CoraGraphDataset()

Downloading /root/.dgl/cora_v2.zip from https://data.dgl.ai/dataset/cora_v2.zip...
Extracting file to /root/.dgl/cora_v2
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.


In [4]:
print(f'Number of categories: {dataset.num_classes}')

Number of categories: 7


A DGL Dataset object may contain one or multiple graphs. The Cora dataset used in this tutorial only consists of one single graph.

In [5]:
print(f'Number of graphs: {len(dataset)}')

Number of graphs: 1


In [6]:
g = dataset[0]

In [7]:
print(g.num_nodes())
print(g.num_edges())
print(g.num_src_nodes())

2708
10556
2708


A DGL graph can store node features and edge features in two dictionary-like attributes called ndata and edata. In the DGL Cora dataset, the graph contains the following node features:

- train_mask: A boolean tensor indicating whether the node is in the training set.

- val_mask: A boolean tensor indicating whether the node is in the validation set.

- test_mask: A boolean tensor indicating whether the node is in the test set.

- label: The ground truth node category.

- feat: The node features.

In [8]:
print('Node features')
print(g.ndata)

Node features
{'train_mask': tensor([ True,  True,  True,  ..., False, False, False]), 'val_mask': tensor([False, False, False,  ..., False, False, False]), 'test_mask': tensor([False, False, False,  ...,  True,  True,  True]), 'label': tensor([3, 4, 4,  ..., 3, 3, 3]), 'feat': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])}


In [9]:
num_train_examples = g.ndata['train_mask'].sum()
num_test_examples = g.ndata['test_mask'].sum()
num_val_examples = g.ndata['val_mask'].sum()

print(f'Number of train examples: {num_train_examples}')
print(f'Number of test examples: {num_test_examples}')
print(f'Number of val examples: {num_val_examples}')

Number of train examples: 140
Number of test examples: 1000
Number of val examples: 500


In [10]:
print(g.ndata['feat'].shape)

torch.Size([2708, 1433])


In [11]:
print('Edge features')
print(g.edata)

Edge features
{}


## Defining a GCN

[GCN](http://tkipf.github.io/graph-convolutional-networks/)

Building **2-layer** GCN

Each layer computes new node representations by aggregating neighbor information.

To build a multi-layer GCN you can simply stack **dgl.nn.GraphConv** modules, which inherit **torch.nn.Module**

In [12]:
from dgl.nn import GraphConv

class GCN(nn.Module):
  def __init__(self, in_features, h_features, num_classes):
    super(GCN, self).__init__()
    self.conv1 = GraphConv(in_feats=in_features, out_feats=h_features)
    self.conv2 = GraphConv(in_feats=h_features, out_feats=num_classes)

  def forward(self, g, in_feat):
    # shape of in_feat: (number_of_nodes x number_of_features)
    h = self.conv1(g, in_feat)
    h = F.relu(h)
    # shape of h: (number_of_nodes x number_of_hidden_features)
    o = self.conv2(g, h)
    # shape of o: (number_of_nodes x number_of_classes)
    return o

In [13]:
# Testing GCN architecture
X = g.ndata['feat']
model = GCN(X.shape[1], 128, dataset.num_classes)
output = model(g, X)
print(output.shape)

torch.Size([2708, 7])


## Training the GCN

In [30]:
def train(g, model, epochs=100):
  optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
  best_val_acc = 0
  best_test_acc = 0

  features = g.ndata['feat']    # shape: (num_nodes, num_features)
  labels = g.ndata['label']     # shape: (num_nodes)

  train_mask = g.ndata['train_mask']
  test_mask = g.ndata['test_mask']
  val_mask = g.ndata['val_mask']

  for epoch in range(epochs):
    # forward pass
    logits = model(g, features)

    # compute prediction
    pred = logits.argmax(1)

    # compute loss
    loss = F.cross_entropy(logits[train_mask], labels[train_mask])

    # compute accuracy on training/validatio/test
    train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
    val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
    test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

    # save best validation accuracy and corresponding test accuracy
    if val_acc > best_val_acc:
      best_val_acc = val_acc
      best_test_acc = test_acc
    
    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 5 == 0:
      print(f'Epoch: [{epoch+1}/{epochs}], Loss: {loss:.6f}, Val Acc: {val_acc:.3f}, Best Val Acc: {best_val_acc:.3f}, Test Acc: {test_acc:.3f}, Best Test Acc: {best_test_acc:.3f}')

In [31]:
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)

In [32]:
train(g, model)

Epoch: [1/100], Loss: 1.946356, Val Acc: 0.146, Best Val Acc: 0.146, Test Acc: 0.150, Best Test Acc: 0.150
Epoch: [6/100], Loss: 1.887570, Val Acc: 0.354, Best Val Acc: 0.354, Test Acc: 0.343, Best Test Acc: 0.342
Epoch: [11/100], Loss: 1.804297, Val Acc: 0.390, Best Val Acc: 0.390, Test Acc: 0.390, Best Test Acc: 0.390
Epoch: [16/100], Loss: 1.698399, Val Acc: 0.554, Best Val Acc: 0.554, Test Acc: 0.575, Best Test Acc: 0.575
Epoch: [21/100], Loss: 1.572961, Val Acc: 0.638, Best Val Acc: 0.638, Test Acc: 0.664, Best Test Acc: 0.654
Epoch: [26/100], Loss: 1.429610, Val Acc: 0.672, Best Val Acc: 0.672, Test Acc: 0.701, Best Test Acc: 0.701
Epoch: [31/100], Loss: 1.272831, Val Acc: 0.720, Best Val Acc: 0.720, Test Acc: 0.738, Best Test Acc: 0.738
Epoch: [36/100], Loss: 1.109048, Val Acc: 0.738, Best Val Acc: 0.738, Test Acc: 0.746, Best Test Acc: 0.746
Epoch: [41/100], Loss: 0.946338, Val Acc: 0.750, Best Val Acc: 0.750, Test Acc: 0.755, Best Test Acc: 0.755
Epoch: [46/100], Loss: 0.79309

## Training on gpu

In [None]:
g = g.to('cuda')
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to('cuda')
train(g, model)