<div style="float:left;"><img src="logo.png" width="500"/></div>

# Node Classification

This demo will focus on more advanced topics, specifically using a Graph Convolutional Network (GCN) for node classification. The notebook requires the *DGL* and *PyTorch* Python packages.

For installation instruction for DGL, see:
https://www.dgl.ai/pages/start.html

In network analysis **node classification** is a widely-applied task which training a machine learning model to classify the nodes in a network into two or more classes or categories.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools, random
import pandas as pd
import numpy as np
import scipy.sparse as sp

# imports for DGL
import dgl
import dgl.data
import dgl.function as fn
from dgl.nn import GraphConv

# display settings
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams.update({'font.size': 14})

Set up number generation:

In [None]:
def set_all_seeds(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_all_seeds(100)

## Data Preparation

Load the "Cora" scientific network dataset, which is included as part of DGL. 

Note that we could also create a NetworkX graph and convert it for use with DGL, using *dgl.

In [None]:
dataset = dgl.data.CoraGraphDataset()
# check the structure of the associated network
g = dataset[0]

In [None]:
num_classes = dataset.num_classes
print("Dataset has %d class labels" % num_classes)

The Cora dataset includes a predefined training/validation/test split of the nodes:

In [None]:
features = g.ndata['feat']
labels = g.ndata['label']
# get the masks for the training/validation/test nodes
train_mask = g.ndata['train_mask']
val_mask = g.ndata['val_mask']
test_mask = g.ndata['test_mask']

## Model Preparation

As our architecture we will use a two-layer Graph Convolutional Network (GCN), where each layer computes new node representations by aggregating neighbour information from the nodes. This is like the idea of node embeddings that we saw previously.

In [None]:
# define the GCN model
class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

## Training Phase

Create the model and optimizer:

In [None]:
# create the GCN model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, num_classes)
# create the optimisation function
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

Apply training for the specified number of epochs:

In [None]:
max_epochs = 100
loss_scores, train_acc, val_acc, test_acc = {}, {}, {}, {}
for e in range(1, max_epochs+1):
    # Forward
    logits = model(g, features)
    # Compute the prediction
    pred = logits.argmax(1)
    
    # Compute the loss on the training set
    loss = F.cross_entropy(logits[train_mask], labels[train_mask])
    loss_scores[e] = float(loss)

    # Compute accuracy on each of the training/validation/test sets at each epoch
    train_acc[e] = float((pred[train_mask] == labels[train_mask]).float().mean())
    val_acc[e] = float((pred[val_mask] == labels[val_mask]).float().mean())
    test_acc[e] = float((pred[test_mask] == labels[test_mask]).float().mean())

    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if e % 20 == 0:
        print('Epoch %d/%d' % (e, max_epochs))      

Plot the trajectory of the loss function:

In [None]:
ax = pd.Series(loss_scores).plot(figsize=(10,6), zorder=3)
ax.set_xlabel("Training Epoch")
ax.set_ylabel("Loss")
ax.yaxis.grid()
ax.set_xlim(1, max_epochs)
ax.set_ylim(0);

Plot the accuracy scores for each of the splits:

In [None]:
df_acc = pd.DataFrame({"Train": pd.Series(train_acc), 
                      "Validation": pd.Series(val_acc), "Test": pd.Series(test_acc)})
ax = df_acc.plot(figsize=(10, 6), zorder=3)
ax.set_xlabel("Training Epoch")
ax.set_ylabel("Accuracy")
ax.yaxis.grid()
plt.legend(loc='lower right')
ax.set_xlim(1, max_epochs)
ax.set_ylim(0);