<a href="https://colab.research.google.com/github/minhld99/dgl-tutorials/blob/main/dgl_tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overall Results:


1.   `Node Classification` with DGL

  - Cora Dataset
      - NumNodes: 2708
      - NumEdges: 10556
      - NumFeats: 1433
      - NumClasses: 7
      - NumTrainingSamples: 140
      - NumValidationSamples: 500
      - NumTestSamples: 1000
      - Number of categories: 7
  - 100 epochs
  - Accuracy: 0.768
  - CPU time: 1.2 s
  - GPU time: 488 ms 

2.   `Link Prediction` using Graph Neural Networks

  - Cora Dataset
      - Train/Test = 90%/10% for positive examples
      - Randomly sample the same amount from abitrary edges for negative examples
  - 100 epochs
  - Optimizer: Adam `(lr=0.01)`
  - Accuracy: `(AUC)` 0.8632950742346308
  - Time: 4.39 s

3.   Training a GNN for `Graph Classification`

  - Small dataset from the paper [How Powerful Are Graph Neural Networks](https://arxiv.org/abs/1810.00826).
      - Node feature dimensionality: 3
      - Number of graph categories: 2 
  - 20 epochs
  - Optimizer: Adam `(lr=0.01)`
  - Accuracy: 0.21524663677130046
  - Time: 9.85 s

In [None]:
!pip3 install ipython-autotime
%load_ext autotime

Collecting ipython-autotime
  Downloading https://files.pythonhosted.org/packages/b4/c9/b413a24f759641bc27ef98c144b590023c8038dfb8a3f09e713e9dff12c1/ipython_autotime-0.3.1-py2.py3-none-any.whl
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 585 µs (started: 2021-03-19 18:54:08 +00:00)


# Node Classification with DGL



- Load a DGL-provided dataset.

- Build a GNN model with DGL-provided neural network modules.

- Train and evaluate a GNN model for node classification on either CPU or GPU.

In [None]:
!pip3 install dgl-cu101

Collecting dgl-cu101
[?25l  Downloading https://files.pythonhosted.org/packages/ab/f3/a1b614ca8e12e92b53f4673b12a4968da2b94c357df8597f41b1d568755b/dgl_cu101-0.6.0.post1-cp37-cp37m-manylinux1_x86_64.whl (36.0MB)
[K     |████████████████████████████████| 36.0MB 94kB/s 
Installing collected packages: dgl-cu101
Successfully installed dgl-cu101-0.6.0.post1


In [None]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


DGL backend not selected or invalid.  Assuming PyTorch for now.
Using backend: pytorch


## Loading Cora Dataset

In [None]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)

Downloading /root/.dgl/cora_v2.zip from https://data.dgl.ai/dataset/cora_v2.zip...
Extracting file to /root/.dgl/cora_v2
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.
Number of categories: 7


In [None]:
g = dataset[0]

In [None]:
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)

Node features
{'train_mask': tensor([ True,  True,  True,  ..., False, False, False]), 'val_mask': tensor([False, False, False,  ..., False, False, False]), 'test_mask': tensor([False, False, False,  ...,  True,  True,  True]), 'label': tensor([3, 4, 4,  ..., 3, 3, 3]), 'feat': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])}
Edge features
{}


## Defining a Graph Convolutional Network (GCN)


In [None]:
from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)

## Training the GCN

In [None]:
g = g.to('cpu')

time: 1.72 ms (started: 2021-03-19 18:57:53 +00:00)


In [None]:
def train(g, model):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    best_val_acc = 0
    best_test_acc = 0

    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    for e in range(100):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that you should only compute the losses of the nodes in the training set.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if e % 5 == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))

time: 20.7 ms (started: 2021-03-19 18:57:01 +00:00)


In [None]:
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)

In epoch 0, loss: 1.946, val acc: 0.104 (best 0.104), test acc: 0.096 (best 0.096)
In epoch 5, loss: 1.901, val acc: 0.416 (best 0.446), test acc: 0.410 (best 0.456)
In epoch 10, loss: 1.826, val acc: 0.462 (best 0.462), test acc: 0.458 (best 0.458)
In epoch 15, loss: 1.728, val acc: 0.590 (best 0.590), test acc: 0.629 (best 0.629)
In epoch 20, loss: 1.608, val acc: 0.634 (best 0.634), test acc: 0.665 (best 0.665)
In epoch 25, loss: 1.468, val acc: 0.672 (best 0.672), test acc: 0.698 (best 0.690)
In epoch 30, loss: 1.312, val acc: 0.708 (best 0.708), test acc: 0.731 (best 0.731)
In epoch 35, loss: 1.147, val acc: 0.730 (best 0.730), test acc: 0.746 (best 0.742)
In epoch 40, loss: 0.980, val acc: 0.750 (best 0.750), test acc: 0.754 (best 0.754)
In epoch 45, loss: 0.819, val acc: 0.752 (best 0.752), test acc: 0.756 (best 0.758)
In epoch 50, loss: 0.672, val acc: 0.766 (best 0.766), test acc: 0.760 (best 0.759)
In epoch 55, loss: 0.545, val acc: 0.766 (best 0.766), test acc: 0.761 (best 0

In [None]:
# example of evaluating new node's classification
def evaluate(model, graph, features, labels, mask):
    model.eval()
    with torch.no_grad():
        logits = model(graph, features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

evaluate(model, g, g.ndata['feat'], g.ndata['label'], g.ndata['test_mask'])

0.749

## Training on GPU

In [None]:
g = g.to('cuda')

time: 2.15 ms (started: 2021-03-19 18:59:51 +00:00)


In [None]:
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to('cuda')
train(g, model)

In epoch 0, loss: 1.946, val acc: 0.224 (best 0.224), test acc: 0.249 (best 0.249)
In epoch 5, loss: 1.897, val acc: 0.558 (best 0.558), test acc: 0.596 (best 0.596)
In epoch 10, loss: 1.820, val acc: 0.696 (best 0.696), test acc: 0.696 (best 0.696)
In epoch 15, loss: 1.718, val acc: 0.696 (best 0.696), test acc: 0.697 (best 0.696)
In epoch 20, loss: 1.592, val acc: 0.700 (best 0.700), test acc: 0.718 (best 0.718)
In epoch 25, loss: 1.442, val acc: 0.708 (best 0.708), test acc: 0.724 (best 0.724)
In epoch 30, loss: 1.277, val acc: 0.728 (best 0.728), test acc: 0.743 (best 0.743)
In epoch 35, loss: 1.103, val acc: 0.742 (best 0.742), test acc: 0.746 (best 0.744)
In epoch 40, loss: 0.932, val acc: 0.738 (best 0.742), test acc: 0.754 (best 0.744)
In epoch 45, loss: 0.772, val acc: 0.744 (best 0.744), test acc: 0.758 (best 0.757)
In epoch 50, loss: 0.629, val acc: 0.752 (best 0.752), test acc: 0.764 (best 0.762)
In epoch 55, loss: 0.508, val acc: 0.760 (best 0.760), test acc: 0.760 (best 0

# How Does DGL Represent A Graph?

- Construct a graph in DGL from scratch.

- Assign node and edge features to a graph.

- Query properties of a DGL graph such as node degrees and connectivity.

- Transform a DGL graph into another graph.

- Load and save DGL graphs

## DGL Graph Construction

In [None]:
import dgl
import numpy as np
import torch

g = dgl.graph(([0, 0, 0, 0, 0], [1, 2, 3, 4, 5]), num_nodes=6)
# Equivalently, PyTorch LongTensors also work.
g = dgl.graph((torch.LongTensor([0, 0, 0, 0, 0]), torch.LongTensor([1, 2, 3, 4, 5])), num_nodes=6)

# You can omit the number of nodes argument if you can tell the number of nodes from the edge list alone.
g = dgl.graph(([0, 0, 0, 0, 0], [1, 2, 3, 4, 5]))

In [None]:
# Print the source and destination nodes of every edge.
print(g.edges())

(tensor([0, 0, 0, 0, 0]), tensor([1, 2, 3, 4, 5]))


## Assigning Node and Edge Features to Graph

In [None]:
# Assign a 3-dimensional node feature vector for each node.
g.ndata['x'] = torch.randn(6, 3)
# Assign a 4-dimensional edge feature vector for each edge.
g.edata['a'] = torch.randn(5, 4)
# Assign a 5x4 node feature matrix for each node.  Node and edge features in DGL can be multi-dimensional.
g.ndata['y'] = torch.randn(6, 5, 4)

print(g.edata['a'])

tensor([[-1.6067, -0.3907,  1.5339,  0.4664],
        [ 1.4048,  0.4527,  0.1840, -0.4003],
        [-0.8095,  1.4550, -1.4049,  1.3023],
        [-1.2192, -1.8538, -0.5934,  1.0736],
        [ 1.1103,  1.1093, -0.6367, -1.8115]])


## Querying Graph Structures

In [None]:
print(g.num_nodes())
print(g.num_edges())
# Out degrees of the center node
print(g.out_degrees(0))
# In degrees of the center node - note that the graph is directed so the in degree should be 0.
print(g.in_degrees(0))

6
5
5
0


## Graph Transformations

In [None]:
# Induce a subgraph from node 0, node 1 and node 3 from the original graph.
sg1 = g.subgraph([0, 1, 3])
# Induce a subgraph from edge 0, edge 1 and edge 3 from the original graph.
sg2 = g.edge_subgraph([0, 1, 3])

In [None]:
# The original IDs of each node in sg1
print(sg1.ndata[dgl.NID])
# The original IDs of each edge in sg1
print(sg1.edata[dgl.EID])
# The original IDs of each node in sg2
print(sg2.ndata[dgl.NID])
# The original IDs of each edge in sg2
print(sg2.edata[dgl.EID])

tensor([0, 1, 3])
tensor([0, 2])
tensor([0, 1, 2, 4])
tensor([0, 1, 3])


In [None]:
# The original node feature of each node in sg1
print(sg1.ndata['x'])
# The original edge feature of each node in sg1
print(sg1.edata['a'])
# The original node feature of each node in sg2
print(sg2.ndata['x'])
# The original edge feature of each node in sg2
print(sg2.edata['a'])

tensor([[-1.7200,  1.9185, -1.2768],
        [ 0.6329,  1.2710,  0.0509],
        [ 1.7366,  0.5627, -0.9224]])
tensor([[-1.6067, -0.3907,  1.5339,  0.4664],
        [-0.8095,  1.4550, -1.4049,  1.3023]])
tensor([[-1.7200,  1.9185, -1.2768],
        [ 0.6329,  1.2710,  0.0509],
        [-0.4717,  0.8141,  0.0573],
        [-0.4007, -0.2372,  1.1367]])
tensor([[-1.6067, -0.3907,  1.5339,  0.4664],
        [ 1.4048,  0.4527,  0.1840, -0.4003],
        [-1.2192, -1.8538, -0.5934,  1.0736]])


In [None]:
newg = dgl.add_reverse_edges(g)
newg.edges()

(tensor([0, 0, 0, 0, 0, 1, 2, 3, 4, 5]),
 tensor([1, 2, 3, 4, 5, 0, 0, 0, 0, 0]))

## Loading and Saving Graphs

In [None]:
# Save graphs
dgl.save_graphs('graph.dgl', g)
dgl.save_graphs('graphs.dgl', [g, sg1, sg2])

# Load graphs
(g,), _ = dgl.load_graphs('graph.dgl')
print(g)
(g, sg1, sg2), _ = dgl.load_graphs('graphs.dgl')
print(g)
print(sg1)
print(sg2)

Graph(num_nodes=6, num_edges=5,
      ndata_schemes={'y': Scheme(shape=(5, 4), dtype=torch.float32), 'x': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={'a': Scheme(shape=(4,), dtype=torch.float32)})
Graph(num_nodes=6, num_edges=5,
      ndata_schemes={'y': Scheme(shape=(5, 4), dtype=torch.float32), 'x': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={'a': Scheme(shape=(4,), dtype=torch.float32)})
Graph(num_nodes=3, num_edges=2,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'x': Scheme(shape=(3,), dtype=torch.float32), 'y': Scheme(shape=(5, 4), dtype=torch.float32)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'a': Scheme(shape=(4,), dtype=torch.float32)})
Graph(num_nodes=4, num_edges=3,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'x': Scheme(shape=(3,), dtype=torch.float32), 'y': Scheme(shape=(5, 4), dtype=torch.float32)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'a': Scheme

# Write your own GNN module

- Understand DGL’s message passing APIs.

- Implement GraphSAGE convolution module by your own.

In [None]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

## Message passing and GNNs

In [None]:
import dgl.function as fn

class SAGEConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model.

    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(SAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)

    def forward(self, g, h):
        """Forward computation

        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        """
        with g.local_scope():
            g.ndata['h'] = h
            # update_all is a message passing API.
            g.update_all(message_func=fn.copy_u('h', 'm'), reduce_func=fn.mean('m', 'h_N'))
            h_N = g.ndata['h_N']
            h_total = torch.cat([h, h_N], dim=1)
            return self.linear(h_total)

In [None]:
class Model(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Model, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats)
        self.conv2 = SAGEConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

## Training loop

In [None]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
g = dataset[0]

def train(g, model):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    all_logits = []
    best_val_acc = 0
    best_test_acc = 0

    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    for e in range(200):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that we should only compute the losses of the nodes in the training set,
        # i.e. with train_mask 1.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        all_logits.append(logits.detach())

        if e % 5 == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))

model = Model(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)

  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
In epoch 0, loss: 1.952, val acc: 0.114 (best 0.114), test acc: 0.103 (best 0.103)
In epoch 5, loss: 1.873, val acc: 0.400 (best 0.408), test acc: 0.391 (best 0.392)
In epoch 10, loss: 1.725, val acc: 0.458 (best 0.458), test acc: 0.500 (best 0.500)
In epoch 15, loss: 1.502, val acc: 0.592 (best 0.592), test acc: 0.602 (best 0.602)
In epoch 20, loss: 1.211, val acc: 0.660 (best 0.660), test acc: 0.674 (best 0.674)
In epoch 25, loss: 0.886, val acc: 0.720 (best 0.720), test acc: 0.727 (best 0.727)
In epoch 30, loss: 0.581, val acc: 0.756 (best 0.756), test acc: 0.749 (best 0.749)
In epoch 35, loss: 0.349, val acc: 0.770 (best 0.770), test acc: 0.771 (best 0.771)
In epoch 40, loss: 0.200, val acc: 0.768 (best 0.770), test acc: 0.776 (best 0.771)
In epoch 45, loss: 0.116, val acc: 0.758 (best 0.770), test acc:

## More customization

In [None]:
class WeightedSAGEConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model with edge weights.

    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(WeightedSAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)

    def forward(self, g, h, w):
        """Forward computation

        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        w : Tensor
            The edge weight.
        """
        with g.local_scope():
            g.ndata['h'] = h
            g.edata['w'] = w
            g.update_all(message_func=fn.u_mul_e('h', 'w', 'm'), reduce_func=fn.mean('m', 'h_N'))
            h_N = g.ndata['h_N']
            h_total = torch.cat([h, h_N], dim=1)
            return self.linear(h_total)

In [None]:
class Model(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Model, self).__init__()
        self.conv1 = WeightedSAGEConv(in_feats, h_feats)
        self.conv2 = WeightedSAGEConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat, torch.ones(g.num_edges()).to(g.device))
        h = F.relu(h)
        h = self.conv2(g, h, torch.ones(g.num_edges()).to(g.device))
        return h

model = Model(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)

In epoch 0, loss: 1.949, val acc: 0.122 (best 0.122), test acc: 0.130 (best 0.130)
In epoch 5, loss: 1.863, val acc: 0.458 (best 0.458), test acc: 0.456 (best 0.456)
In epoch 10, loss: 1.713, val acc: 0.230 (best 0.458), test acc: 0.230 (best 0.456)
In epoch 15, loss: 1.494, val acc: 0.292 (best 0.458), test acc: 0.280 (best 0.456)
In epoch 20, loss: 1.218, val acc: 0.380 (best 0.458), test acc: 0.381 (best 0.456)
In epoch 25, loss: 0.910, val acc: 0.532 (best 0.532), test acc: 0.507 (best 0.507)
In epoch 30, loss: 0.616, val acc: 0.626 (best 0.626), test acc: 0.605 (best 0.605)
In epoch 35, loss: 0.382, val acc: 0.710 (best 0.710), test acc: 0.680 (best 0.680)
In epoch 40, loss: 0.224, val acc: 0.728 (best 0.728), test acc: 0.722 (best 0.722)
In epoch 45, loss: 0.130, val acc: 0.730 (best 0.730), test acc: 0.731 (best 0.723)
In epoch 50, loss: 0.078, val acc: 0.728 (best 0.732), test acc: 0.738 (best 0.736)
In epoch 55, loss: 0.049, val acc: 0.732 (best 0.732), test acc: 0.739 (best 0

## Even more customization by user-defined function

In [None]:
def u_mul_e_udf(edges):
    return {'m' : edges.src['h'] * edges.data['w']}

In [None]:
def sum_udf(nodes):
    return {'h': nodes.mailbox['m'].sum(1)}

## Best practice of writing custom GNN modules
DGL recommends the following practice ranked by preference:

- Use dgl.nn modules.

- Use dgl.nn.functional functions which contain lower-level complex operations such as computing a softmax for each node over incoming edges.

- Use update_all with builtin message and reduce functions.

- Use user-defined message or reduce functions.

# Link Prediction using Graph Neural Networks

In [None]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools
import numpy as np
import scipy.sparse as sp

time: 2.55 ms (started: 2021-03-19 19:13:30 +00:00)


## Overview of Link Prediction with GNN
This tutorial formulates the link prediction problem as a binary classification problem as follows:

- Treat the edges in the graph as positive examples.

- Sample a number of non-existent edges (i.e. node pairs with no edges between them) as negative examples.

- Divide the positive examples and negative examples into a training set and a test set.

- Evaluate the model with any binary classification metric such as Area Under Curve (AUC).

## Loading graph and features

In [None]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
g = dataset[0]

  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
time: 114 ms (started: 2021-03-19 19:13:34 +00:00)


## Prepare training and testing sets

In [None]:
# Split edge set for training and testing
u, v = g.edges()

eids = np.arange(g.number_of_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.number_of_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]

# Find all negative edges and split them for training and testing
# print("np.ones(len(u)) = ", np.ones(len(u)))
# print("u.numpy() = ", u.numpy())
# print("v.numpy() = ", v.numpy())
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
# print("adj = ", adj)
# print("adj.todense() = ", adj.todense())
# print("np.eye(g.number_of_nodes()) = ", np.eye(g.number_of_nodes()))
adj_neg = 1 - adj.todense() - np.eye(g.number_of_nodes())
# print("adj_neg = ", adj_neg)
neg_u, neg_v = np.where(adj_neg != 0)

neg_eids = np.random.choice(len(neg_u), g.number_of_edges() // 2)
test_neg_u, test_neg_v = neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]]
train_neg_u, train_neg_v = neg_u[neg_eids[test_size:]], neg_v[neg_eids[test_size:]]

In [None]:
train_g = dgl.remove_edges(g, eids[:test_size])

time: 9.63 ms (started: 2021-03-19 19:42:31 +00:00)


## Define a GraphSAGE model

In [None]:
from dgl.nn import SAGEConv

# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, h_feats, 'mean')

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

time: 6.44 ms (started: 2021-03-19 19:43:38 +00:00)


## Positive graph, negative graph, and `apply_edges`





In [39]:
train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.number_of_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.number_of_nodes())

test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.number_of_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.number_of_nodes())

time: 9.52 ms (started: 2021-03-19 20:09:09 +00:00)


In [40]:
import dgl.function as fn

class DotPredictor(nn.Module):
    def forward(self, g, h):
        with g.local_scope():
            g.ndata['h'] = h
            # Compute a new edge feature named 'score' by a dot-product between the
            # source node feature 'h' and destination node feature 'h'.
            g.apply_edges(fn.u_dot_v('h', 'h', 'score'))
            # u_dot_v returns a 1-element vector for each edge so you need to squeeze it.
            return g.edata['score'][:, 0]

time: 2.98 ms (started: 2021-03-19 20:09:11 +00:00)


In [41]:
class MLPPredictor(nn.Module):
    def __init__(self, h_feats):
        super().__init__()
        self.W1 = nn.Linear(h_feats * 2, h_feats)
        self.W2 = nn.Linear(h_feats, 1)

    def apply_edges(self, edges):
        """
        Computes a scalar score for each edge of the given graph.

        Parameters
        ----------
        edges :
            Has three members ``src``, ``dst`` and ``data``, each of
            which is a dictionary representing the features of the
            source nodes, the destination nodes, and the edges
            themselves.

        Returns
        -------
        dict
            A dictionary of new edge features.
        """
        h = torch.cat([edges.src['h'], edges.dst['h']], 1)
        return {'score': self.W2(F.relu(self.W1(h))).squeeze(1)}

    def forward(self, g, h):
        with g.local_scope():
            g.ndata['h'] = h
            g.apply_edges(self.apply_edges)
            return g.edata['score']

time: 9.52 ms (started: 2021-03-19 20:09:13 +00:00)


## Training loop



In [42]:
model = GraphSAGE(train_g.ndata['feat'].shape[1], 16)
# You can replace DotPredictor with MLPPredictor.
#pred = MLPPredictor(16)
pred = DotPredictor()

def compute_loss(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score])
    labels = torch.cat([torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])])
    return F.binary_cross_entropy_with_logits(scores, labels)

def compute_auc(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score]).numpy()
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]).numpy()
    return roc_auc_score(labels, scores)

time: 18.5 ms (started: 2021-03-19 20:09:17 +00:00)


In [43]:
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(itertools.chain(model.parameters(), pred.parameters()), lr=0.01)

# ----------- 4. training -------------------------------- #
all_logits = []
for e in range(100):
    # forward
    h = model(train_g, train_g.ndata['feat'])
    pos_score = pred(train_pos_g, h)
    neg_score = pred(train_neg_g, h)
    loss = compute_loss(pos_score, neg_score)

    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

# ----------- 5. check results ------------------------ #
from sklearn.metrics import roc_auc_score
with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print('AUC', compute_auc(pos_score, neg_score))


# Thumbnail Courtesy: Link Prediction with Neo4j, Mark Needham
# sphinx_gallery_thumbnail_path = '_static/blitz_4_link_predict.png'

In epoch 0, loss: 0.6177338361740112
In epoch 5, loss: 0.6021549701690674
In epoch 10, loss: 0.5715932250022888
In epoch 15, loss: 0.5184945464134216
In epoch 20, loss: 0.44610998034477234
In epoch 25, loss: 0.3957952857017517
In epoch 30, loss: 0.34818312525749207
In epoch 35, loss: 0.3054206073284149
In epoch 40, loss: 0.2675386667251587
In epoch 45, loss: 0.23587919771671295
In epoch 50, loss: 0.20788024365901947
In epoch 55, loss: 0.18020963668823242
In epoch 60, loss: 0.15578201413154602
In epoch 65, loss: 0.1317504197359085
In epoch 70, loss: 0.110292449593544
In epoch 75, loss: 0.09030572324991226
In epoch 80, loss: 0.07257099449634552
In epoch 85, loss: 0.05705918371677399
In epoch 90, loss: 0.04370430111885071
In epoch 95, loss: 0.032618604600429535
AUC 0.8632950742346308
time: 4.39 s (started: 2021-03-19 20:09:19 +00:00)


# Training a GNN for Graph Classification
Train a graph classification model for a small dataset from the paper [How Powerful Are Graph Neural Networks](https://arxiv.org/abs/1810.00826).



In [44]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

time: 1.46 ms (started: 2021-03-19 20:10:12 +00:00)


## Loading Data

In [45]:
import dgl.data

# Generate a synthetic dataset with 10000 graphs, ranging from 10 to 500 nodes.
dataset = dgl.data.GINDataset('PROTEINS', self_loop=True)

Downloading /root/.dgl/GINDataset.zip from https://raw.githubusercontent.com/weihua916/powerful-gnns/master/dataset.zip...
Extracting file to /root/.dgl/GINDataset
time: 23 s (started: 2021-03-19 20:10:17 +00:00)


In [46]:
print('Node feature dimensionality:', dataset.dim_nfeats)
print('Number of graph categories:', dataset.gclasses)

Node feature dimensionality: 3
Number of graph categories: 2
time: 1.56 ms (started: 2021-03-19 20:10:40 +00:00)


## Defining Data Loader

In [47]:
from dgl.dataloading import GraphDataLoader
from torch.utils.data.sampler import SubsetRandomSampler

num_examples = len(dataset)
num_train = int(num_examples * 0.8)

train_sampler = SubsetRandomSampler(torch.arange(num_train))
test_sampler = SubsetRandomSampler(torch.arange(num_train, num_examples))

train_dataloader = GraphDataLoader(
    dataset, sampler=train_sampler, batch_size=5, drop_last=False)
test_dataloader = GraphDataLoader(
    dataset, sampler=test_sampler, batch_size=5, drop_last=False)

time: 5.02 ms (started: 2021-03-19 20:12:41 +00:00)


In [48]:
it = iter(train_dataloader)
batch = next(it)
print(batch)

[Graph(num_nodes=288, num_edges=1294,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={}), tensor([0, 0, 0, 0, 0])]
time: 9.23 ms (started: 2021-03-19 20:12:44 +00:00)


## A Batched Graph in DGL

In [49]:
batched_graph, labels = batch
print('Number of nodes for each graph element in the batch:', batched_graph.batch_num_nodes())
print('Number of edges for each graph element in the batch:', batched_graph.batch_num_edges())

# Recover the original graph elements from the minibatch
graphs = dgl.unbatch(batched_graph)
print('The original graphs in the minibatch:')
print(graphs)

Number of nodes for each graph element in the batch: tensor([38, 85, 48, 28, 89])
Number of edges for each graph element in the batch: tensor([198, 401, 196, 124, 375])
The original graphs in the minibatch:
[Graph(num_nodes=38, num_edges=198,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={}), Graph(num_nodes=85, num_edges=401,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={}), Graph(num_nodes=48, num_edges=196,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={}), Graph(num_nodes=28, num_edges=124,
      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'attr': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={}), Graph(num_nodes=89, num_edges=375,
      ndata_schemes={'label': Scheme(shape=(), dtype=

## Define Model

In [50]:
from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        g.ndata['h'] = h
        return dgl.mean_nodes(g, 'h')

time: 8.86 ms (started: 2021-03-19 20:12:52 +00:00)


## Training Loop

In [51]:
# Create the model with given dimensions
model = GCN(dataset.dim_nfeats, 16, dataset.gclasses)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(20):
    for batched_graph, labels in train_dataloader:
        pred = model(batched_graph, batched_graph.ndata['attr'].float())
        loss = F.cross_entropy(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:
    pred = model(batched_graph, batched_graph.ndata['attr'].float())
    num_correct += (pred.argmax(1) == labels).sum().item()
    num_tests += len(labels)

print('Test accuracy:', num_correct / num_tests)

Test accuracy: 0.21524663677130046
time: 9.85 s (started: 2021-03-19 20:12:59 +00:00)


# Make Your Own Dataset
Create your own graph dataset for node classification, link prediction, or graph classification.

## `DGLDataset` Object Overview

Your custom graph dataset should inherit the dgl.data.DGLDataset class and implement the following methods:

`__getitem__(self, i)`: retrieve the i-th example of the dataset. An example often contains a single DGL graph, and occasionally its label.

`__len__(self)`: the number of examples in the dataset.

`process(self)`: load and process raw data from disk.

## Creating a Dataset for Node Classification or Link Prediction from CSV

In [None]:
import urllib.request
import pandas as pd
urllib.request.urlretrieve(
    'https://data.dgl.ai/tutorial/dataset/members.csv', './members.csv')
urllib.request.urlretrieve(
    'https://data.dgl.ai/tutorial/dataset/interactions.csv', './interactions.csv')

members = pd.read_csv('./members.csv')
members.head()

interactions = pd.read_csv('./interactions.csv')
interactions.head()

Unnamed: 0,Src,Dst,Weight
0,0,1,0.043591
1,0,2,0.282119
2,0,3,0.370293
3,0,4,0.73057
4,0,5,0.821187


In [None]:
import dgl
from dgl.data import DGLDataset
import torch
import os

class KarateClubDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='karate_club')

    def process(self):
        nodes_data = pd.read_csv('./members.csv')
        edges_data = pd.read_csv('./interactions.csv')
        node_features = torch.from_numpy(nodes_data['Age'].to_numpy())
        node_labels = torch.from_numpy(nodes_data['Club'].astype('category').cat.codes.to_numpy())
        edge_features = torch.from_numpy(edges_data['Weight'].to_numpy())
        edges_src = torch.from_numpy(edges_data['Src'].to_numpy())
        edges_dst = torch.from_numpy(edges_data['Dst'].to_numpy())

        self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
        self.graph.ndata['feat'] = node_features
        self.graph.ndata['label'] = node_labels
        self.graph.edata['weight'] = edge_features

        # If your dataset is a node classification dataset, you will need to assign
        # masks indicating whether a node belongs to training, validation, and test set.
        n_nodes = nodes_data.shape[0]
        n_train = int(n_nodes * 0.6)
        n_val = int(n_nodes * 0.2)
        train_mask = torch.zeros(n_nodes, dtype=torch.bool)
        val_mask = torch.zeros(n_nodes, dtype=torch.bool)
        test_mask = torch.zeros(n_nodes, dtype=torch.bool)
        train_mask[:n_train] = True
        val_mask[n_train:n_train + n_val] = True
        test_mask[n_train + n_val:] = True
        self.graph.ndata['train_mask'] = train_mask
        self.graph.ndata['val_mask'] = val_mask
        self.graph.ndata['test_mask'] = test_mask

    def __getitem__(self, i):
        return self.graph

    def __len__(self):
        return 1

dataset = KarateClubDataset()
graph = dataset[0]

print(graph)

Graph(num_nodes=34, num_edges=156,
      ndata_schemes={'feat': Scheme(shape=(), dtype=torch.int64), 'label': Scheme(shape=(), dtype=torch.int8), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool)}
      edata_schemes={'weight': Scheme(shape=(), dtype=torch.float64)})


  


## Creating a Dataset for Graph Classification from CSV
This tutorial demonstrates how to create a graph classification dataset with the following synthetic CSV data:

- `graph_edges.csv`: containing three columns:

  - `graph_id`: the ID of the graph.

  - `src`: the source node of an edge of the given graph.

  - `dst`: the destination node of an edge of the given graph.

- `graph_properties.csv`: containing three columns:

  - `graph_id`: the ID of the graph.

  - `label`: the label of the graph.

  - `num_nodes`: the number of nodes in the graph.

In [None]:
urllib.request.urlretrieve(
    'https://data.dgl.ai/tutorial/dataset/graph_edges.csv', './graph_edges.csv')
urllib.request.urlretrieve(
    'https://data.dgl.ai/tutorial/dataset/graph_properties.csv', './graph_properties.csv')
edges = pd.read_csv('./graph_edges.csv')
properties = pd.read_csv('./graph_properties.csv')

edges.head()

properties.head()

class SyntheticDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='synthetic')

    def process(self):
        edges = pd.read_csv('./graph_edges.csv')
        properties = pd.read_csv('./graph_properties.csv')
        self.graphs = []
        self.labels = []

        # Create a graph for each graph ID from the edges table.
        # First process the properties table into two dictionaries with graph IDs as keys.
        # The label and number of nodes are values.
        label_dict = {}
        num_nodes_dict = {}
        for _, row in properties.iterrows():
            label_dict[row['graph_id']] = row['label']
            num_nodes_dict[row['graph_id']] = row['num_nodes']

        # For the edges, first group the table by graph IDs.
        edges_group = edges.groupby('graph_id')

        # For each graph ID...
        for graph_id in edges_group.groups:
            # Find the edges as well as the number of nodes and its label.
            edges_of_id = edges_group.get_group(graph_id)
            src = edges_of_id['src'].to_numpy()
            dst = edges_of_id['dst'].to_numpy()
            num_nodes = num_nodes_dict[graph_id]
            label = label_dict[graph_id]

            # Create a graph and add it to the list of graphs and labels.
            g = dgl.graph((src, dst), num_nodes=num_nodes)
            self.graphs.append(g)
            self.labels.append(label)

        # Convert the label list to tensor for saving.
        self.labels = torch.LongTensor(self.labels)

    def __getitem__(self, i):
        return self.graphs[i], self.labels[i]

    def __len__(self):
        return len(self.graphs)

dataset = SyntheticDataset()
graph, label = dataset[0]
print(graph, label)


# Thumbnail Courtesy: (Un)common Use Cases for Graph Databases, Michal Bachman
# sphinx_gallery_thumbnail_path = '_static/blitz_6_load_data.png'

Graph(num_nodes=15, num_edges=45,
      ndata_schemes={}
      edata_schemes={}) tensor(0)
