The following is from [this article](https://medium.com/towards-data-science/a-beginners-guide-to-graph-neural-networks-using-pytorch-geometric-part-1-d98dc93e7742) in Medium.

# 1. Getting started.

Let’s pick a simple graph dataset like [Zachary’s Karate Club](https://en.wikipedia.org/wiki/Zachary%27s_karate_club). Here, the nodes represent 34 students who were involved in the club and the links represent 78 different interactions between pairs of members outside the club. There are two different types of labels i.e, the two factions. We can use this information to formulate a node classification task.

We divide the graph into train and test sets where we use the train set to build a graph neural network model and use the model to predict the missing node labels in the test set.

Here, we use [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric) (PyG) python library to model the graph neural network. Alternatively, [Deep Graph Library](https://docs.dgl.ai/) (DGL) can also be used for the same purpose.

PyTorch Geometric is a geometric deep learning library built on top of PyTorch. Several popular graph neural network methods have been implemented using PyG and you can play around with the code using built-in datasets or create your own dataset. PyG uses a nifty implementation where it provides an [InMemoryDataset](https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html) class which can be used to create the custom dataset (*Note: InMemoryDataset should be used for datasets small enough to load in the memory*).

# 2. Formulate the problem.

In order to formulate the problem, we need:

1. The graph itself and the labels for each node
2. The edge data in the [Coordinate Format](https://scipy-lectures.org/advanced/scipy_sparse/coo_matrix.html) (COO)
3. Embeddings or numerical representations for the nodes

> Note: For the numerical representation for nodes, we can use graph properties like degree or use different embedding generation methods like [node2vec](https://github.com/eliorc/node2vec), [DeepWalk](https://github.com/phanein/deepwalk) etc. In this example, I will be using node degree as its numerical representation.

# 3. Preparations.

In [1]:
import networkx as nx
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler

In [2]:
# load graph from networkx library
G = nx.karate_club_graph()

In [3]:
G.nodes

NodeView((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33))

In [4]:
G.nodes[0]

{'club': 'Mr. Hi'}

In [5]:
G.nodes[33]

{'club': 'Officer'}

In [6]:
# retrieve the labels for each node
labels = np.asarray([G.nodes[i]["club"] != "Mr. Hi" for i in G.nodes]).astype(np.int64)

In [7]:
labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [8]:
G.edges

EdgeView([(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 10), (0, 11), (0, 12), (0, 13), (0, 17), (0, 19), (0, 21), (0, 31), (1, 2), (1, 3), (1, 7), (1, 13), (1, 17), (1, 19), (1, 21), (1, 30), (2, 3), (2, 7), (2, 8), (2, 9), (2, 13), (2, 27), (2, 28), (2, 32), (3, 7), (3, 12), (3, 13), (4, 6), (4, 10), (5, 6), (5, 10), (5, 16), (6, 16), (8, 30), (8, 32), (8, 33), (9, 33), (13, 33), (14, 32), (14, 33), (15, 32), (15, 33), (18, 32), (18, 33), (19, 33), (20, 32), (20, 33), (22, 32), (22, 33), (23, 25), (23, 27), (23, 29), (23, 32), (23, 33), (24, 25), (24, 27), (24, 31), (25, 31), (26, 29), (26, 33), (27, 33), (28, 31), (28, 33), (29, 32), (29, 33), (30, 32), (30, 33), (31, 32), (31, 33), (32, 33)])

In [9]:
# create edge index from
adj = nx.to_scipy_sparse_array(G).tocoo()
row = torch.from_numpy(adj.row.astype(np.int64)).to(torch.long)
col = torch.from_numpy(adj.col.astype(np.int64)).to(torch.long)
edge_index = torch.stack([row, col], dim=0)

In [10]:
adj.row

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  3,  3,  3,  3,  3,  3,  4,  4,  4,  5,  5,  5,  5,  6,  6,  6,
        6,  7,  7,  7,  7,  8,  8,  8,  8,  8,  9,  9, 10, 10, 10, 11, 12,
       12, 13, 13, 13, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19,
       19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25,
       25, 25, 26, 26, 27, 27, 27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30,
       30, 30, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32,
       32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
       33, 33, 33])

In [11]:
adj.col

array([ 1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 17, 19, 21, 31,  0,
        2,  3,  7, 13, 17, 19, 21, 30,  0,  1,  3,  7,  8,  9, 13, 27, 28,
       32,  0,  1,  2,  7, 12, 13,  0,  6, 10,  0,  6, 10, 16,  0,  4,  5,
       16,  0,  1,  2,  3,  0,  2, 30, 32, 33,  2, 33,  0,  4,  5,  0,  0,
        3,  0,  1,  2,  3, 33, 32, 33, 32, 33,  5,  6,  0,  1, 32, 33,  0,
        1, 33, 32, 33,  0,  1, 32, 33, 25, 27, 29, 32, 33, 25, 27, 31, 23,
       24, 31, 29, 33,  2, 23, 24, 33,  2, 31, 33, 23, 26, 32, 33,  1,  8,
       32, 33,  0, 24, 25, 28, 32, 33,  2,  8, 14, 15, 18, 20, 22, 23, 29,
       30, 31, 33,  8,  9, 13, 14, 15, 18, 19, 20, 22, 23, 26, 27, 28, 29,
       30, 31, 32])

In [12]:
adj.data

array([4, 5, 3, 3, 3, 3, 2, 2, 2, 3, 1, 3, 2, 2, 2, 2, 4, 6, 3, 4, 5, 1,
       2, 2, 2, 5, 6, 3, 4, 5, 1, 3, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3,
       3, 5, 3, 3, 3, 2, 5, 3, 2, 4, 4, 3, 2, 5, 3, 3, 4, 1, 2, 2, 3, 3,
       3, 1, 3, 3, 5, 3, 3, 3, 3, 2, 3, 4, 3, 3, 2, 1, 1, 2, 2, 2, 1, 3,
       1, 2, 2, 2, 3, 5, 4, 3, 5, 4, 2, 3, 2, 5, 2, 7, 4, 2, 2, 4, 3, 4,
       2, 2, 2, 3, 4, 4, 2, 2, 3, 3, 3, 2, 2, 7, 2, 4, 4, 2, 3, 3, 3, 1,
       3, 2, 5, 4, 3, 4, 5, 4, 2, 3, 2, 4, 2, 1, 1, 3, 4, 2, 4, 2, 2, 3,
       4, 5], dtype=int32)

In [13]:
edge_index

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,
          3,  3,  3,  3,  3,  4,  4,  4,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,
          7,  7,  8,  8,  8,  8,  8,  9,  9, 10, 10, 10, 11, 12, 12, 13, 13, 13,
         13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21,
         21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27,
         27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31,
         31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33],
        [ 1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 17, 19, 21, 31,  0,  2,
          3,  7, 13, 17, 19, 21, 30,  0,  1,  3,  7,  8,  9, 13, 27, 28, 32,  0,
          1,  2,  7, 12, 13,  0,  6, 10,  0,  6, 10, 16,  0,  4,  5, 16,  0,  1,
          2,  3,  0,  2, 30, 32, 33,  2, 33,  0,  4

In [14]:
# using degree as embedding
embeddings = np.array(list(dict(G.degree()).values()))

In [15]:
G.degree()  # number of edges adjacent to the node

DegreeView({0: 16, 1: 9, 2: 10, 3: 6, 4: 3, 5: 4, 6: 4, 7: 4, 8: 5, 9: 2, 10: 3, 11: 1, 12: 2, 13: 5, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 3, 20: 2, 21: 2, 22: 2, 23: 5, 24: 3, 25: 3, 26: 2, 27: 4, 28: 3, 29: 4, 30: 4, 31: 6, 32: 12, 33: 17})

In [16]:
embeddings

array([16,  9, 10,  6,  3,  4,  4,  4,  5,  2,  3,  1,  2,  5,  2,  2,  2,
        2,  2,  3,  2,  2,  2,  5,  3,  3,  2,  4,  3,  4,  4,  6, 12, 17])

In [17]:
# normalizing degree values
scale = StandardScaler()
embeddings = scale.fit_transform(embeddings.reshape(-1, 1))

In [18]:
embeddings

array([[ 2.98709092],
       [ 1.15480319],
       [ 1.41655858],
       [ 0.36953702],
       [-0.41572915],
       [-0.15397376],
       [-0.15397376],
       [-0.15397376],
       [ 0.10778163],
       [-0.67748454],
       [-0.41572915],
       [-0.93923993],
       [-0.67748454],
       [ 0.10778163],
       [-0.67748454],
       [-0.67748454],
       [-0.67748454],
       [-0.67748454],
       [-0.67748454],
       [-0.41572915],
       [-0.67748454],
       [-0.67748454],
       [-0.67748454],
       [ 0.10778163],
       [-0.41572915],
       [-0.41572915],
       [-0.67748454],
       [-0.15397376],
       [-0.41572915],
       [-0.15397376],
       [-0.15397376],
       [ 0.36953702],
       [ 1.94006936],
       [ 3.24884631]])

The karate club dataset can be loaded directly from the NetworkX library. We retrieve the labels from the graph and create an edge index in the coordinate format. The node degree was used as embeddings/ numerical representations for the nodes (In the case of a directed graph, in-degree can be used for the same purpose). Since degree values tend to be diverse, we normalize them before using the values as input to the GNN model.

With this, we have prepared all the necessary parts to construct the Pytorch Geometric custom dataset.

# 4. The Custom Dataset.

In [19]:
import pandas as pd
import torch_geometric.transforms as T
from sklearn.model_selection import train_test_split
from torch_geometric.data import Data, InMemoryDataset

In [20]:
# custom dataset
class KarateDataset(InMemoryDataset):
    def __init__(self, transform=None):
        super(KarateDataset, self).__init__(".", transform, None, None)

        data = Data(edge_index=edge_index)

        data.num_nodes = G.number_of_nodes()

        # embedding
        data.x = torch.from_numpy(embeddings).type(torch.float32)

        # labels
        y = torch.from_numpy(labels).type(torch.long)
        data.y = y.clone().detach()

        data.num_classes = 2

        # splitting the data into train, validation and test
        X_train, X_test, y_train, y_test = train_test_split(
            pd.Series(list(G.nodes())),
            pd.Series(labels),
            test_size=0.30,
            random_state=42,
        )

        n_nodes = G.number_of_nodes()

        # create train and test masks for data
        train_mask = torch.zeros(n_nodes, dtype=torch.bool)
        test_mask = torch.zeros(n_nodes, dtype=torch.bool)
        train_mask[X_train.index] = True
        test_mask[X_test.index] = True
        data["train_mask"] = train_mask
        data["test_mask"] = test_mask

        self.data, self.slices = self.collate([data])

    def _download(self):
        return

    def _process(self):
        return

    def __repr__(self):
        return "{}()".format(self.__class__.__name__)

In [21]:
dataset = KarateDataset()

In [22]:
dataset[0]

Data(edge_index=[2, 156], num_nodes=34, x=[34, 1], y=[34], num_classes=2, train_mask=[34], test_mask=[34])

In [23]:
data = dataset[0]

In [24]:
data.train_mask

tensor([False,  True,  True,  True,  True,  True,  True,  True, False, False,
         True,  True, False,  True,  True, False,  True,  True,  True, False,
         True, False,  True,  True, False,  True, False, False,  True,  True,
         True,  True, False,  True])

In [25]:
data.train_mask.sum()

tensor(23)

In [26]:
data.test_mask

tensor([ True, False, False, False, False, False, False, False,  True,  True,
        False, False,  True, False, False,  True, False, False, False,  True,
        False,  True, False, False,  True, False,  True,  True, False, False,
        False, False,  True, False])

In [27]:
data.test_mask.sum()

tensor(11)

The KarateDataset class inherits from the InMemoryDataset class and use a Data object to collate all information relating to the karate club dataset. The graph data is then split into train and test sets, thereby creating the train and test masks using the splits.

This custom dataset can now be used with several graph neural network models from the Pytorch Geometric library. Let’s pick a Graph Convolutional Network model and use it to predict the missing labels on the test set.

> *Note: PyG library focuses more on node classification task but it can also be used for link prediction.*

# 5. Graph Convolutional Network.

In [28]:
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

In [29]:
data.num_features

1

In [30]:
data.x

tensor([[ 2.9871],
        [ 1.1548],
        [ 1.4166],
        [ 0.3695],
        [-0.4157],
        [-0.1540],
        [-0.1540],
        [-0.1540],
        [ 0.1078],
        [-0.6775],
        [-0.4157],
        [-0.9392],
        [-0.6775],
        [ 0.1078],
        [-0.6775],
        [-0.6775],
        [-0.6775],
        [-0.6775],
        [-0.6775],
        [-0.4157],
        [-0.6775],
        [-0.6775],
        [-0.6775],
        [ 0.1078],
        [-0.4157],
        [-0.4157],
        [-0.6775],
        [-0.1540],
        [-0.4157],
        [-0.1540],
        [-0.1540],
        [ 0.3695],
        [ 1.9401],
        [ 3.2488]])

In [31]:
data.x.shape

torch.Size([34, 1])

In [32]:
data.edge_index

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,
          3,  3,  3,  3,  3,  4,  4,  4,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,
          7,  7,  8,  8,  8,  8,  8,  9,  9, 10, 10, 10, 11, 12, 12, 13, 13, 13,
         13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21,
         21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27,
         27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31,
         31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33,
         33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33],
        [ 1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 17, 19, 21, 31,  0,  2,
          3,  7, 13, 17, 19, 21, 30,  0,  1,  3,  7,  8,  9, 13, 27, 28, 32,  0,
          1,  2,  7, 12, 13,  0,  6, 10,  0,  6, 10, 16,  0,  4,  5, 16,  0,  1,
          2,  3,  0,  2, 30, 32, 33,  2, 33,  0,  4

In [33]:
# GCN model with 2 layers
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(data.num_features, 16)
        self.conv2 = GCNConv(16, int(data.num_classes))

    def forward(self):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(
            x, training=self.training
        )  # self.training becomes either True or False based on model.train()/eval()
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)  # Applies a softmax followed by a logarithm

In [34]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [35]:
device

device(type='cuda')

In [36]:
data = data.to(device)

In [37]:
model = Net().to(device)

The GCN model is built with 2 hidden layers and each hidden layer contains 16 neurons. Let’s train the model!

# 6. Train the GCN model.

In [38]:
torch.manual_seed(42)

<torch._C.Generator at 0x20f91b948f0>

In [39]:
optimizer_name = "Adam"
lr = 1e-1
optimizer = getattr(torch.optim, optimizer_name)(model.parameters(), lr=lr)
epochs = 200

In [40]:
def train():
    model.train()
    optimizer.zero_grad()
    F.nll_loss(model()[data.train_mask], data.y[data.train_mask]).backward()
    optimizer.step()

In [41]:
@torch.no_grad()
def test():
    model.eval()
    logits = model()
    mask1 = data["train_mask"]
    pred1 = logits[mask1].max(1)[1]
    acc1 = pred1.eq(data.y[mask1]).sum().item() / mask1.sum().item()
    mask = data["test_mask"]
    pred = logits[mask].max(1)[1]
    acc = pred.eq(data.y[mask]).sum().item() / mask.sum().item()  # eq == equal
    return acc1, acc

In [42]:
for epoch in range(1, epochs):
    train()

In [43]:
train_acc, test_acc = test()

In [44]:
print("#" * 70)
print("Train Accuracy: %s" % train_acc)
print("Test Accuracy: %s" % test_acc)
print("#" * 70)

######################################################################
Train Accuracy: 0.8695652173913043
Test Accuracy: 0.7272727272727273
######################################################################


# 7. A Summary.

To summarize everything we have done so far:

1. Generate numerical representations for each node in the graph (node degree in this case).
2. Construct a PyG custom dataset and split data into train and test.
3. Use a GNN model like GCN and train the model.
4. Make predictions on the test set and calculate the accuracy score.