# New developments in deep learning - R-GCNs
Github repository: https://github.com/phucdev/NDinDL

Isotropic Graph Neural Networks 
- Representations are learned via differentiable message passing scheme 
-  All neighbors are treated as equally important 
-  Starting points: 
  -  Kipf & Welling: “Semi-Supervised Classification with Graph Convolutional Networks” (https://arxiv.org/abs/1609.02907) 
  -  Schlichtkrull et al.: “Modeling Relational Data with Graph Convolutional Networks” (https://arxiv.org/abs/1703.06103) 
- Task: 
  - Implement Relational Graph convolutional Neural Network for Node Classification

Blog posts:
- https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780
- http://tkipf.github.io/graph-convolutional-networks/

Jupyter notebook tutorial:
- https://github.com/TobiasSkovgaardJepsen/posts/blob/master/HowToDoDeepLearningOnGraphsWithGraphConvolutionalNetworks/Part2_SemiSupervisedLearningWithSpectralGraphConvolutions/notebook.ipynb

Keras implementation:
- https://github.com/tkipf/relational-gcn

PyTorch implementations
- https://github.com/tkipf/pygcn 
- https://github.com/masakicktashiro/rgcn_pytorch_implementation
- https://github.com/mjDelta/relation-gcn-pytorch
- https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/4_rgcn.html
- https://github.com/rusty1s/pytorch_geometric/blob/master/examples/rgcn.py


Datasets directly available via `torch.geometric.datasets.Entities: AIFB, MUTAG
- Overview: https://www.uni-mannheim.de/dws/research/resources/sw4ml-benchmark/
- Download: http://data.dws.informatik.uni-mannheim.de/rmlod/LOD_ML_Datasets/data/datasets/RDF_Datasets/

```tex
@InProceedings{10.1007/978-3-319-46547-0_20,
  author="Ristoski, Petar
  and de Vries, Gerben Klaas Dirk
  and Paulheim, Heiko",
  editor="Groth, Paul
  and Simperl, Elena
  and Gray, Alasdair
  and Sabou, Marta
  and Kr{\"o}tzsch, Markus
  and Lecue, Freddy
  and Fl{\"o}ck, Fabian
  and Gil, Yolanda",
  title="A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web",
  booktitle="The Semantic Web -- ISWC 2016",
  year="2016",
  publisher="Springer International Publishing",
  address="Cham",
  pages="186--194",
  abstract="In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. In this paper, we present a collection of 22 benchmark datasets of different sizes. Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches.",
  isbn="978-3-319-46547-0"
}
```

Dataset information:
- The AIFB dataset describes the AIFB research institute in terms of its staff, research group, and publications. In the original paper the dataset was first used to predict the affiliation (i.e., research group) for people in the dataset. The dataset contains 178 members of a research group, however the smallest group contains only 4 people, which is removed from the dataset, leaving 4 classes. Also, we remove the employs relation, which is the inverse of the affiliation relation.
(176 labeled instances, 4 classes, 8k entities, 45 relations, 28k edges)
- The MUTAG dataset is distributed as an example dataset for the DL-Learner toolkit29. It contains information about complex molecules that are potentially carcinogenic, which is given by the `isMutagenic` property. MUTAG is a dataset of molecular graphs, which was later converted to RDF format, where relations either indicate atomic bonds or merely the presence of a certain feature. Labeled entities in MUTAG are only connected via high-degree hub nodes that encode a certain feature. (340 instances, 2 classes, 23k entities, 23 relations, 74k edges)

Entity classification results (accuracy averaged over 10 runs) reported in the R-GCN paper:
- AIFB: 95.83
- MUTAG: 73.23

## Relational GCNs
Extension of GCNs: Use a set of relation-specific weight matrices $W_r^{(l)}$, where $r \in R$ denotes the relation type

Propagation model:
> $h_i^{l+1} = \sigma\left(\sum_{r\in R}\sum_{j\in N^r_i}\frac{1}{c_{i,r}}W_r^{(l)}h_j^{(l)}+\underbrace{W_0^{(l)}h_i^{(l)}}_{\text{self-connection}}\right)$

where 
- $N^r_i$ denotes the set of neighbor indices of node $i$ under relation $r \in R$, 
- $c_{i,r}$ is a problem-specific normalization constant that can either be learned or chosen in advance (such as $c_{i,r} = |N_i^r|$).

Neural network layer update: evaluate message passing update in parallel for every node $i \in V$.

Parameter sharing for highly- multi-relational data: basis decomposition of relation-specific weight matrices
> $W_r^{(l)} = \sum^B_{b=1}a^{(l)}_{r,b}V_b^{(l)}$

Linear combination of basis transformations $V_b^{(l)} \in \mathbb{R}^{d^{(l+1)}\times d^{(l)}}$ with learnable coefficients $a^{(l)}_{r,b}$ such that only the coefficients depend on $r$. $B$, the number of basis functions, is a hyperparameter.

For entity classification as described in the paper minimize:
> $L = -\sum_{i\in Y}\sum^K_{k=1}t_{i,k}\ln h_{i,k}^{(l)}$

whre:
- $Y$ is the set of node indices with labels
- $K$ is the number of classes (?)
- $t_{i,k}$ is the ground-truth label
- $h_{i,k}^{(l)}$ is the $k$-th entry of network ouput for $i$-th labeled node

# Training and evaluation
- 2 layer model with 16 hidden units (dimension of hidden node representation)
- 50 epochs with learning rate 0.01 using Adam optimizer
- normalization constant $c_{i,r} = |N_i^r|$, i.e. average all incoming messages from a particular relation type
- $l2$ penalty on first layer weights $C_{l2} \in \{0, 5\cdot 10^{-4}\}$
- number of basis functions $B \in \{0, 10, 20, 30, 40\}$

Results reported
- Accuracy and standard error over 10 runs

## Implementation

### Imports
We mainly use pytorch geometric to load the datasets, numpy and scipy to process the data and pytorch for the model.

In [None]:
from torch_geometric.datasets import Entities
from torch_geometric.utils.num_nodes import maybe_num_nodes
import numpy as np
import scipy.sparse as sp
from scipy import stats
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial
import os.path as osp

### R-GCN layer
This part is inspired by the keras implementation from Thomas Kipf and pytorch based implementations:
- https://github.com/tkipf/pygcn
- https://github.com/masakicktashiro/rgcn_pytorch_implementation
- https://github.com/mjDelta/relation-gcn-pytorch

In [None]:
class RGCNConv(nn.Module):
    def __init__(self,
                 input_dim,
                 output_dim,
                 num_rels,
                 num_bases=-1,
                 bias=False,
                 activation=None,
                 dropout=0.5,
                 is_output_layer=False):
        r"""The relational graph convolutional operator from the `"Modeling
        Relational Data with Graph Convolutional Networks"
        <https://arxiv.org/abs/1703.06103>`_ paper

        Propagation model:
        (1) $h_i^{l+1} = \sigma\left(\sum_{r\in R}\sum_{j\in N^r_i}\frac{1}{c_{i,r}}W_r^{(l)}h_j^{(l)}+
        \underbrace{W_0^{(l)}h_i^{(l)}}_{\text{self-connection}}\right)$

        where
        - $N^r_i$ denotes the set of neighbor indices of node $i$ under relation $r \in R$,
        - $c_{i,r}$ is a problem-specific normalization constant that can either be learned or chosen in advance
          (such as $c_{i,r} = |N_i^r|$).

        Neural network layer update: evaluate message passing update in parallel for every node $i \in V$.

        Parameter sharing for highly- multi-relational data: basis decomposition of relation-specific weight matrices
        (2) $W_r^{(l)} = \sum^B_{b=1}a^{(l)}_{r,b}V_b^{(l)}$

        Linear combination of basis transformations $V_b^{(l)} \in \mathbb{R}^{d^{(l+1)}\times d^{(l)}}$ with learnable
        coefficients $a^{(l)}_{r,b}$ such that only the coefficients depend on $r$. $B$, the number of basis functions,
        is a hyperparameter.

        :param input_dim: Input dimension
        :param output_dim: Output dimension
        :param num_rels: Number of relation types
        :param num_bases: Number of bases used in basis decomposition of relation-specific weight matrices
        :param bias: Optional additive bias
        :param activation: Activation function
        :param dropout: Dropout
        :param is_output_layer: Indicates whether this layer is the output layer
        """
        super(RGCNConv, self).__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.num_rels = num_rels
        self.num_bases = num_bases
        self.bias = bias
        self.activation = activation
        self.dropout = dropout
        self.is_output_layer = is_output_layer

        # Number of bases for the basis decomposition can be less or equal to 
        # the number of relation types
        if self.num_bases <= 0 or self.num_bases > self.num_rels:
            self.num_bases = self.num_rels

        # Weight bases in equation (2)
        # V_b if self.num_bases < self.num_rels, 
        # W_r if self.num_bases == self.num_rels
        self.weight = nn.Parameter(
            torch.Tensor(self.num_bases * self.input_dim, self.output_dim))

        # Use basis decomposition otherwise if num_bases = num_rels we can just 
        # use one weight matrix per relation type
        if self.num_bases < self.num_rels:
            # linear combination coefficients a^{(l)}_{r, b} in equation (2)
            self.w_comp = nn.Parameter(
                torch.Tensor(self.num_rels, self.num_bases))

        if self.bias:
            self.b = nn.Parameter(torch.Tensor(self.output_dim))
        self.reset_parameters()

    def reset_parameters(self):
        # Initialize trainable parameters, see following link for explanation:
        # https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79
        # Xavier initialization: improved weight initialization method enabling 
        # quicker convergence and higher accuracy
        # gain is an optional scaling factor, here we use the recommended gain 
        # value for the given nonlinearity function
        nn.init.xavier_uniform_(self.weight, gain=nn.init.calculate_gain('relu'))
        if self.num_bases < self.num_rels:
            nn.init.xavier_uniform_(self.w_comp, gain=nn.init.calculate_gain('relu'))
        if self.bias:
            nn.init.xavier_uniform_(self.b, gain=nn.init.calculate_gain('relu'))

    def forward(self, x, adj_t):
        supports = []
        num_nodes = adj_t[0].shape[0]
        for i, adj in enumerate(adj_t):
            if x is not None:
                supports.append(torch.spmm(adj, x))
            else:
                supports.append(adj)
        supports = torch.cat(supports, dim=1)   # (num_rel, num_nodes*num_rel)

        # Calculate relation specific weight matrices
        if self.num_bases < self.num_rels:
            # Generate all weights from bases as in equation (2)
            weight = self.weight.reshape(self.num_bases, self.input_dim, self.output_dim).permute(1, 0, 2)

            # Matrix product: learnable coefficients a_{r, b} and basis transformations V_b
            # (self.num_rels, self.num_bases) x (self.input_dim, self.num_bases, self.output_dim)
            weight = torch.matmul(self.w_comp, weight)  # (self.input_dim, self.num_rels, self.output_dim)
            weight = weight.reshape(self.input_dim * self.num_rels, self.output_dim)
        else:
            weight = self.weight

        out = torch.spmm(supports, weight)  # (num_nodes, num_rels)

        # If x is None add dropout to output, by elementwise multiplying with 
        # column vector of ones, with dropout applied to the vector of ones.
        if x is None:
            temp = torch.ones(num_nodes).to(out.device)
            temp_drop = F.dropout(temp, self.dropout)
            out = (out.transpose(1, 0) * temp_drop).transpose(1, 0)

        if self.bias:
            out += self.b

        out = self.activation(out)
        return out

### R-GCN model
This part is somewhat inspired by: https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/4_rgcn.html.
We borrowed the idea of building the model using separate build functions for the input, hidden and output layer in order to allow building models with multiple hidden layers.

In [None]:
class RGCN(torch.nn.Module):
    def __init__(self, num_nodes, h_dim, out_dim, num_rels,
                 num_bases=-1, num_hidden_layers=1, dropout=0.5, bias=False):
        """
        Implementation of R-GCN from the `"Modeling
        Relational Data with Graph Convolutional Networks"
        <https://arxiv.org/abs/1703.06103>`_ paper

        :param num_nodes: Number of nodes (input dimension)
        :param h_dim: Hidden dimension
        :param out_dim: Output dimension
        :param num_rels: Number of relation types
        :param num_bases: Number of basis functions
        :param num_hidden_layers: Number of hidden layers
        :param dropout: Dropout probability
        :param bias: Whether to use an additive bias
        """
        super(RGCN, self).__init__()
        self.num_nodes = num_nodes
        self.h_dim = h_dim
        self.out_dim = out_dim
        self.num_rels = num_rels
        self.num_bases = num_bases
        self.num_hidden_layers = num_hidden_layers
        self.dropout = dropout
        self.bias = bias

        self.layers = nn.ModuleList()
        # create rgcn layers
        self.build_model()

    def build_model(self):
        # input to hidden
        i2h = self.build_input_layer()
        self.layers.append(i2h)
        # hidden to hidden
        for _ in range(self.num_hidden_layers):
            h2h = self.build_hidden_layer()
            self.layers.append(h2h)
        # hidden to output
        h2o = self.build_output_layer()
        self.layers.append(h2o)

    def build_input_layer(self):
        return RGCNConv(self.num_nodes, self.h_dim, self.num_rels, self.num_bases, activation=F.relu,
                        dropout=self.dropout, bias=self.bias)

    def build_hidden_layer(self):
        return RGCNConv(self.h_dim, self.h_dim, self.num_rels, self.num_bases, activation=F.relu,
                        dropout=self.dropout, bias=self.bias)

    def build_output_layer(self):
        return RGCNConv(self.h_dim, self.out_dim, self.num_rels, self.num_bases, activation=partial(F.softmax, dim=-1),
                        dropout=self.dropout, is_output_layer=True, bias=self.bias)

    def reset_parameters(self):
        for layer in self.layers:
            layer.reset_parameters()

    def forward(self, x, adj_t):
        out = x
        for layer in self.layers:
            out = layer(out, adj_t)
            if not layer.is_output_layer:
                out = F.dropout(out, self.dropout, self.training)
        return out

### Training and evaluation functions

In [None]:
def train(model, x, adj_t, optimizer, loss_fn, train_idx, train_y):
    model.train()

    # Zero grad the optimizer
    optimizer.zero_grad()
    # Feed the data into the model
    out = model(x, adj_t)
    # Feed the sliced output and label to loss_fn
    labels = torch.LongTensor(train_y).to(out.device)
    loss = loss_fn(out[train_idx], labels)

    # Backpropagation, optimizer
    loss.backward()
    optimizer.step()
    return loss.item()


@torch.no_grad()
def test(model, x, adj_t, train_idx, train_y, test_idx, test_y):
    model.eval()

    # Output of model on all data
    out = model(x, adj_t)
    # Get predicted class labels
    pred = out.argmax(dim=-1).cpu()

    # Evaluate prediction accuracy
    train_acc = pred[train_idx].eq(train_y).to(torch.float).mean()
    test_acc = pred[test_idx].eq(test_y).to(torch.float).mean()
    return train_acc.item(), test_acc.item()

### Data preprocessing functions
In order to use the data from `torch.geometric.datasets.Entities` with our R-GCN implementation we have to convert the dataset and construct the adjacency matrices from the edge index and edge type arrays.

In [None]:
def get_adjacency_matrices(data):
    """
    Converts torch_geometric.datasets.entities data to relation type specific 
    adjacency matrices
    :param data: torch_geometric.datasets.entities data
    :return:
        A: list of relation type specific adjacency matrices
    """
    num_rels = data.num_rels
    num_nodes = data.num_nodes

    A = []
    source_nodes = data.edge_index[0].numpy()
    target_nodes = data.edge_index[1].numpy()

    # Get edges for given (relation) edge type and construct adjacency matrix
    for i in range(num_rels):
        indices = np.argwhere(np.asarray(data.edge_type) == i).squeeze(axis=1)
        r_source_nodes = source_nodes[indices]
        r_target_nodes = target_nodes[indices]
        a = sp.csr_matrix(
            (np.ones(len(indices)), (r_source_nodes, r_target_nodes)), 
            shape=(num_nodes, num_nodes))
        A.append(a)

    return A

The following functions are for normalizing the matrices individually and converting them to sparse tensors.

In [None]:
def normalize(adj_matrix):
    """
    Normalizes the adjacency matrix
    :param adj_matrix: Adjacency matrix
    :return:
        out: Normalized adjacency matrix
    """
    node_degrees = np.array(adj_matrix.sum(axis=1)).flatten()
    # Essentially 1. / node_degrees, while avoiding division by zero warning
    norm_const = np.divide(np.ones_like(node_degrees), node_degrees, out=np.zeros_like(node_degrees),
                           where=node_degrees != 0)
    D_inv = sp.diags(norm_const)
    out = D_inv.dot(adj_matrix).tocsr()
    return out

In [None]:
def to_sparse_tensor(sparse_array):
    """
    Converts sparse array (normalized adjacency matrix) to sparse tensor
    :param sparse_array: Sparse array (normalized adjacency matrix)
    :return:
        sparse_tensor: Converted sparse tensor
    """
    if len(sp.find(sparse_array)[-1]) > 0:
        # Get indices and values of nonzero elements in matrix
        v = torch.FloatTensor(sp.find(sparse_array)[-1])
        i = torch.LongTensor(sparse_array.nonzero())
        shape = sparse_array.shape
        sparse_tensor = torch.sparse_coo_tensor(i, v, torch.Size(shape))
    else:
        sparse_tensor = torch.sparse_coo_tensor(sparse_array.shape[0], sparse_array.shape[1])
    return sparse_tensor

## Experiments
In theory this model should work with all 4 datasets. However BGS and AM contain huge graphs, which require lots of memory. We recommend to only use AIFB or MUTAG.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Data loading function

In [None]:
def load_data(dataset_name):
  # Load data via pytorch geometric
  path = osp.join('.', 'data', 'Entities')
  dataset = Entities(path, dataset_name)
  data = dataset[0]

  data.num_nodes = maybe_num_nodes(data.edge_index)
  data.num_rels = dataset.num_relations

  # Construct relation type specific adjacency matrices from data.edge_index and data.edge_type in utils
  A = get_adjacency_matrices(data)

  adj_t = []
  # Normalize matrices individually and convert to sparse tensors
  for a in A:
      nor_a = normalize(a)
      if len(nor_a.nonzero()[0]) > 0:
          tensor_a = to_sparse_tensor(nor_a)
          adj_t.append(tensor_a.to(device))

  # Replace if features are available
  x = None    
  return dataset, data, adj_t, x

Experiment function

In [None]:
def run_experiment(dataset_name, args, runs=10):
  dataset, data, adj_t, x = load_data(dataset_name)
  # Initialize RGCN model
  model = RGCN(
      num_nodes=data.num_nodes,
      h_dim=args["h_dim"],
      out_dim=dataset.num_classes,
      num_rels=dataset.num_relations,
      num_bases=args["num_bases"],
      dropout=args["dropout"]
  ).to(device)

  test_accs = []

  for i in range(1, runs+1):
    print('------------------------------------------------')
    print(f'Model run {i}')
    print('------------------------------------------------')
    # Reset the parameters to initial random values
    model.reset_parameters()

    optimizer = torch.optim.Adam(model.parameters(), lr=args["lr"], weight_decay=args["l2"])
    loss_fn = nn.CrossEntropyLoss()

    best_test_acc = 0
    test_acc = 0
    # Train and evaluate model
    for epoch in range(1, args["epochs"] + 1):
        loss = train(model, x, adj_t, optimizer, loss_fn, data.train_idx, data.train_y)
        train_acc, test_acc = test(model, x, adj_t, data.train_idx, data.train_y, data.test_idx, data.test_y)
        if test_acc > best_test_acc:
          best_test_acc = test_acc
        if epoch == 1 or (epoch % 10) == 0:
          print(f'Epoch: {epoch:02d}, Loss: {loss:.4f}, Train: {train_acc:.4f} '
                f'Test: {test_acc:.4f}')
    test_accs.append(test_acc)  # alternatively use the best test acc
    print(f'Best test accuracy: {best_test_acc:.4f}')
  
  avg_test_acc = np.mean(test_accs)
  sem_test_acc = stats.sem(test_accs)

  print('------------------------------------------------')
  print(f'Average test accuracy over {runs} runs: {100 * avg_test_acc:.2f}+-{100 * sem_test_acc:.2f}')

### AIFB

In [None]:
# Parameters from the RGCN paper
args = {
        'h_dim': 16,
        'num_bases': -1,
        'num_hidden_layers': 0,
        'dropout': 0.,
        'lr': 0.01,
        'l2': 0.,
        'bias': False,
        'epochs': 50,
    }

Training and Evaluation

In [None]:
run_experiment(dataset_name="AIFB", args=args, runs=10)

### MUTAG

In [None]:
dataset_name = "MUTAG"  # choices=['AIFB', 'MUTAG', 'BGS', 'AM']

In [None]:
# Parameters from the RGCN paper
args = {
        'h_dim': 16,
        'num_bases': 30,
        'num_hidden_layers': 0,
        'dropout': 0.,
        'lr': 0.01,
        'l2': 0.0005,
        'bias': False,
        'epochs': 50,
    }

Training and Evaluation

In [None]:
run_experiment(dataset_name="MUTAG", args=args, runs=10)