## How to do Deep Learning on Graphs with Graph Convolutional Networks

Adapted from

https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780

E. Klyshko

### Why Graph representation of data over vectors and images?

<img src="https://www.researchgate.net/profile/Pascal_Welke/publication/308383358/figure/fig1/AS:502229565808640@1496752267428/Glucose-molecule-in-its-3D-left-and-graph-representation-right-All-bonds-in-glucose.png" width="500" align="left">

![](https://www.learnopencv.com/wp-content/uploads/2020/05/cnn-vs-gcn-comparison.png)

### How do they work?

<img src="https://miro.medium.com/max/1200/1*gC8q4uABSQtM8zyXG2wOog.png" width="700" align="left">

More formally, a graph convolutional network (GCN) is a neural network that operates on graphs. Given a graph G = (V, E), a GCN takes as input

- an input feature matrix N × F⁰ feature matrix, X, where N is the number of nodes and F⁰ is the number of input features for each node, and
- an N × N matrix representation of the graph structure such as the adjacency matrix A of G.

A hidden layer in the GCN can thus be written as Hⁱ = f(Hⁱ⁻¹, A)) where H⁰ = X and f is a propagation [1]. Each layer Hⁱ corresponds to an N × Fⁱ feature matrix where each row is a feature representation of a node. At each layer, these features are aggregated to form the next layer’s features using the propagation rule f. In this way, features become increasingly more abstract at each consecutive layer. In this framework, variants of GCN differ only in the choice of propagation rule f.

![](https://miro.medium.com/max/1000/1*pCeWhIrEFXoEgsB5eEB6sw.png)

### Propagation rule

One of the simplest possible propagation rule is [1]:

f(Hⁱ, A) = σ(AHⁱWⁱ)

where Wⁱ is the weight matrix for layer i and σ is a non-linear activation function such as the ReLU function. The weight matrix has dimensions Fⁱ × Fⁱ⁺¹; in other words the size of the second dimension of the weight matrix determines the number of features at the next layer. If you are familiar with convolutional neural networks, this operation is similar to a filtering operation since these weights are shared across nodes in the graph.

### Simplifications
Let’s examine the propagation rule at its most simple level. Let
- i = 1, s.t. f is a function of the input feature matrix,
- σ be the identity function, and
- choose the weights s.t. AH⁰W⁰ =AXW⁰ = AX.

In other words, f(X, A) = AX. This propagation rule is perhaps a bit too simple, but we will add in the missing parts later. As a side note, AX is now equivalent to the input layer of a multi-layer perceptron.

### A Simple Graph Example
As a simple example, we’ll use the the following graph:


![](https://miro.medium.com/max/432/1*jTW7doI_cqC_p9XQrmuu9A.png)

In [36]:
## Adjaceny matrix

import numpy as np

A = np.matrix([
    [0, 1, 0, 0],
    [0, 0, 1, 1], 
    [0, 1, 0, 0],
    [1, 0, 1, 0]],
    dtype=float
)

In [37]:
## feature vector

X = np.matrix(
    [
            [i, -i]
            for i in range(A.shape[0])
    ], dtype=float
)

print(X)

[[ 0.  0.]
 [ 1. -1.]
 [ 2. -2.]
 [ 3. -3.]]


### Applying a propagation rule

In [38]:
A * X

matrix([[ 1., -1.],
        [ 5., -5.],
        [ 1., -1.],
        [ 2., -2.]])

 The representation of each node (each row) is now a sum of its neighbors features! In other words, the graph convolutional layer represents each node as an aggregate of its neighborhood. I encourage you to check the calculation for yourself. Note that in this case a node n is a neighbor of node v if there exists an edge from v to n.

## Problems

- The aggregated representation of a node does not include its own features! The representation is an aggregate of the features of neighbor nodes, so only nodes that has a self-loop will include their own features in the aggregate.
- Nodes with large degrees will have large values in their feature representation while nodes with small degrees will have small values. This can cause vanishing or exploding gradients, but is also problematic for stochastic gradient descent algorithms which are typically used to train such networks and are sensitive to the scale (or range of values) of each of the input features.

### Adding self-loops

In [39]:
I = np.matrix(np.eye(A.shape[0]))
print(I)
A_hat = A + I

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


In [40]:
A_hat * X

matrix([[ 1., -1.],
        [ 6., -6.],
        [ 3., -3.],
        [ 5., -5.]])

### Normalization

The feature representations can be normalized by node degree by transforming the adjacency matrix A by multiplying it with the inverse degree matrix D.
Thus our simplified propagation rule looks like this:
f(X, A) = D⁻¹AX

In [41]:
D = np.array(np.sum(A, axis=0))[0]
D = np.matrix(np.diag(D))

print(D)

[[1. 0. 0. 0.]
 [0. 2. 0. 0.]
 [0. 0. 2. 0.]
 [0. 0. 0. 1.]]


### Before

In [42]:
A

matrix([[0., 1., 0., 0.],
        [0., 0., 1., 1.],
        [0., 1., 0., 0.],
        [1., 0., 1., 0.]])

### After

In [43]:
D**-1 * A

matrix([[0. , 1. , 0. , 0. ],
        [0. , 0. , 0.5, 0.5],
        [0. , 0.5, 0. , 0. ],
        [1. , 0. , 1. , 0. ]])

### Putting it all together

In [14]:
W = np.matrix(
    [
         [1, -1],
         [-1, 1]
    ]
)

In [44]:
D_hat = np.array(np.sum(A_hat, axis=0))[0]
D_hat = np.matrix(np.diag(D_hat))

D_hat**-1 * A_hat * X * W

matrix([[ 1., -1.],
        [ 4., -4.],
        [ 2., -2.],
        [ 5., -5.]])

In [45]:
def relu(X):
    return np.maximum(0,X)

In [46]:
relu(D_hat**-1 * A_hat * X * W)

matrix([[1., 0.],
        [4., 0.],
        [2., 0.],
        [5., 0.]])

## Zachary’s Karate Club

Zachary’s karate club is a commonly used social network where nodes represent members of a karate club and the edges their mutual relations. While Zachary was studying the karate club, a conflict arose between the administrator and the instructor which resulted in the club splitting in two. The figure below shows the graph representation of the network and nodes are labeled according to which part of the club. The administrator and instructor are marked with ‘A’ and ‘I’, respectively.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Social_Network_Model_of_Relationships_in_the_Karate_Club.png/250px-Social_Network_Model_of_Relationships_in_the_Karate_Club.png)

In [47]:
from networkx import karate_club_graph, to_numpy_matrix

zkc = karate_club_graph()
order = sorted(list(zkc.nodes()))

In [49]:
zkc.nodes(), zkc.edges()

(NodeView((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)),
 EdgeView([(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 10), (0, 11), (0, 12), (0, 13), (0, 17), (0, 19), (0, 21), (0, 31), (1, 2), (1, 3), (1, 7), (1, 13), (1, 17), (1, 19), (1, 21), (1, 30), (2, 3), (2, 7), (2, 8), (2, 9), (2, 13), (2, 27), (2, 28), (2, 32), (3, 7), (3, 12), (3, 13), (4, 6), (4, 10), (5, 6), (5, 10), (5, 16), (6, 16), (8, 30), (8, 32), (8, 33), (9, 33), (13, 33), (14, 32), (14, 33), (15, 32), (15, 33), (18, 32), (18, 33), (19, 33), (20, 32), (20, 33), (22, 32), (22, 33), (23, 25), (23, 27), (23, 29), (23, 32), (23, 33), (24, 25), (24, 27), (24, 31), (25, 31), (26, 29), (26, 33), (27, 33), (28, 31), (28, 33), (29, 32), (29, 33), (30, 32), (30, 33), (31, 32), (31, 33), (32, 33)]))

In [51]:

A = to_numpy_matrix(zkc, nodelist=order)
I = np.eye(zkc.number_of_nodes())

A_hat = A + I
D_hat = np.array(np.sum(A_hat, axis=0))[0]
D_hat = np.matrix(np.diag(D_hat))

In [52]:
### random weights initialization
W_1 = np.random.normal(
    loc=0, scale=1, size=(zkc.number_of_nodes(), 4))
W_2 = np.random.normal(
    loc=0, size=(W_1.shape[1], 2))

In [53]:
I

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

### Stack layers

In [54]:
def gcn_layer(A_hat, D_hat, X, W):
    return relu(D_hat**-1 * A_hat * X * W)

H_1 = gcn_layer(A_hat, D_hat, I, W_1)
H_2 = gcn_layer(A_hat, D_hat, H_1, W_2)

output = H_2

In [55]:
feature_representations = {
    node: np.array(output)[node] 
    for node in zkc.nodes()
}

### Embedding without even training

![](https://miro.medium.com/max/700/1*Voir16IcvOvmWyO3nX4WZA.png)

## References
- [1] Blog post on graph convolutional networks by Thomas Kipf. (https://tkipf.github.io/graph-convolutional-networks/)

- [2] Paper called Semi-Supervised Classification with Graph Convolutional Networks by Thomas Kipf and Max Welling. (https://arxiv.org/abs/1609.02907)