# Deep Learning for graph data

Now, we know what is a graph. But how we can apply *machine learning* techniques to
graph data? More specifically, how we can use break-throughts in *neural networks* for graph data?


## Naive approach #1

We will start with the simplest GNN architecture, one where we learn new embeddings for graph attributes (nodes, edges, global), but where we do not yet use the connectivity of the graph. 
This GNN uses a separate multilayer perceptron on each component of a graph; we call this a GNN layer. For each node vector, we apply the MLP and get back a learned node-vector. We do the same for each edge, learning a per-edge embedding, and also for the global-context vector, learning a single embedding for the entire graph.
As is common with neural networks modules or layers, we can stack these GNN layers together.

<center><img src="images/graph_attributes.png" width=500></center>
<center><img src="images/naive_1.png" width=500></center>
<small>A single layer of a simple GNN. A graph is the input, and each component (V,E,U) gets updated by a MLP to produce a new graph. Each function subscript indicates a separate function for a different graph attribute at the n-th layer of a GNN model.</small>
<i><center><small>images from https://distill.pub/2021/gnn-intro/</small></center></i>

Example in Python: *Classifying nodes* with vanilla neural networks

<center><img src="images/cora.png" width=600></center>

In [46]:
from torch_geometric.datasets import Planetoid
import torch

# Load the CORA dataset
dataset = Planetoid(root=".", name="Cora")
# Access the first graph object
data = dataset[0]
print(f"number of nodes: {data.x.shape[0]}")
print(f"number of features: {data.x.shape[1]}")
print(f"number of classes: {torch.unique(data.y).size()[0]}")

# The data object contains train_mask, val_mask, and test_mask
train_mask = data.train_mask
val_mask = data.val_mask
test_mask = data.test_mask

number of nodes: 2708
number of features: 1433
number of classes: 7


In [76]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.functional as F
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

class MLP(nn.Module):
    def __init__(self, dim_in, dim_h, dim_out):
        super().__init__()
        self.linear1 = nn.Linear(dim_in, dim_h)
        self.linear2 = nn.Linear(dim_h, dim_out)

    def forward(self, x):
        x = self.linear1(x)
        x = torch.relu(x)
        x = self.linear2(x)
        return x
    
mlp = MLP(dataset.num_features, 16, dataset.num_classes)
optimizer = optim.Adam(mlp.parameters(), lr=0.01, weight_decay=5e-4)
loss_fn = nn.CrossEntropyLoss()

train_data = TensorDataset(torch.as_tensor(data.x[train_mask,:]),torch.as_tensor(data.y[train_mask]))
test_data = TensorDataset(torch.as_tensor(data.x[test_mask,:]),torch.as_tensor(data.y[test_mask]))

n_epochs = 200

def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)


for epoch in range(n_epochs):
    prediction = mlp(train_data.tensors[0])
    loss = loss_fn(prediction,train_data.tensors[1])
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_acc = accuracy(torch.argmax(prediction, dim=1),train_data.tensors[1])

    with torch.no_grad():
        prediction = mlp(test_data.tensors[0])
        loss = loss_fn(prediction,test_data.tensors[1])
        test_acc = accuracy(torch.argmax(prediction, dim=1),test_data.tensors[1])
    if epoch%10==0:
        print(f'Epoch: {epoch}, Loss: {loss.item():.4f}, Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}')

Epoch: 0, Loss: 1.9247, Train Accuracy: 0.1214, Test Accuracy: 0.2130
Epoch: 10, Loss: 1.5306, Train Accuracy: 0.9929, Test Accuracy: 0.4990
Epoch: 20, Loss: 1.3995, Train Accuracy: 1.0000, Test Accuracy: 0.5070
Epoch: 30, Loss: 1.4643, Train Accuracy: 1.0000, Test Accuracy: 0.4810
Epoch: 40, Loss: 1.4934, Train Accuracy: 1.0000, Test Accuracy: 0.4950
Epoch: 50, Loss: 1.4664, Train Accuracy: 1.0000, Test Accuracy: 0.5050
Epoch: 60, Loss: 1.4076, Train Accuracy: 1.0000, Test Accuracy: 0.5210
Epoch: 70, Loss: 1.3570, Train Accuracy: 1.0000, Test Accuracy: 0.5230
Epoch: 80, Loss: 1.3292, Train Accuracy: 1.0000, Test Accuracy: 0.5260
Epoch: 90, Loss: 1.3164, Train Accuracy: 1.0000, Test Accuracy: 0.5280
Epoch: 100, Loss: 1.3045, Train Accuracy: 1.0000, Test Accuracy: 0.5290
Epoch: 110, Loss: 1.2928, Train Accuracy: 1.0000, Test Accuracy: 0.5280
Epoch: 120, Loss: 1.2843, Train Accuracy: 1.0000, Test Accuracy: 0.5320
Epoch: 130, Loss: 1.2776, Train Accuracy: 1.0000, Test Accuracy: 0.5360
Epo

## Naive Approach #2

*DeepWalk* is a groundbreaking method that uses techniques from language modeling to learn how to represent nodes in a network. It focuses on the local connections and relationships between nodes to create these representations.


DeepWalk works in two main steps:

- Random Walks: It explores the network by randomly moving from one node to another, capturing the local structure and neighborhood relationships.
- SkipGram: It then uses a technique called SkipGram to learn node embeddings, which are representations that incorporate the patterns and structures found in the first step.

### Random Walk:

In this step, the goal is to find neighborhoods in the network. Here's how it works:

Generate Random Walks: We start at each node and create a set number (k) of random paths, each with a fixed length (l). After this process, we have k sequences of nodes, each sequence being l nodes long. The intuition is that nodes next to each other in these paths are similar and should have similar embeddings. Nodes that frequently appear together in these random walks are considered similar, as edges in a network usually connect similar or interacting nodes.

### SkipGram:

The SkipGram algorithm is a well-known method used to learn word embeddings. It was introduced by Mikolov and his team in their famous word2vec paper.

Here's how it works:

Context Windows: Given a text and a window size, SkipGram looks at words that appear close to each other within these windows (also called contexts).
Learning Embeddings: The algorithm aims to make the embeddings (representations) of words that occur in the same context similar to each other.
The idea behind this is simple: words that appear together often usually have similar meanings, so their embeddings should be close to each other.

For an intuitive tutorial on Word2Vec, See https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 


To use SkipGram for networks:

- Context in Networks: Treat each random walk as a context, similar to how word windows work in text.
- Starting Point: Begin with random vectors for each node in the network.
- Embedding Updates: Iterate over the random walks and adjust the node embeddings. The goal is to make nodes that appear together in these walks have similar embeddings. This adjustment is done using gradient descent and the softmax function to ensure that the embeddings reflect the node similarities. This is done using a simple neural network.

<center><img src="images/deepwalk.jpg" width=400></center>
<center><small>image from https://www.geeksforgeeks.org/deepwalk-algorithm/</small></center>

Example of DeepWalk in Python:

To use DeepWalk for node classification on the Cora dataset, we'll follow these steps:

- Load the Cora dataset.
- Generate random walks to capture the local structure of the graph.
- Learn node embeddings using the SkipGram model.
- Train a classifier (e.g., logistic regression) using the learned embeddings.

In [78]:
import random

def generate_random_walks(G, num_walks, walk_length):
    walks = []
    nodes = list(G.nodes())
    for _ in range(num_walks):
        random.shuffle(nodes)
        for node in nodes:
            walk = [node]
            while len(walk) < walk_length:
                cur = walk[-1]
                neighbors = list(G.neighbors(cur))
                if neighbors:
                    walk.append(random.choice(neighbors))
                else:
                    break
            walks.append(walk)
    return walks

In [109]:
from nodevectors import Node2Vec
from torch_geometric.datasets import Planetoid
import torch_geometric
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load Cora dataset
dataset = Planetoid(root=".", name="Cora")
data = dataset[0]

# The data object contains train_mask, val_mask, and test_mask
train_mask = data.train_mask
val_mask = data.val_mask
test_mask = data.test_mask

# Convert to NetworkX graph for random walk generation
G = torch_geometric.utils.to_networkx(data, to_undirected=True)
# Parameters
num_walks = 10
walk_length = 80
# Generate random walks
walks = generate_random_walks(G, num_walks, walk_length)

from gensim.models import Word2Vec

# Convert walks to strings for gensim
walks_str = [[str(node) for node in walk] for walk in walks]

# Train Word2Vec model
model = Word2Vec(walks_str, vector_size=128, window=10, min_count=0, sg=1, workers=4, epochs=10)

# Extract embeddings
node_embeddings = {int(node): model.wv[str(node)] for node in G.nodes()}

# Prepare data for classification
X_train = np.array([node_embeddings[node.item()] for node in torch.where(train_mask)[0]])
y_train = np.array(data.y[train_mask])

# Train logistic regression classifier
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)

# Evaluate the classifier
X_test = np.array([node_embeddings[node.item()] for node in torch.where(test_mask)[0]])
y_test = np.array(data.y[test_mask])
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy of DeepWalk is: {accuracy:.4f}')

Test Accuracy of DeepWalk is: 0.7060


A vanilla neural network (NN) focuses solely on the attributes of nodes, disregarding the underlying structure of the graph. This means it cannot leverage the relationships and connections between nodes, which are often crucial for tasks like node classification. On the other hand, DeepWalk excels in capturing the structural information of the graph by generating node embeddings based on random walks, but it completely ignores node attributes. These two approaches are at opposite extremes: one neglects structure and the other overlooks attributes. A more integrated approach is needed to combine both node features and graph structure, providing a more comprehensive representation that leverages the strengths of both attributes and connectivity for more effective learning on graph-structured data.

<center><img src="images/Two-extremes.png" width=500></center>

So how to consider both attributes and connectivity?

## Graph Neural Network: Combining Attributes and Connectivity

Suppose our graph neural network starts with a graph and a set of feature vectors, where each node in the graph has its own feature vector.
<center><img src="images/gcn_intro.png" width="300"></center>

Let $X \in \mathbb{R}^{n \times d}$ represent the feature matrix, where $n$ denotes the number of nodes and $d$ is the feature dimension. To apply a single layer of a neural network to these features, a straightforward approach generates new embedding features as:

\begin{equation*}
 H^{(1)} = \sigma(XW), 
\end{equation*}

where $\sigma$ is a point-wise nonlinear function (such as ReLU), and $W \in \mathbb{R}^{d \times b}$ is a weight matrix learned during training. However, this equation overlooks the local neighborhood of each node. To incorporate connectivity, we modify the approach as follows:

\begin{equation*}
 H^{(1)} = \sigma(AXW),
\end{equation*}

where $A \in \mathbb{R}^{n \times n}$ is the adjacency matrix. This concept is illustrated in the figure below:

<center><img src="images/GCN_naive.png" width="500"></center>

Notice how the embedding of node A now depends on its *neighbors* B and C. The adjacency matrix contains the connections between every node in the graph. Multiplying the input matrix by this adjacency matrix aggregates the neighboring node features.

However, there are still two problems:

- **Problem #1:** The new embedding for each node does not consider the feature of the node itself.
    - **Solution:** Add an identity matrix $I$ to $A$ to create a new adjacency matrix $\tilde{A}$. This addition includes self-loops, ensuring the central node is also considered:

    \begin{equation*} \tilde{A} = A + I \end{equation*}

- **Problem #2:** If node $A$ has 1,000 neighbors and node $B$ has only 1, the embedding $H_A$ will have much larger values than $H_B$. This disparity makes meaningful comparisons between embeddings difficult.
    - **Solution:** Normalize the embeddings by the number of neighbors. 

    \begin{equation*} H[i,:] = \frac{1}{\text{deg}(i)} \sum_{j \in \mathcal{N}_i} X[j,:]W \end{equation*}

Let $D$ be the diagonal degree matrix, where $D[i,i] = \sum_{j=1}^n \tilde{A}[i,j]$. The inverse of this matrix $D^{-1}$ provides the normalization coefficients.

To incorporate normalization, we consider:

- $D^{-1} \tilde{A} XW$ will normalize each row of features.
- $\tilde{A} D^{-1} XW$ will normalize each column of features.

Combining both, we derive the normalization formula:

\begin{equation*} D^{-\frac{1}{2}} \tilde{A} D^{-\frac{1}{2}} XW \end{equation*}

This defines a layer of a *Graph Convolutional Network (GCN)*.

## GCN: More formal explanation

*[Note: borrowed heavily from excellent blog https://mbernste.github.io/posts/gcn/]*

A GCN consists of several graph convolutional layers that progressively change the feature vectors at each node. The result is a graph where each node has a new output vector. These output vectors can have different dimensions than the original input vectors.

**Message Passing** is the core concept of GCN layers. The convolutional layer $l$ in GCN uses the node vectors from the previous layer $H^{(l-1)}$ (the input feature vectors for the first layer, that is, $H^{0}=X$) and creates new output vectors for each node. It does this by combining the vectors from each node's neighbors, as shown below:

<center><img src="images/GCN_overview.png" width="500"></center>

<div style="text-align: center;">
<i>node $A$’s vector, that is, $x_A$ is pooled/aggregated with the vectors of its neighbors, $x_B$ and $x_C$. This pooled vector is then transformed/updated to form node $A$’s vector in the next layer. This same procedure is carried out over every node.</i></div>

<center><small>image from https://mbernste.github.io/posts/gcn/ </small></center>

The intuitive concept of message-passing is shown in the following gif. Each node sends its vector to its neighbors to help update their vectors. Essentially, the "message" from each node is its own vector.

<center><img src="images/message_passing.gif"></center>
<center><small>gif from https://medium.com/@ashes192000/graph-neural-networks-part-4-e54251d0256d</small></center>

The graph convolutional layer can be seen as a function that takes two inputs and produces a matrix with the updated vectors for each node:

\begin{equation*} H^{(l)}=\sigma \big(D^{-\frac{1}{2}} (A+I) D^{-\frac{1}{2}} H^{(l-1)}W\big) \end{equation*}

<small>*Note that The matrix $D$ is the degree matrix of $A+I$.*

A diagram showing multiple stacked graph convolutional layers is shown below:

<center><img src="images/GCN_stacked.png" width="400"></center>
<center><small> image from https://mbernste.github.io/posts/gcn/</small></center>

## Okay... but where is the convolution?

In fact, a Graph Convolutional Neural Network (GCN) performs operations similar to those in Convolutional Neural Networks (CNNs) used for images. In a GCN, we can think of the message passing process as moving a filter (or kernel) over each node in the graph. When the filter is on a node, it gathers and combines data from nearby nodes to create the output for that node."

<img src="images/CNN_filter.png" style="width: 45%; display: inline-block;">
<img src="images/GCN_filter.png" style="width: 45%; display: inline-block;">
<center>convolution in CNN vs convolution in GCN</center>
<center><small>images from https://mbernste.github.io/posts/gcn/</small></center>

Very similar to CNN, for GCNs, a filter is passed over each node and the values of the **neighboring nodes** are combined to form the output value at the next layer.

## GCN in python using Geometric library

This code implements and trains a Graph Convolutional Network (GCN) using the PyTorch Geometric library on the Cora citation dataset. The GCN model consists of two graph convolutional layers that process the graph-structured data to classify nodes into different classes. 

In [21]:
from torch_geometric.datasets import Planetoid
import torch_geometric
from torch_geometric.nn import GCNConv
import torch.nn as nn
import torch.optim as optim
import torch

# Load the Cora dataset
dataset = Planetoid(root=".", name="Cora")
data = dataset[0]

# Extract training, validation, and test masks from the data
# These masks are boolean arrays indicating which nodes are used for training, validation, and testing
train_mask = data.train_mask
val_mask = data.val_mask
test_mask = data.test_mask

# Define the GCN (Graph Convolutional Network) model
class GCN(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        # Define the first graph convolution layer
        self.gcn1 = GCNConv(in_dim, hidden_dim)
        # Define the second graph convolution layer
        self.gcn2 = GCNConv(hidden_dim, out_dim)
    
    def forward(self, x, edge_index):
        # Apply the first graph convolution layer and ReLU activation
        h = self.gcn1(x, edge_index)
        h = torch.relu(h)
        # Apply the second graph convolution layer
        h = self.gcn2(h, edge_index)
        return h

# Instantiate the GCN model
gcn = GCN(dataset.num_features, 16, dataset.num_classes)

# Define the optimizer (Adam) and the loss function (Cross Entropy Loss)
optimizer = optim.Adam(gcn.parameters(), lr=0.01, weight_decay=5e-4)
loss_fn = nn.CrossEntropyLoss()

# Set the number of training epochs
n_epochs = 200

# Define a function to calculate accuracy
def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)

# Training loop
for epoch in range(n_epochs):
    # Forward pass: Compute predictions
    prediction = gcn(data.x, data.edge_index)
    
    # Compute the loss on the training data
    loss = loss_fn(prediction[train_mask, :], data.y[train_mask])
    
    # Zero the gradients, perform backpropagation, and update the weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Calculate training accuracy
    train_acc = accuracy(torch.argmax(prediction[train_mask, :], dim=1), data.y[train_mask])

    # Evaluate on the test set without updating weights
    with torch.no_grad():
        test_acc = accuracy(torch.argmax(prediction[test_mask, :], dim=1), data.y[test_mask])
    
    # Print progress every 10 epochs
    if epoch % 10 == 0:
        print(f'Epoch: {epoch}, Loss: {loss.item():.4f}, Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}')


Epoch: 0, Loss: 1.9465, Train Accuracy: 0.1500, Test Accuracy: 0.0980
Epoch: 10, Loss: 0.6023, Train Accuracy: 0.9857, Test Accuracy: 0.7270
Epoch: 20, Loss: 0.1072, Train Accuracy: 1.0000, Test Accuracy: 0.7960
Epoch: 30, Loss: 0.0270, Train Accuracy: 1.0000, Test Accuracy: 0.7840
Epoch: 40, Loss: 0.0144, Train Accuracy: 1.0000, Test Accuracy: 0.7880
Epoch: 50, Loss: 0.0128, Train Accuracy: 1.0000, Test Accuracy: 0.7920
Epoch: 60, Loss: 0.0143, Train Accuracy: 1.0000, Test Accuracy: 0.7960
Epoch: 70, Loss: 0.0159, Train Accuracy: 1.0000, Test Accuracy: 0.7970
Epoch: 80, Loss: 0.0166, Train Accuracy: 1.0000, Test Accuracy: 0.8010
Epoch: 90, Loss: 0.0162, Train Accuracy: 1.0000, Test Accuracy: 0.8010
Epoch: 100, Loss: 0.0154, Train Accuracy: 1.0000, Test Accuracy: 0.7980
Epoch: 110, Loss: 0.0145, Train Accuracy: 1.0000, Test Accuracy: 0.7970
Epoch: 120, Loss: 0.0138, Train Accuracy: 1.0000, Test Accuracy: 0.7960
Epoch: 130, Loss: 0.0131, Train Accuracy: 1.0000, Test Accuracy: 0.8010
Epo

## Resources

- https://mbernste.github.io/posts/gcn/
- https://medium.com/analytics-vidhya/an-intuitive-explanation-of-deepwalk-84177f7f2b72
- LABONNE, Maxime. "Hands-On Graph Neural Networks Using Python: Practical techniques and architectures for building powerful graph and deep learning apps with PyTorch." Packt Publishing Ltd, 2023.