#### Graph Neural Networks (a class of deep learning models that can natively learn patterns from graph structures.)

in following exercise we'll cover two variant of GNN:
1. graph convolutional network (GCN)
2. graph attention network (GAT)

Capturing the graph structure besides the node-specific information is what gave birth to the new class of deep learning models in the form of GNNs, or else regular NNs would have been sufficient.

so, in the process of embedding nodes or edges, our embedding function uses the graph structure and information to embed nodes with similar properties close to each other and embed the edges(relationship between them) in a (roughly) similar direction.

Using adjacency matrix and node(edge)-level features we can train a regular NN model or more specifically train a variant of GNN models.

for GNN we need more than just the adjacency matrix and node(edge)-level information:
- due to neighboring constraint (for addressing more than just *one* edge further layer) -> using computational graphs: ach node in the graph is represented by the respective node-level features and if a node has
multiple neighbouring nodes, these neighbour nodes’ features are **aggregated** before being propagated
to the given node. For each node in the graph, we create a **computational graph** where we represent the node’s neighborhood *layer-wise* –each layer representing a different degree of connection. **It is important to note that weights are shared between all NNs of a given layer and between all computational graphs** . This weight sharing prevents the number of overall model parameters from exploding as the number of graph nodes or the depth of the computation graph increases. Weight sharing also makes the system robust against different ordering of nodes or the addition of a new node to the original graph.

- with the **aggregation** mechanism we address the limitation of permutation and rotation of the nodes in the computational graph because the ordering of nodes in the graph doesn’t matter anymore. This implicitly means that the aggregation function needs to be order independent, such as *sum, average, maximum, and minimum*. We cannot use aggregation functions like concatenation.

- by using computational graph we don't need ro retrain the whole model from scratch, we only need to create one more computational graph for this newly added node.

- also with this method (computational graph) we address the sparsity of adjacency matrix,  because we do not use the adjacency matrix information with computational graphs and rely only on the *intrinsic node features*.

we have three types of graph learning tasks:
1. For **Node-level tasks**, the **latent feature representations** of each node (final output of the node’s computational graph) are used to train a downstream task.
2. Typically, an **edge-level task** uses the accompanying *node features* as well as the *edge features* to train a downstream classification task such as predicting the type of relationship -> To find equivalence in the world of computer vision, **image scene understanding** is an appropriate example of an edge-level task equivalent where different objects in an image are the nodes and the relationships between these objects are the edges, and the goal is to predict the type of relationship between these objects in the image.
3. In **graph-level tasks**, we predict a class or a numerical value for the entire graph, by using the (permutation invariant) aggregation of the latent features of all nodes in the graph -> In the world of images, image classification is the graph-level task equivalent because all pixels (nodes) of the image (graph) are used to attribute a single value to the entire image (graph).

**more on computational graph**:

The concept of a computational graph in Graph Neural Networks (GNNs) refers to the structured representation of how computations are organized and executed during the training and inference processes. This graph is not the same as the input graph data; instead, it is a tool used to describe the flow of data and operations within the neural network itself.

- **Forward** and **Backward** Passes:
    - Forward Pass: This involves propagating input data through the network, applying each operation in sequence to compute the output. In GNNs, this includes **message passing** between nodes to update their representations.
	- Backward Pass: This is used during training to compute gradients of the loss function with respect to model parameters, facilitating optimization via backpropagation
  
**message passing mechanism**:

$$h_v^{(l+1)} = \sigma \left( \sum_{u \in \mathcal{N}(v)} W^{(l)} h_u^{(l)} \right)$$
where:
- $h_v^{(l)}$: embedding of node $v$ at layer $l$
- $W^{(l)}$: learnable weight matrix (parameters)
- $\sigma$: activation function (e.g., ReLU)

it also is written using **Kipf & Welling (2016) GCN layer implementation**:
$$H^{(l+1)} = \sigma\left( \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)} \right)$$

lets separate the process to undrestand how GCN's leverage the graph topology under message passing mechanism:
1. Aggregation (structure-driven): $\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)}$:
   - This part averages (or sums) over a node’s neighbors.
   - It is structure-aware — it considers the graph topology.
   - No trainable parameters here.
   - You’re collecting messages — **“raw influence” from neighbors**.
  
2. Transformation (feature-driven): $\left( \text{Aggregated Messages} \right) \cdot W^{(l)}$:
   - Now, you apply the *same* linear transformation to all nodes.
   - $W^{(l)}$ acts like an MLP layer: it projects node features from $\mathbb{R}^{d_{in}} \rightarrow \mathbb{R}^{d_{out}}$
   - This is the **trainable part**.
   - The weight matrix learns what to “look for” in the aggregated messages. It’s not learning the importance of each neighbor per se — it’s learning how to interpret the combined signal from the neighbors.
  
so, $W$ acts after neighbors are combined. It says “*when a signal arrives from the neighborhood, how do I process it*?”

and one important charasteristic of GNN and it's variants is:
- Unlike CNNs (spatial locality) or RNNs (temporal locality), **GNNs leverage the graph topology**. The same parameters $W$ are shared across all nodes, enabling generalization across arbitrary graph structures.

### Reviewing prominent GNN models:
- GCN
- GAT
- GraphSAGE

##### GCN

first, lets remind ourselves that the **convolution** term in GCN architecture comes from **shared weights** using computational graphs

<img src=../images/GCN-computation_graph.png width=700>

the computational graph above is the computational graph of Node A with feature embedding of it's neighbors' neighbors as:

Two-layer GCN-based node classification model, demonstrating the computational graph
for node A – in the first layer, the features from the second-level neighbors (neighbor of neighbors)
of node A are aggregated to produce a latent representation of node A’s neighbors; in the second
layer, these latent representations are aggregated to produce final features for node A, which are
then used for node classification

While we are using node classification as an example to discuss GCNs, GCNs can
also perform *graph classification, wherein the aggregation is performed across all nodes of the graph.*

$?$ brief note about the **projection layer**:

a projection layer refers to a linear transformation that maps input features into a new space. in GCN it correspond to:
$$Z = \hat{A}_{norm} \cdot X \cdot W$$

where:
- $X$: input node feature matrix $\in \mathbb{R}^{N \times F}$
- $W$: learnable weight matrix $\in \mathbb{R}^{F \times F{\prime}}$
- $\hat{A}_{norm} = \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2}$: normalized adjacency matrix
- $Z$: output features $\in \mathbb{R}^{N \times F{\prime}}$

It projects node features from their original space (e.g., 5-dim) to a new embedding space (e.g., 64-dim). This is equivalent to a Linear Layer (also known as fully connected / dense).

so the difference between fully-connected(Feed_Forwarded layer) and GCN is that, The GCN layer itself consists of a fully connected layer **but also contains the neighborhood feature aggregation component**, which is the secret sauce that makes GCN work well on graph datasets.

##### GAT

While GCNs use the averaging of information from neighbors, which already extracts valuable graph
information, a lot of work has been done on *finding better ways of aggregating information from neighbors*. An important milestone in this aspect of GNN research is GAT.

as we said, GCN uses averaging as a mechanism to aggregate information from neighboring node features. This
has an inherent limitation as it assumes that all neighbors are to be treated equally, which might not
necessarily be the case. For example, if two nodes, $X$ and $Y$, have the same initial feature values and
the same set of neighbors, a GCN model would identify them under the same class or cluster. But this
might not be true. To capture this level of nuanced information in graphs, we can replace the simple
averaging mechanism with the attention mechanism. This is where GATs come into play

In the context of GATs, attention allows the model to place different weights on different
neighbors of a node while classifying the node type, thereby enabling a more complex and powerful
model. With the attention mechanism, we learn attention coefficients for each neighbor that add more
trainable parameters to the model

the image below demonstrate the difference between GCN and GAT architecture:

<img src=../images/GCN-vs-GAT.png width=750>

in GAT, scalibility can become a bottleneck over a large graph data structure. this is where the **GraphSAGE** comes into play.

##### GraphSAGE

the source paper of introducing [GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)

introduction in the paper of the proposed method:

The key idea behind our approach is that we learn how to aggregate feature information from a
node’s local neighborhood (e.g., the degrees or text attributes of nearby nodes). We first describe
the GraphSAGE embedding generation (i.e., forward propagation) algorithm, which generates
embeddings for nodes assuming that the GraphSAGE model parameters are already learned . We then describe how the GraphSAGE model parameters can be learned using standard stochastic gradient descent and backpropagation techniques

**GraphSAGE** , short for graph sample and aggregate, randomly and uniformly samples neighbors
for a given node, and uses only these selected neighbors to extract graph information, as opposed to
GCN and GAT, which uses all neighbors. This algorithm is hence useful for large and dense graphs.

<img src=../images/GraphSAGE.png width=750>

Node Embedding algorithm provided in the paper:

<img src=../images/Node_embedding_GraphSAGE.png width=850>

The intuition behind Algorithm is that at each iteration, or search depth, nodes aggregate information
from their local neighbors, and as this process iterates, nodes incrementally gain more and more
information from further reaches of the graph.

so we've seen that the difference is the in averaging techniques which GraphSAGE approach provides, there are listed as below:
- Mean aggregator, (We (as in the paper) call this modified mean-based aggregator convolutional since it is a rough, linear approximation of
a localized spectral convolution):
$$\mathbf{h}_v^k \leftarrow \sigma\left(\mathbf{W} \cdot \text{MEAN}\left(\left\{\mathbf{h}_v^{k-1}\right\} \cup \left\{\mathbf{h}_u^{k-1}, \forall u \in \mathcal{N}(v)\right\}\right)\right).$$

- LSTM aggregator: We also examined a more complex aggregator based on an LSTM architecture. Compared to the mean aggregator, LSTMs have the advantage of larger expressive capability. However, it is important to note that LSTMs are not inherently symmetric (i.e., they are not permutation invariant), since they process their inputs in a sequential manner. We adapt LSTMs to operate on an unordered set by simply applying the LSTMs to a random permutation of the node’s neighbors.

- Pooling aggregator: The final aggregator we examine is both symmetric and trainable. In this
pooling approach, each neighbor’s vector is independently fed through a fully-connected neural
network; following this transformation, an elementwise max-pooling operation is applied to aggregate
information across the neighbor set:
$$\text{AGGREGATE}_k^{\text{pool}} = \max\left(\left\{\sigma\left(\mathbf{W}_{\text{pool}}\mathbf{h}_{u_i}^k + \mathbf{b}\right), \forall u_i \in \mathcal{N}(v)\right\}\right)$$

### Building a GCN model using PyTorch Geometric
***Lets get hands-on***

In [1]:
!pip install torch==2.2
!pip install torch_geometric==2.4.0
!pip install seaborn==0.12.2
!pip install networkx==2.8.5
!pip install scikit-learn==1.3.2
!pip install matplotlib==3.5.2
!pip install pandas==1.4.3

[31mERROR: Could not find a version that satisfies the requirement torch==2.2 (from versions: 2.6.0)[0m[31m
[0m[31mERROR: No matching distribution found for torch==2.2[0m[31m
Collecting scikit-learn==1.3.2
  Using cached scikit-learn-1.3.2.tar.gz (7.5 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mPreparing metadata [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[668 lines of output][0m
  [31m   [0m Partial import of sklearn during the build process.
  [31m   [0m clang: error: unsupported option '-fopenmp'
  [31m   [0m 
  [31m   [0m                 ***********
  [31m   [0m                 ***********
  [31m   [0m 
  [31m   [0m It seems that scikit-learn cannot be built with OpenMP

In [2]:
# setting up the notebook width to 100% of the screen
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [3]:
%pip install decorator==5.0.9
# %pip install torch-geometric==2.3.1

%matplotlib inline
import os
import torch
import pandas as pd
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

import torch
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.nn import GATConv
from torch_geometric.utils import to_networkx
from torch_geometric.datasets import Planetoid

Note: you may need to restart the kernel to use updated packages.


#### Helper functions

In [4]:
def visualize(data, labels):
    tsne = TSNE(n_components=2, init='pca', random_state=7)
    tsne_res = tsne.fit_transform(data)
    v = pd.DataFrame(data,columns=[str(i) for i in range(data.shape[1])])
    v['color'] = labels
    v['label'] = v['color'].apply(lambda i: str(i))
    v["dim1"] = tsne_res[:,0]
    v["dim2"] = tsne_res[:,1]
    
    plt.figure(figsize=(12,12))

    sns.scatterplot(
        x="dim1", y="dim2",
        hue="color",
        palette=sns.color_palette(["#52D1DC", "#8D0004", "#845218","#563EAA", "#E44658", "#63C100", "#FF7800"]),
        legend=False,
        data=v,
    )

In [5]:
def visualize_graph(G, color):
    plt.figure(figsize=(75,75))
    plt.xticks([])
    plt.yticks([])
    nx.draw_networkx(G, pos=nx.spring_layout(G), with_labels=False,
                     node_color=color, cmap="Set2")
    plt.show()

#### Load graph data:

In [None]:
dataset = Planetoid(root='data/Planetoid', name='CiteSeer')
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

In [None]:
# describing the dataset
data = dataset[0]  # Get the first graph object.

print(data)
print('==============================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Has isolated nodes: {data.has_isolated_nodes()}')
print(f'Has self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')

In [None]:
G = to_networkx(data)
visualize_graph(G, color=data.y)

#### First lets build a baseline model NN based node classifier