## Task 9: How Expressive are GNNs

Neighbor aggregation is a key step in a GNN framework. It can be abstracted as a function over a multi-set. Multi-set is a set with repeating elements. Expressive power of GNNs can be characterized by that of neighbor aggregation functions they use

- A more expressive aggregation function leads to a more expressive a GNN.
- **Injective aggregation function** leads to the most expressive GNN

In this task, we then analyze expressive power for different aggregation functions theoretically.

## GCN

GCN (Kipf & Welling, ICLR 2017) uses **element-wise mean pooling** over neighboring node features $\mathrm{Mean}({x}_u\in N(v))$, and then followed by linear function and ReLU activation.

GCN’s aggregation function <span style="color:red">cannot distinguish different multi-sets</span> with the <span style="color:blue">same color proportion</span>(Xu et al. ICLR 2019). 

Let's assume node features are represented by one-hot encoding, detailed illustration as follows:

![](https://joyenjoye-assets.s3.ap-northeast-2.amazonaws.com/datawhale_team_learning/202302_graph_ml/GCN_failure.png)


## GraphSAGE


GraphSAGE(Hamilton et al. NeurIPS 2017) applies Multi-Layer Perceptron (MLP) and then uses **element-wise max pooling** over neighboring node features

GraphSAGE’s aggregation function <span style="color:red">cannot distinguish different multi-sets</span> with the  <span style="color:blue">same set of distinct colors</span>(Xu et al. ICLR 2019). 

The detailed illustration as follows:

![](https://joyenjoye-assets.s3.ap-northeast-2.amazonaws.com/datawhale_team_learning/202302_graph_ml/GraphSAGE_failure.png)

## GIN

To achieve maximally powerful GNNs in the class of message-passing GNNs, we can design a injective
neighbor aggregation function over multisets. An injective multi-set function can be expressed as:

$$\Phi\left(\sum_{x\in S}f(x)\right)$$


Where $\Phi$ and $f$ are non-linear functions. $\sum_{x\in S}$ sums over the input multi-set.

Graph Isomorphism Network(GIN) uses a neural network to model injective multiset function(Xu et al. ICLR 2019). Specifically, to model $\Phi$ and $f$, MLP is used as follows: 

$$\mathrm{MLP}_\Phi\left(\sum_{x\in S}\mathrm{MLP}_f(x)\right)$$

As per **Universal Approximation Theorem**,  1-hidden-layer MLP with sufficiently-large hidden dimensionality and appropriate non-linearity function can approximate any continuous function to an arbitrary accuracy (Hornik et al. 1989). In practice, MLP hidden dimensionality of 100 to 500 is sufficient.


Graph Isomorphism Network(GIN) is <span style="color:green"> the most expressive GNN among the above message-passing GNNs</span>.

### WL Kernel

GIN is closely related to Weisfeiler-Lehman (WL) Kernel. It is a "neural network" version of the WL graph kernel. Recall that WL Kernel can be achieved with the **color refinement algorithm** as follows:
- Assgin an initial color $c^{(0)}(v)$ to each node $v$.
- Iteractively refine node colors by 

    $$c^{(k+1)}(v) = \mathrm{HASH}(\underbrace{c^{(k)}(v)}_\text{root node features},\underbrace{\{c^{(k)}(u)\}_{u\in N(v)}\}}_\text{neighboring node features})\tag{1}\label{eq:wl}$$

    where $\mathrm{HASH}$ maps different inputs to different colors.

- After $K$ steps of color refinement, $c^{(k)}(v)$ summarizes the structure of $K$-hop neighborhood.


Note that the **HASH table used in the color refinement algorithm is injective**.


GIN enssentially uses a **neural network** to model the injective $\mathrm{HASH}$ function. Specicially, it models the injective function as follows:

$$\text{GINConv}(\underbrace{c^{(k)}(v)}_\text{root node features},\underbrace{\{c^{(k)}(u)\}_{u\in N(v)}\}}_\text{neighboring node features}) = \mathrm{MLP}_\Phi\left((1+\epsilon)\cdot\mathrm{MLP}_f\left(c^{(k)}(v)\right)+\sum_{u\in N(v)} \mathrm{MLP}_f\left(c^{(k)}(u)\right)\right)$$

where $\epsilon$ is a learnable scalar. If the node feature $c^{(k)}(v)$ is represented as one-hot encoding, direct summation is injective. In this case we only need $\Phi$ to ensure the injectivity. As such, it can be writen as follows:

$$\text{GINConv}(\underbrace{c^{(k)}(v)}_\text{root node features},\underbrace{\{c^{(k)}(u)\}_{u\in N(v)}\}}_\text{neighboring node features}) = \mathrm{MLP}_\Phi\left((1+\epsilon)\cdot c^{(k)}(v)+\sum_{u\in N(v)} c^{(k)}(u)\right)\tag{2}\label{eq:gin_conv}$$


compare $\eqref{eq:wl}$ with $\eqref{eq:gin_conv}$, it shows that GIN can be understood as neural network version of the WL graph Kernel. 

The advantages of GIN over the WL graph kernel are:
- Node embeddings are low-dimensional; hence, they can capture the fine-grained similarity of different nodes.
- Parameters of the update function can be learned for the downstream tasks.

Because of the relation between GIN and the WL graph kernel, their expressive is exactly the same. WL kernel has been both theoretically and empirically shown to distinguish most of the realworld graphs (Cai et al. 1992).Hence, GIN is also powerful enough to distinguish most of the real graphs!

<!-- ## Tips & Tricks
- Data preprocessing is important
    - Node attributes can vary a lot! Use normalization E.g. probability ranges (0,1), but some inputs could have much larger range, say (−1000, 1000)
- Optimizer: ADAM is relatively robust to learning rate
- Activation function
    - ReLU activation function often works well
    - Other good alternatives: LeakyReLU, PReLU
    - No activation function at your output layer
    - Include bias term in every layer
- Embedding dimensions: 32, 64 and 128 are often good starting points

- Debug issues: Loss/accuracy not converging during training
    - Check pipeline (e.g. in PyTorch we need zero_grad)
    - Adjust hyperparameters such as learning rate
    - Pay attention to weight parameter initialization
    - Scrutinize loss function!
- Model development
    - Overfit on (part of) training data: With a small training dataset, loss should be essentially close to 0, with an expressive neural network
    - Monitor the training & validation loss curve -->

## References

\[1\][CS224W: Machine Learning with Graphs](http://web.stanford.edu/class/cs224w/)  
\[2\][Theory of Graph Neural Network Slides. Stanford CS224W: Machine Learning with Graphs | 2023](http://web.stanford.edu/class/cs224w/slides/07-theory.pdf)