## Applications of Graph Neural Networks

#### Over-Smoothing Problem
The issue of stacking too many GNN layers is that GNNs suffer from the over-smoothing problem. This is essentially that all the node embeddings converge to the same value. This is really bad as we want to use node embeddings to differentiate nodes. 

The **receptive field** is the set of nodes that determine the embedding of a node of interest. In a K-layer GNN, each node has a receptive field of K-hop neighborhood. The shared neighbors quickly grow when we increase the number of hops (number of GNN layers). 

Over-smoothing can be explained via the notion of the receptive field. We know that the embedding of a node is determined by its receptive field. If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar. 

Stacking many GNN layers will lead to nodes having highly-overlapped receptive fields. Node embeddings will be highly similar and suffer from the over-smoothing problem. How do we overcome this?

The first lesson is that we need to be cautious when adding GNN layers. Adding more GNN layers do not always help. 

We can also make GNNs more expressive for when we use shallow GNNs. We can make aggregation/transformation become a deep neural network. We could add layers that do not pass messages. A GNN does not necessarily only contain GNN layers. For example, we could add MLP layers (applied to each node) before and after GNN layers as pre-process layers and post-process layers. **Pre-processing layers**: are important when encoding node features is necessary (eg when nodes represent images/text). **Post-processing layers** are important when reasoning/transformation over node embeddings is needed (eg. graph classification, knowledge graphs). These layers work really well in pratice.

If we absolutely require many layers, we can also add skip connections. The basic idea of skip connections is that before adding shortcuts, the function is F(x) and after adding shortcuts, it becomes F(x) + x. We want to create a mixture of models. N skip connections leads to $2^{N}$ possible paths. Each path could have up to N modules and we automatically get a mixture of shallow GNNs and deep GNNs.

#### Graph Augmentation for GNNs
Its highly unlikely that the raw input graph happens to be the optimal computational graph for embeddings. We may not want to use raw input graphs as computational graphs since input graphs may lack features, graphs may be too sparse, graphs may be too dense or may even be too large.

Graph feature augmentation is when the graph lacks features. Graph structure augementation is done when the graph is too sparse, dense or large.

The standard approaches to feature augmentation include assigning constant values to nodes, and assigning unique IDs to nodes. Assigning unique IDs to nodes uses IDs that can be converted into one-hot vectors.

Feature augmentation is also important when we have some structures that may be really hard to learn by GNNs. 

When we want to augment sparse graphs, we can always add virtual edges and virutal nodes. 

#### GNN Predictions
There are a couple of different prediction heads possible including node level tasks, edge level tasks and graph level tasks. Different task levels require different prediction heads. 

For node level prediction, we can directly make prediction using node embeddings. For edge-level predictions, we can make prediction using pairs of node embeddings. For graph level predictions, we can make prediction using all the node embeddings in our graph.

For supervised learning, the labels are coming from the graphs. For unsupervised learning, there are signals instead that come from the graphs. The differences between the two can be blurry as we still have "supervision" in unsupervised learning. Supervised lables come from specific use cases that include node labels, edge labels, and graph labels. It is best to reduce your task to node/edge/graph labels since they're easiest to work with.

When we  want to compute final loss, we could use two common ones which are classification loss and regression loss. Classification is used for labels with discrete values to predict. Regression is used for labels with continuous values. GNNs can be applied to both settings. The differences between the two are loss function and evaluation metrics.


#### Classification Loss
Cross entropy (CE) is a very common loss function in classification. K-way prediction for i-th data point:

$CE(y^{(i)},\hat{y}^{(i)}=-\sum^{K}_{j=1}y_j^{(i)}log(\hat{y}_{j}^{(i)})$

where:

$y^{(i)} in \mathbb{R}^{K}$ = one-hot label encoding

$\hat{y}^{(i)} in \mathbb{R}^{K}$ = prediction after Softmax

Total loss over all N training examples

$Loss = \sum^{N}_{i=1}CE(y^{(i)},\hat{y}^{(i)})$

#### Regression Loss
For regression tasks we often use mean squared error (MSE) or L2 loss.

K-way regression for data point (i):

$MSE(y^{(i)},\hat{y}^{(i)}) = \sum^{K}_{j=1} (y^{(i)}_{j} - \hat{y}^{(i)}_{j})^2$

where:

$y^{(i)} in \mathbb{R}^{K}$ = Real valued vector of target

$\hat{y}^{(i)} in \mathbb{R}^{K}$ = Real valued vector of predictions

## Theory of Graph Neural Networks