# Task 8: Graph Neural Networks

In Task 2, we learnt traditional feature based methods - for a given input graph, node, link and graph-level features are extracted so that they can be feed into a model (SVM, neural network) that maps features to target labels.

In Task 4, we learnt graph representation learning which learns task-idependent features for downstream models efficiently. It uses a ***shallow*** **Encoder** to map nodes to emebdings and **Decoder** to map embeddings to similarity Score. 

The limitation of shallow emebedding methods are as follows:
- The complexity of $O(|V|)$ as there is no sharing of paramters between nodes, and every nodes has its own unique embedding
- Inherently transductive and cannot generte emebeddings that not seen during training.
- Node features are not incorporated.


In this task, we learn how to use deep learning methods (graph neural networks, GNNs) to get a deep encoder to map nodes to embeddings. 

Essentially, the deep encoder is a multiple layers of non-linear transformations based on graph structure.




## Basics of Deep Learning

### Machine Learning as Optimization


#### Objective Function

Formulate the task as an optimization probelm. Here, we optimize $\Theta$ to minize the objective function $\mathcal{L}(\mathbf{y},f(\mathbf{x}))$.

$$\min_{\Theta} \mathcal{L}(\mathbf{y},f(\mathbf{x}))$$

Where $\Theta$ contains parameters of $f$. 

The [Loss function](https://pytorch.org/docs/stable/nn.html) $\mathcal{L}$ can take many forms: L1, L2, huber, max margin(hinge loss), cross entropy, and etc. One of the common loss for classification is cross entropy loss. 

Consider we are doing multi-class classification where the target variable can belong to one of 3 classes: Class 1, Class 2, Class 3. 


|Target Variable|
|:--:|
|Class 1|
|Class 3|
|Class 2|
|Class 2|
|Class 1|
||



One-hot encoding is applied on the target variable, we get $\mathbf{y}$ which is a n by 3 matrix. where ${y}_i$ is the actual values of the i-th class for a given instance.


|Class 1 ($y_1$)|Class 2 ($y_2$)|Class ($y_3$)|
|:--:|:--:|:--:|
|1|0|0|
|0|0|1|
|0|1|0|
|0|1|0|
|1|0|0|
|||



We want to train a model $\hat {\mathbf{y}} = f(\mathbf{x})$ to make prediction on the probablity of each class. $\hat {{y}}_i$ is the predicted values of the i-th class for a given instance.

|Class 1 ($\hat{y}_1$)|Class 2 ($\hat {y}_2$)|Class ($\hat {y}_3$)|
|:--:|:--:|:--:|
|<span style="color:blue">0.80</span>|0.11|0.09|
|0.05|0.05|<span style="color:blue">0.99|
|0.20|<span style="color:blue">0.70</span>|0.10|
|0.10|<span style="color:blue">0.88</span>|0.02|
|<span style="color:blue">0.65</span>|0.25|0.10|
|||
    
The sum of the probability of each class equals to 1. For a given instance, we can then predict it belongs to the class with maximum probablity. In general, to ensure the output of $f(x)$ to be probabilities which sum up to 1, softmax function $\sigma$ is applied on the output $g(x)$ from the previous step:
    
$$\hat{{y}} = f(x) = \sigma\big(g(x)\big)$$ 

The predicted values of the $i$-th class for a given instance $\hat{{y}}_i$ is thus as follows


$$ \hat{y}_i =f(x)_i = \sigma\big(g(x)_i\big) = \frac{e^{g(x)_i}}{ \sum_{i=j}^C e^{g(x)_j}}$$
    

Where $C$ is the number of classes. In the above example $C=3$
    
   

The cross entropy loss for each instance is thus:
    
$$\mathrm{CE}\big(y,f(x)\big) =-\sum_{i=1}^C {y}_i \hat {{y}}_i =-\sum_{i=1}^C {y}_i logf({x})_i$$
    
The lower the loss, the closer the prediction $\hat {{y}}$ is to one-hot encoded true label ${y}$. Sum up the loss over all training examples, we have the loss function as follows:
    
$$\mathcal{L}(\mathbf{y},f(\mathbf{x})) = \sum_{(x,y)\in\mathcal{T}} \mathrm{CE}\big(y,f(x)\big) $$

where $\mathcal{T}$ training set containing all pairs of data and labels


#### How to Optimize?

Once we have the objective function, the next question is how to optimize it?

**Gradient Descent** is an iterative algorithm which repeated update weights $\Theta$ in the oppsite direction of gradients until the objective function converge. 


$$ \Theta \leftarrow  \Theta - \mathbf{\eta} \, \nabla_\Theta \mathcal{L} $$


Where $\eta$ is a hyperparameter that controls the size of gradient step. It can vary over the course of training. The graident vector $\nabla_\Theta \mathcal{L}$ can be computed as follows:



$$ \nabla_\Theta \mathcal{L}= (\frac{\partial \mathcal{L}}{\partial \, {\Theta_1} } ,\frac{\partial \mathcal{L}}{\partial\,{\Theta_2} },...)$$

Ideally, we would like to terminate the iteration when gradient equals 0. In practice we stop training when it no longer imporves performance on validation set. 

The problem with grident decent is that extract gradient requires computing $\nabla_\Theta \mathcal{L}(\mathbf{y},\mathbf{x})$ where $\mathbf{x}$ is the entire dataset. i.e. summing up gradient contribution over all data points in the dataset. Modern dataset often contain billions of data points which leads to extermelly expensive caculation for every gradient descent step.

One solution to address this is to use **Stochastic Gradient Descent(SGD)**. At each step, it picks a different minibatch $\mathcal{B}$ containing a subset of the dataset use it a input $\mathbf{x}$.  The SGD process involves the following conceps:
- *Batch size*: the number of data points in a minibatch
- *Iteration*: 1 step of SGD on a minibatch
- *Epoch*: one full pass over the dataset (# iterations is equal to ratio of dataset size and batch size)

SGD is unbiased estimator of full gradient. But there is no guarantee on the rate of convergence. In practice often requires tuning of learning rate. Common optimizer that improves over SGD: Adam, Adagrad, Adadelta, RMSprop.


When updating the weights $\Theta$ interatively, there are 2 steps for each iteration. 
- **Forward Propoagation**: compute $f(\mathbf{x})$ given $\mathbf{x}$ and updated $\Theta$. Use computed $f(\mathbf{x})$ and $\mathbf{x}$ to compute $\mathcal{L}$.

- **Back Propoagation**: using chain rule to propagate gradients of intermediate steps, and finally obtain gradient $\nabla_\Theta \mathcal{L}$ .


### Linear Function

In previous section, we formulate machine learning as an optimization probelm. 

$$\min_{\Theta} \mathcal{L}(\mathbf{y},f(\mathbf{x}))$$

Now let's see try to apply and use it. To start, consider linear function.

$$f(\mathbf{x}) = W \cdot \mathbf{x}, \quad \Theta = \{W\}$$

- If $f$ returns a scalar, then $W$ is a weight vector.
- If $f$ returns a vector, then $W$ is a weight matrix, called Jacobian Matrix. 

### 2-Layer Linear Network

To make it a bit more complex, let's look at 2-Layer linear network, 

$$f(\mathbf{x}) = g(h(\mathbf{x})) =W_2( W_1 \mathbf{x}), \quad \Theta = \{W_1,W_2\}$$

Here we use $h(\mathbf{x})= W_1 \mathbf{x}$ to denote the hidden layer.

Assume we use L2 Loss for objective function and SGD for optimization. Then for each minibatch $\mathcal{B}$, the loss can be calculated as follows:

$$ \mathcal{L} = \sum_{(x,y)\in\mathcal{B}} ||\mathbf{y}-f(\mathbf{x})||_2$$

### Multilayer Perceptron



Note that in 2-Layer Linear Network, $W_2 W_1$ is just another matrix, and $f(\mathbf{x})$ is still linear w.r.t. $\mathbf{x}$.  To intorudce non-linearity, we need to apply a non-linear transformations. Popular non-linear transformation function include:

- Rectified Linear Unit(ReLU) 
    $$ReLU(x)= \max(x,0)$$
    
- Sigmoid
   $$\sigma(x)= \frac{1}{1+e^{-x}}$$
   
Each layer of Multilayer Perceptron(MLP) combines linear and non-linear transformation.

$$\mathbf{x}^{(l+1)} = \sigma(W_l\mathbf{x}^{(l)}+b^{l})$$

- $W_l$ is weight matrix that transforms hidden representation at layer $l$ to layer $l+1$, 
- $b^{l}$ is bias at layer  $l$ , and is added to the linear transformation of $\mathbf{x}$. 
- $\sigma$ is non-linearity function (e.g., sigmod)

<!-- ## Ideas for Deep Learning for Graphs


### Convolutional Networks

### Permutation Invariance

### Permutation Equivariance

### Graph Neural Network

Graph neural networks consist of multiple permutation equivariant / invariant functions.
 -->

## Graph Convolutional Networks

Assume we have a graph $G$
- $V$ is the vertex set
- $\mathbf{A}$ is the adjacency matrix(assume binary)
- $\mathbf{X} \in \mathbb{R}^{m |V|}$is a matrix of node features
- $v$ is a node in $V$ and $\mathrm{N}(v)$ is the set of neighbors of $v$.

Nodes aggregate information from their neighbors using neural networks. The important consideration is how to aggregate information across the layers. A basic appraoch is to average neighbor messages and paply a neural network.

$$h_v^{(l+1)} = \sigma\big(W_l \sum_{u\in N(v)}\frac{h_u^{(l)}}{|\mathrm{N}(v)|}+\mathrm{B}_l h_v^{(l)} \big), \quad \forall \, l \in \{0,...,L-1\}$$

- $\sum_{u\in N(v)}\frac{h_u^{(l)}}{|\mathrm{N}(v)|}$ is the average of neighbors' previous layer emebeddings
- $h_v^{(l)}$ is the embedding of $v$ at layer $l$. When $l=0$, $h_v^0 = x_v$ which is the initial 0-th layer's embedding and equals to the node features.
- $L$ is the total number of layers.
- $W_l$ is the weight matrix for neighborhood aggregation
- $B_l$ is the weight matrix for transforming hidden vector. 

After $L$ layers of neighborhood aggregation, we can get the emebeddings of $v$ as follows:

$$z_v = h_v^{(L)}$$


### Matrix Formulation

Let 

$$H^{(l)} =[h_1^{(l)}, ... , h_{|v|}^{(l)}]^T$$

Then

$$\sum_{u\in N(v)}{h_u^{(l)}} = A_vH^{(l)}$$

<!-- ### How to train -->

## GNN vs. CNN vs. Transformer

**CNN** can be seen as a special GNN with fixed neighbor size and ordering:
- The size of the filter is pre-defined for a CNN.
- The advantage of GNN is it processes arbitrary graphs with different degrees for each node.

**Transformer** is one of the most popular architectures that achieves great performance in many sequence modeling tasks. The key component of transformer is self-attention mechamsim where each word attends to all the other words. In this case, the computation graph of a transformer layer is identical to that of a GNN on the fully-connected “word” graph.


<!-- ## 
## A Single Layer of a GNN
## GNN Layers in Practice

### Batch Normalization

### Dropout

### Activation

## Stacking Layers of a GNN
## Graph Manipulation in GNN -->