# Task 4: Node Embeddings

In Task 2, we learnt traditional feature based methods - for a given input graph, node, link and graph-level features are extracted so that they can be feed into a model (SVM, neural network) that maps features to target labels.

Graph representation learning alleviates the need to do feature engineering manully, but instead automatically learn the features.

The goal for graph representation learning here is to learn  task-idependent features for downstream models efficiently.

The task of learning node embeddings is to map nodes to an embedding space so that similarity of embeddings between nodes indicates their similarity in the network. In other words, the embedding should be able to capture the nework information.

Assume we have a graph $G$:
- $V$ is the vertex set. 
- $A$ is the adjacency maxtrix(assume binary). 

The goal is to encode nodes so that similarity in the embedding space approximates similarity in the graph.

There are two key components in the above process:

- **Encoder** maps from nodes to embeddings. 

    
    $$\mathrm{ENC}(v) = z_v$$

- **Decoder** maps from embedding to simiarliy score for node $u$ and $v$.

    
    $$\mathrm{DEC} = z_v^T z_u$$
    
    here is the dot product between embeddings for node $v$ and node $u$.
  
So the problem here is to optimize the parameters of the encoder so that

$$\mathrm{similarity}(u,v) \approx z_v^T z_u$$

Here $\mathrm{similarity}(u,v)$ refers to the similarity of the node is the original netowrk.

To do the above optimation, we need to define both encoder and node similarity.

## Shallow Encoder

Simplests encoding approach is to treat encoder as just an embedding look up table.

$$\mathrm{ENC}(v) = z_v = Z*v$$

Where $Z \in \mathbb{R}^{d*|v|}$ is a matrix where each column is node embeding, and $v\in\mathbb{I}^{|v|}$ is indicator vector, all zeros except a one in column indicating node $v$.

In such cases, each node is assigned a unique emebedding vector, and we can directly optimize the embedding of each node $Z$.

## Node Similarity

Intuitively, if the two nodes are linked, or share neighbors, or have similar structural roles, they are more likely to have similar emebeddings.


### Node Linkage

Simplest node similarity: node $v$ and node $u$ are similar if they are connected by an edge. This means: 

$$z_v^T z_u = A_{u,v}$$

which is the $(u,v)$ entry of the graph adjacency matrix $A$. Therefore,

$$Z^TZ = A$$

Exact factorization $A=Z^TZ$ is generally not possible. However, we can learn $Z$ approximately. 

Specifically, we optimize $Z$ such that it minimizes the L2 norm (Frobenius norm) of $A-Z^TZ$. The objective function is thus as follows:

$$\min_Z\lVert A-Z^TZ\rVert_2$$




### Random Walk

One way is to define node similarity using random walks.

Given a graph and a starting point, we select a neighbor of it at random, and move to this neighbor, then we select a neighbor of this point at radom and move to it and repeat... The random sequence of points visited this way is a **random walk** on the graph.

The probability that node $v$ and node $u$ co-occur on the random walk over graph can be used to measure the node simiarlity between node $v$ and node $u$. The intuition is that If random walk starting from node $u$ visits $v$ with high probability, $u$ and $v$ are similar (high-order multi-hop information).As such, we can write the follows:

$$P_R(v|u) \approx z_v^T z_u$$

The reasons that why should we use random walks for node similarity:
- **Expressivity**:  Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information.
- **Efficiency**: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

#### Random walk Strategies
There are different strategies that we can run random walk:

- Fixed-length, unbiased random walk
    - DeepWalk: [(Perozzi et al., 2013)](https://arxiv.org/abs/1403.6652)
    - The issue is that such notion of similarity is too constrained.
- Biased random walks:
    - Node2Vec: [(Grover and Leskovec, 2016)](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf). 
    - Based on node attributes (Dong et al., 2017).
    - Based on learned weights (Abu-El-Haija et al., 2017)
    
- Alternative optimization schemes:
    - Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).

- Network preprocessing techniques:
    - Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al.2016’s HARP).

## Optimization

Now, we can formally definie the optimization. For given $G = (V,E)$, the goal is to learn a mapping $f:u \rightarrow \mathbb{R}^d: f(u) = \mathbf{z}_u$ to maximize the Log-likelihood objective:

$$\max_f\sum_{u \in V}\log P(N_R(u)|\mathbf{z}_u)\tag{1}$$

Where $N_R(u)$ is the neighborhood of node $u$ by strategy $R$.

The optimization takes the following steps:

- Run short fixed-length random walks starting from each node $u$ in the graph using some random walk strategy $R$.
- For each node $R$ collect $N_R(u)$, the multiset of nodes visited on random walks starting from $u$.
- Optimize embeddings $Z_u$ so that for a given node $u$, predict its neighbors $N_R(u)$.


The maximization of the above is equivalent to minimize the following:

$$\mathcal{L} = \sum_{u \in V}\sum_{v \in N_R(u)}-\log P(v|\mathbf{z}_u)\tag{2}\label{eq:loss_function}$$

Here, we are optimizing emebding $Z_u$ to maximize the likelihood of random walk co-occurrences. 

Where $P(v|\mathbf{z}_u)$ can be parmeterize using softmax:

$$
P(v|\mathbf{z}_u) = \frac{\exp(z_u^T z_v)}{\sum_{n \in V}\exp(z_v^T z_n)} \tag{3}\label{eq:likelihood}
$$

Plug $\eqref{eq:likelihood}$ into $\eqref{eq:loss_function}$, we get the following loss function:


$$\mathcal{L} = \sum_{u \in V}\sum_{v \in N_R(u)}-\log \frac{\exp(z_u^T z_v)}{\sum_{n \in V}\exp(z_v^T z_n)}\tag{4}\label{eq:loss_function_final}$$

But doing this naively is too expensive becaue nested sum over nodes gives complexity of $\mathrm{O}(|V|^2)$


### Negative Sampling


To address this, we can consider to approximate the the normalization term as follows:

$$\log \frac{\exp(z_u^T z_v)}{\sum_{n \in V}\exp(z_v^T z_n)} \approx log\big(\sigma(z_u^T z_v)\big) - \sum_{i=1}^{k}log\big(\sigma(z_u^T z_{n_i})\big), n_i \sim P_V$$


Instead of normalizing w.r.t. all nodes, just normalize against $k$ random "negative samples" $n_i$

The above approximation is called negative sampling it is a form of noise contrastive estimation(NCE) which approximate the maximimation of the log probability of softmax. 

The new formulation correspons to using the logistic regression to distinguish the target node $v$ from nodes $n_i$ sampled from the background distribution $P_v$. For more details on this, refer to [oldberg, Y. and Levy, O., 2014](https://arxiv.org/pdf/1402.3722.pdf).

Considerations for negative sampling:
- Higher $k$ gives more robust estimates
- Higher $k$ corresponds to higher bias on negative events. In practice $k =5-20$ .
- Can negative sample be any node or only the nodes not on the
walk? People often use any nodes (for efficiency). However, the
most “correct” way is to use nodes not on the walk.

### Stochastic Gradient Descent

Now that we have the loss function, and we need to optimize (minimize) it.

**Gradient Descent** is a simple way to minimize $\mathcal{L}$ :
- Initialize $z_u$ at some randomized value for all nodes $u$.
- Iterate until convergence:
    - For all $u$, compute the derivative 
    
        $$ \frac{\partial \mathcal{L}}{\partial z_u}$$
    
    - For all $u$, make a step in reverse direction of derivative: 
    
        $$ z_u \leftarrow z_u - \eta \frac{\partial \mathcal{L}}{\partial z_u}$$
    
        where $\eta$ is lthe earning rate.
    
**Stochastic Gradient Descent** evaluates it for each individual training example instead of evaluating gradients over all examples.
- Initialize $z_u$ at some randomized value for all nodes $u$.
- Iterate until convergence: 

    $$\mathcal{L}^{(u)} = \sum_{v \in N_R(u)}-\log P(v|\mathbf{z}_u)$$
    
    - Sample a node $u$, for all $v$, compute the derivative 
    
        $$ \frac{\partial \mathcal{L}^{(u)}}{\partial z_v}$$
    
    - For all $v$, update
    
    $$ z_v \leftarrow z_v - \eta \frac{\partial \mathcal{L}^{(u)}}{\partial z_v}$$
    


## Limitations

Limitations of node embeddings via matrix factorization and random walks
- Cannot obtain embeddings for nodes not in the training set
- Cannot capture structural similarity
- Cannot utilize node, edge and graph features

To address these limitations, deep representation learning and graph neural networks can be used.

## How to Use Embedding

For **node-level tasks** such as clustering/community dectection, node classification, we can just directly use the node embedding $z_i$ for a given node $i$.

For **link-level tasks** that predict edge$(i,j)$ base on  $(z_i,z_j)$, we can concatenate, avg, product, or take a difference
between the embeddings to get the link embedding.

For **graph-level tasks** that classsify graphs into different classes, we can get graph embedding through the following 3 approaches. Examples of graph level tasks are classifying toxic vs. non-toxic molecules, and identifying anomalous graphs. 
- **Approach 1**: Embed notdes and aggregate node embeddings.  
    - Run a standard node embedding technique on the (sub)graph $G$.
    - Then just sum (or average) the node embeddings in the (sub)graph $G$.
    
    It is simple but efficient and was used by [Duvenaud et al., 2016](https://arxiv.org/abs/1509.09292) to classify molecules based on their graph structure.

- **Approach 2**: Introduce and embed virtual node.
    - Create super-node that spans the (sub) graph and then embed that node.
    - use the virtual node embedding as graph embedding.

    It was proposed by [Li et al., 2016](https://arxiv.org/abs/1511.05493) as a general
    technique for subgraph embedding.

- **Approach 3**: hierarchically embeddings
    - Hierarchically cluster nodes in graphs,
    - Then sum(or average) the node embeddings according to these clusters.



## References

\[1\][Node Embeddings Youtube. Stanford CS224W: Machine Learning with Graphs | 2021](https://www.youtube.com/watch?v=rMq21iY61SE&list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn&index=7)

\[2\][Node Embeddings Slides. Stanford CS224W: Machine Learning with Graphs | 2021](http://web.stanford.edu/class/cs224w/slides/03-nodeemb.pdf)