# GraphVAE for networks generation

### Variational Autoencoders

First of all let's have a brief remind on **VAE** Variational autoencoders, firstly introduced by [*D. P. Kingma and M. Welling. 'Auto-encoding variational bayes', 2014*](https://arxiv.org/pdf/1312.6114.pdf)

VAE is a neural network architecture belonging to the family of variational Bayesian methods.

From a probabilistic point of view we want to maximize the likelyhood of our data **x** given a proper set of parameters **$\theta$**, like in a normal MLE problem: $p_{\theta}(x) = p(x|\theta)$. By neglecting from the third moment upwards, we could approximate the distribution to a normal distribution $\mathcal{N}(x|\mu,\sigma)$. Simple distributions like the normal ones are usually easy to maximize, however if we assume a prior over a latent space $z$ the posterior usually becomes intractable.

By marginalizing over $z$ we obtain:

$$p_{\theta}(x) = \int_{\mathcal{Z}}{p_{\theta}(x,z)dz} = \int_{\mathcal{Z}}{p_{\theta}(x|z)p_{\theta}(z)dz}$$

So we may define the set of relationships between the input data and the latent space through:
- $p_{\theta}(z)$ the prior distribution of the latent space
- $p_{\theta}(x|z)$ the likelyhood
- $p_{\theta}(z|x)$ the posterior

Using the Bayes's theorem we could get:

$$p_{\theta}(z|x) = \frac{p_{\theta}(x|z)p_{\theta}(z)}{p_{\theta}(x)}$$

but the the computation is usually expensive if not intractable. However, it is possible to approximate the posterior:

$$ q_{\phi}(z|x)\simeq p_{\theta}(z|x)$$

### Variational Graph Autoencoders

Variational Graph Autoencoders [Kingma and Welling, 2016](https://arxiv.org/pdf/1611.07308.pdf) provide a framework extension to graph for VAEs.

Our problem could be formalized as follows: an undirected graph $\mathcal{G}=(\nu, \epsilon)$ with $N$ nodes and a features/attribute matrix $X\in\mathbb{R}^{N\times C}$. An adjacency matrix $A\in\mathbb{R}^{N\times N}$ with self-loops included. Assume that each node within the graph is associated to a latent variable $\in Z$ with $Z\in\mathbb{R}^{N\times F}$ and $F$ being the latent space dimension, we are interested in inferring the latent variables of nodes in the graph and decoding the edges.

Similarly to VAE, VGAE consist of an **encoder** $q_{\phi}(Z|A,X)$, a **decoder** $p_{\theta}(A|Z)$ and a prior $p(Z)$.
- The **encoder** tries to learn a distribution of latent variables associated with each node conditioning on the node features $X$ and $A$. One efficient option is to instantiate $q_{\phi}(Z|A,X)$ as a graph neural network where the learnable parameters are $\phi$. In particular, VGAE assumes a node-independent encoder so that the probabilities factorize: $$q_{\phi}(Z|A,X) = \prod_{i=1}^{N}q_{\phi}(z_{i}|A,X)$$ then, by neglecting from the third moment upwards of your distribution, the problem translates into: $$q_{\phi}(z_{i}|A,X)=\mathcal{N}(z_{i}|\mu_{i},diag(\sigma_{i}^2))$$ $$\mathbf{\mu},\mathbf{\sigma} = GCN_{\phi}(X,A)$$ Where $z_{i}, \mu_{i},\sigma_{i}$ are the i-th rows of the matrices $Z,\mu$ and $\sigma$. The mean and diagonal covariance are predicted by the encoder network, i.e. the $GCN$. For a two-layer $GCN$ we have: $$ H=\tilde{A}\sigma{(\tilde{A}XW_{1})}W_{2}$$ where $H\in\mathbb{R}^{N\times d_{H}}$ are the node representations (each node is associated with a size $d_{H}$ vector), $\tilde{A}=D^{-\frac{1}{2}}(A+I)D^{-\frac{1}{2}}$ is the normalized adjacency matrix as described by the [original 2016 GCN paper by Kipf and Welling](https://arxiv.org/abs/1609.02907). $\sigma$ is a pointwise nonlinearity (e.g. a ReLU) and $\{W_{1},W_{2}\}$ are trainable parameters containing the biases. Relying on the learned node representation, the distribution is computed as follows: $$q_{\phi}(Z|A,X) = \prod_{i=1}^{N}q_{\phi}(z_{i}|A,X)$$ $$q_{\phi}(z_{i}|A,X)=\mathcal{N}(z_{i}|\mu_{i},\sigma_{i}^2I)$$ $$\mu=MPL_{\mu}(H)$$ $$\log{\sigma}=MPL_{\sigma}{(H)}$$ Where $\mu_{i},\sigma_{i}$ are the i-th rows of the MPL predictions. Therefore, the set $\phi$ of parameters consist in the set of the trainable parameters of the twp MLPs and the aforementioned GCN. We remark that the NNs underlying each Gaussian ('GNN+MLP') are very powerful so that the conditional distributions are expressive in capturing the uncertainty of latent variables and computationally cheaper than other techniques.
- GVAEs often adopt a **prior** that remains fixed during the training. A common choice is a node-independent Gaussian as follows: $$p(Z)=\prod_{i}^{N}{p(z_{i})}$$ $$p(z_i)=\mathcal{N}(0,I)$$ Surely this prior can be substituted by more powerful models such as autoregressive models at the cost of more computational resources. Nevertheless, a simple prior like the one expressed before is usually the starting point to benchmark more complicated alternatives.
- The aim of a **decoder** is to construct a probability distribution over the graph and it's features/attributes conditioned on the latent variables, $p(\mathcal{G}|Z)$. One should always consider all the possible node permutations, each corresponding to an adjacency matrix with different rows orderings which leaves the graph unchanged: $$ p(\mathcal{G}|Z) = \sum_{P\in\prod_{\mathcal{G}}} {p(PAP^{T},PX|Z)}$$ but we'll neglect this discussion for the moment. A simple and popular construction of the probability distribution could be: $$ p(A,X|Z)=\prod_{i,j}p(A_{ij}|Z)\prod_{i=1}^{N}p(x_i|Z)$$ $$p(A_{ij}|Z)=Bernoulli(\Theta_{ij})$$ $$p(x_i|Z)=\mathcal{N}(\tilde{\mu}_{i},\tilde{\sigma}_i)$$ Where, once again, the parameters are learned through MLPs: $$\Theta_{ij}=MLP_{\Theta}([z_{i}||z_j])$$ $$\tilde{\mu}_{i}=MLP_{\tilde{\mu}}(z_i)$$ $$\tilde{\sigma}_{i}=MLP_{\tilde{\sigma}}(z_i)$$
- The **objective** of the GVAE is the evidence lower bound (ELBO): $$\max_{\theta,\phi}{\mathbb{E}_{q_{\phi}(Z|A,X)} {[\log{p_{\theta}(\mathcal{G}|Z)}} - KL(q_{\phi}(Z|A,X)||p(Z))]}$$ where the Kullback-Leibler divergence measures the divergence between two probability distributions