(app:ch13:spectral)=
# Spectral Method Theory

We've tried to present this material so far in a manner which is as easy to understand as possible with a distant understanding of probability/statistics and a working knowledge of machine learning. Unfortunately, there is really no way around it now, so this section is about to get hairy mathematically. To understand this section properly, you should have an extremely firm understanding of linear algebra, and more than likely, should have a working knowledge of matrix analysis and multivariate probability theory. While we've already seen some concentration inequalities in the last section (Chebyshev's inequality) you should have a working knowledge of how this term can be extended to random vectors and matrices before proceeding. Before taking on this section, we would recommend checking out the excellent primer by Roman Vershynin, [High Dimensional Probability](https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html), which should get you a good foundation to understand many of the results we will take a look at here. We aren't going to prove many of these results; if you want more details, please check out the [excellent paper on Random Dot Product Graphs](https://arxiv.org/abs/1709.05454) by Avanti Athreya and her research team from 2017.

Buckle up!

## Disclaimer about classical statistical asymptotic theory

While in classic statistics there is a large literature that derives the large sample properties of an estimator, this concepts are more challenging in network analysis for multiple reasons. To start with, the very basic concept of sample size is not particularly clear. We often associate sample size with the number of independent observations, which are usually assumed to be independent from each other (for example, think of the number of poll participants to estimate polling preferences). In a network, having independent observations is no longer possible in the same way since all we observe are edges, and they are related to some type of interaction between two vertices. We therefore often assume that the sampled units are the vertices. However, everytime a new vertex is added, a new set of interactions with all the existing vertices is added to the model, which often results in the need of including more parameters, leading to the second important challenge in studying networks. A body of new literature has addressed some of these challenges for the models and estimators introduced in the previous sections, and we review some of these results here.

## Adjacency spectral embedding

In the following sections, we summarize some of the main results in the literature about spectral embeddings. A more in-deep review of this results is presented  in Dr. Athreya's paper. In this section, we review some theoretical properties for the adjacency spectral embedding (ASE), introduced in [Section 6.3](#link?). 

We consider a sequence of  independent random adjacency  matrices, denoted by $\mathbf A_{n_0}, \mathbf A_{n_0+1}, \mathbf A_{n_0+2}, \ldots$, where $n_0$ is some positive integer, denoting the number of vertices in an observed network. We assume that each matrix $\mathbf A_n$ is a random adjacency matrix generated from a random dot product graph (RDPG) model with latent positions $\mathbf X_n\in\mathbb R^{n\times d}$. We write $(\mathbf X_n)_i$ to represent the $i$-th row of $\mathbf X_n$, and we assume that the rows of $\mathbf X_n$, which correspond to the latent positions of the vertices, are independent and identically distributed with $(\mathbf X_1), \ldots, (\mathbf X_n)_n\overset{\text{i.i.d.}}{\sim} F$, where $F$ is a distribution with support $\mathcal{X}\subset\mathbb R^d$. We also  assume that the second moment matrix $\mathbf{\Delta} = \mathbb{E}[(\mathbf X_n)_1(\mathbf X_n)_1^\top]\in\mathbb R^{d\times d}$ has non-zero eigenvalues.   We use $\widehat{\mathbf X}_n=ASE(\mathbf A_n)\in\mathbb R^{n\times d}$ to denote the $d$-dimensional adjacency spectral embedding of $\mathbf A_n$, and $\widetilde{\mathbf X}_n = LSE(\mathbf A_n)$  to denote its $d$-dimensional Laplacian spectral embedding.

### Error of the adjacency spectral embedding

The first result concerns the error of the adjacency spectral embedding (ASE) method, which is an estimator for the latent positions. This estimator can consistently estimate the latent positions of a network, or in other words, as the sample size (number of vertices) increases, the estimated latent positions approach the true latent positions. The typical distance between these two can be quantified explicitly in terms of the sample size $n$, dimension of the latent positions $d$ and a constant that only depends on the distribution of the latent positions $F$. It has been shown that  with probability tending to one, there exists some orthogonal rotation $\mathbf W_n\in\mathbb R^{d\times d}$ such that 
the largest distance between the true and estimated latent positions satisfies:
\begin{align*}
\max_{i\in[n]}\|(\widehat{\mathbf X}_n)_i - \mathbf W_n(\mathbf X_n)_i\| \leq \frac{Cd^{1/2}\log^2 n}{\sqrt{n}}. \label{eq:thm-ASE-const}
\end{align*}

The orthogonal matrix in the previous result comes from the fact that the latent positions are the same up to orthogonal rotations (for instance, changing the signs of the columns of $\mathbf X_n$ does not change the inner products of their rows). The previous result implies that the distance between true and estimated latent positions is shrinking with $n$, and as such, we can construct an accurate estimator of this latent positions. This result justifies the use of $\mathbf {\hat X}$ in place of $\mathbf X$ for subsequent inference tasks, such as community detection, vertex nomination or classification.

A further result on the asymptotic properties of the ASE concerns to the distribution of the difference between an estimated latent position $\widehat{\mathbf X}_n)_i$ and the true parameter $(\mathbf X_n)_i$. By consistency, this difference shrinks with $n$, but knowing this fact is sometimes not enough when quantifying how much different they are is needed. This is important, as some statistical tasks, such as hypothesis testing or confidence interval estimation, require to quantify the error in estimation. 
Distributional results on the rows of the adjacency spectral embedding show that the error in estimating the true latent positions is asymptotically normally distributed. That is, as $n$ grows, the distribution of this difference gets more similar to a multivariate normal distribution. In particular, it has been shown that the latent positions converge to a mixture of standard multivariate normal distributions, that is, for any $\mathbf{z}\in\mathbb R^{d}$,
\begin{equation}
\mathbb{P}\left(\sqrt{n}\left(\widehat{\mathbf X}_n\mathbf W_n - \mathbf X\right)_i\leq \mathbf{z} \right) \approx \int_{\mathcal{X}}\Phi(\mathbf{z}, \mathbf{\Sigma}(\mathbf{x}))\  dF(\mathbf{x}),\label{eq:thm-ASE-CLT}
\end{equation}
where $\Phi(\mathbf{z}, \mathbf{\Sigma}(\mathbf{x}))$ is the cumulative distribution function of a multivariate normal distribution with mean zero and a covariance matrix $\mathbf{\Sigma}(\mathbf{x})\in\mathbb R^{d\times d}$ that is a function of $\mathbf{x}\in\mathcal{X}$

\tcr{maybe illustrate this results with a figure of a SBM with 2 communities?}

The Laplacian spectral embedding ($LSE$) possess analogous properties. In particular, the estimated latent positions obtained from LSE converge to a scaled version of the latent positions as $n$ grows, and the distribution of this difference also has a similar asymptotic distribution, given by:
\begin{align*}
\mathbb{P}\left\{\sqrt{n}\left(\mathbf W_n(\widetilde{\mathbf X}_n)_i - \frac{(\mathbf X_n)_i}{\sqrt{\sum_{j}(\mathbf X_n)_i^\top (\mathbf X_n)_j }} \right)\leq \mathbf{z}\right\}  \approx \int_{\mathcal{X}}\Phi(\mathbf{z}, \widetilde{\mathbf{\Sigma}}(\mathbf{x}))\  dF(\mathbf{x}),\label{eq:thm-LSE-CLT}
\end{align*}
for some covariance matrix $\widetilde{\mathbf{\Sigma}}(\mathbf{x})$. 

## Theory for multiple network models

### Spectral embeddings

Having asymptotic properties of the estimators facilitates performing statistical inference. 

Comparing the distribution of two populations is a frequent problem in statistics and across multiple domains. In classical statistics, a typical strategy to perform this task is to compare the mean of two populations by using an appropriate test statistic. Theoretical results on the distribution of this statistic (either exact or asymptotic) are then used to derive a measure of uncertainty for this problem (such as p-values or confidence intervals). Similarly, when comparing two observed graphs, we may wonder whether they were generated by the same mechanism. The results discussed before have been used to develop valid statistical tests for   two-network hypothesis testing questions. 

A semiparametric network hypothesis test for the equivalence between the latent positions of the vertices of a pair of networks can be constructed by using the estimates of the latent positions. Formally, for each fixed $n$ let $\mathbf X_n, \mathbf Y_n\in\mathbb R^{n\times d}$ be a sequence of latent positions matrices, and define
$\mathbf A_n\sim RDPG(\mathbf X_n)$, $\mathbf B_n\sim RDPG(Y_n)$ as independent random adjacency matrices. The problem of testing the equality of the distributions of $\mathbf A_n$ and $\mathbf B_n$  is defined as:
\begin{align*}
\mathcal{H}^n_0:\mathbf X_n =_{\mathbf W} \mathbf Y_n\quad\quad\quad \text{ vs.}\quad\quad\quad \mathcal{H}^n_a:\mathbf X_n \neq_{\mathbf W} \mathbf Y_n,
\end{align*}
where $\mathbf X_n =_{\mathbf W}\mathbf Y_n$ denotes that $\mathbf X_n$ and $\mathbf Y_n$ are equivalent up to an orthogonal transformation $\mathbf W\in\mathcal{O}_d$, and $\mathcal{O}_d$ is the set of $d\times d$ orthogonal matrices. To define the test statistic, denote  $\widehat{\mathbf X}_n = ASE(\mathbf A_n)$, $\widehat{\mathbf Y}_n=ASE(\mathbf B_n)$, and for a matrix $\mathbf A\in\mathbb R^{n\times n}$ with singular values $\sigma_1(\mathbf A) \geq \ldots\geq \sigma_n(\mathbf A)\geq 0$ and largest observed degree $\delta(\mathbf A) = \max_{i\in[n]}\sum_{j=1}^n\mathbf A_{ij}$, define 
$$\gamma(\mathbf A):=\frac{\sigma_d(\mathbf A) - \sigma_{d+1}(\mathbf A)}{\delta(\mathbf A)}.$$ 
Define $T_n$ as the test statistic:
\begin{align*}
T_n : = \frac{\min_{\mathbf W\in\mathcal{O}_d} \|\widehat{\mathbf X}_n\mathbf W - \widehat{\mathbf Y}_n\|_F}{\sqrt{d\gamma^{-1}(\mathbf A_n)} + \sqrt{d\gamma^{-1}(\mathbf B_n)}}.
\end{align*}
It can be shown that, under appropriate regularity conditions, this test statistic is a consistent test for the  hypothesis testing problem described above, in the sense that for any significance level $\alpha$ and $C>1$, then  $\mathbb{P}(T_n> C)\leq \alpha$ for $n$ sufficiently large under $\mathcal{H}^n_0$ (type I error control), and if $\lim_{n\rightarrow\infty}\min_{\mathbf W\in\mathcal{O}_d} \|\widehat{\mathbf X}_n\mathbf W - \widehat{\mathbf Y}_n\|_F=\infty$, then $\mathbb{P}(T_n> C)\rightarrow 1$ under $\mathcal{H}^n_a$ (i.e., the type II error vanishes). 

When the vertices of the networks are not necessarily aligned (including cases in which the networks do not have the same number of vertices), testing equality of latent positions is inappropriate. In those settings, the methods introduced in chapter ??? can be used to test the equality of the distribution of the latent positions.
For a pair of matrices $\mathbf X_n\in\mathbb R^{n\times d}$ and $\mathbf Y_m\in\mathbb R^{m\times d}$ with their rows distributed as $(\mathbf X_n)_i\overset{\text{i.i.d.}}{\sim} F$ and $(\mathbf Y_m)_i\overset{\text{i.i.d.}}{\sim} G$ and a pair of independent adjacency matrices $\mathbf A_n\sim RDPG(\mathbf X_n)$, $\mathbf B_n\sim RDPG(\mathbf Y_n)$ , the nonparametric network hypothesis testing problem is given by:
\begin{align*}
\mathcal{H}^n_0:F \perp G \quad\quad\quad \text{ vs.}\quad\quad\quad \mathcal{H}^n_a: F \not\perp G,
\end{align*}
where $F\perp G$ indicates equality of the distributions up to an orthogonal transformation. To test such hypothesis, the following statistic is used:
\begin{align*}
U_{n,m}(\mathbf X, \mathbf Y)=& \frac{1}{n(n-1)}\sum_{j\neq i}\kappa(X_i, X_j)-\frac{2}{mn}\sum_{i=1}^n\sum_{k=1}^m\kappa(X_i, Y_k)\\
& + \frac{1}{m(m-1)}\sum_{l\neq k}\kappa(Y_k, Y_l),
\end{align*}
where $\kappa:\mathcal{X}\times \mathcal{X}\rightarrow\mathbb R$ is a positive definite kernel. As in the semiparametric test, it can be shown that  $U_{n,m}(\mathbf X, \mathbf Y)$ is a consistent and unbiased estimate of the maximum mean discrepancy  between the distributions $F$ and $G$. Furthermore, under the null hypothesis, the quantity $(m+n)U_{n,m}(\mathbf X, \mathbf Y)$ converges in distribution to an infinite weighted sum of independent chi-squared random variables as $n,m\rightarrow \infty$, provided that $\frac{n}{n+m}\rightarrow \rho \in (0, 1)$.  Moreover, when the latent positions are used in place of the true latent positions, then the difference between $U_{n,m}(\widehat{\mathbf X}, \widehat{\mathbf Y})$  and $U_{n,m}(\mathbf X, Y)$ converges to zero sufficiently fast to yield a consistent test procedure.

### Omnibus Embedding (omni)

The omnibus embedding described in [Section 6.7] jointly estimates the latent positions under the joint random dot product network ($ JRDPG$) model, where $(\mathbf A^{(1)}, \ldots, \mathbf A^{(m)})\sim JRDPG(\mathbf X_n)$, and the rows of $\mathbf X_n\in\mathbb R^{n\times d}$ are an i.i.d. sample from some distribution $F$. Let $\widehat{\mathbf{O}}\in\mathbb R^{mn\times mn}$ be the omnibus embedding of $\mathbf A^{(1)}, \ldots, \mathbf A^{(m)}$ and $\widehat{\mathbf Z} = ASE(\mathbf{O})\in\mathbb R^{mn\times d}$.
Under this setting, it can be shown that the rows of $\widehat{\mathbf Z}_n$ are a consistent estimator of the latent positions of each individual network  as $n\rightarrow\infty$, and that:
\begin{equation}
\max_{i\in[n],j\in[m]}\|(\widehat{\mathbf Z}_n)_{(j-1)n + i} - \mathbf W_n(\mathbf X_n)_{i}\| \leq \frac{C\sqrt{m}\log(mn)}{\sqrt{n}}. \label{eq:OMNI-consistency}    
\end{equation}
Furthermore, a central limit theorem for the rows of the omnibus embedding  asserts that:
\begin{equation}
\lim_{n\rightarrow\infty} \mathbb{P}\left\{\sqrt{n}\left(\mathbf W_n(\widehat{\mathbf Z}_n)_{(j-1)n + i} - (\mathbf X_n)_i\right)\leq \mathbf{z}\right\}  = \int_{\mathcal{X}}\Phi(\mathbf{z}, \widehat{\mathbf{\Sigma}}(\mathbf{x}))\  dF(\mathbf{x}),\label{eq:thm-OMNI-CLT}
\end{equation}
for some covariance matrix $\widehat{\Sigma}(\mathbf{x})$. 

### Multiple adjacency spectral embedding (MASE)

The $COSIE$ model described in [Section 5.5](#link?) gives a joint model that characterizes the distribution of multiple networks with expected probability matrices that share the same common invariant subspace. The $MASE$ algorithm (see [Section 5.5](#link?)) is a consistent estimator for this common invariant subspace, and results in asymptotically normally estimators for the individual symmetric matrices. Specifically, let $\mathbf V_n\in\mathbb R^{n\times d}$ be
a sequence of orthonormal matrices and $\mathbf R^{(1)}_n, \ldots, \mathbf R^{(m)}_n\in\mathbb R^{d\times d}$ a sequence of score matrices such that $\mathbf{P}^{(l)}_n=\mathbf V_n\mathbf R^{(l)}_n\mathbf V_n^\top\in[0,1]^{n\times n} $, $(\mathbf A_n^{(1)}, \ldots, \mathbf A_n^{(m)})\sim COSIE(\mathbf V_n;, \mathbf R^{(1)}_n, \ldots, \mathbf R^{(m)}_n)$, and $\widehat{\mathbf V}, \widehat{\mathbf R}^{(1)}_n, \ldots, \widehat{\mathbf R}^{(1)}_n$ be the estimators obtained by $MASE$. Under appropriate regularity conditions, the estimate for $\mathbf V$ is consistent as $n,m\rightarrow\infty$, and there exists some constant $C>0$ such that:
\begin{align*}
\mathbb{E}\left[\min_{\mathbf W\in\mathcal{O}_d}\|\widehat{\mathbf V}-\mathbf V\mathbf W\|_F\right] \leq C\left(\sqrt{\frac{1}{mn}} + {\frac{1}{n}}\right). \label{eq:theorem-bound}
\end{align*}
In addition, the entries of $\widehat{\mathbf{R}}^{(l)}_n$, $l\in[m]$ are asymptotically normally distributed. Namely, there exists a sequence of orthogonal matrices $\mathbf W$ such that:
$$\frac{1}{\sigma_{l,j,k}}\left(\widehat{\mathbf R}^{(l)}_n - \mathbf W^\top\mathbf R^{(l)}_n\mathbf W + \mathbf H_m^{(l)}\right)_{jk} \overset{d}{\rightarrow} \mathcal{N}(0, 1), $$
as $n\rightarrow\infty$, where:
$\mathbb{E}[\|\mathbf H_m^{(l)}\|]=O\left(\frac{d}{\sqrt{m}}\right)$ and $\sigma^2_{l,j,k} = O(1)$. 


### Graph Matching for Correlated Networks

Given a pair of network adjacency matrices $\mathbf A$ and $\mathbf B$ with $n$ vertices each (but possibly permuted), when is it possible to match the networks correctly? In principle, if the two networks are isomorphic (i.e., equal up to some orthogonal permutation), matching the vertices across the networks should be theoretically possible, as long as the exact match is unique. However, in the presence of edge noise, it might not be possible in general. This raises the question of whether matching the vertices is feasible for a particular joint model that describes the edges of the networks.

A body of literature has studied the feasibility of finding the correct matching under different random network models, including correlated Erd\H{o}s-R\'enyi and Bernoulli networks. In this section we review some of the results for the correlated Erd\H{o}s-R\'enyi model described in [Section 5.5](#link?).

Formally, given parameters $\rho_n\in[0,1]$ and $q_n \in(0, 1-\xi_1)$ for some small $\xi_1>0$, the $n\times n$ adjacency matrices $\mathbf A_n$ and $\mathbf B_n$ are distributed as correlated Erd\H{o}s-R\'enyi if their marginal distributions are $\mathbf A_n\sim ER_n(q_n)$, $\mathbf B_n\sim ER_n(q_n)$, but the edge pairs satisfy $\text{Corr}((\mathbf A_n)_{ij},(\mathbf{Q}_n^\top\mathbf B_n\mathbf{Q}_n)_{ij})=\rho_n$, where $\mathbf{Q}_n\in\mathcal{P}_n$ is a permutation matrix that gives the correct alignment between the vertices (here $\mathcal{P}_n$ denotes the set of $n\times n$ permutation matrices). 

One particular question is when does the solution of the optimization problem defined in [Section 9.3](#link?) recover the correct solution? It can be shown that this is possible when the networks are correlated-ER distributed with correlation between the networks and the edge probability that are not to small. Formally, the conditions require that
 $\rho_n\geq c_1\sqrt{\frac{\log n}{n}}$ and $q_n\geq c_2 \frac{\log n }{n}$ for some fixed positive constants $c_1, c_2$, then $\mathbf{Q}_n$ can be correctly recovered  with probability 1 for $n$ sufficiently large. 

While the solution of the quadratic assignment problem can correctly recover the vertex alignment in theory, it is computationally challenging to solve the optimization problem due to the non-convexity of the loss function. Introducing seeds nodes (as described in [Section 9.3](#link?)) can greatly improve the performance of the algorithm. In particular, in the presence of a logarithmic number of seed vertices $s_n$, graph matching succeeds in polynomial time to recover the alignment of the remaining $n-s_n$ vertices.