# Stepping stone exploration 2

**authors:** Joseph Marcus

Here I explore the stepping stone model with possible approximations building off the classic results of Bodmer and Cavalli-Sforza 1967. The notes here are a bit scattered and I will clean them up later.

Consider a single bi-allelic SNP with haploid individuals carrying either the $A$ or $a$ allele dispersed throughout a habitat. The habitat is discretized and defined on a graph $\mathcal{G}$ over geographic space with $d$ nodes and a migration matrix $\mathbf{M}$ which specifies the edge weights. Note that $\mathbf{M}$ can be interpreted as a "backwards" migration matrix where $m_{ij} >= 0.0$ and $\sum_{j=1}^d m_{ij} = 1$. Furthermore, $m_{ij}$ can be interpreted as the probability that an individual in node $i$ has parents from node $j$. Let $p_{i,t}$ be the allele frequency of the $A$ allele at node $i$ and time $t$, here time is discrete as well. Each generation we can describe the evolution of the allele frequency in two steps, first a deterministic migration event where individuals are swapped amongst only neighboring nodes and a drift event which is a random fluctuation in allele frequency in each node proportional to its population size.

$$
p_{i,t} = \sum_{j=1}^d m_{ij} p_{i,t-1}
$$

Or in matrix notation 

$$
\mathbf{p}_t = \mathbf{M}\mathbf{p}_{t-1}
$$

For now we don't assume any distributional form for $\mathbf{p}_{t}$ but do define its conditional moments

$$
\begin{aligned}
E\big(\mathbf{p}_t | \mathbf{p}_{t-1}\big) &= \mathbf{M}\mathbf{p}_{t-1} \\
Var\big(\mathbf{p}_t | \mathbf{p}_{t-1}\big) &= diag\Big(\frac{1}{\mathbf{N}} \odot \mathbf{M}\mathbf{p}_{t-1} \odot \big(\mathbf{1} - \mathbf{M}\mathbf{p}_{t-1}\big) \Big)
\end{aligned}
$$

Here $\mathbf{N}$ is a $d$ vector of population sizes within each node and $\odot$ refers to element-wise multiplication. Note that this exactly corresponds to the process we described previously. There is first a deterministic migration event and variance induced by random sampling of gametes due to genetic drift. Here we make the simplifying assumption where we focus only on common SNPs such that the binomial sampling variance has a small range and approximate this conditional variance as 

$$
Var\big(\mathbf{p}_t | \mathbf{p}_{t-1}\big) \approx \sigma^2 diag\Big(\frac{1}{\mathbf{N}}\Big)
$$

Now lets make a further assumption that the change in frequency due to drift are normally distributed

$$
\mathbf{p}_t = \mathbf{M}\mathbf{p}_{t-1} + \epsilon \\ 
\epsilon | \sigma^2, \mathbf{N} \sim \mathcal{N}\Bigg(\mathbf{0}, \sigma^2 diag\Big(\frac{1}{\mathbf{N}}\Big)\Bigg) 
$$


Letting $\pi$ be the stationary distribution of the allele frequency and assuming a stationary distribution exists then ...

$$
\mathbf{\pi} = \mathbf{M}\pi + \epsilon \\
\mathbf{\pi} - \mathbf{M}\pi = \epsilon \\
(\mathbf{I} - \mathbf{M})\pi = \epsilon
$$

Recall the graph laplacian is defined as $\mathbf{L} = \mathbf{D} - \mathbf{M}$ where $d_{ii} = \sum_{j=1}^d m_{ij} = 1$. Because $\mathbf{M}$ is a stochastic matrix it rows sum to 1 and $\mathbf{L} = \mathbf{I} - \mathbf{M}$.

$$
\mathbf{L}\mathbf{\pi} = \epsilon \\
\pi = \mathbf{L}^{-1}\epsilon
$$

Letting $\mathbf{Q} = diag\Big(\frac{1}{\mathbf{N}}\Big)$ our stationary distribution is 

$$
\pi | \sigma^2, \mathbf{L}, \mathbf{Q} \sim \mathcal{N}\big(\mathbf{0}, \sigma^2(\mathbf{L}\mathbf{Q}\mathbf{L}^T)^{-1}\big)
$$

As an simple example let all population sizes be the same with size $N$ across the nodes

$$
\pi | \sigma^2, \mathbf{L}, N \sim \mathcal{N}\Big(\mathbf{0}, \frac{\sigma^2}{N}(\mathbf{L}\mathbf{L}^T)^{-1}\Big)
$$

This is very similar to the covariance matrix derived in Hanks 2017 ... now its a bit easier to see the some of the connections.

*TODO: show formally the stationary distribution exists ... perhaps this also needs the constraint of $\mathbf{1}^T\epsilon = 0$*