# Triangular lattice barrier

**authors:** Joseph Marcus
 
Here I simulate genetic data under the coalescent in a triangular lattice with a barrier and explore the fit of different ways to compute expected genetic distances on simulated genotypes.

Lets load the necessary packages and modules to get started 

In [2]:
%load_ext autoreload
%autoreload 2

import numpy as np
import networkx as nx

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import sys
sys.path.append("../code/")
from habitat import *
from genotype_simulator import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Plot size configuration

In [3]:
sns.set_style('white')
plt.style.use('bmh')
mpl.rcParams['font.size'] = 14
mpl.rcParams['figure.figsize'] = 8, 6

## Setup the habitat

Here we define a triangular lattice with 8 rows and 8 columns so we have 64 demes in total

In [4]:
hab = TriangularLattice(8, 8)

Next we need to define a migration surface which is a function on the nodes of the graph that define edge weights. Here I choose a quadratic function to match the simulations in the EEMS paper

In [34]:
def quad_barrier_migration(self, m_min, m_max):
    """
    Arguments:
        g : nx graph
            regular lattice
        s : array
            d x 2 array of spatial positions
            
    Returns:
        g: nx.graph
            regular lattice with assigned weights
    """
    s0_max = np.max(self.s[:,0])
    s0_med = np.median(self.s[:,0]) + .25
    for i,j in self.g.edges():
        mu = np.mean([self.s[i,0], self.s[j,0]])
        m = (s0_max / s0_med ** 2) * (mu - s0_med) ** 2 + m_min
        self.g[i][j]["m"] = min(m, m_max)
        
    z = nx.adjacency_matrix(self.g, weight='m')
    z = z.toarray()
    z_norm = z / (2 * z.sum(axis=1, keepdims=True))
    z_norm_tril = np.tril(z_norm)
    self.m = z_norm_tril + z_norm_tril.T
    
    
    #z_triu = np.triu(z)
    #z_triu_norm = z_triu / (z_triu.sum(axis=1, keepdims=True))
    #z_triu_norm[np.isnan(z_triu_norm)] = 0.0
    #self.m = z_triu_norm + z_triu_norm.T - np.diag(z_triu_norm.diagonal())
    
    # normalize to sum to 1
    #self.m = self.m / (np.sum(self.m, axis=1, keepdims=True))

We then assign this method to the habitat object

In [35]:
hab.migration_surface = quad_barrier_migration
hab.migration_surface(hab, .01, 3.)

# compute graph laplacian
hab.get_graph_lapl()

In [36]:
np.sum(hab.m, axis=1)

array([0.22563143, 0.50911974, 0.61212124, 0.60992685, 0.25504578,
       0.27459944, 0.41575149, 0.42692441, 0.54612011, 0.59344231,
       0.69402607, 0.44979909, 0.27057003, 0.43517317, 0.6014354 ,
       0.43373704, 0.35028016, 0.57272219, 0.63868374, 0.65504897,
       0.30497844, 0.40078946, 0.49156077, 0.5616992 , 0.54612011,
       0.59344231, 0.69402607, 0.44979909, 0.27057003, 0.43517317,
       0.6014354 , 0.43373704, 0.35028016, 0.57272219, 0.63868374,
       0.65504897, 0.30497844, 0.40078946, 0.49156077, 0.5616992 ,
       0.54612011, 0.59344231, 0.69402607, 0.44979909, 0.27057003,
       0.43517317, 0.6014354 , 0.43373704, 0.41694683, 0.68963316,
       0.73014518, 0.72944564, 0.31874051, 0.47518613, 0.58302222,
       0.69527684, 0.47941759, 0.61527503, 0.74598725, 0.5       ,
       0.25401275, 0.38472497, 0.60391574, 0.5       ])

Lets visualize triangular lattice with edge widths proportional to the defined edge weights, note we multiply the weights by some constant just for visualization purposes. Additional the nodes have a color scheme based on their position on the map, particularly the x position difference is emphasized more than the y. the habitat

In [None]:
hab.plot_habitat(200, 2, False)

As expected we see that the migration matrix $\mathbf{M}$ is extremely sparse as only neighboring nodes are connected

In [None]:
hab.plot_migration_matrix()

## Simulate genotypes

Here we simulate genotypes under the coalescent using msprime ... this may take a bit of time. Specifically we simulate 10 haploid individuals per deme in 5000 independent regions of the genome. See `../code/genotype_simulator.py` for default params and implementation of the simulation object.

In [None]:
sim_path = path = "../output/simulations/trilat_bar.pkl"
geno = GenotypeSimulator(hab, sim_path, n_rep=2e3)
print(geno.y.shape)

Here we visualize the site frequency spectrum which seems to match the neutral expectation

In [None]:
geno.plot_sfs()

Lets filter out too rare variants leavings us with fewer SNPs

In [None]:
geno.filter_rare_var()
print(geno.y.shape)

Lets peform PCA on the genotype matrix and visualize the first two PCs. Note that I center and scale the data matrix before running PCA.

In [None]:
geno.pca()
geno.plot_pca(geno.pcs, geno.pves)

We see a strong signature of the barrier with two clusters based on geographic position on the x axis

## Expected genetic distances

We can see the graph laplacian is sparse as $\mathbf{M}$ is sparse. We can think of $\mathbf{L}$ here as a sparse precision matrix

In [None]:
hab.plot_precision_matrix(hab.l)

We can see that $\mathbf{L}\mathbf{L}^T$ is also sparse but not as sparse. It seems to have an additional off-diagonal band

In [None]:
hab.plot_precision_matrix(hab.l @ hab.l.T)

Compute observed genetic distances and different models to compute expected genetic distances. Note that I center the data matrix before computing genetic distances but I do not scale.

In [None]:
# lower triangular indicies
tril_idx = np.tril_indices(geno.n, -1)

# observed genetic distance
d_geno = geno.geno_dist()
d_geno_tril = d_geno[tril_idx]

# geographic distance
d_geo = geno.node_to_obs_mat(hab.geo_dist(), geno.n, geno.v)
d_geo_tril = d_geo[tril_idx]

# resistence distance
d_res = geno.node_to_obs_mat(hab.rw_dist(hab.l), geno.n, geno.v)
d_res_tril = d_res[tril_idx]

# random-walk distance
d_rw = geno.node_to_obs_mat(hab.rw_dist(hab.l @ hab.l.T), geno.n, geno.v)
d_rw_tril = d_rw[tril_idx]

In [None]:
geno.plot_dist(d_geo_tril, d_geno_tril, "Geographic Distance", "Genetic Distance")

print('geo r2 = {}'.format(np.corrcoef(d_geo_tril, d_geno_tril)[0, 1]))

In [None]:
geno.plot_dist(d_res_tril, d_geno_tril, "Resistence Distance", "Genetic Distance")

print('res r2 = {}'.format(np.corrcoef(d_res_tril, d_geno_tril)[0, 1]))

In [None]:
geno.plot_dist(d_rw_tril, d_geno_tril, "Random Walk Distance", "Genetic Distance")

print('rw r2 = {}'.format(np.corrcoef(d_rw_tril, d_geno_tril)[0, 1]))

In summary ...

In [None]:
print('coal r2 = {}'.format(np.corrcoef(d_coal_tril, d_geno_tril)[0, 1]))
print('geo r2 = {}'.format(np.corrcoef(d_geo_tril, d_geno_tril)[0, 1]))
print('res r2 = {}'.format(np.corrcoef(d_res_tril, d_geno_tril)[0, 1]))
print('rw r2 = {}'.format(np.corrcoef(d_rw_tril, d_geno_tril)[0, 1]))

Interestingly the correlation between the random walk distance and genetic distance is quite similar to the correlation of the coalescent distance with genetic distance! This is appealing as the coalescent distance is computed under the model we simulate under so its as good as it gets. One caveat is we have to solve a system of equations which might have numerical precision issues. Here we use the conjugate gradient sparse solver implemented in scipy as a first pass.

In [None]:
geno.plot_dist(d_rw_tril, d_coal_tril, "Random Walk Distance", "Coalescent Distance")

It seems like the Random Walk distance is doing poorly when the coalescent distance is small.

In [None]:
np.diag(hab.l)