# Simulating aliquot sequencing for covariance matrix

Outside of this notebook I simulated sequencing of 192 aliquots of 20 gametes apiece, with 100 markers and 3,840,000 reads spread across those markers.

Here, I will revisit the recombination map and show how a **covariance matrix** contains the signal of genome-wide recombination variation.

### imports

In [1]:
import scipy.stats as st
import numpy as np
import scipy.integrate as integrate
import toyplot
import h5py

from tqdm.notebook import tqdm

import poolparty

## 1. Show the recombination map

### make a cool-shaped recombination map with all values greater than 0.

In [2]:
# now define scaling by that previous number:
scalar = 1 / integrate.quad(lambda x: (1+1*np.cos(21*x+np.sin(60*x))), 0, 1)[0] # one over previous line

# now look at new result (should equal 1!)
integrate.quad(lambda x: (1+1*np.cos(21*x+np.sin(60*x))) * scalar, 0, 1)[0]

1.0

In [3]:
toyplot.scatterplot(np.linspace(0,1,1000), 
                   (1+1*np.cos(21*np.linspace(0,1,num=1000)+np.sin(60*np.linspace(0,1,num=1000))))*scalar, # cool equation here
                   height=300,
                   width=500);

# 2. Simulate!

### This has already been done, with `poolparty`. 

# 3. Load simulated data.

In [72]:
haps = h5py.File('/pinky/patrick/poolparty_sims/sims/20gpa_192nali_100loci_3840e3reads/haplotypes.hdf5','r')

# 4. Examine distribution of markers.

### Where are the markers?

In [73]:
np.array(haps['loci_locs'])

array([0.01713903, 0.01939749, 0.03608618, 0.04482447, 0.05775471,
       0.05786713, 0.07590005, 0.08368368, 0.09580599, 0.10956318,
       0.11469308, 0.14549169, 0.14693218, 0.15594299, 0.16176115,
       0.1753468 , 0.17840618, 0.19012154, 0.19409192, 0.19450069,
       0.21793627, 0.2184831 , 0.24468038, 0.2523716 , 0.26221637,
       0.27584935, 0.28674107, 0.29964363, 0.32011889, 0.32895052,
       0.34597573, 0.37708686, 0.38280757, 0.38854184, 0.39250975,
       0.39839582, 0.40865566, 0.41356142, 0.43724481, 0.44045411,
       0.45562362, 0.46724957, 0.46746545, 0.48273026, 0.48536054,
       0.49024082, 0.49425538, 0.49434697, 0.4967879 , 0.49750747,
       0.51126486, 0.51980264, 0.53949535, 0.54851223, 0.57439135,
       0.59464173, 0.59874063, 0.60096924, 0.6056355 , 0.61057238,
       0.62353998, 0.67581999, 0.69065215, 0.69994757, 0.70049625,
       0.70289751, 0.70894795, 0.72931759, 0.73379558, 0.73901322,
       0.7412148 , 0.7464459 , 0.74745078, 0.75745521, 0.76035

### Look at the distribution...

In [74]:
toyplot.scatterplot(np.array(haps['loci_locs']),np.repeat(0,len(np.array(haps['loci_locs']))),height=300,width=1000);

# 5. Make a covariance matrix showing how the 100 markers covary in haplotype proportions across the 192 aliquots:

In [111]:
# make covariance matrix
cov_mat = np.zeros((100,192))
for row in tqdm(range(100)):
    # what what is locus Alpha, how many hap1s do we observe there?
    locusnumAlpha = row # locus location (out of 100)
    alphavals = np.zeros((192))
    for i in range(192):
        obs = haps[str(i)][str(locusnumAlpha)]
        alphavals[i] = np.sum(obs) / len(obs)

    #c = np.cov(alphavals,betavals)

    cov_mat[row,:] = alphavals

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [112]:
# 100 by 100 -- because we have 100 markers
np.cov(cov_mat).shape

(100, 100)

In [115]:
# plotting markers and recombination map on top, covariance matrix below:
canvas = toyplot.Canvas(width=800, height=300)
axes = canvas.cartesian(label = "Recombination map and loci locations")
mark = axes.scatterplot(np.linspace(0,1,1000), 
                   (1+1*np.cos(21*np.linspace(0,1,num=1000)+np.sin(60*np.linspace(0,1,num=1000))))*scalar)
mark = axes.scatterplot(np.array(haps['loci_locs']),np.repeat(1,len(np.array(haps['loci_locs']))))
toyplot.matrix(np.cov(cov_mat), tshow=False, lshow=False, label="Covariance matrix",width=800,height=800);

# 6. Same process, new (better) data!
## Let's compare results to new data that uses the same number of markers but includes more aliquots and many more reads! This is close to the "true" amount of information we might get from a lane of sequencing.

## This new dataset includes 576 aliquots, 20 gametes per aliquots, 100 markers, and 19,200,300 reads.

In [116]:
haps = h5py.File('/pinky/patrick/poolparty_sims/sims/20gpa_576nali_100loci_19200e3reads/haplotypes.hdf5','r')

# 6.1 Make a covariance matrix showing how the 100 markers covary in haplotype proportions across the 576 aliquots:

In [117]:
# make covariance matrix
cov_mat = np.zeros((100,576))
for row in tqdm(range(100)):
    # what what is locus Alpha, how many hap1s do we observe there?
    locusnumAlpha = row # locus location (out of 100)
    alphavals = np.zeros((576))
    for i in range(576):
        obs = haps[str(i)][str(locusnumAlpha)]
        alphavals[i] = np.sum(obs) / len(obs)

    #c = np.cov(alphavals,betavals)

    cov_mat[row,:] = alphavals

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [118]:
# 100 by 100 -- because we have 100 markers
np.cov(cov_mat).shape

(100, 100)

In [119]:
# plotting markers and recombination map on top, covariance matrix below:
canvas = toyplot.Canvas(width=800, height=300)
axes = canvas.cartesian(label = "Recombination map and loci locations")
mark = axes.scatterplot(np.linspace(0,1,1000), 
                   (1+1*np.cos(21*np.linspace(0,1,num=1000)+np.sin(60*np.linspace(0,1,num=1000))))*scalar)
mark = axes.scatterplot(np.array(haps['loci_locs']),np.repeat(1,len(np.array(haps['loci_locs']))))
toyplot.matrix(np.cov(cov_mat), tshow=False, lshow=False, label="Covariance matrix",width=800,height=800);

# You can see that this covariance matrix, produced from aliquot sequence data alone, very clearly captures much of the fine-scale variation in recombination rates.