# Gamete simulation, and simulating selection at a single locus

This notebook:  
1) Uses `poolparty` to simulate a bunch of single gametes under a user-provided recombination map.  
2) Plots resulting distribution of recombination events  
3) Simulates selection against a particular parental haplotype at a locus  

### imports

In [59]:
import poolparty
import h5py
import toyplot
import numpy as np
import scipy.integrate as integrate
import scipy.stats as st
from tqdm.notebook import tqdm

# 1. Define the recombination map

### Make a cool-shaped probability density function with all values greater than 0. Integrating between two points on this map gives the probability of a forced recombination event landing there.

In the second line I'm just including an arbitrary function to represent the pdf of the recombination map.

In [6]:
toyplot.scatterplot(np.linspace(0,1,1000), 
                   (1+1*np.cos(21*np.linspace(0,1,num=1000))), # cool equation here
                   height=300,
                   width=500);

### 1.1 Does it integrate to 1?
The pdf should always integrate to 1. 

In [10]:
integrate.quad(lambda x: (1+1*np.cos(21*x)), 0, 1)[0]

1.0398407446921933

We will multiply by a scalar to make our recombination map integrate to 1:

In [11]:
# now define scaling by that previous number:
scalar = 1 / integrate.quad(lambda x: (1+1*np.cos(21*x)), 0, 1)[0] # one over previous line

# now look at new result (should equal 1!)
integrate.quad(lambda x: (1+1*np.cos(21*x)) * scalar, 0, 1)[0]

0.9999999999999999

### 1.2 Now look at the scaled, final recombination map:

In [13]:
toyplot.scatterplot(np.linspace(0,1,1000), 
                   (1+1*np.cos(21*np.linspace(0,1,1000))) * scalar,
                   height=300,
                   width=500);

# 2. Simulate gametes with `poolparty`

The `poolparty` package automates the process of gamete simulating. It induces one crossover event per gamete, plus an extra crossover event with probability 0.2. The location of this crossover event is determined by sampling from the pdf defined above.

First, define the pdf in a way that allows us to sample from it:

In [17]:
scalar = 0.961685724573154
class my_pdf(st.rv_continuous):
    def _pdf(self,x):
        expression = (1+1*np.cos(21*x)) * scalar # scaling by the multiplier to bring max draw down to 1
        return (expression)  # Normalized over its range, in this case [0,1]

Now, create a simulation object. We have to tell it where to save the gamete file and the number of gametes to simulate.

In [22]:
sim_obj = poolparty.Sim_Gamete_Sequencing(
                      directory='/pinky/patrick/poolparty_sims/sims/20e3gametes/',
                      pdf=my_pdf(a=0,b=1), # this is our recombination map
                      num_gams = int(20e3),
                      gpa = None, # this and the below parameters are related to sequencing, not relevant here
                      nali=None,
                      ncutsites=None,
                      num_reads = None,
                 )

Because we aren't simulating sequencing of the gametes, we use the `sim_gametes_only()` function. This will save a .hdf5 file with all of the gametes and their respective recombination breakpoints.

In [23]:
sim_obj.sim_gametes_only()

# 3. Load the simulated gametes

We have now simulated 20000 gametes, which are saved to a file. We can now inspect where the recombination events occurred, and we can look at how selection would have influenced the results.

In [32]:
gamsfile = h5py.File('/pinky/patrick/poolparty_sims/sims/20e3gametes/gams.hdf5','r')
gamsfile.keys()

<KeysViewHDF5 ['crossovers', 'num_crossovers', 'start_haplo']>

# 4. Find and plot the recombination breakpoints (similar to results from single-cell sequencing):

In [30]:
toyplot.bars(np.histogram(gamsfile['crossovers'][:,1],100));

# 5. Simulate selection at a locus:

## 5.1 Let's impose complete selection against haplotype 1 at location 0.3 (in a high recombining region)

Across all gametes, if they have haplotype 1 at location 0.3, we will filter out the whole gamete.

In [82]:
loc = 0.3 # this is the location on the chromosome where selection is happening
selected_haplotype = 1
selection_intensity = 1


surviving_idxs = np.zeros((20000), dtype=bool)
for i in tqdm(range(20000)): # for each gamete
    crossover_locs = gamsfile['crossovers'][(gamsfile['crossovers'][:,0] == i),1]
    start_haplo = gamsfile['start_haplo'][i]
    if (np.sum(crossover_locs < loc) % 2) == 0:
        haplo_at_loc = start_haplo
    else:
        haplo_at_loc = 1-start_haplo
    
    if haplo_at_loc == selected_haplotype:
        if np.random.binomial(1,selection_intensity):
            surviving_idxs[i] = False
        else:
            surviving_idxs[i] = True
    else:
        surviving_idxs[i] = True

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20000.0), HTML(value='')))




### 5.1.0 Now plot the distribution of recombination events after selection:

You might wonder if selecting out the gametes that have a specific haplotype at a specific locus will influence the total distribution of remaining crossover events. Here we show that it **does not**.

In [83]:
# get the crossover distribution from the surviving gametes
co_events = []
for i in tqdm(range(20000)):
    if surviving_idxs[i]:
        co_events.extend(gamsfile['crossovers'][(gamsfile['crossovers'][:,0] == i),1])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20000.0), HTML(value='')))




In [86]:
# plotting:
canvas = toyplot.Canvas()#width=300, height=300)
axes = canvas.cartesian(label="Ratio of Haplotype 1 to Haplotype 0 genome-wide")

mark = axes.bars(np.histogram(co_events,100));

mark = axes.plot(a=np.array([0,250]),b=np.array([0.3,0.3]), # place a red line where we've selected against hap1
                 color='red',
                 along='y') # plots a red line at selected location

As we might have expected, there's **no signal** of the selection in the **distribution of recombination events.** The selection just reduces our sampled recombination events to about 1/2 of what it was, since 1/2 of the neutrally simulated gametes had haplotype 1 at location 0.3.

### 5.1.1 However, we should see a signal of the selection in the ratio of haplotype 1 to haplotype 0 across gametes at each location on the chromosome in the post-selection gamete pool:

In [87]:
# get the surviving crossovers and, across the chromosome, plot the ratio of haplotype 1 to haplotype 0:
haplotype_ratios = np.zeros((100))
idx = 0
for loc in tqdm(np.linspace(0,1,100)):
    num_haplo_1 = 0
    num_haplo_0 = 0
    for i in range(20000):
            if surviving_idxs[i]:
                crossover_locs = gamsfile['crossovers'][(gamsfile['crossovers'][:,0] == i),1]
                start_haplo = gamsfile['start_haplo'][i]
                if (np.sum(crossover_locs < loc) % 2) == 0:
                    haplo_at_loc = start_haplo
                else:
                    haplo_at_loc = 1-start_haplo
                
                if haplo_at_loc == 1:
                    num_haplo_1 += 1
                else:
                    num_haplo_0 += 1
    haplotype_ratios[idx] = num_haplo_1 / num_haplo_0
    idx += 1

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [88]:
canvas = toyplot.Canvas(width=800, height=300)
axes = canvas.cartesian(label="Ratio of Haplotype 1 to Haplotype 0 genome-wide")

mark = axes.scatterplot(np.linspace(0,1,100), haplotype_ratios)

mark = axes.plot(a=np.array([0,2]),b=np.array([0.3,0.3]),
                 color='red',
                 along='y')

mark = axes.scatterplot(np.linspace(0,1,1000), 
                   ((1+1*np.cos(21*np.linspace(0,1,1000))) * scalar));

In the above plot, the green-dotted line shows the ratio of haplotype 1 to haplotype 0 at points along the genome in the post-selection pool of gametes. The vertical red line shows the x-axis location of the locus where haplotype 1 was selected against. The orange line shows the pdf of the recombination map.

We can see that the selection drives down the hap1:hap2 ratio where it occurs, but the ratio increases quickly away from this point because it's in a high-recombining region. Because of the way recombination is modeled (i.e. inducing just one or two recombination events per chromosome), the ratio skews far above 1:1 on the right side of the plot.

## 5.2 Selection in a low-recomb part of the chromosome
### We'll try it at 0.15 and see how the hap1:hap0 ratio is affected.

Impose selection:

In [75]:
loc = 0.15 # this is the location on the chromosome where selection is happening
selected_haplotype = 1
selection_intensity = 1


surviving_idxs = np.zeros((20000), dtype=bool)
for i in tqdm(range(20000)): # for each gamete
    crossover_locs = gamsfile['crossovers'][(gamsfile['crossovers'][:,0] == i),1]
    start_haplo = gamsfile['start_haplo'][i]
    if (np.sum(crossover_locs < loc) % 2) == 0:
        haplo_at_loc = start_haplo
    else:
        haplo_at_loc = 1-start_haplo
    
    if haplo_at_loc == selected_haplotype:
        if np.random.binomial(1,selection_intensity):
            surviving_idxs[i] = False
        else:
            surviving_idxs[i] = True
    else:
        surviving_idxs[i] = True

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20000.0), HTML(value='')))




Get genome-wide haplotype ratios:

In [76]:
# get the surviving crossovers and, across the chromosome, plot the ratio of haplotype 1 to haplotype 0:
haplotype_ratios = np.zeros((100))
idx = 0
for loc in tqdm(np.linspace(0,1,100)):
    num_haplo_1 = 0
    num_haplo_0 = 0
    for i in range(20000):
            if surviving_idxs[i]:
                crossover_locs = gamsfile['crossovers'][(gamsfile['crossovers'][:,0] == i),1]
                start_haplo = gamsfile['start_haplo'][i]
                if (np.sum(crossover_locs < loc) % 2) == 0:
                    haplo_at_loc = start_haplo
                else:
                    haplo_at_loc = 1-start_haplo
                
                if haplo_at_loc == 1:
                    num_haplo_1 += 1
                else:
                    num_haplo_0 += 1
    haplotype_ratios[idx] = num_haplo_1 / num_haplo_0
    idx += 1

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




Plot:

In [81]:
# plotting
canvas = toyplot.Canvas(width=800, height=300)
axes = canvas.cartesian(label="Ratio of Haplotype 1 to Haplotype 0 genome-wide")

mark = axes.scatterplot(np.linspace(0,1,100), haplotype_ratios)

mark = axes.plot(a=np.array([0,2]),b=np.array([0.15,0.15]),
                 color='red',
                 along='y')

mark = axes.scatterplot(np.linspace(0,1,1000), 
                   ((1+1*np.cos(21*np.linspace(0,1,1000))) * scalar));

In the above plot, the green-dotted line shows the ratio of haplotype 1 to haplotype 0 at points along the genome in the post-selection pool of gametes. The vertical red line shows the x-axis location of the locus where haplotype 1 was selected against. The orange line shows the pdf of the recombination map.

Because selection is happening in a low-recombining region, the ratio of hap1:hap0 is depressed for a much larger part of the chromosome. In turn, this pushes up the hap1:hap0 ratio even higher than the previous example when we move far away from the selected position.