# Lecture 05: Genetic linkage and its effects

### Reduction in diversity near a beneficial mutation after a selective sweep

One dramatic consequence of genetic linkage is that neutral alleles near an allele under selection can dramatically increase (or decrease) in frequency due to linkage disequilibrium. This process is called [genetic hitchhiking](https://en.wikipedia.org/wiki/Genetic_hitchhiking). 

Here, we will simulate loss of diversity at a neutral locus at various "distances" (measured by the ratio of the recombination probability $r$ and the selection coefficient of the beneficial allele, $s$) from an allele undergoing a selective sweep. 

To make our analysis simpler, we'll consider a haploid model instead of diploids. In this model we consider two loci, labeled 1 and 2, which can have either A or B alleles. We'll assume that the B allele at site 1 is beneficial with selection coefficient $s$, and both alleles at site 2 are neutral. The fitness values of the different genotypes are then

$$w_{BA} = w_{BB} = 1 + s, \quad w_{AA} = w_{AB} = 1\,.$$

Recall that under the Wright-Fisher model, the change in genotype frequency after selection is

$$ p^\prime_G = \frac{w_G}{\bar{w}}p_G\, \quad \bar{w} = \sum_G w_G p_G\,. $$

If the population is haploid, then this is also the change in genotype frequencies after one generation.

Now we will add recombination to this model. The probability of recombination per reproductive cycle is $r$, and the recombination partner is chosen at random from the population. The probability of producing a particular genotype $G = \{g_1, g_2\}$ through recombination is equal to the product of the corresponding allele frequences at sites one and two, $p_1(g_1)\,p_2(g_2)$. So, combining selection and recombination the change in genotype frequencies between successive generations is

$$ p^\prime_G = \left(1 - r\right) \frac{w_G}{\bar{w}} p_G + r\, p_1(g_1)\, p_2(g_2)\,. $$

With this model in hand, we can simulate how genetic diversity will change a the neutral locus two due to linkage with locus one. As in the book, let's assume that one copy of the beneficial B allele at site one appears together with an A allele at site two, and the initial allele frequency at site two is $p_2(A) = p_2(B) = 1/2$. This choice is arbitrary; the beneficial allele could just as well have appeared together with a B allele at site two. 

In [None]:
import numpy as np          # here we import numpy
import numpy.random as rng  # and here we import the random number generation (sub-)library


# Set the starting parameters for the simulations

N    = 2*1000        # population size -- this should be divisible by 2
s    = 0.01          # selection coefficient for B at site 1
n_BA = 1             # starting number of BA genotypes
n_AA = N/2 - n_BA    # starting number of AA genotypes
n_AB = N/2           # starting number of AB genotypes
n_BB = 0             # starting number of BB genotypes
w_BA = w_BB = 1 + s  # fitness with beneficial alleles
w_AA = w_AB = 1      # fitness with neutral alleles only
p_end = 0.999        # run the simulation until the beneficial allele is >= this frequency

In [None]:
# Parameters used to iterate over and store our results

r    = np.arange(0, 0.5*s + 0.01*s, 0.01*s)  # recombination probability values
h    = np.zeros(len(r))            # final "heterozygosity" values
p_AA = (n_AA/N) * np.ones_like(r)  # AA frequency
p_AB = (n_AB/N) * np.ones_like(r)  # AB frequency
p_BA = (n_BA/N) * np.ones_like(r)  # BA frequency
p_BB = (n_BB/N) * np.ones_like(r)  # BB frequency
p_1A = p_AA + p_AB  # frequency of A at site 1
p_1B = p_BA + p_BB  # frequency of B at site 1
p_2A = p_AA + p_BA  # frequency of A at site 2
p_2B = p_AB + p_BB  # frequency of B at site 2
h0   = p_2A[0] * p_2B[0]  # initial heterozygosity at site two

for i in range(len(r)):

    while p_1B[i] < p_end:
        
        # Get new genotype frequencies
        w_bar = (w_AA * p_AA[i]) + (w_AB * p_AB[i]) + (w_BA * p_BA[i]) + (w_BB * p_BB[i])
        new_p_AA = (1 - r[i])*w_AA*p_AA[i]/w_bar + r[i] * p_1A[i] * p_2A[i]
        new_p_AB = (1 - r[i])*w_AB*p_AB[i]/w_bar + r[i] * p_1A[i] * p_2B[i]
        new_p_BA = (1 - r[i])*w_BA*p_BA[i]/w_bar + r[i] * p_1B[i] * p_2A[i]
        new_p_BB = (1 - r[i])*w_BB*p_BB[i]/w_bar + r[i] * p_1B[i] * p_2B[i]
        
        # Overwrite old genotype frequencies
        p_AA[i] = new_p_AA
        p_AB[i] = new_p_AB
        p_BA[i] = new_p_BA
        p_BB[i] = new_p_BB
        
        # Overwrite old allele frequencies
        p_1A[i] = p_AA[i] + p_AB[i]
        p_1B[i] = p_BA[i] + p_BB[i]
        p_2A[i] = p_AA[i] + p_BA[i]
        p_2B[i] = p_AB[i] + p_BB[i]

    # After the B allele at site one sweeps, save the scaled heterozygosity at site 2
    h[i] = p_2A[i] * p_2B[i] / h0

In [None]:
# Finally, let's make a plot

import seaborn as sns            # import seaborn
import matplotlib.pyplot as plt  # and matplotlib


# As in the textbook, we can reflect the "heterozygosity" scores around a distance of zero

r_r = list(-np.array(list(r[1:][::-1]))) + list(r)
h_r = list(h[1:][::-1]) + list(h)


# Plot the results
sns.lineplot(x=np.array(r_r)/s, y=h_r)
    

plt.xlabel('Distance (r/s)')
plt.ylabel('Heterozygosity')
plt.xlim(-0.5, 0.5)
plt.ylim(0, 1);

**Discussion.** How does the plot above change as we vary $N$ and $s$? What does that tell us about the effective of genetic hitchhiking in different scenarios?