# Phasing and recombination tutorial

In this tutorial you will extract information about the genotype phases and the recombination rate starting from VCF files.

## How to make this notebook work

* In this notebook we will use both the `command line bash` commands and `R` to setup the file folders.
* Having to shift between two languages, you need to choose a kernel every time we shift from one language to another. A kernel contains a programming language and the necessary packages to run the course material. To choose a kernel, go on the menu on the top of the page and select `Kernel --> Change Kernel`, and then select the preferred one. We will shift between two kernels, and along the code in this notebook you will see a picture telling you to change kernel. The two pictures are below:

<img src="img/bash.png" alt="Bash" width="80"> Shift to the `Bash` kernel

<img src="img/R.png" alt="R" width="80"> Shift to the `popgen course` kernel
* You can run the code in each cell by clicking on the run cell sign in the toolbar, or simply by pressing <kbd>Shift</kbd>+<kbd>Enter</kbd>. When the code is done running, a small green check sign will appear on the left side.
* You need to run the cells in sequential order to execute the analysis. Please do not run a cell until the one above is done running, and do not skip any cells
* The code goes along with textual descriptions to help you understand what happens. Please try not to focus on understanding the code itself in too much detail - but rather try to focus on the explanations and on the output of the commands 
*   You can create new code cells by pressing `+` in the Menu bar above or by pressing <kbd>B</kbd> after you select a cell. 


## Learning outcomes

At the end of this tutorial you will be able to

- **Create** phased data and **discuss** the results through the genome viewer
- **Estimate and analyze** recombination maps
- **Plot** recombination maps in `R` and **compare** different populations

## Setting up folders

Here we setup a link to the `Data` folder and create the `Results` folder

<img src="img/bash.png" alt="Bash" width="80"> Choose the `Bash` kernel

In [None]:
ln -s ../../Data
mkdir -p Results

# Inferring the genotype phase (AKA Phasing)

During base calling, we identified the two bases at each position in each diploid individual. However, we do not know which base goes on which of the two chromosomes. That means that we do not know if the two haploid chromosomes look like the left or right example below:

    -----A-----C------         -----T-----C------ 
    -----T-----G------    or   -----A-----G------


To do that we use the program [Beagle](http://faculty.washington.edu/browning/beagle/beagle.html), which applies a clustering algorithm to call the genotype phase.

We put the jointly called bases for Africans, West Eurasians, and East Asians in these three files:

- Africa (9 individuals): `Data/vcf/Allvariants_africa.vcf`
- West Eurasia (10 individuals): `Data/vcf/Allvariants_westeurasia.vcf`
- Eash Asia (8 individuals): `Data/vcf/Allvariants_eastasia.vcf`

In this exercise we will just use the Africans and the West Eurasians.

## Running Beagle

For additional information see the [Beagle 4.1 manual](https://faculty.washington.edu/browning/beagle/beagle_4.1_03Oct15.pdf)

To obtain phased data, Beagle needs a [genetic map](https://www.genome.gov/genetics-glossary/Genetic-Map). Genetic maps for widely used organisms (humans, mice, yeast, ...) are already available online. You can find the genetic map for chr2 (hg19 assembly) here: `Data/genetic_map/plink.chr2.GRCh37.map`.

We run `Beagle` on the african vcf file. `Beagle` is written in `java` language in a file with extension `.jar`, so we call it using `java -jar`.

In [None]:
java -jar Data/software/beagle.08Jun17.d8b.jar \
        gt=Data/vcf/Allvariants_africa.vcf \
        map=Data/genetic_map/plink.chr2.GRCh37.map \
        out=Results/Allvariants_africa_phased

We run the same command for the west eurasian vsc file:

In [None]:
java -jar Data/software/beagle.08Jun17.d8b.jar \
        gt=Data/vcf/Allvariants_westeurasia.vcf \
        map=Data/genetic_map/plink.chr2.GRCh37.map \
        out=Results/Allvariants_westeurasia_phased

Vcf files can both be compressed (gz) or uncompressed. IGV needs it in an uncompressed format, so decompress using

In [None]:
gunzip -c Results/Allvariants_africa_phased.vcf.gz > Results/Allvariants_africa_phased_t.vcf

In [None]:
gunzip -c Results/Allvariants_westeurasia_phased.vcf.gz > Results/Allvariants_westeurasia_phased_t.vcf

This command outputs the decompressed to stdout, which then is written into your file name of choice using `>`.

## Browsing the phased results

Download the phased VCF files to your computer and open them in IGV (integrative genomics viewer): 
    
1. Choose Human hg19 as the reference genome.
2. Click `File > Load from File...` and select you phased VCF file.

Explore phases of haplotypes at two positions in the alignment:

Select chr2, zoom all the way in and select find the base at position 136608646. First, take a look at the WestEurasian samples. Consider these questions while zooming further out:

1. What does the haplotypes look like?
2. Do you see any long streches of homozygosity?
3. Which haplotypes agree?
4. How wide is the region where they agree?

To help derive your answers, make use of the metadata file: `Data/metadata/Sample_meta_subset.tsv`

Now, compare it with the African samples.

Try to search the position chr2:136608646 in the [UCSC genome browser](https://genome-euro.ucsc.edu/cgi-bin/hgGateway?redirect=manual&source=genome.ucsc.edu). Remember we are using the Hg19 assembly version of the reference human genome. Can you find anything that explains your observations? (HINT: https://omim.org/entry/601806#0004)

# Estimating a recombination map

We use `LDhat`, a package for the analysis of recombination rates from population genetic data, to estimate a recombination map from the vcf files. The proportion of crossovers occurring between two genes can be used to indicate the distance between them, and thus enable the construction of a genetic map that illustrates how all genes in the genome are related in space.

## Format input data for LDhat

LDhat needs its input data in a particular format ([LDhat  manual](https://github.com/auton1/LDhat/blob/master/manual.pdf)). We will use `vcftools` to produce these input files from the phased VCF file.

For Africa:

In [None]:
vcftools --gzvcf Results/Allvariants_africa_phased.vcf.gz --chr 2 --ldhat --out Results/recmap_data_africa

West Eurasia

In [None]:
vcftools --gzvcf Results/Allvariants_westeurasia_phased.vcf.gz --chr 2 --ldhat --out Results/recmap_data_westeurasia

Have a look at the two files produced

In [None]:
head Results/recmap_data_westeurasia.ldhat.locs

In [None]:
head Results/recmap_data_westeurasia.ldhat.sites

How do you think the information is encoded in these files?

## Running LDhat

To speed up computations you can make a lookup table first. That takes a while, so we did if for you. But it is done using the `complete` program that comes with LDhat as below (command is ineffective because commented with the `#` symbol). The parameters are
- `-n 20`:the number of haplotypes (2 * 10).
- `-rhomax 100`: maximum rho ($4N_e r$) alowed: 100 (recommended).
- `-n_pts 101`: number of points in grid: 101 (recommended).
- `-theta 0.0001`: human theta ($4N_e \mu$).

In [None]:
#Data/software/builds/LDhat/complete -n 20 -rhomax 100 -n_pts 101 -theta 0.0001

The output is a file that will serve as a look up table for the algorithm. It includes coalescent likelihoods for each pairs of SNPs using a grid of recombination rates. You can find it in `Data/ldhat/new_lk.txt`.

The next step is to calculate the recombination map. This command will take some time to run (~ 6 min). The options we are using are 
- `-lk`: likelihood lookup table.
- `-its`: number of iterations of the MCMC.
- `-samp`: number of iterations between sampling events, i.e how often to sample from the MCMC.
- `-burn`: how many of the initial iterations to discard. Here we set it to zero to leave keep all samples. Then we look later how much burnin to discard.

Africa:

In [None]:
Data/software/builds/LDhat/rhomap -seq Results/recmap_data_africa.ldhat.sites \
                                        -loc Results/recmap_data_africa.ldhat.locs \
                                        -prefix Results/recmap_data_africa. \
                                        -lk Data/ldhat/new_lk.txt -its 100000 -samp 500 -burn 0

West Eurasia:

In [None]:
Data/software/builds/LDhat/rhomap -seq Results/recmap_data_westeurasia.ldhat.sites \
                                        -loc Results/recmap_data_westeurasia.ldhat.locs \
                                        -prefix Results/recmap_data_westeurasia. \
                                        -lk Data/ldhat/new_lk.txt -its 100000 -samp 500 -burn 0

When `rhomap` completes, it writes three files:

- `acceptance_rates.txt`: acceptance rates of the MCMC. If they are lower than 1%. The program should be run with more iterations.
- `summary.txt`: (quoting the manual) for each SNP interval, the estimated genetic map position, the estimated recombination rate, and the hotspot density (the number of hotspots per kb per iteration).
- `rates.txt`: (quoting the manual) is the output from each sample detailing the recombination rate (expressed in $4N_e r$ per kb) between each SNP. 

## Analyze results in R

<img src="img/R.png" alt="R" width="80"> Shift to the `popgen course` kernel

We now import the resulting files from `LDhat` in `R` using a function create by the author of `LDhat`. The resulting summary file from this function will be useful to perform some analysis later on.

In [None]:
source("Data/software/ldhat.r")

One of the loaded functions is `summarise.rhomap`, which produces two plots:
- A graph of the recombination rate across on each polymorphic loci, along with confidence intervals.
- A plot showing how estimation of recombination rate has progressed with each MCMC sample. Notice that the initial run of MCMC samples are atypical. This is the "burn-in" of the MCMC. We want to remove that, so take notice of how many samples it corresponds to. We can produce a new set of estimates that excludes this burn-in using the `stat` program that comes with LDhat:

In [None]:
summarise.rhomap(rates.file = "Results/recmap_data_africa.rates.txt", 
                 locs.file="Results/recmap_data_africa.ldhat.locs")

In [None]:
summarise.rhomap(rates.file = "Results/recmap_data_westeurasia.rates.txt", 
                 locs.file="Results/recmap_data_westeurasia.ldhat.locs")

<img src="img/bash.png" alt="Bash" width="80"> Shift to the `Bash` kernel

We run `LDhat stat`. This produces a file called `res.txt` that describes the confidence in the estimated recombination rate along the sequence.

In [None]:
Data/software/builds/LDhat/stat -input Results/recmap_data_africa.rates.txt \
                                -loc Results/recmap_data_africa.ldhat.locs \
                                -burn 60
mv res.txt Results/recmap_data_africa.res.txt

In [None]:
Data/software/builds/LDhat/stat -input Results/recmap_data_westeurasia.rates.txt \
                                -loc Results/recmap_data_westeurasia.ldhat.locs \
                                -burn 60
mv res.txt Results/recmap_data_westeurasia.res.txt

<img src="img/R.png" alt="R" width="80"> Shift to the `popgen course` kernel

Now we to plot the final results. To have the positions of the loci for which we have estimated mean recombination rates, we will merge the new dataset created with the summary files generated by LDhat:

In [None]:
library(ggplot2)
library(dplyr)
library(tidyr)
library(magrittr)

We plot the mean of the recombination rate along genome positions. In blue we visualize the average rec. rate, and in gray the 95% confidence interval

In [None]:
summary <- read.table('Results/recmap_data_africa.summary.txt', header = T)
rates <- read.table("Results/recmap_data_africa.res.txt", header=T)
rates %>%
  filter(Loci > 0) %>%
  mutate(pos=summary$Position.kb.*1000) %>%
  ggplot(aes(x=pos, y=Mean_rho, ymin=L95, ymax=U95)) +  
  geom_line(color='blue') +
  geom_ribbon(alpha=0.1) +
  theme_bw()

In [None]:
summary <- read.table('Results/recmap_data_westeurasia.summary.txt', header = T)
rates <- read.table("Results/recmap_data_westeurasia.res.txt", header=T)
rates %>%
  filter(Loci > 0) %>%
  mutate(pos=summary$Position.kb.*1000) %>%
  ggplot(aes(x=pos, y=Mean_rho, ymin=L95, ymax=U95)) +  
  geom_line(color='blue') +
  geom_ribbon(alpha=0.1) +
  theme_bw()

Look at the plots and ponder the following questions:

- Are there any recombination hotspots?
- Are there any regions where the estimated recombination rate is really low? 
- Can you see any hotspots in Africans that are not found in West Eurasians - other the othe way around?
- What does the recombination rate look like around the lactase gene?