# 1. Resources and set up
- [`PSMC`](https://github.com/lh3/psmc)
- [`VCF file format`](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
- [`vcftools`](https://vcftools.github.io/index.html)
- [`1000 genomes high coverage variant calls`](https://www.internationalgenome.org/data-portal/sample)
- `Cattle genomes high coverage variant calls`

We have prepared all required tools and data for this exercise. Please enjoy the exercise !

We will use the `psmc` tool to estimate the effective population size for our data. But instead of directly using bam files, we will use a variant calling format (vcf) file as the starting point. In this tutorial, we will first take a look at a vcf file, then look at how a vcf file is converted to an input for `psmc` using a custom python script. And finally, run psmc on simulated and real data to see how well we can estimate effective population size $N_e$. Note that `psmc` relies on high coverage data where we can estimate genotypes accurately, which is one of the reasons we are working in our exercises with vcf files. 

# 2. Data
In this tutorial, we will use 3 sets of data. First, we will use three simulated samples as input data to estimate the effective population size in those three populations. The advantage here is that here we know what the true population size should be, so therefore we can test whether `psmc` is able to give a correct result. In the second part of the exercises, we will use data from the 1000 genomes project, namely 2 human individuals, NA12718 - a CEU female sample (Northern European ancestry), and NA19471 - a female Luhya sample from Kenya, to compare and contrast how the effective population sizes vary between these two populations. Lastly, we will use one sample of European cattle (*Bos taurus*): Limousin_ROUSTAN, one sample from their close relative, the Bali cattle (a domesticated population of the banteng, *Bos javanicus*): N_7B, and one sample that is the hybrid offspring of a Bali cattle and a regular cattle: N_31B. The latter sample will help us to understand what happens when we analyze an admixed sample using `psmc`.

Before we begin, let us take a look at the vcf file format using the simulated data with three individuals, one from each simulated population. Note that all the vcf files are zipped, so we will use `zcat` to display them.

In [1]:
zcat /course/bgi23/rasmus/shyam/psmc_tutorial/data/threeinds.recode.vcf.gz | head -10

##fileformat=VCFv4.2
##source=tskit 0.5.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr42,length=100000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	s1_1	s2_1	s3_1
chr42	84	0	A	T	.	PASS	.	GT	0|0	0|0	0|0
chr42	147	1	T	G	.	PASS	.	GT	0|0	0|0	0|0
chr42	334	2	G	A	.	PASS	.	GT	0|0	0|0	0|0
chr42	406	3	C	A	.	PASS	.	GT	0|0	0|0	0|0

gzip: stdout: Broken pipe


All the lines starting with `##` are information header lines which describe the VCF file. The line beginning with `#CHROM` is the header line describing the contents of the subsequent data lines. Each data line contains information on a single variant. As you can also see, each individual s1_1, s2_1 and s3_1 has a column with the genotype for that individual at each variant. Here "0|0" is homozygous for the reference (REF) allele, "0|1" is heterozygous, and "1|1" is homozygous for the alternate (ALT) allele.

We will use the vcf files to generate input files for our `psmc` analysis. Note that this is __NOT__ the prefered method to generate input files for `psmc`. The prefered method is to generate the input files directly from bam files, and that information can be found [here](https://github.com/lh3/psmc). 

__Question:__ Why do you think vcf files are not the preferred format? What information do you think is missing?

In [2]:
cp /davidData/data/course/bgi23/rasmus/shyam/psmc_tutorial/images/*.png ./

# 3. Working with simulated data

In this first exercise, we will look at simulated data. The simulation setup looked like this:

(please run the following code to show the plot)

In [3]:
## check the output image
from IPython.display import Image # import image module 
Image(url="popsize.png", width=600, height=600)

We will try and reconstruct the population sizes using `psmc` to see if we can faithfully reconstruct the population sizes. First, let us construct a psmcfa file - an input fasta file for psmc from the vcf files. 

## psmcfa file

A psmcfa file is a fasta representation of the genome (or part of a genome), where each entry (letter) represents if there exists a heterozygous variant in a fixed size window. Let us try and understand this by converting our vcf file to a psmcfa file. We will use the custom script `vcf2psmcfa.py` to do this. Note that this script expects that the vcf file contains only 1 "chromosome", which is not the case for a general vcf file. The script takes 2 input parameters - the vcf file name and the name of the sample. The output file name is derived from the name of the sample. It also uses a fixed window size of 100 bp. 

In [4]:
python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/threeinds.recode.vcf.gz s1_1
python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/threeinds.recode.vcf.gz s2_1
python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/threeinds.recode.vcf.gz s3_1

Let us take a quick look at one of the psmcfa files. 

In [5]:
head /course/bgi23/rasmus/shyam/psmc_tutorial/data/s2_1.psmcfa

>chr1
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTKTTTTTKKTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTKTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTKTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTKTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTKTTKTTTTTTTTTTTTTTTT


It appears to be a regular fasta file, with only "T" and "K". Here a "T" represents a 100 bp window without any heterozygous sites in it, whereas a "K" represents a 100 bp window with at least 1 heterozygous site in it. 

__Question:__ Given the population histories, which of the three samples would you expect to have the highest number of heterozygous windows?

## Running psmc

Now it is time for us to run our first `psmc` analysis. Let us first look at the options for running psmc. 

In [6]:
/course/bgi23/rasmus/psmc/psmc


Program: psmc (Pairwise SMC Model)
Version: 0.6.5-r67
Contact: <http://hengli.uservoice.com/>

Usage:   psmc [options] input.txt

Options: -p STR      pattern of parameters [4+5*3+4]
         -t FLOAT    maximum 2N0 coalescent time [15]
         -N INT      maximum number of iterations [30]
         -r FLOAT    initial theta/rho ratio [4]
         -c FILE     CpG counts generated by cntcpg [null]
         -o FILE     output file [stdout]
         -i FILE     input parameter file [null]
         -T FLOAT    initial divergence time; -1 to disable [-1]
         -b          bootstrap (input be preprocessed with split_psmcfa)
         -S          simulate sequence
         -d          perform decoding
         -D          print full posterior probabilities



: 1

In particular, note the `-p` option, which allows us to specify a pattern of parameters. As we discussed in the lecture before, we divide up time into discrete bins. This pattern allows us to estimate the same parameter for multiple consecutive time bins. For example - the default pattern 4+5\*3+4 splits time up into 23 bins (4+15+4), where only 1 parameter is estimated for the first 4 bins, then 1 parameter is estimmated for the next 5 triples of bins, and 1 parameter for the last 4 bins. So the total number of paramters we are estimating is 7 (1+5+1). Note this also in the plots for the effective population size, when we plot them. 

Let us now run psmc for the first time - we will just use the default values for the options, while still explicitly specifying the parameter pattern. This pattern is quite coarse, but we will try and estimate it again with finer time bins in the next section.  

__Question:__ Can you think of a reason why the most recent and oldest time bins are quite often merged to get only 1 parameter?

In [7]:
/course/bgi23/rasmus/psmc/psmc -p "4+5*3+4" -o s1_1_coarsePattern.psmc /course/bgi23/rasmus/shyam/psmc_tutorial/data/s1_1.psmcfa &
/course/bgi23/rasmus/psmc/psmc -p "4+5*3+4" -o s2_1_coarsePattern.psmc /course/bgi23/rasmus/shyam/psmc_tutorial/data/s2_1.psmcfa &
/course/bgi23/rasmus/psmc/psmc -p "4+5*3+4" -o s3_1_coarsePattern.psmc /course/bgi23/rasmus/shyam/psmc_tutorial/data/s3_1.psmcfa

[1] 451122
[2] 451123
[1]-  Done                    /course/bgi23/rasmus/psmc/psmc -p "4+5*3+4" -o s1_1_coarsePattern.psmc /course/bgi23/rasmus/shyam/psmc_tutorial/data/s1_1.psmcfa


Let us take a look at one of the output files, and discuss its contents. 

In [8]:
tail -34 s1_1_coarsePattern.psmc

//
IT	1301
RD	30
LK	-6048.127908
QD	-0.987841 -> -0.895559
RI	0.0013290298
TR	0.012441	0.011574
MT	27.920540
MM	C_pi: 1.955250, n_recomb: 1209.203524
RS	0	0.000000	58721.822094	0.000595	0.000000	0.000000
RS	1	0.029196	58721.822094	0.000768	0.000001	0.000000
RS	2	0.066916	58721.822094	0.000993	0.000001	0.000000
RS	3	0.115649	58721.822094	0.001283	0.000001	0.000001
RS	4	0.178609	1.652714	57.496115	0.047549	0.037859
RS	5	0.259952	1.652714	70.251382	0.058097	0.055644
RS	6	0.365044	1.652714	84.451425	0.069841	0.069953
RS	7	0.500817	1.790183	92.146706	0.076204	0.075640
RS	8	0.676232	1.790183	106.556138	0.088121	0.088476
RS	9	0.902860	1.790183	119.305718	0.098665	0.100785
RS	10	1.195655	2.057115	112.996050	0.093447	0.096319
RS	11	1.573934	2.057115	118.651040	0.098123	0.100018
RS	12	2.062655	2.057115	117.310827	0.097015	0.098087
RS	13	2.694063	1.614276	130.058273	0.107557	0.108800
RS	14	3.509816	1.614276	95.324766	0.078833	0.080210
RS	15	4.563736	1.614276	59.361086	0.049091	0.050224
RS	16	5.92

Let us now cat the psmc outputs for all 3 samples, and plot them into a pdf. 

In [9]:
cat s1_1_coarsePattern.psmc \
    s2_1_coarsePattern.psmc s3_1_coarsePattern.psmc > combined_coarsePattern.psmc 

We will use the inbuilt plotting functions in psmc to plot them. 

In [10]:
perl /course/bgi23/rasmus/psmc/utils/psmc_plot.pl


Usage:   psmc_plot.pl [options] <out.prefix> <in.psmc>

Options: -u FLOAT   absolute mutation rate per nucleotide [2.5e-08]
         -s INT     skip used in data preparation [100]
         -X FLOAT   maximum generations, 0 for auto [0]
         -x FLOAT   minimum generations, 0 for auto [10000]
         -Y FLOAT   maximum popsize, 0 for auto [0]
         -m INT     minimum number of iteration [5]
         -n INT     take n-th iteration (suppress GOF) [20]
         -M titles  multiline mode [null]
         -f STR     font for title, labels and tics [Helvetica,16]
         -g INT     number of years per generation [25]
         -w INT     line width [4]
         -P STR     position of the keys [right top]
         -T STR     figure title [null]
         -N FLOAT   false negative rate [0]
         -S         no scaling
         -L         show the last bin
         -p         convert to PDF (with epstopdf)
         -R         do not remove temporary files
         -G         plot grid



: 255

In [11]:
perl /course/bgi23/rasmus/psmc/utils/psmc_plot.pl -u 2.5e-08 -s 100 -Y 1 -m 5 -n 30 -p -M "Pop1, Pop2, Pop3" Simulations_coarsePattern combined_coarsePattern.psmc
convert Simulations_coarsePattern.pdf Simulations_coarsePattern.png

In [12]:
## check the output image
from IPython.display import Image # import image module 
Image(url="Simulations_coarsePattern.png", width=600, height=600)


__Question:__ Can you see the differences in the population sizes? Do they make sense to you? Do they match up with you simulation scheme?

Now that we know how to run psmc, and plot the output, we could run the psmc for the same samples, but with a much finer time pattern of "4+25\*2+4+6", and plot the output. 

__Question:__ What would you expect the output to be? 


__Question:__ Does the output plot from the finer time pattern match the coarser time pattern? Which one would you prefer?


__Question:__ Is it always better to use a finer resolution? Why or why not?


# 4. PSMC on real world samples

## 4.1 human data sets
We are now going to shift gears to estimate the effective population sizes from 2 samples in the 1000 genomes consoritum. We will only be using data from chromosome 1 due to time constraints. Further, we have already run the commands that generate the psmcfa files using our custom script, so you do not need to do that.

In [13]:
echo python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/1kg_2samps_chr1_noindels.recode.vcf.gz NA12718
echo python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/1kg_2samps_chr1_noindels.recode.vcf.gz NA19471

python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/1kg_2samps_chr1_noindels.recode.vcf.gz NA12718
python2 /course/bgi23/rasmus/shyam/psmc_tutorial/scripts/vcf2psmcfa.py /course/bgi23/rasmus/shyam/psmc_tutorial/data/1kg_2samps_chr1_noindels.recode.vcf.gz NA19471


__Task__ The psmcfa files are called NA12718_chr1.psmcfa and NA19471_chr1.psmcfa. Using the same sets of commands we used in the previous section, run psmc on these 2 samples using the pattern "4+25\*2+4+6". Note that this will take about 10 minutes to run.

If you are short on time, or are tired of waiting, I have the psmc files for these samples already generated. They are called NA12718_chr1.psmc and NA19471_chr1.psmc. Use the plotting command - you might have to remove the -Y option from the command from the last section - to plot these. 

Here is my output from psmc runs on these 2 samples. Please run the following code to show the output plot.

In [14]:
## check the output image
from IPython.display import Image # import image module 
Image(url="1kg_chr1.png", width=600, height=600)

__Question:__ Interpret these effective population size plots for an European and an African population.

__Question:__ Connect the recent high population size of African populations to what you know about the genetic variation seen in Africa. 

### Effect of low coverage

In the last section, we will explore the effects of low coverage on effective population size estimation. First let me ask you a quesiton?

__Question:__ What will happen to genotype calling with decreasing coverage, for a single individual? Think about which kind of genotypes will be hard to call? Homozygote or heterozygote?

I have run psmc 2 extra times for the sample NA12718, where I intentionally mask 10% and 20% of the heterozygote sites to mimic the effects of low coverage. 

__Question:__ What do you expect will happen to the effective population size estimates?

Here is my psmc plot for the these three scenarios: (Please run the following code to show the plot)

In [15]:
## check the output image
from IPython.display import Image # import image module 
Image(url="NA12718.png", width=600, height=600)

__Question:__ Can you explain the reduction in the $N_e$ with increasing missingness of heterozygotes?

### Estimating uncertainty in $N_e$ estimates

Due to time constraints, we will not delve into details here, but one can estimate the uncertainty in the $N_e$ estimates from psmc using bootstrapping, where the genome is broken into chunks and resampled. An example of bootstrapped estimates: (Please run the following code to see the plot)

In [16]:
## check the output image
from IPython.display import Image # import image module 
Image(url="bootstrap.png", width=600, height=600)

__Question:__ Why do the most recent time periods have the highest variance in $N_e$ estimates?

## 4.2 Cattle data
We are now going to estimate the effective population sizes from 3 samples of cattles. We will only be using data from chromosome 1 due to time constraints. We will also be slightly reducing the resolution, but you can feel free to try using the same resolution settings as above for the human data.

**Note: running PSMC may take around 10 mins**

In [19]:
python2 /course/bgi23/rasmus/cattles/vcf2psmcfa.py /course/bgi23/rasmus/cattles/Cattles.4inds.vcf.gz N_7B &
python2 /course/bgi23/rasmus/cattles/vcf2psmcfa.py /course/bgi23/rasmus/cattles/Cattles.4inds.vcf.gz N_31B &
python2 /course/bgi23/rasmus/cattles/vcf2psmcfa.py /course/bgi23/rasmus/cattles/Cattles.4inds.vcf.gz Limousin_ROUSTAN

[3] 451545
[4] 451546
[1]   Done                    /course/bgi23/rasmus/psmc/psmc -p "4+10*2+4+6" -o N_7B.425246.psmc N_7B.psmcfa
[2]   Done                    /course/bgi23/rasmus/psmc/psmc -p "4+10*2+4+6" -o N_31B.425246.psmc N_31B.psmcfa
[3]-  Done                    python2 /course/bgi23/rasmus/cattles/vcf2psmcfa.py /course/bgi23/rasmus/cattles/Cattles.4inds.vcf.gz N_7B


In [18]:
/course/bgi23/rasmus/psmc/psmc -p "4+10*2+4+6" -o N_7B.425246.psmc N_7B.psmcfa &
/course/bgi23/rasmus/psmc/psmc -p "4+10*2+4+6" -o N_31B.425246.psmc N_31B.psmcfa &
/course/bgi23/rasmus/psmc/psmc -p "4+10*2+4+6" -o Limousin_ROUSTAN.425246.psmc Limousin_ROUSTAN.psmcfa

[1] 451332
[2] 451333


In [20]:
cat N_7B.425246.psmc Limousin_ROUSTAN.425246.psmc > two.combined_cattles.425246.psmc
perl /course/bgi23/rasmus/psmc/utils/psmc_plot.pl -u 1.26e-08 -g 6 -s 100 -m 5 -n 30 -p -M "PureBali,EuropeanTaurine" two.combined_cattles.425246.psmc.plot two.combined_cattles.425246.psmc
convert two.combined_cattles.425246.psmc.plot.pdf two.combined_cattles.425246.psmc.plot.png
cat N_7B.425246.psmc N_31B.425246.psmc Limousin_ROUSTAN.425246.psmc > combined_cattles.425246.psmc
perl /course/bgi23/rasmus/psmc/utils/psmc_plot.pl -u 1.26e-08 -g 6 -s 100 -m 5 -n 30 -p -M "PureBali,Hybrid,EuropeanTaurine" combined_cattles.425246.psmc.plot combined_cattles.425246.psmc
convert combined_cattles.425246.psmc.plot.pdf combined_cattles.425246.psmc.plot.png

[4]+  Done                    python2 /course/bgi23/rasmus/cattles/vcf2psmcfa.py /course/bgi23/rasmus/cattles/Cattles.4inds.vcf.gz N_31B


check the outputs

In [21]:
# import image module 
from IPython.display import Image # import image module 
Image(url="combined_cattles.425246.psmc.plot.png", width=600, height=600)

In [22]:
Image(url="two.combined_cattles.425246.psmc.plot.png", width=600, height=600)

### Effect of admixture

__Question__: Look at the curves for cattle and Bali cattle - what can we see here?

__Question__: How about the hybrid sample? How does admixture impact the inference of effective population size? 