# 1. Resources and set up
- [`PSMC`](https://github.com/lh3/psmc)
- [`VCF file format`](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
- [`vcftools`](https://vcftools.github.io/index.html)
- [`1000 genomes high coverage variant calls`](https://www.internationalgenome.org/data-portal/sample)


You will need to start the exercises by copying the data and tutorial into your own directory.

```bash
cp -r /course/popgen24/shyam/psmc_tutorial .
```

For convenience's sake, this tutorial is presented as a python notebook. Almost all of the commands in this tutorial are run in bash.

We will use the `psmc` tool to estimate the effective population size for our data. But instead of directly using bam files, we will use a variant calling format (vcf) file as the starting point. In this tutorial, we will first take a look at a vcf file, then look at how a vcf file is converted to an input for `psmc` using a custom python script. And finally, run psmc on simulated and real data to see how well we can estimate effective population size $N_e$. Note that, unlike many of the previous tools you used in this course, `psmc` relies on high coverage data where we can estimate genotypes accurately, which is one of the reasons we are working in our exercises with vcf files. 

# 2. Data
In this tutorial, we will use 2 sets of data. First, we will use the samples from 3 simulated populations, one with constant population size, one with populatoin growth, and the last one with population decline (see figure later in the tutorial). In the second half of the exercises, we will use data from the 1000 genomes project, namely 2 individuals, NA12718 - a CEU female sample (Northern European ancestry), and NA19471 - a female Luhya sample from Kenya, to compare and contrast how the effective population sizes vary between these two populations. Finally, if you want to play with some data on your own, then we can use the data from wildebeest to recreate their population history. 

Before we begin, let us take a look at the vcf file format using the simulated data with three individuals, one from each simulated population. Note that all the vcf files are zipped, so we will use `zcat` to display them.

In [1]:
!zcat data/threeinds.recode.vcf.gz | head -10

##fileformat=VCFv4.2
##source=tskit 0.5.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr42,length=100000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	s1_1	s2_1	s3_1
chr42	84	0	A	T	.	PASS	.	GT	0|0	0|0	0|0
chr42	147	1	T	G	.	PASS	.	GT	0|0	0|0	0|0
chr42	334	2	G	A	.	PASS	.	GT	0|0	0|0	0|0
chr42	406	3	C	A	.	PASS	.	GT	0|0	0|0	0|0

gzip: stdout: Broken pipe


All the lines starting with `##` are information header lines which describe the VCF file. The line beginning with `#CHROM` is the header line describing the contents of the subsequent data lines. Each data line contains information on a single variant. As you can also see, each individual s1_1, s2_1 and s3_1 has a column with the genotype for that individual at each variant. Here "0|0" is homozygous for the reference (REF) allele, "0|1" is heterozygous, and "1|1" is homozygous for the alternate (ALT) allele.

We will use the vcf files to generate input files for our `psmc` analysis. Note that this is __NOT__ the prefered method to generate input files for `psmc`. The prefered method is to generate the input files directly from bam files, and that information can be found [here](https://github.com/lh3/psmc). 

__Question:__ Why do you think vcf files are not the preferred format? What information do you think is missing?

# 3. Working with simulated data

In this first exercise, we will look at simulated data from demography that looks like this ![population sizes](images/popsize.png).

We will try and reconstruct the population sizes using `psmc` to see if we can faithfully reconstruct the population sizes. First, let us construct a psmcfa file - an input fasta file for psmc from the vcf files. 

## psmcfa file

A psmcfa file is a fasta representation of the genome (or part of a genome), where each entry (letter) represents if there exists a heterozygous variant in a fixed size window. Let us try and understand this by converting our vcf file to a psmcfa file. We will use the custom script `vcf2psmcfa.py` to do this. Note that this script expects that the vcf file contains only 1 "chromosome", which is not the case for a general vcf file. The script takes 2 input parameters - the vcf file name and the name of the sample. The output file name is derived from the name of the sample. It also uses a fixed window size of 100 bp. 

In [2]:
!scripts/vcf2psmcfa.py data/threeinds.recode.vcf.gz s1_1
!scripts/vcf2psmcfa.py data/threeinds.recode.vcf.gz s2_1
!scripts/vcf2psmcfa.py data/threeinds.recode.vcf.gz s3_1

Let us take a quick look at one of the psmcfa files. 

In [3]:
!head s2_1.psmcfa

>chr1
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTKTTTTTKKTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTKTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTKTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTKTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTKTTKTTTTTTTTTTTTTTTT


It appears to be a regular fasta file, with only "T" and "K". Here a "T" represents a 100 bp window without any heterozygous sites in it, whereas a "K" represents a 100 bp window with at least 1 heterozygous site in it. 

__Question:__ Given the population histories, which of the three samples would you expect to have the highest number of heterozygous windows?

## Running psmc

Now it is time for us to run our first `psmc` analysis. Let us first look at the options for running psmc. 

In [4]:
!psmc/psmc


Program: psmc (Pairwise SMC Model)
Version: 0.6.5-r67
Contact: <http://hengli.uservoice.com/>

Usage:   psmc [options] input.txt

Options: -p STR      pattern of parameters [4+5*3+4]
         -t FLOAT    maximum 2N0 coalescent time [15]
         -N INT      maximum number of iterations [30]
         -r FLOAT    initial theta/rho ratio [4]
         -c FILE     CpG counts generated by cntcpg [null]
         -o FILE     output file [stdout]
         -i FILE     input parameter file [null]
         -T FLOAT    initial divergence time; -1 to disable [-1]
         -b          bootstrap (input be preprocessed with split_psmcfa)
         -S          simulate sequence
         -d          perform decoding
         -D          print full posterior probabilities



In particular, note the `-p` option, which allows us to specify a pattern of parameters. As we discussed in the lecture before, we divide up time into discrete bins. This pattern allows us to estimate the same parameter for multiple consecutive time bins. For example - the default pattern 4+5\*3+4 splits time up into 23 bins (4+15+4), where only 1 parameter is estimated for the first 4 bins, then 1 parameter is estimmated for the next 5 triples of bins, and 1 parameter for the last 4 bins. So the total number of paramters we are estimating is 7 (1+5+1). Note this also in the plots for the effective population size, when we plot them. 

Let us now run psmc for the first time - we will just use the default values for the options, while still explicitly specifying the parameter pattern. This pattern is quite coarse, but we will try and estimate it again with finer time bins in the next section.  

__Question:__ Can you think of a reason why the most recent and oldest time bins are quite often merged to get only 1 parameter?

__Task (bonus)__: A recent [article](https://www.biorxiv.org/content/10.1101/2024.06.17.599025v1) investigated the effects of the time partitioning parameters on the signal of recent expansion seen in many PSMC analyses. Try and see if using differe parameters will help - keep the same number of total time slices, but change the partitioning, and subsequently change the number of time slices as well. Some examples could be - "2+10\*2+1", "6+11\*1+6", "4+20\*2+6", "2+30\*1+2". Try these parameters and see how the output changes - also look at the convergence, and output log file.

In [5]:
!psmc/psmc -p "4+5*3+4" -o s1_1_coarsePattern.psmc s1_1.psmcfa
!psmc/psmc -p "4+5*3+4" -o s2_1_coarsePattern.psmc s2_1.psmcfa
!psmc/psmc -p "4+5*3+4" -o s3_1_coarsePattern.psmc s3_1.psmcfa

Let us take a look at one of the output files, and discuss its contents. 

In [6]:
!tail -34 s1_1_coarsePattern.psmc

//
IT	1327
RD	30
LK	-6015.274045
QD	-0.974007 -> -0.880735
RI	0.0014122029
TR	0.012531	0.011684
MT	27.555660
MM	C_pi: 1.933349, n_recomb: 1204.525381
RS	0	0.000000	98287.905916	0.000353	0.000000	0.000000
RS	1	0.029119	98287.905916	0.000456	0.000000	0.000000
RS	2	0.066717	98287.905916	0.000589	0.000000	0.000000
RS	3	0.115264	98287.905916	0.000760	0.000001	0.000001
RS	4	0.177946	1.648980	57.121265	0.047422	0.037435
RS	5	0.258881	1.648980	69.764081	0.057918	0.055311
RS	6	0.363384	1.648980	83.838296	0.069603	0.069975
RS	7	0.498317	1.694196	96.171145	0.079842	0.079550
RS	8	0.672541	1.694196	110.535319	0.091767	0.092333
RS	9	0.897497	1.694196	122.825500	0.101970	0.104154
RS	10	1.187958	2.119479	106.716113	0.088596	0.091275
RS	11	1.562999	2.119479	112.883875	0.093716	0.095418
RS	12	2.047248	2.119479	112.707953	0.093570	0.094600
RS	13	2.672505	1.582841	131.761145	0.109388	0.110663
RS	14	3.479831	1.582841	96.024355	0.079720	0.081149
RS	15	4.522243	1.582841	59.396654	0.04

Let us now cat the psmc outputs for all 3 samples, and plot them into a pdf. 

In [7]:
!cat s1_1_coarsePattern.psmc s2_1_coarsePattern.psmc s3_1_coarsePattern.psmc > combined_coarsePattern.psmc 

We will use the inbuilt plotting functions in psmc to plot them. 

In [8]:
!psmc/utils/psmc_plot.pl


Usage:   psmc_plot.pl [options] <out.prefix> <in.psmc>

Options: -u FLOAT   absolute mutation rate per nucleotide [2.5e-08]
         -s INT     skip used in data preparation [100]
         -X FLOAT   maximum generations, 0 for auto [0]
         -x FLOAT   minimum generations, 0 for auto [10000]
         -Y FLOAT   maximum popsize, 0 for auto [0]
         -m INT     minimum number of iteration [5]
         -n INT     take n-th iteration (suppress GOF) [20]
         -M titles  multiline mode [null]
         -f STR     font for title, labels and tics [Helvetica,16]
         -g INT     number of years per generation [25]
         -w INT     line width [4]
         -P STR     position of the keys [right top]
         -T STR     figure title [null]
         -N FLOAT   false negative rate [0]
         -S         no scaling
         -L         show the last bin
         -p         convert to PDF (with epstopdf)
         -R         do not remove temporary files
         -G

In [9]:
!psmc/utils/psmc_plot.pl -u 2.5e-08 -s 100 -Y 1 -m 5 -n 30 -p -M "Pop1, Pop2, Pop3" Simulations_coarsePattern combined_coarsePattern.psmc

Here is the plot from me (remember to change it to your own conputs once you have generated your output images) - ![Estimated population sizes](images/Simulation_coarsePattern.png).

__Question:__ Can you see the differences in the population sizes? Do they make sense to you? Do they match up with you simulation scheme?

Now that we know how to run psmc, and plot the output, run the psmc for the same samples, but with a much finer time pattern of "4+25\*2+4+6", and plot the output. 

__Question:__ What would you expect the output to be? 
__Question:__ Does the output plot from the finer time pattern match the coarser time pattern? Which one would you prefer?
__Question:__ When would you not always use a finer time partition?


# 4. PSMC on real world samples

We are now going to shift gears to estimate the effective population sizes from 2 samples in the 1000 genomes consoritum. We will only be using data from chromosome 1 due to time constraints. Further, I have already run the commands to generate the psmcfa files using our custom script. 

In [10]:
!echo scripts/vcf2psmcfa.py data/1kg_2samps_chr1_noindels.recode.vcf.gz NA12718
!echo scripts/vcf2psmcfa.py data/1kg_2samps_chr1_noindels.recode.vcf.gz NA19471

scripts/vcf2psmcfa.py data/1kg_2samps_chr1_noindels.recode.vcf.gz NA12718
scripts/vcf2psmcfa.py data/1kg_2samps_chr1_noindels.recode.vcf.gz NA19471


__Task__ The psmcfa files are called NA12718_chr1.psmcfa and NA19471_chr1.psmcfa. Using the same sets of commands we used in the previous section, run psmc on these 2 samples using the pattern "4+25\*2+4+6". Note that this will take about 10 minutes to run.

If you are short on time, or are tired of waiting, I have the psmc files for these samples already generated. They are called NA12718_chr1.psmc and NA19471_chr1.psmc. Use the plotting command - you might have to remove the -Y option from the command from the last section - to plot these. 

Here is my output from psmc runs on these 2 samples (again remember to change the file paths to your own outputs when you generate them). ![PSMC 1000 genomes](images/1kg_chr1.png)

__Question:__ Interpret these effective population size plots for an European and an African population. 

__Question:__ Connect the recent high population size of African populations to what you know about the genetic variation seen in Africa. 

## Effect of low coverage

In the last section, we will explore the effects of low coverage on effective population size estimation. First let me ask you a quesiton?

__Question:__ What will happen to genotype calling with decreasing coverage, for a single individual? Think about which kind of genotypes will be hard to call? Homozygote or heterozygote?

I have run psmc 2 extra times for the sample NA12718, where I intentionally mask 10% and 20% of the heterozygote sites to mimic the effects of low coverage. 

__Question:__ What do you expect will happen to the effective population size estimates?

Here is my psmc plot for the these three scenarios. ![PSMC missingness](images/NA12718.png)


__Question:__ Can you explain the reduction in the $N_e$ with increasing missingness of heterozygotes?

## Estimating uncertainty in $N_e$ estimates

Due to time constraints, we will not delve into details here, but one can estimate the uncertainty in the $N_e$ estimates from psmc using bootstrapping, where the genome is broken into chunks and resampled. An example of bootstrapped estimates: ![Bootstrap](images/bootstrap.png)

__Question:__ Why do the most recent time periods have the highest variance in $N_e$ estimates?

## Bonus task: Wildebeest demography

If you have reached here already, then here is the final challenge. Here is the bcf file for wildebeests (chr 27) `/davidData/users/thomas/workshop/CTauTzS_8872.chr27.filtered.bcf.gz`.

First convert it to a vcf file using `bcftools view`, then use the pipeline from the rest of the tutorial to run PSMC, and see if you can estimate the $N_e$ for them.