# 1. Resources and set up
- [`PSMC`](https://github.com/lh3/psmc)
- [`VCF file format`](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
- [`vcftools`](https://vcftools.github.io/index.html)
- [`1000 genomes high coverage variant calls`](https://www.internationalgenome.org/data-portal/sample)


Let us start by defining some variables and setting up the environment before we run the analyses.

In [None]:
## Set up location for the software
TOOLS_PATH=/course/popgen25/software
DATA_PATH=/course/popgen25/demography/data
SCRIPTS_PATH=/course/popgen25/demography/scripts

## Set up vars for executables
PSMC=${TOOLS_PATH}/psmc/psmc
VCF2PSMCFA=${SCRIPTS_PATH}/vcf2psmcfa.py
PSMC_PLOT=${TOOLS_PATH}/psmc/utils/psmc_plot.pl

Now, let us set up the directories for running the demography analyses.

In [None]:
## Make directory
cd ~
mkdir -p demography
cd demography

In [None]:
import os
os.chdir("demography")

For convenience's sake, this tutorial is presented as a python notebook. Almost all of the commands in this tutorial are run in bash.

We will use the `psmc` tool to estimate the effective population size for our data. But instead of directly using bam files, we will use a variant calling format (vcf) file as the starting point. In this tutorial, we will first take a look at a vcf file, then look at how a vcf file is converted to an input for `psmc` using a custom python script. And finally, run psmc on simulated and real data to see how well we can estimate effective population size $N_e$. Note that, unlike many of the previous tools you used in this course, `psmc` relies on high coverage data where we can estimate genotypes accurately, which is one of the reasons we are working in our exercises with vcf files. 

# 2. Data
In this tutorial, we will use 2 sets of data. First, we will use the samples from 3 simulated populations, one with constant population size, one with population growth, and the last one with population decline (see figure later in the tutorial). In the second half of the exercises, we will use data from the 1000 genomes project, namely 2 individuals, NA12718 - a CEU female sample (Northern European ancestry), and NA19471 - a female Luhya sample from Kenya, to compare and contrast how the effective population sizes vary between these two populations. Finally, if you want to play with some data on your own, then we can use the data from wildebeest to recreate their population history. 

Before we begin, let us take a look at the vcf file format using the simulated data with three individuals, one from each simulated population. Note that all the vcf files are zipped, so we will use `zcat` to display them.

In [None]:
zcat ${DATA_PATH}/threeinds.recode.vcf.gz | head -10

All the lines starting with `##` are information header lines which describe the VCF file. The line beginning with `#CHROM` is the header line describing the contents of the subsequent data lines. Each data line contains information on a single variant. As you can also see, each individual s1_1, s2_1 and s3_1 has a column with the genotype for that individual at each variant. Here "0|0" is homozygous for the reference (REF) allele, "0|1" is heterozygous, and "1|1" is homozygous for the alternate (ALT) allele.

We will use the vcf files to generate input files for our `psmc` analysis. Note that this is __NOT__ the prefered method to generate input files for `psmc`. The prefered method is to generate the input files directly from bam files, and that information can be found [here](https://github.com/lh3/psmc). 

__Question:__ Why do you think vcf files are not the preferred format? What information do you think is missing?

# 3. Working with simulated data

In this first exercise, we will look at simulated data from demography that looks like this.

In [None]:
## We are using python to see the images - we will do the same later for our PSMC results.
from matplotlib import pyplot as plt
from matplotlib import image as mpimg
fig, ax = plt.subplots(figsize=(5, 10))
ax.axis('off')
image = mpimg.imread("/course/popgen25/demography/images/popsize.png")
ax.imshow(image)

We will try and reconstruct the population sizes using `psmc` to see if we can faithfully reconstruct the population sizes. First, let us construct a psmcfa file - an input fasta file for psmc from the vcf files. 

## psmcfa file

A psmcfa file is a fasta representation of the genome (or part of a genome), where each entry (letter) represents if there exists a heterozygous variant in a fixed size window. Let us try and understand this by converting our vcf file to a psmcfa file. We will use the custom script `vcf2psmcfa.py` to do this. Note that this script expects that the vcf file contains only 1 "chromosome", which is not the case for a general vcf file. The script takes 2 input parameters - the vcf file name and the name of the sample. The output file name is derived from the name of the sample. It also uses a fixed window size of 100 bp. 

In [None]:
$VCF2PSMCFA ${DATA_PATH}/threeinds.recode.vcf.gz s1_1
$VCF2PSMCFA ${DATA_PATH}/threeinds.recode.vcf.gz s2_1
$VCF2PSMCFA ${DATA_PATH}/threeinds.recode.vcf.gz s3_1

Let us take a quick look at one of the psmcfa files. 

In [None]:
head s2_1.psmcfa

It appears to be a regular fasta file, with only "T" and "K". Here a "T" represents a 100 bp window without any heterozygous sites in it, whereas a "K" represents a 100 bp window with at least 1 heterozygous site in it. 

__Question:__ Given the population histories, which of the three samples would you expect to have the highest number of heterozygous windows?

## Running psmc

Now it is time for us to run our first `psmc` analysis. Let us first look at the options for running psmc. 

In [None]:
$PSMC

In particular, note the `-p` option, which allows us to specify a pattern of parameters. As we discussed in the lecture before, we divide up time into discrete bins. This pattern allows us to estimate the same parameter for multiple consecutive time bins. For example - the default pattern 4+5\*3+4 splits time up into 23 bins (4+15+4), where only 1 parameter is estimated for the first 4 bins, then 1 parameter is estimmated for the next 5 triples of bins, and 1 parameter for the last 4 bins. So the total number of paramters we are estimating is 7 (1+5+1). Note this also in the plots for the effective population size, when we plot them. 

Let us now run psmc for the first time - we will just use the default values for the options, while still explicitly specifying the parameter pattern. This pattern is quite coarse, but we will try and estimate it again with finer time bins in the next section.

__Question:__ Can you think of a reason why the most recent and oldest time bins are quite often merged to get only 1 parameter? Maybe you can answer it better after seeing the output.

In [None]:
$PSMC -p "4+5*3+4" -o s1_1_coarsePattern.psmc s1_1.psmcfa
$PSMC -p "4+5*3+4" -o s2_1_coarsePattern.psmc s2_1.psmcfa
$PSMC -p "4+5*3+4" -o s3_1_coarsePattern.psmc s3_1.psmcfa

Let us take a look at one of the output files, and discuss its contents. 

In [None]:
tail -34 s1_1_coarsePattern.psmc

Let us now cat the psmc outputs for all 3 samples, and plot them into a pdf. 

In [None]:
cat s1_1_coarsePattern.psmc s2_1_coarsePattern.psmc s3_1_coarsePattern.psmc > combined_coarsePattern.psmc 

We will use the inbuilt plotting functions in psmc to plot them. 

In [None]:
$PSMC_PLOT

In [None]:
$PSMC_PLOT -u 2.5e-08 -s 100 -Y 1 -m 5 -n 30 -p -M "Pop1, Pop2, Pop3" Simulations_coarsePattern combined_coarsePattern.psmc

Let us take a look at the plot. First we will convert the pdf to a png for easier viewing in python.

In [None]:
#We convert the pdf file to a png file. pdftoppm adds a -1.png to the output file name
pdftoppm -png Simulations_coarsePattern.pdf Simulations_coarsePattern

In [None]:
plt.axis('off')
image = mpimg.imread("Simulations_coarsePattern-1.png")
plt.imshow(image, aspect="auto")

__Question:__ Can you see the differences in the population sizes? Do they make sense to you? Do they match up with the simulation scheme?

Now that we know how to run psmc, and plot the output, run the psmc for the same samples, but with a much finer time pattern of "4+25\*2+4+6", and plot the output. 

__Question:__ What would you expect the output to be? 

__Question:__ Does the output plot from the finer time pattern match the coarser time pattern? Which one would you prefer?

__Question:__ When would you not always use a finer time partition?

__Task (bonus)__: A recent [article](https://www.biorxiv.org/content/10.1101/2024.06.17.599025v1) investigated the effects of the time partitioning parameters on the signal of recent expansion seen in many PSMC analyses. Try and see if using different parameters will help - keep the same number of total time slices, but change the partitioning, and subsequently change the number of time slices as well. Some examples could be - "2+10\*2+1", "6+11\*1+6", "4+20\*2+6", "2+30\*1+2". Try these parameters and see how the output changes - also look at the convergence, and output log file.


# 4. PSMC on real world samples

We are now going to shift gears to estimate the effective population sizes from 2 samples in the 1000 genomes consoritum. We will only be using data from chromosome 1 due to time constraints. Further, I have already run the commands to generate the psmcfa files using our custom script. 

In [None]:
echo $VCF2PSMCFA 1kg_2samps_chr1_noindels.recode.vcf.gz NA12718
echo $VCF2PSMCFA 1kg_2samps_chr1_noindels.recode.vcf.gz NA19471

__Task__ The psmcfa files are called NA12718_chr1.psmcfa and NA19471_chr1.psmcfa. They are in the $DATA_PATH location. Using the same sets of commands we used in the previous section, run psmc on these 2 samples using the pattern "4+25\*2+4+6". Note that this will take about 10 minutes to run.

In [None]:
## Write your PSMC commands here.



If you are short on time, or are tired of waiting, I have the psmc files for these samples already generated. They are called NA12718_chr1.psmc and NA19471_chr1.psmc and are present in the `$DATA_PATH` folder. Use the plotting command - you might have to remove the -Y option from the command from the last section - to plot these. 

Here is my output from psmc runs on these 2 samples (again remember to change the file paths to your own outputs when you generate them).

In [None]:
plt.axis('off')
image = mpimg.imread("/course/popgen25/demography/images/1kg_chr1.png")
plt.imshow(image)

__Question:__ Interpret these effective population size plots for an European and an African population. 

__Question:__ Connect the recent high population size of African populations to what you know about the genetic variation seen in Africa. 

## Effect of low coverage

In the last section, we will explore the effects of low coverage on effective population size estimation. First let me ask you a quesiton?

__Question:__ What will happen to genotype calling with decreasing coverage, for a single individual? Think about which kind of genotypes will be hard to call? Homozygote or heterozygote?

I have run psmc 2 extra times for the sample NA12718, where I intentionally mask 10% and 20% of the heterozygote sites to mimic the effects of low coverage. These files - called NA12718_chr1_10pctHetMiss.psmcfa and NA12718_chr1_20pctHetMiss.psmcfa - are in the `$DATA_PATH` folder. You can run these with `psmc` if you want.

__Question:__ What do you expect will happen to the effective population size estimates?

Here is my psmc plot for the these three scenarios.

In [None]:
plt.axis('off')
image = mpimg.imread("/course/popgen25/demography/images/NA12718.png")
plt.imshow(image)

__Question:__ Can you explain the reduction in the $N_e$ with increasing missingness of heterozygotes?

## Estimating uncertainty in $N_e$ estimates

Due to time constraints, we will not delve into details here, but one can estimate the uncertainty in the $N_e$ estimates from psmc using bootstrapping, where the genome is broken into chunks and resampled. An example of bootstrapped estimates: 

In [None]:
plt.axis('off')
image = mpimg.imread("/course/popgen25/demography/images/bootstrap.png")
plt.imshow(image)

__Question:__ Why do the most recent time periods have the highest variance in $N_e$ estimates?

## Bonus task: Wildebeest demography

If you have reached here already, then here is the final challenge. Here is the bcf file for wildebeests (chr 27) `CTauTzS_8872.chr27.filtered.bcf.gz`. It is located in the `$DATA_PATH` folder.

First convert it to a vcf file using `bcftools view`, then use the pipeline from the rest of the tutorial to run PSMC, and see if you can estimate the $N_e$ for them. 

In [None]:
# You can put your code below here
