# lh3/psmc

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
C TeX Perl Lua JavaScript Makefile
 Failed to load latest commit information. utils Jan 21, 2016 .gitignore Apr 29, 2015 LICENSE.txt Apr 29, 2015 Makefile Apr 29, 2015 README Apr 29, 2015 aux.c Apr 29, 2015 cli.c Apr 29, 2015 core.c Apr 29, 2015 em.c Jan 22, 2011 khmm.c Jan 23, 2011 khmm.h Jan 23, 2011 kmin.c Jan 23, 2011 kmin.h Jan 23, 2011 kseq.h Jan 23, 2011 main.c Jul 13, 2011 psmc.bib May 16, 2011 psmc.h Apr 29, 2015 psmc.tex Jul 13, 2011

This software package infers population size history from a diploid sequence
using the Pairwise Sequentially Markovian Coalescent (PSMC) model. The
detailed model is described in file psmc.tex'.

To compile the binaries, you may run

make; (cd utils; make)

After that, you may try

utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa
psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa
utils/psmc2history.pl diploid.psmc | utils/history2ms.pl > ms-cmd.sh
utils/psmc_plot.pl diploid diploid.psmc

where diploid.fq.gz' is typically the whole-genome diploid consensus sequence
of one human individual, which can be generated by, for example:

samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \
| vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz

Here option -d sets and minimum read depth and -D sets the maximum. It is
recommended to set -d to a third of the average depth and -D to twice.  Program
fq2psmcfa' transforms the consensus sequence into a fasta-like format where
the i-th character in the output sequence indicates whether there is at least
one heterozygote in the bin [100i, 100i+100).

Program psmc' infers the population size history. In particular, the -p'
option specifies that there are 64 atomic time intervals and 28 (=1+25+1+1)
free interval parameters. The first parameter spans the first 4 atomic time
intervals, each of the next 25 parameters spans 2 intervals, the 27th spans 4
intervals and the last parameter spans the last 6 time intervals. The -p' and
-t' options are manually chosen such that after 20 rounds of iterations, at
least ~10 recombinations are inferred to occur in the intervals each parameter
spans. Impropriate settings may lead to overfitting. The command line in the
example above has been shown to be suitable for modern humans.

The psmc' program infers the scaled mutation rate, the recombination rate and
the free population size parameters. All these parameters are scaled to 2N0. You
may run psmc2history.pl' combined with history2ms.pl' to generate the ms
command line that simulates the history inferred by PSMC, or visualize the result
with psmc_plot.pl'.

To perform bootstrapping, one has to run splitfa first to split long chromosome
sequences to shorter segments. When the -b' option is applied, psmc will then
randomly sample with replacement from these segments. As an example, the
following command lines perform 100 rounds of bootstrapping:

utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa
utils/splitfa diploid.psmcfa > split.psmcfa
psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa
seq 100 | xargs -i echo psmc -N25 -t15 -r5 -b -p "4+25*2+4+6" \
-o round-{}.psmc split.fa | sh
cat diploid.psmc round-*.psmc > combined.psmc
utils/psmc_plot.pl -pY50000 combined combined.psmc

One probably wants to modify the "xargs" command-line to parallelize PSMC.

You do not need to register unless you also want to modify your own questions.
You may also post comments at github (if you have a github account). I want to
make the question and the answer public such that others can see them and I do
not need to answer the same question multiple times. Thank you for using PSMC.

APPENDIX I: Scaling the PSMC output
===================================

The PSMC output is scaled to the 2N_0. There are two ways of rescaling the time
and the popuation size more meaningfully.

Firstly, suppose we know the per-site per-generation mutation rate \mu, we can
compute N_0 as:

N_0 = \theta_0 / (4\mu) / s

where \theta_0 is given at the 2nd column of "TR" lines, and s is the bin size
we use for generating the PSMC input. Knowing N_0, we can scale time to
generations and relative population size to effective size by

T_k = 2N_0 * t_k
N_k = N_0 * \lambda_k

where t_k and \lambda_k are given at the 3rd and 4th columns of "RS" lines,
respectively.

A problem with the above strategy is that we do not know a definite answer of
\mu and in fact it various with regions and mutation types. An alternative way
is to use per-site pairwise sequence divergence to represent time:

d_k = 2\mu * T_k = t_k * \theta_0 / s

and use scaled mutation rate to represent population size:

\theta_k = 4N_k * \mu = \lambda_k * \theta_0 / s

where, again, t_k and \lambda_k are given at the "RS" line, \theta_0 at the
"TR" line and s is the bin size, which defaults to 100 in fq2psmcfa.

APPENDIX II: Correcting for low coverage
========================================

For diploid genomes sequenced to low coverage, heterozygotes will be randomly
lost due to the lack of coverage of both alleles. This has the same effect as
smaller mutation rate and can be corrected. If you know the fraction of hets
missed due to low coverage, you can generate the PSMC plot with:

psmc_plot.pl -M "sample1=0.1,sample2=0.2" prefix sample1.psmc sample2.psmc

This says that sample1 has 10% false negative rate (FNR) on hets and sample2
has 20%. The plotting script does not correct FNR for bootstrapping. If you
want to plot the result with your own scripts, you can increase \theta_0 to
\theta_0/(1-FNR).

Unfortunately, I haven't found a reliable way to estimate the background FNR
relevant to PSMC. The simple and unscientific approach is to align the PSMC
curves by eye. Probably a better solution is to downsample a high-coverage
sample to a certain coverage and measures FNR. I have not done this.

Not correcting for low coverage is the most common pitfall when using PSMC.