msstats - read data from ms via stdin, calculate common summary statistics
Copyright (C) 2002 Kevin Thornton
msstats is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Comments are welcome.
- Kevin Thornton <email@example.com>
The final release of this software is version 0.3.5, which is compatible with libsequence 1.9.2. No further updates will occur, and the GitHub repository is archived/read-only.
[coalesent simulation] | msstats | gzip > stats.gz
"coalescent simulation is assumed to be any program that prints biallelic marker output in the format of Dick Hudson's ms (http://home.uchicago.edu/~rhudson1)
This format looks like this:
positions: 0.1 0.9
The above is for 2 "SNPs" in a sample of size n = 2
sudo make install
##If dependent libraries are in non-standard locations. For example, "/opt":
CXXFLAGS=-I/opt/include LDFLAGS="$LDFLAGS -L/opt/lib" ./configure
sudo make install
##Installing somewhere other than /usr/local/bin
will result in msstats being in ~/bin.
##Notes on calculations for metapopulations.
If the input data contain ordered samples from multiple demes, you may pass that info to msstats as follows:
-I D n0 n1 ... nD,
where D is the number of demes and n0 is the sample size in the first deme, etc.
When -I is used, summary statistics will be calculated for each deme within each replicate. The columns "rep" and "pop" will tell you which deme in which replicate each statistic corresponds to.
The program will exit with an error if the sum of deme sizes does not equal the input sample size read from STDIN.
When the -I option is used, msstats will report Hudson, Slatkin, and Maddison's Fst statistic. The reference for this statistic is:
Hudson, Slatkin and Maddison (1992) Estimation of levels of gene flow from population data. Genetics 132:583-589
For all pairwise comparisions amongst the D demes, you will see output columns with headers hsmij, which is the Fst statistic calculated between demes i and j. These numbers will be repeated for each deme within each replicate. Thus, to get the actual distribution of Fst for a simulation, you should condition on the line of results for just one population. For example:
ms 30 100 -t 10 -I 3 10 10 10 1 | ./src/msstats -I 3 10 10 10 > output
Then, using R,
##For a single population
There are 21 columns in the output:
- S = the number of "segregating sites", aka mutations. (Watterson, 1975)
- n1 = the number of singletons in the data. This is the number of mutations at frequency 1 and n-1 in the sample.
- next = the number of "external mutations" (sensu Fu) = the number of derived singletons.
- theta = Watterson's estimate of theta
- pi = "sum of site heterogzygosity" = sum of 2pq over the S sites. (Nei, Tajima, others).
- thetaH = Fay and Wu's estimator of theta. Their H statistic is pi - thetaH, hence no column for it.
- Hprime = Zeng et al.'s normalized Fay and Wu's H.
- tajd = Tajima's D
- fulif = Fu and Li's F
- fulid = Fu and Li's D
- fulifs = Fu and Li's F-star
- fulifds = Fu and Li's D-star
- rm = Hudson and Kaplan's Rmin
- rmmg = Myers and Griffiths simplest lowest bound on Rmin
- nhap = Number of distinct haplotypes. Depaulis and Veuille
- hdiv = haplotype diversity. Depaulis and Veuille
- wallB = Jeff Wall's B
- wallQ = Jeff Wall's Q
- rosarf = Ramos-Onsins and Rozas Rf statistic
- rosarf = Ramos-Onsins and Rozas Ru statistic
- zns = Kelly's Zns = average pairwise r-squared.