sowhat automates the SOWH phylogenetic topology test, which uses parametric bootstrapping and is described by the manuscripts listed in FURTHER READING. It works on amino acid, nucleotide, and binary character state datasets.
A peer-reviewed manuscript describing
sowhat is available at Systematic Biology: http://sysbio.oxfordjournals.org/content/early/2015/07/30/sysbio.syv055.abstract
sowhat includes several features that provide flexibility and aid in the interpretation and assessment of SOWH test results, including:
- The test can be performed with the adjustment suggested by Susko 2014 (http://dx.doi.org/10.1093/molbev/msu039), which is the default behavior, or as originally described.
- Partitions, including partitions by codon position, can be used.
- Gaps are propagated from the original dataset to the simulated dataset.
- Likelihood searches can be performed with RAxML or GARLI.
- Boostrap replicate datasets can be simulated with Seq-Gen or PhyloBayes.
- Different models can be used for simulation and inference.
- Confidence intervals are estimated for the p-value, which helps the investigator assess if a sufficient number of bootstrap replicates have been sampled.
- The option to account for variability in the maximum likelihood searches by estimating the test statistic and parameters for each new alignment.
sowhat is in active development. Please use with caution. We appreciate hearing about your experience with the program via the issue tracker.
https://github.com/josephryan/sowhat (click the "Download ZIP" button at the bottom of the right column).
sowhat and documentation, type the following:
perl Makefile.PL make make test sudo make install
To install without root privelages try:
perl Makefile.PL PREFIX=/home/myuser/scripts make make test sudo make install
We have tested
sowhat on OS X 10.9, OS X 10.10, Ubuntu Server 10.04 (Amazon ami-d05e75b8), and Ubuntu Desktop 15.04. It will likely work on a variety of other Unix-like operating systems.
The dependencies listed below are required by
sowhat. They must be installed and
available in the appropriate
PATH. If they are not installed already, follow the
installation instructions in the links provided for each tool. We have tested
with the indicated dependency versions. Other versions may be incompatible, and should be
used with caution. These external tools are the result of a considerable amount of work by other investigators, please also cite them when you cite
General system tools:
- Perl, which comes with most operating systems
- The Statistics::R Perl module.
Statistics::Rhas additional requirements, as described at http://search.cpan.org/dist/Statistics-R/README. Use the
local::liboption to install
sudo. Use the boostrap method found at http://search.cpan.org/~haarg/local-lib-2.000004/lib/local/lib.pm for installation information. Once local::lib has been installed, and with R installed, install the Statistics::R package as you would normally. The use local::lib option must be activated in the program as well.
- The IPC::Run Perl module is currently needed for
make testto work correctly (optional).
To use more complex models for simulation, you will need to install the following optional dependency:
To print results to a json file, you will need to install the following optional dependency:
- The JSON Perl module.
You can install SOWHAT and all the required dependencies listed above on a clean Ubuntu 15.04
machine with the following commands (executables will be placed in
sudo apt-get update sudo apt-get install -y r-base-core cpanminus unzip gcc git sudo cpanm Statistics::R sudo cpanm JSON sudo Rscript -e "install.packages('ape', dependencies = T, repos='http://cran.rstudio.com/')" cd ~ git clone https://github.com/josephryan/sowhat.git cd sowhat/ # To work on the development branch (not recommended) execute: git checkout -b Development origin/Development sudo ./build_3rd_party.sh perl Makefile.PL make make test sudo make install
build_3rd_party.sh installs some dependencies from versions that are cached in
this repository. They may be out of date.
Several test datasets are provided in the
examples/ directory. To run example analyses
on these datasets, execute:
examples.sh and the resulting
test.output/ directory for more on the specifics of
Warning: Some of the examples take time (especially those that use Garli). For a quick example run make test and see the output in the test.output directory.
GETTING STARTED WITH YOUR OWN ANALYSES
1. Alignment (DNA, AA, or binary characters)
Format: non-interleaved PHYLIP format
Description: This can be DNA, amino acid, or binary characters. Often, you would have performed phylogenetic analyses on this alignment and recovered a result that was in conflict with an a priori hypothesis. You will specify the a priori hypothesis in a constraint tree (next section).
2. Constraint tree
Format: Newick format
The constraint tree represents a hypothesis that you would like to compare to the ML tree or some alternative hypothesis. In most cases you will want a tree that is mostly unresolved but includes the clade being tested. For example if your ML tree showed a sister relationship between two taxa 'A' and 'B' and you want to compare this result to topology with a sister relationship between 'A' and 'C,' you would create the following constraint tree:
Note that B, D, E, and F are unresolved.
3. RAxML model
The only other required parameter when using RAxML is
This option can specify any of the models that are available to RAxML. Running sowhat with the option --raxml_model=available will provide a list of all possible models.
Other RAxML parameters (including number of threads) can be specified with the option:
--rax='/usr/local/bin/raxmlHPC-PTHREADS -T 20'
4. For using GARLI instead of RAxML
RAxML is much faster than Garli and can use multiple processors, but GARLI has more available models. To use GARLI, you need to provide the option:
Example Garli configuration files are available (examples/garli.conf and examples/aa.garli.conf). For an in-depth explanation of all of the options, see the Garli manual available from: http://www.bio.utexas.edu/faculty/antisense/garli/garli.html
The nucleotide model specified in examples/garli.conf is GTR+GAMMA. The amino acid model specified in examples/aa.garli.conf is WAG+GAMMA. To adjust either of these the following parameters should be adjusted in garli.conf:
_ratematrix_, _statefrequencies_, _ratehetmode_, _numratecats_, _invariantsites_
We highly recommend not adjusting other parameters in the garli conf files as this could cause sowhat to fail.
See examples.sh for examples of sowhat command lines.
Examining the results
The results of the SOWH test are included in a file called sowhat.results.txt, which can be found in the directory specified with the --dir option in the sowhat command line. The bottom of sowhat.results.txt includes a p-value representing the probability that the test statistic would be observed under the null hypothesis.
Additional outputs include detailed information on the model used for simulating new alignments in the file sowhat.model.txt, information on the null distribution in sowhat.distribution.txt, and all program files printed to a directory sowhat_scratch. Within this directory, the files ending in i.0.0 represent the initial search of the empirical alignment file. Results can be printed to a file sowhat.results.json using the option
MORE COMPLEX SOWHAT OPTIONS
Parametric bootstrap tests rely heavily on the model used for data simulation. sowhat provides a number of options for exploring models and examining the results.
Using PhyloBayes CAT-GTR model
When using sowhat with RAxML, the user can specify that data be simulated using parameters estimated with the CAT-GTR model in a posterior probability framework in PhyloBayes. This model allows for more parameters free to vary. The likelihood scoring will still be performed using the RAxML model specified. The option is
Using the maximum parameters
Using both RAxML and GARLI, the user can specify that parameter estimation and data simulation be performed using the maximum number of free parameters. For example, in RAxML with nucleotide data, the model GTRGAMMAIX would be used for data simulation, which allows rates, frequencies, alpha value, and invariant sites to all be estimated using likelihood. This option is
Using a specified model
The user can additionally specify a model for data simulation. The format for this model is demonstrated in the files examples/simulation.... For nucleotide data, rates and frequencies must be specified. For aminoacid data, a matrix may be provided, or if the GTR model is speicified, a rate file can be provided which includes a symmetrical 20 by 20 matrix of aminoacid rates. Alpha and invariant site parameters may also be included. This option is
The classic Goldman+Susko SOWH test
This test evaluates whether a null hypothesis can be rejected, given the data. Data is simulated using parameters estimated under the null hypothesis. The generating topology uses a polytomy for the conflicting clades of interest, as recommended by Susko et al 2014. No additional options need be specified.
Testing two trees
This test evaluates whether the data, assuming a specific topology, provides significant support to reject a second topology. Use the options --constraint=... --treetwo=... --resolved
Testing the SOWH test
There are a number of options for verifying the results of the SOWH test. Multiple simultaneous SOWH tests can be performed and the mean and the ratio of the means can be plotted and examined (files ending in .eps in the specified directory). Use the options --runs=(number of tests) --plot
Specifying an evolutionary hypothesis
The user can simulated data under very specific models, including specifying abnormal rate matrices and a specific generating topology. This test evaluates whether, assuming evolution under these conditions, can the null hypothesis be rejected. Use the options --usegenmodel=... --usegentree=...
The most thorough approach to parametric bootstrapping is one in which the user changes the model options, evaluates the effects on the resulting p-value, and reports any indication that the null hypothesis cannot be rejected. To accomplish this, the user can perform a series of SOWH tests changing the model, using each of the following options:
which uses a model with a high number of free parameters to simulated the data
which recalculates the test statistic and parameters each iteration, to marginalize over any effects resulting from the likelihood software failing to find the optimal topology
which simulates data without any gaps; more information will be present in simulated data than in empirical data, which can affect the null distribution
sowhat \ --constraint=NEWICK_CONSTRAINT_TREE \ --aln=PHYLIP_ALIGNMENT \ --name=NAME_FOR_REPORT \ --dir=DIR \ [--debug] \ [--garli=GARLI_BINARY_OR_PATH_PLUS_OPTIONS] \ [--garli_conf=PATH_TO_GARLI_CONF_FILE] \ [--help] \ [--initial] \ [--json] \ [--max] \ [--raxml_model=MODEL_FOR_RAXML] \ [--nogaps] \ [--partition=PARTITION_FILE] \ [--pb=PB_BINARY_OR_PATH_PLUS_OPTIONS] [--pb_burn=BURNIN_TO_USE_FOR_PB_TREE_SIMULATIONS] \ [--plot] \ [--ppred=PPRED_BINARY_OR_PATH_PLUS_OPTIONS] \ [--rax=RAXML_BINARY_OR_PATH_PLUS_OPTIONS] \ [--reps=NUMBER_OF_REPLICATES] \ [--resolved] \ [--rerun] \ [--restart] \ [--runs=NUMBER_OF_TESTS_TO_RUN] \ [--seqgen=SEQGEN_BINARY_OR_PATH_PLUS_OPTIONS] \ [--treetwo=NEWICK_ALTERNATIVE_TO_CONST_TREE] \ [--usepb] \ [--usegarli] \ [--usegentree=NEWICK_TREE_FOR_SIMULATING_DATA] \ [--version] \
Extensive documentation is embedded inside of
sowhat in POD format and
can be viewed by running any of the following:
sowhat --help perldoc sowhat man sowhat # available after installation
A peer-reviewed manuscript describing
sowhat is available at Systematic Biology:
Church, Samuel H., Joseph F. Ryan, and Casey W. Dunn. "Automation and Evaluation of the SOWH Test with SOWHAT" Systematic Biology 2015 Nov;64(6):1048-58. doi: 10.1093/sysbio/syv055
Also see the file sowhat.bibtex
Goldman, Nick, Jon P. Anderson, and Allen G. Rodrigo. "Likelihood-based tests of topologies in phylogenetics." Systematic Biology 49.4 (2000): 652-670. doi:10.1080/106351500750049752
Swofford, David L., Gary J. Olsen, Peter J. Waddell, and David M. Hillis. Phylogenetic inference. (1996): 407-514. http://www.sinauer.com/molecular-systematics.html
COPYRIGHT AND LICENSE
Copyright (C) 2015 Samuel H. Church, Joseph F. Ryan, Casey W. Dunn
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program in the file LICENSE. If not, see http://www.gnu.org/licenses/.