Sensitivity analysis of RADseq assembly parameter effects on downstream genetic analyses
All code within the
RAPFX v0.1.0 repository is available "AS IS" under a generous GNU license. See the LICENSE file for more information.
If you use scripts from this repository as part of your published research, I require that you cite the repository as follows (also see DOI information below):
- Bagley, J.C. 2017. RAPFX. GitHub repository, Available at: http://github.com/justincbagley/RAPFX.
Alternatively, please provide the following link to this software repository in your manuscript:
The DOI for
RAPFX, via Zenodo, is as follows: . Here are some examples of citing
RAPFX using the DOI:
Bagley, J.C. 2017. RAPFX. GitHub package, Available at: http://doi.org/10.5281/zenodo.890870.
Bagley, J.C. 2017. RAPFX. Zenodo, Available at: http://doi.org/10.5281/zenodo.890870.
This repository is the GitHub home for
RAPFX, a new script pipeline for testing the sensitivity of downstream genetic analyses to varying RADseq assembly parameters in
pyRAD (Eaton 2014) or
ipyrad (Eaton and Overcast 2016), which is currently under development. Molecular ecologists often select an assembly or SNP calls from a single run of an assembler with a fixed set of parameter values (e.g. Phred offset, clustering percentage), and the final manuscript containing the work usually only presents results from that assembly. This is problematic because assembly parameters (1) can have substantial effects on the resulting assembly and (2) can also substantially affect downstream genetic analyses (e.g. population structure inference, phylogeny reconstruction).
RAPFX enters the scene and takes steps towards automating testing of the sensitivity of assemblies and genetic inferences to
ipyrad parameters. Thus,
RAPFX can help users assess the sensitivity of their results by probing assembly parameters, which should make reviewers happy.
RAPFX can also help users select an assembly with more optimal output, for example one that yielded genetic inferences tending towards an average (across assemblies) rather than extreme values (say, for number of genetic clusters in the data, or K).
What can you do with RAPFX?
- Infer maximum-likelihood gene trees in
RAxMLfor all loci produced by a set of n
ipyradassemblies (however, this is not a primary function of the software; for greater flexibility in doing this, users should refer to MAGNET, which is also distributed within PIrANHA)
- Infer population genetic structure (number of clusters) in
fastSTRUCTUREfor each assembly in a set of n assemblies
- Qualitatively assess sensitivity of genetic inferences to
- Quantify sensitivity of resulting (downstream) genetically-based results (number of clusters, gene tree distances) to varying
- Use some available code for parameter importance testing
Given a set of n
pyRAD-inferred assemblies of RADseq data (run by the user a priori over a range of parameter values),
RAPFX automates estimating a maximum-likelihood (ML) gene tree in
RAxML, and population admixture groups in
fastSTRUCTURE, for each assembly. Assembly parameters,
RAxML trees, and best K (cluster/admixture groups) results are then organized, and relationships among them are explored, through a customized analysis in
R. Output includes graphical summaries comparing results from the different assemblies. The following figure, Figure 1, presents a general overview of the
After inferring gene trees in
RAPFX plots graphics of the gene trees from each assembly and compares them in multivariate treespace. This is somewhat crudely done based on Robinson-Foulds distances (dRF, or "RF dist"), which reflect topological differences, and geodesic distances (dG, or "Geodesic dist"), which reflect branch length variation.
Variance-based sensitivity analysis & parameter importance
Code is also provided in
RAPFX for (1) variance-based sensitivity analysis using Sobol' sensitivity indices (Sobol' 1990, 1993; Saltelli et al. 2008) estimated using the Latin hypercube space method of Tissot & Prieur (2015), and (2) parameter relative importance analyses in
R, with quantitative results output to file. For these quantitative sensitivity analyses, it is recommended that users vary the following three assembly parameters in
pyRAD (at least; given with my own arbitrary abbreviations): Phred score offset, pO; minimum depth of coverage, mD; clustering percentage threshold, cP.
Variance-based sensitivity analysis
During variance-based sensitivity analysis, for each model input variable Xi, Sobol' first-order indices reflect the contribution to total output variance of the decomposed variance attributable to Xi. Likewise, Sobol' second-order indices reflect the effect of varying two parameters, Xi and Xj, simultaneously, beyond the effect of the variation of each individual parameter, akin to interaction effects between variables (Saltelli et al. 2008). You can read more about Sobol' indices here, here, and here, and in papers listed under the Recommended Reading section at the bottom of this document.
Example of how to implement sensitivity analysis in RAPFX
In a recent paper that is presently in revision (Bagley et al. in revision), my colleagues and I compared 36
pyRADassemblies, across which we varied each of the three parameters just mentioned above (pO, mD, and cP) over two- or three-value ranges within up to 30% of the original assembly parameters of the "target assembly" (our original a priori preferred assembly), and we also conducted runs with versus without the outgroup taxon. In this example, a total of four parameters were varied across the assemblies. We passed these assemblies as input to
RAPFX, which we then used to generate gene trees and bootstrap trees in
RAxML from RAD loci in each assembly (using the GTRGAMMA model) and
fastSTRUCTURE models estimating cluster assignments for K = 1–15 clusters. We also estimated Sobol' first-order and second-order indices of sensitivity of
pyRAD parameters based on a vector of four assembly outcome variables: mean best-tree dRF from all other best trees; mean best-tree dG; best K; and number of RAD loci. All analyses on the assemblies were conducted in
RAPFX (which calls the other programs listed above).
The end results suggested that downstream genetic analyses (1) are influenced by
pyRAD assembly parameters, but that (2) K varied over assemblies but was probably biased in assemblies with small numbers of loci/RAD tags while (3) gene trees from different assemblies showed massive overlap in "treespace", suggesting phylogenetic inference was probably not greatly influenced by the assembly parameters. A final take home message from this sensitivity analysis was (4) that our genetic inferences based on ddRAD-seq data from Ion Torrent semiconductor sequencing, assembled in
pyRAD, were slightly more sensitive to varying Phred offset score (pO) than the other parameters of the assembler, such as clustering threshold. These results have implications for future use of
pyRAD with an outgroup when using (a) Ion Torrent data (b) from similar organisms to those investigated in our study (Neotropical teleost freshwater fishes).
RAPFX takes as input a set of assembly folders generated using
ipyrad. Beyond the assembler, RAPFX has four main dependencies, listed below. It is critical that the dependencies are available from the command line interface on your local machine (i.e. in your
fastSTRUCTUREv1.0 (available at: https://rajanil.github.io/fastStructure/), including several key scripts in the
Perl(available at: https://www.perl.org/get.html).
Pythonv2.7 and/or 3++ (available at: https://www.python.org/downloads/)
Rv3++ (available at: https://cran.r-project.org/)
RAPFX comprises a pipeline of UNIX shell scripts, and generates customized
R scripts; thus, no installation is required, apart from downloading the master distro folder to your local machine and making sure that the shell scripts are present in any working directory you would like to run from.
RAPFX should run from any folder on UNIX/LINUX-like systems such as Mac OS X.
RAPFX, simply clone this repository and modify permissions, as follows:
$ cd ~ $ git clone https://github.com/justincbagley/RAPFX.git $ cd ./RAPFX/ $ chmod u+x ./*.sh
RAPFX pipeline expects as input the path to working directory (e.g. current dir,
.) containing sub-folders containing results from multiple
ipyrad (Eaton 2014; Eaton and Overcast 2016) assembly runs over ranges of assembly parameter values (e.g. set by hand, or through 'branching' analysis after step 2 in ipyrad). Alternatively, results from a different assembler can be used, if they are organized in similar fashion to pyRAD/ipyrad output (i.e. have an 'outfiles' sub-folder containing output files formatted exactly) like those output by
Sequence names (passed to the assembler and propagated into the output files) may not include hyphen characters, or there will be issues.
Users can access short or long help texts using different command line options. Access the regular Usage text by typing
-help after the name of the main shell script in the pipeline,
RAPFX.sh. Access the much longer and more detailed "verbose" help text by using the uppercase
-Help flags. The code snippet below provides a printout of the verbose Usage for the current version of
$ chmod u+x ./RAPFX.sh $ ./RAPFX.sh -H Usage: RAPFX.sh [Help: -h help H Help] [Options: -b r o l u n p s f d] workingDir ## Help: -h help text (also: -help) -H verbose help text (also: -Help) ## Options: -b numBootstraps (def: 100) RAxML bootstrap pseudoreplicates -r raxmlModel (def: $MY_RAXML_MODEL; other: GTRGAMMAI, GTRCAT, GTRCATI) -o outgroup (def: NULL) allows specifying an outgroup for phylogenetic analyses and tree plotting -l lK (def: 1) lowest value of K to be modeled in fastSTRUCTURE -u uK (def: 10) upper value of K to be modeled in fastSTRUCTURE -n fsOutName (def: 'fs_simple_iter') basename for fastSTRUCTURE output files -p popFile (def: NULL) specifies a population file for grouping samples by categorical variables in DISTRUCT plots of fastSTRUCTURE results -s Sobol2relImpor (def: 0) turns variance-based sensitivity analysis and parameter relative importance estimations on (1) or off (0) -f numFactors (def: 3) number of parameters changed across assemblies -d repDesignSize (def: 9) replicate design size N, where 2*N is total number of assemblies OVERVIEW Runs a sensitivity analysis pipeline to qualitatively and/or quantitatively explore the effects of varying RADseq assembly parameters on downstream phylogenetic gene tree and population structure analyses. Reads path to working directory (e.g. current dir, '.') containing sub-folders with results from multiple pyRAD/ipyrad (Eaton 2014; Eaton and Overcast 2016) assembly runs over ranges of assembly parameter values (e.g. set by hand, or through 'branching' analysis after step 2 in ipyrad). Alternatively, results from a different assembler can be used, if they are organized in similar fashion to pyRAD/ipyrad output. Next, uses Phylip-and STRUCTURE-formatted DNA sequence/genotype files in 'outfiles' dir to infer a maximum-likelihood gene tree in RAxML (Stamatakis 2014) and estimate population admixture groups in fastSTRUCTURE (Raj et al. 2014). Analyses are conducted in separate 'raxml' and 'fs' folders created within each assembly sub-folder. Assembly parameters, RAxML trees, and best K (no. clusters/admixture groups) results are then organized, and relationships among them are explored, through a customized R analysis comparing and plotting trees and multivariate treespace. Although currently in experimental stages of development, formal 1) variance-based sensitivity analyses (Tissot and Prieur 2015; Pujol et al. 2017) and 2) parameter relative importance analyses are also performed in R, and quantitative results are output to file, if 's' option is turned on. Sequence names may not include hyphen characters, or there will be issues. For info on various dependencies, see 'README.md' file in the distribution folder; however, it is key that the dependencies are available from the command line interface. Docs supporting install and use of some of the dependencies are available from the README. DETAILS The -b flag sets the number of boostrap pseudoreplicates for RAxML (Stamatakis 2014) to perform while estimating the gene tree for each locus. The default is 100; remove bootstrapping by setting to 0. The -r flag sets the RAxML model for analyses. This uses the full default GTRGAMMA model, and at present it is not possible to vary the model across assemblies. Several GTR-based models are available, but if you want to use HKY or K80, you will need to manually change the 'RAxMLRunner.sh' section of this script. The -o flag allows the user to specify the name of a *single* outgroup. The default value is NULL, and if by contrast the user gives an outgroup that outgroup is used during all RAxML runs, and later it is passed to an external 'outgroup.txt' file that is tested and read into R and used to root plotted trees. The -l flag sets the lower K value in the range of K values modeled in fastSTRUCTURE. The default value is 1. The -u flag sets the upper K value to be modeled in fastSTRUCTURE. The default value is 10. The -n flag sets the name used for fastSTRUCTURE output files, which ideally would be changed from the default value to something indicative of the particular species and/or assembly parameters from which data being analyzed originated. However, It is **recommended** that user-specified output names end in 'iter' and contain no numbers. NOTES: 1) fastSTRUCTURE input names are taken from the assembly params/results, and 2) the RAPFX script organizes fastSTRUCTURE output in a straightforward fashion, within an 'fs' folder inside each assembly sub-folder, so it is easy to keep track of results regardless of output name. ### EXPERIMENTAL OPTIONS (not yet recommended for use; e-mail author for info!): ##--Variance-based sensitivity analysis & parameter relative importance options: A quantitative sensitivity analysis portion of the RAPFX workflow is currently under development, and it uses the sensitivity R package (Pujol et al. 2017) to estimate first- order sensitivity indices of assembly parameters using Tissot and Prieur's (2015) LHS method. The following options apply __only__ to this section of the pipeline, which is referred to as 's' (short for 's2r' or 'Sobol2relImpor'): The -s flag allows turning the Sobol2relImpor section of the pipeline on (when set to 1) or off (set to 0). If 1, a customized Rscript will be written and executed to perform the analyses, and __users must also specify options for the following flags__: The -f flag sets the number of factors used during estimation of Sobol' (1990, 1993) sensitivity indices during the quantitative sensitivity analysis portion of the RAPFX workflow, which uses the sensitivity R package (Pujol et al. 2017) to estimate the first- order indices using the LHS method of Tissot and Prieur (2015). The number of factors is the number of assembly parameters that were changed during pyRAD/ipyrad (Eaton 2014; Eaton and Overcast 2016; or other assembler) runs. The -d flag sets the replicate design size N. Here, each assembly run is a model evaluation and the total number of model evaluations must be 2*N. N is supplied to sensitivity pkg during estimation of Sobol' sensitivity indices. The default values for -f and -d flags are 3 and 9, respectively, because RAPFX code was developed using a case study dataset of 18 assemblies in a replicated design changing each of 3 parameters multiple times across 2- to 3-value ranges. ## Examples #1-2, _Without_ Sobol2relImpor analysis: "$0" -b 100 -r GTRGAMMA -o HypJCB310 -u 10 . "$0" -b 100 -r GTRGAMMA -o HypJCB310 -u 10 -s 0 . ## Example #3, _With_ Sobol2relImpor analysis: "$0" -b 100 -r GTRGAMMA -o HypJCB310 -u 10 -s 1 -f 3 -d 9 . CITATION Bagley, J.C. 2017. RAPFX v0.1.0. GitHub repository, Available at: <http://github.com/justincbagley/RAPFX>. REFERENCES Eaton DA (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30, 1844-1849. Eaton DAR, Overcast I (2016) ipyrad: interactive assembly and analysis of RADseq data sets. Available at: <http://ipyrad.readthedocs.io/>. Pujol G, Iooss B, Janon A (2017) sensitivity: Global Sensitivity Analysis of Model Outputs. R package version 1.14.0. Available at: <https://CRAN.R-project.org/package=sensitivity> Sobol' IM (1990) Sensitivity estimates for nonlinear mathematical models. Matematicheskoe Modelirovani, 2, 112–118. Sobol' IM (1993) Sensitivity estimates for nonlinear mathematical models [English Translation]. Mathematical Modeling and Computational Experiment, 1, 407–414. Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30, 1312-1313. Tissot JY, Prieur C (2015) Estimating Sobol's indices combining Monte Carlo integration and Latin hypercube sampling, Journal of Statistical Computation and Simulation, 85, 1358- 1381.
pyRADuser manual and tutorials listed on this page.
ipyradReadTheDocs documentation pages here.
- Documentation for
fastSTRUCTUREhere and on GitHub.
DISTRUCTdocumentation, available here.
- The Saltelli et al. (2008) Global sensitivity analysis book cited herein.
- The Tissot and Prieur (2015) paper cited herein.
- Bagley, J.C. 2017. RAPFX. Zenodo, Available at: http://doi.org/10.5281/zenodo.890870.
- Bagley JC, Aquino PPU, Hrbek T, Hernandez S, Langeani F, Colli GR (in revision) Using ddRAD-seq phylogeography to test for genetic effects of headwater river capture in suckermouth armored catfish (Loricariidae: Hypostomus) from the central Brazilian Shield. Molecular Ecology.
- Eaton DA (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30, 1844–1849.
- Eaton DAR, Overcast I (2016) ipyrad: interactive assembly and analysis of RADseq data sets. Available at: http://ipyrad.readthedocs.io/.
- Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One, 7, e37135.
- Raj A, Stephens M, and Pritchard JK (2014) fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics, 197, 573–589.
- Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008) Global sensitivity analysis: the primer. John Wiley & Sons.
- Sobol' IM (1990) Sensitivity estimates for nonlinear mathematical models. Matematicheskoe Modelirovani, 2, 112–118.
- Sobol' IM (1993) Sensitivity estimates for nonlinear mathematical models [English Translation]. Mathematical Modeling and Computational Experiment, 1, 407–414.
- Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30, 1312–1313.
- Tissot JY, Prieur C (2015) A randomized orthogonal array-based procedure for the estimation of first- and second-order Sobol' indices. Journal of Statistical Computation and Simulation, 85:1358-1381.
September 25, 2017 Justin C. Bagley, Richmond, VA, USA