Sensitivity analysis of RADseq assembly parameter effects on downstream genetic analyses
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
DISTRUCTsvgstack.sh
DISTRUCTsvgutils.sh
KLikelihoodProfiler.sh
LICENSE
RAPFX.sh
README.md
changeLog.md
compareTreesAndPopStruct.r
visualizeMargLikes.r

README.md

Codacy Badge License

Sensitivity analysis of RADseq assembly parameter effects on downstream genetic analyses

LICENSE

All code within the RAPFX v0.1.0 repository is available "AS IS" under a generous GNU license. See the LICENSE file for more information.

CITATION

If you use scripts from this repository as part of your published research, I require that you cite the repository as follows (also see DOI information below):

Alternatively, please provide the following link to this software repository in your manuscript:

DOI

The DOI for RAPFX, via Zenodo, is as follows:  DOI. Here are some examples of citing RAPFX using the DOI:

Bagley, J.C. 2017. RAPFX. GitHub package, Available at: http://doi.org/10.5281/zenodo.890870.

Bagley, J.C. 2017. RAPFX. Zenodo, Available at: http://doi.org/10.5281/zenodo.890870.

INTRODUCTION

This repository is the GitHub home for RAPFX, a new script pipeline for testing the sensitivity of downstream genetic analyses to varying RADseq assembly parameters in pyRAD (Eaton 2014) or ipyrad (Eaton and Overcast 2016), which is currently under development. Molecular ecologists often select an assembly or SNP calls from a single run of an assembler with a fixed set of parameter values (e.g. Phred offset, clustering percentage), and the final manuscript containing the work usually only presents results from that assembly. This is problematic because assembly parameters (1) can have substantial effects on the resulting assembly and (2) can also substantially affect downstream genetic analyses (e.g. population structure inference, phylogeny reconstruction).

Here, RAPFX enters the scene and takes steps towards automating testing of the sensitivity of assemblies and genetic inferences to pyRAD/ipyrad parameters. Thus, RAPFX can help users assess the sensitivity of their results by probing assembly parameters, which should make reviewers happy. RAPFX can also help users select an assembly with more optimal output, for example one that yielded genetic inferences tending towards an average (across assemblies) rather than extreme values (say, for number of genetic clusters in the data, or K).

What can you do with RAPFX?

  • Infer maximum-likelihood gene trees in RAxML for all loci produced by a set of n pyRAD/ipyrad assemblies (however, this is not a primary function of the software; for greater flexibility in doing this, users should refer to MAGNET, which is also distributed within PIrANHA)
  • Infer population genetic structure (number of clusters) in fastSTRUCTURE for each assembly in a set of n assemblies
  • Qualitatively assess sensitivity of genetic inferences to pyRAD/ipyrad assembly parameters
  • Quantify sensitivity of resulting (downstream) genetically-based results (number of clusters, gene tree distances) to varying pyRAD/ipyrad assembly parameters
  • Use some available code for parameter importance testing

Pipeline

Given a set of n pyRAD-inferred assemblies of RADseq data (run by the user a priori over a range of parameter values), RAPFX automates estimating a maximum-likelihood (ML) gene tree in RAxML, and population admixture groups in fastSTRUCTURE, for each assembly. Assembly parameters, RAxML trees, and best K (cluster/admixture groups) results are then organized, and relationships among them are explored, through a customized analysis in R. Output includes graphical summaries comparing results from the different assemblies. The following figure, Figure 1, presents a general overview of the RAPFX workflow:

Figure 1

Treespace

After inferring gene trees in RAxML, RAPFX plots graphics of the gene trees from each assembly and compares them in multivariate treespace. This is somewhat crudely done based on Robinson-Foulds distances (dRF, or "RF dist"), which reflect topological differences, and geodesic distances (dG, or "Geodesic dist"), which reflect branch length variation.

Variance-based sensitivity analysis & parameter importance

Code is also provided in RAPFX for (1) variance-based sensitivity analysis using Sobol' sensitivity indices (Sobol' 1990, 1993; Saltelli et al. 2008) estimated using the Latin hypercube space method of Tissot & Prieur (2015), and (2) parameter relative importance analyses in R, with quantitative results output to file. For these quantitative sensitivity analyses, it is recommended that users vary the following three assembly parameters in pyRAD (at least; given with my own arbitrary abbreviations): Phred score offset, pO; minimum depth of coverage, mD; clustering percentage threshold, cP.

Variance-based sensitivity analysis

During variance-based sensitivity analysis, for each model input variable Xi, Sobol' first-order indices reflect the contribution to total output variance of the decomposed variance attributable to Xi. Likewise, Sobol' second-order indices reflect the effect of varying two parameters, Xi and Xj, simultaneously, beyond the effect of the variation of each individual parameter, akin to interaction effects between variables (Saltelli et al. 2008). You can read more about Sobol' indices here, here, and here, and in papers listed under the Recommended Reading section at the bottom of this document.

Example of how to implement sensitivity analysis in RAPFX

In a recent paper that is presently in revision (Bagley et al. in revision), my colleagues and I compared 36 pyRADassemblies, across which we varied each of the three parameters just mentioned above (pO, mD, and cP) over two- or three-value ranges within up to 30% of the original assembly parameters of the "target assembly" (our original a priori preferred assembly), and we also conducted runs with versus without the outgroup taxon. In this example, a total of four parameters were varied across the assemblies. We passed these assemblies as input to RAPFX, which we then used to generate gene trees and bootstrap trees in RAxML from RAD loci in each assembly (using the GTRGAMMA model) and fastSTRUCTURE models estimating cluster assignments for K = 1–15 clusters. We also estimated Sobol' first-order and second-order indices of sensitivity of pyRAD parameters based on a vector of four assembly outcome variables: mean best-tree dRF from all other best trees; mean best-tree dG; best K; and number of RAD loci. All analyses on the assemblies were conducted in RAPFX (which calls the other programs listed above).

The end results suggested that downstream genetic analyses (1) are influenced by pyRAD assembly parameters, but that (2) K varied over assemblies but was probably biased in assemblies with small numbers of loci/RAD tags while (3) gene trees from different assemblies showed massive overlap in "treespace", suggesting phylogenetic inference was probably not greatly influenced by the assembly parameters. A final take home message from this sensitivity analysis was (4) that our genetic inferences based on ddRAD-seq data from Ion Torrent semiconductor sequencing, assembled inpyRAD, were slightly more sensitive to varying Phred offset score (pO) than the other parameters of the assembler, such as clustering threshold. These results have implications for future use of pyRAD with an outgroup when using (a) Ion Torrent data (b) from similar organisms to those investigated in our study (Neotropical teleost freshwater fishes).

GETTING STARTED

Dependencies

RAPFX takes as input a set of assembly folders generated using pyRAD or ipyrad. Beyond the assembler, RAPFX has four main dependencies, listed below. It is critical that the dependencies are available from the command line interface on your local machine (i.e. in your $PATH).

Installation

💻 RAPFX comprises a pipeline of UNIX shell scripts, and generates customized R scripts; thus, no installation is required, apart from downloading the master distro folder to your local machine and making sure that the shell scripts are present in any working directory you would like to run from. RAPFX should run from any folder on UNIX/LINUX-like systems such as Mac OS X.

To install RAPFX, simply clone this repository and modify permissions, as follows:

$ cd ~
$ git clone https://github.com/justincbagley/RAPFX.git
$ cd ./RAPFX/
$ chmod u+x ./*.sh

Input Format

📄 The RAPFX pipeline expects as input the path to working directory (e.g. current dir, .) containing sub-folders containing results from multiple pyRAD/ipyrad (Eaton 2014; Eaton and Overcast 2016) assembly runs over ranges of assembly parameter values (e.g. set by hand, or through 'branching' analysis after step 2 in ipyrad). Alternatively, results from a different assembler can be used, if they are organized in similar fashion to pyRAD/ipyrad output (i.e. have an 'outfiles' sub-folder containing output files formatted exactly) like those output by pyRAD/ipyrad.

Sequence names (passed to the assembler and propagated into the output files) may not include hyphen characters, or there will be issues.

USAGE

Users can access short or long help texts using different command line options. Access the regular Usage text by typing -h or -help after the name of the main shell script in the pipeline, RAPFX.sh. Access the much longer and more detailed "verbose" help text by using the uppercase -H or -Help flags. The code snippet below provides a printout of the verbose Usage for the current version of RAPFX.

🚧 However, as noted in the Usage, some of the options are experimental at present, so you should not use them, and if you do want to use them then you should email me first!!!

🚧 The main reason that I deem this part of the script "experimental" is that it has not been reworked to be customizable to all input assemblies, but I plan on developing this functionality soon (for a future alpha, or major version 1, release).

$ chmod u+x ./RAPFX.sh
$ ./RAPFX.sh -H

Usage: RAPFX.sh [Help: -h help H Help] [Options: -b r o l u n p s f d] workingDir 
 ## Help:
  -h   help text (also: -help)
  -H   verbose help text (also: -Help)

 ## Options:
  -b   numBootstraps (def: 100) RAxML bootstrap pseudoreplicates
  -r   raxmlModel (def: $MY_RAXML_MODEL; other: GTRGAMMAI, GTRCAT, GTRCATI)
  -o   outgroup (def: NULL) allows specifying an outgroup for phylogenetic analyses and tree plotting
  -l   lK (def: 1) lowest value of K to be modeled in fastSTRUCTURE
  -u   uK (def: 10) upper value of K to be modeled in fastSTRUCTURE
  -n   fsOutName (def: 'fs_simple_iter') basename for fastSTRUCTURE output files
  -p   popFile (def: NULL) specifies a population file for grouping samples by categorical variables
       in DISTRUCT plots of fastSTRUCTURE results
  -s   Sobol2relImpor (def: 0) turns variance-based sensitivity analysis and parameter relative 
       importance estimations on (1) or off (0)
  -f   numFactors (def: 3) number of parameters changed across assemblies
  -d   repDesignSize (def: 9) replicate design size N, where 2*N is total number of assemblies

 OVERVIEW
 Runs a sensitivity analysis pipeline to qualitatively and/or quantitatively explore the
 effects of varying RADseq assembly parameters on downstream phylogenetic gene tree and
 population structure analyses.
	Reads path to working directory (e.g. current dir, '.') containing sub-folders with 
 results from multiple pyRAD/ipyrad (Eaton 2014; Eaton and Overcast 2016) assembly runs over 
 ranges of assembly parameter values (e.g. set by hand, or through 'branching' analysis after 
 step 2 in ipyrad). Alternatively, results from a different assembler can be used, if they are 
 organized in similar fashion to pyRAD/ipyrad output. 
	Next, uses Phylip-and STRUCTURE-formatted DNA sequence/genotype files in 'outfiles' 
 dir to infer a maximum-likelihood gene tree in RAxML (Stamatakis 2014) and estimate population 
 admixture groups in fastSTRUCTURE (Raj et al. 2014). Analyses are conducted in separate 'raxml' 
 and 'fs' folders created within each assembly sub-folder. Assembly parameters, RAxML trees, 
 and best K (no. clusters/admixture groups) results are then organized, and relationships 
 among them are explored, through a customized R analysis comparing and plotting trees and 
 multivariate treespace. 
	Although currently in experimental stages of development, formal 1) variance-based 
 sensitivity analyses (Tissot and Prieur 2015; Pujol et al. 2017) and 2) parameter relative 
 importance analyses are also performed in R, and quantitative results are output to file, 
 if 's' option is turned on.
	Sequence names may not include hyphen characters, or there will be issues. For info on 
 various dependencies, see 'README.md' file in the distribution folder; however, it is key 
 that the dependencies are available from the command line interface. Docs supporting install
 and use of some of the dependencies are available from the README.

 DETAILS
 The -b flag sets the number of boostrap pseudoreplicates for RAxML (Stamatakis 2014) to 
 perform while estimating the gene tree for each locus. The default is 100; remove 
 bootstrapping by setting to 0.

 The -r flag sets the RAxML model for analyses. This uses the full default GTRGAMMA model,
 and at present it is not possible to vary the model across assemblies. Several GTR-based
 models are available, but if you want to use HKY or K80, you will need to manually change 
 the 'RAxMLRunner.sh' section of this script.

 The -o flag allows the user to specify the name of a *single* outgroup. The default value
 is NULL, and if by contrast the user gives an outgroup that outgroup is used during all
 RAxML runs, and later it is passed to an external 'outgroup.txt' file that is tested and
 read into R and used to root plotted trees.

 The -l flag sets the lower K value in the range of K values modeled in fastSTRUCTURE. The 
 default value is 1.

 The -u flag sets the upper K value to be modeled in fastSTRUCTURE. The default value is 10.

 The -n flag sets the name used for fastSTRUCTURE output files, which ideally would be 
 changed from the default value to something indicative of the particular species and/or
 assembly parameters from which data being analyzed originated. However, It is **recommended** 
 that user-specified output names end in 'iter' and contain no numbers. NOTES: 1) fastSTRUCTURE 
 input names are taken from the assembly params/results, and 2) the RAPFX script organizes 
 fastSTRUCTURE output in a straightforward fashion, within an 'fs' folder inside each assembly 
 sub-folder, so it is easy to keep track of results regardless of output name.

	### EXPERIMENTAL OPTIONS (not yet recommended for use; e-mail author for info!):

	##--Variance-based sensitivity analysis & parameter relative importance options:
	A quantitative sensitivity analysis portion of the RAPFX workflow is currently under
	development, and it uses the sensitivity R package (Pujol et al. 2017) to estimate first- 
	order sensitivity indices of assembly parameters using Tissot and Prieur's (2015) LHS 
	method. The following options apply __only__ to this section of the pipeline, which is 
	referred to as 's' (short for 's2r' or 'Sobol2relImpor'): 
	
	The -s flag allows turning the Sobol2relImpor section of the pipeline on (when set to 
	1) or off (set to 0). If 1, a customized Rscript will be written and executed to perform 
	the analyses, and __users must also specify options for the following flags__:
	
	The -f flag sets the number of factors used during estimation of Sobol' (1990, 1993) 
	sensitivity indices during the quantitative sensitivity analysis portion of the RAPFX
	workflow, which uses the sensitivity R package (Pujol et al. 2017) to estimate the first-
	order indices using the LHS method of Tissot and Prieur (2015). The number of factors 
	is the number of assembly parameters that were changed during pyRAD/ipyrad (Eaton 2014; 
	Eaton and Overcast 2016; or other assembler) runs.

	The -d flag sets the replicate design size N. Here, each assembly run is a model evaluation
	and the total number of model evaluations must be 2*N. N is supplied to sensitivity pkg
	during estimation of Sobol' sensitivity indices. 

	The default values for -f and -d flags are 3 and 9, respectively, because RAPFX code was
	developed using a case study dataset of 18 assemblies in a replicated design changing each
	of 3 parameters multiple times across 2- to 3-value ranges.

		## Examples #1-2, _Without_ Sobol2relImpor analysis: 
		"$0" -b 100 -r GTRGAMMA -o HypJCB310 -u 10 .
		"$0" -b 100 -r GTRGAMMA -o HypJCB310 -u 10 -s 0 .
	
		## Example #3, _With_ Sobol2relImpor analysis: 
		"$0" -b 100 -r GTRGAMMA -o HypJCB310 -u 10 -s 1 -f 3 -d 9 .

 CITATION
 Bagley, J.C. 2017. RAPFX v0.1.0. GitHub repository, Available at: 
	<http://github.com/justincbagley/RAPFX>.

 REFERENCES
 Eaton DA (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. 
	Bioinformatics, 30, 1844-1849.
 Eaton DAR, Overcast I (2016) ipyrad: interactive assembly and analysis of RADseq data sets. 
	Available at: <http://ipyrad.readthedocs.io/>.
 Pujol G, Iooss B, Janon A (2017) sensitivity: Global Sensitivity Analysis of Model Outputs. 
	R package version 1.14.0. Available at: <https://CRAN.R-project.org/package=sensitivity>
 Sobol' IM (1990) Sensitivity estimates for nonlinear mathematical models. Matematicheskoe 
	Modelirovani, 2, 112–118. 
 Sobol' IM (1993) Sensitivity estimates for nonlinear mathematical models [English Translation]. 
	Mathematical Modeling and Computational Experiment, 1, 407–414.
 Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of 
	large phylogenies. Bioinformatics, 30, 1312-1313.
 Tissot JY, Prieur C (2015) Estimating Sobol's indices combining Monte Carlo integration and 
	Latin hypercube sampling, Journal of Statistical Computation and Simulation, 85, 1358-
	1381.

RECOMMENDED READING

  • pyRAD user manual and tutorials listed on this page.
  • ipyrad ReadTheDocs documentation pages here.
  • Documentation for fastSTRUCTURE here and on GitHub.
  • DISTRUCT documentation, available here.
  • The Saltelli et al. (2008) Global sensitivity analysis book cited herein.
  • The Tissot and Prieur (2015) paper cited herein.

REFERENCES

  • Bagley, J.C. 2017. RAPFX. Zenodo, Available at: http://doi.org/10.5281/zenodo.890870.
  • Bagley JC, Aquino PPU, Hrbek T, Hernandez S, Langeani F, Colli GR (in revision) Using ddRAD-seq phylogeography to test for genetic effects of headwater river capture in suckermouth armored catfish (Loricariidae: Hypostomus) from the central Brazilian Shield. Molecular Ecology.
  • Eaton DA (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30, 1844–1849.
  • Eaton DAR, Overcast I (2016) ipyrad: interactive assembly and analysis of RADseq data sets. Available at: http://ipyrad.readthedocs.io/.
  • Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One, 7, e37135.
  • Raj A, Stephens M, and Pritchard JK (2014) fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics, 197, 573–589.
  • Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008) Global sensitivity analysis: the primer. John Wiley & Sons.
  • Sobol' IM (1990) Sensitivity estimates for nonlinear mathematical models. Matematicheskoe Modelirovani, 2, 112–118.
  • Sobol' IM (1993) Sensitivity estimates for nonlinear mathematical models [English Translation]. Mathematical Modeling and Computational Experiment, 1, 407–414.
  • Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30, 1312–1313.
  • Tissot JY, Prieur C (2015) A randomized orthogonal array-based procedure for the estimation of first- and second-order Sobol' indices. Journal of Statistical Computation and Simulation, 85:1358-1381.

September 25, 2017 Justin C. Bagley, Richmond, VA, USA