Skip to content

Latest commit

 

History

History
51 lines (44 loc) · 3.4 KB

README.md

File metadata and controls

51 lines (44 loc) · 3.4 KB

Sequence alignment statistics were calculated using AMAS. The version used is v0.95, Git commit 84a679a. The statistics based on trees or on trees and sequences (regression) were calculated in R using ape (Paradis 2012) and seqinr (Charif et al. 2007) libraries. The code used is in tree_props.R. It is a slightly modified version of the code for the metazoan phylogeny paper (Borowiec 2015; GitHub) with “clock-likeness” measure added. Average sequence heterogeneity was calculated using p4 (Foster 2004). This is the aln_hetero.py script. A brief explanation of what is being calculated and how in the file good_gene_stats.csv; For each alignment/corresponding tree I calculated:

  • alignment length
  • number of taxa
  • total matrix cells
  • count of undetermined characters (X, N, O, -, ?)
  • percent missing (using the above)
  • number of variable sites (undetermined chars are excluded when determining this)
  • proportion of variable sites
  • number of parsimony informative sites (excluding undetermined)
  • proportion of parsimony informative sites
  • AT content
  • GC content
  • counts of all nucleotides, gap -, and missing ?
  • average matrix heterogeneity (this is mean of Euclidean distance matrix of compositions)
  • average bootstrap
  • average branch length (including internal branches; possible rate proxy, lower number means slower-evolving)
  • “clocklikeness” score (this is a measure how close to ultrametric a tree is; the algorithm finds a root that minimizes coefficient of variation in root to tip distances and returns that value; lower value is more clock-like: ultrametric tree has a score of 0)
  • average uncorrected p-distance
  • regression slope of identity distances plotted against branch lengths (the higher the value the closer the alignment is fitting to linear regression, which means lower saturation potential)
  • R-squared of regression (as above, higher means better fit to linear regression and less saturation potential)

The code in plotting_correlations.R helps visualize correlations among 12 variables:

  • score
  • normalized score
  • alignment length
  • percent missing
  • proportion of parsimony informative sites
  • GC content
  • average matrix heterogeneity
  • average bootstrap
  • average branch length
  • clocklikeness
  • average p-distance
  • R-squared of regression

To cite:

Borowiec, M.L. 2016. AMAS: a fast tool for alignment manipulation and computing of summary statistics. PeerJ 4:e1660.

Borowiec, M. L., Lee, E. K., Chiu, J. C., & Plachetzki, D. C. 2015. Extracting phylogenetic signal and accounting for bias in whole-genome data sets supports the Ctenophora as sister to remaining Metazoa. BMC Genomics; 16(1):987.

Charif D, Lobry JR. 2007. SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis in Structural approaches to sequence evolution: Molecules, networks, populations (U. Bastolla, M. Porto, H.E. Roman and M. Vendruscolo Eds.) Biological and Medical Physics, Biomedical Engineering; pp 207–232.

Foster, P. G. 2004. Modeling compositional heterogeneity. Systematic Biology; 53(3):485–495.

Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics; 20:289–90.