Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
C Shell Makefile Other
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This is the README file for the biomc2 suite. ======== (1) Installation For general instructions on how to install the package, please refer to the INSTALL file. The configuration is handled by autoconf/automake. I've tested only on GNU/Debian linux machines (intel and AMD64), and MacOSX using fink (http://fink.sourceforge.net/) but it may work on Windows using cygwin (http://www.cygwin.com/). (1.1) quick install guidelines tar jxvf biomc2-1.9.tbz (if you downloaded the bzip2'ed version) or tar zxvf biomc2-1.9.tgz (if you downloaded the gzip'ed version) to extract the files; cd biomc2-1.9/build ../configure --enable-static-binary --prefix=$HOME make make install The "build" directory is where the compilation will be done. It is not necessary, but it is good practice to build the program in a separate directory. You may choose some other directory at will, adapting the instructions as necessary. The option "--enable-static-binary" will link the compiled files with "-static", so that you can copy the binaries to other machines with the same architecture/OS but that lack the libraries. Another option specific to these programs is "--enable-optimization", that will include optimization flags when compiling. The program may become faster, but I don't know if it works for other architectures or older compilers. An important parameter is "--prefix=SOMEWHERE", since the compiled programs will go to "SOMEWHERE/bin/" and so forth. The default is '/usr/local', but it may be better to install it locally. In the example above, it will go to your home directory ($HOME). OBS: to compile it statically, I had to configure with the command (at least the pkg_config and -static are obligatory...): PKG_CONFIG="pkg-config --static" CFLAGS="-pthread -static -flto -ffat-lto-objects -fPIC" ../biomc2-1.11/configure --enable-static-binary --enable-optimization (1.2) required libraries - The GNU C library (http://www.gnu.org/software/libc/), known as glibc. - The GTK development library, for plotting the figure when running biomc2.summarise. You can install it through the command apt-get install libgtk2.0-dev libcairo2-dev in a debian/ubuntu linux or a Mac OSX with fink installed. (1.3) recommended software/libraries - The R software for statistical computing (http://www.r-project.org/). It is an excellent environment for statistical analysis and graphics, better than any non-free equivalent (IMHO). It is useful for analysing the output of the program, plotting, etc. The following libraries for R are very handy: - ape (http://cran.r-project.org/src/contrib/Descriptions/ape.html), for plotting the trees, calculating consensus trees, even estimating population parameters; - coda (http://cran.r-project.org/src/contrib/Descriptions/coda.html), for analysing and diagnosing MCMC simulations; - bioconductor (http://www.bioconductor.org/), for analysis of genomic data in general (I use it for creating nice figures). - MrBayes (http://mrbayes.csit.fsu.edu/), for analysing the posterior distribution of topologies. In the first version of our program, MrBayes was necessary. From version 1.9 on the program biomc2.summarise can calculate the MAP topologies. - paml (http://abacus.gene.ucl.ac.uk/software/paml.html), for simulating alignments along a given phylogeny. - paup (http://paup.csit.fsu.edu/), for generating random topologies. This is not free software, so you may try the 'rtree' function from the "ape" package in R. The program 'biomc2.distance' also generates trees, but they are correlated by the recombination distance. - gawk, perl; to play with the files (2) Programs This suite is composed of four programs: "biomc2", "biomc2.summarise", "biomc.sprdist" and "biomc2.distance", which are explained in more detail below. Despite my efforts in complying with the GNU standards, the usage of the programs is quite "unusual": the parameters are given in a specific order. If you simply run the program without any parameters, it will give a minimalistic hint about its usage. (2.1) biomc2 Usage: biomc2 control_file This is the main program, that given a DNA alignment and a tree file will try to estimate the recombination break-points and the number of recombinations for each break-point. The alignment file should be in INTERLEAVED nexus format. Or at least have the word "INTERLEAVE", since as far as I understand the non-interleaved format is a special case (the whole sequence in one line). The tree file is just an initial guess and should be in nexus format. If you feel unconfortable about the arbitrariness of this initial state you can give a tree file with lots of trees. The program will choose one randomly. All trees should be bifurcating (no politomies), with exception of the root node - since unrooted trees are represented by a trifurcation, in parenthetic format. The program can read rooted and unrooted trees, since in the end it will remove an eventual root node. The only input parameter to the program is the control file name. In this file there will be information about the DNA and tree files, along with other model parameters. A template file can be found at doc/ctrl.biomc2 and an example is available at example/ctrl.biomc2 . The program outputs to the screen progress information, consisting of the acceptance rates for each move and a point estimate of the number of recombinations nSPR and the number of break-points nCOP in the form [nSPR nCOP] . It also outputs the sampled values of each parameter to the files "post.tre" and "post.dist". If you are recovering the prior distribution (in other words, if you set "prior=1" in control_file) then the output files will be named "prior.tre" and "prior.dist". (in the original version the program would produce way too many output files, restricting the maximum number of segments). The file "post.dist" has a variable number of columns and should be processed by program "biomc2.summarise" but the four first columns (after the third row) are respectively "iteration", "posterior probability" (proportional to; in log scale), "number of SPRs" (=lower bound on number of recomb events), "number of break-points" (one break-point can harbor more than one recomb event), "prior average rate" and "prior kappa of HKY". The file "post.tree" is a standard nexus tree file with trees mapped to segments and samples (you need "biomc2.summarise" to make sense out of them, though). (2.2) biomc2.summarise Usage: biomc2.summarise post.tre post.dist This program reads the treefile and distfile from the MCMC sampler and summarises the posterior distribution of break-points. It changed a lot from the previous version, and is focused on determining a likely mosaic structure. The caveat is that it can no longer summarise the posterior distribution of average rates per segment - but this was not our main purpose in the first place. If necessary these and other values can be output with a little programming (on function sample_to_output_file() of file run_sampler.c). The program uses two main approaches to estimate a break-point mosaic structure: the piece-wise median and the centroid sample. The piece-wise median is based on the idea that the posterior distribution of break-points will have several modes, one for each break-point, along the alignment. Thus we partition the distribution based on the CDF where the number of partitions equals the median number of break-points, and for each partition we find its median value. The centroid sample is in fact the sample whose mosaic structure minimizes the total distance to all other samples. This distance algorithm will be published in the "Annals of the Institute of Statistical Mathematics". For each estimated mosaic structure we infer the topology as the MAP topology over all segments composing the non-recombinant regions. The program output is composed of three files: "mosaicTreeFreq.txt", "mosaicTrees.tre" and "recomb_freq.pdf". The file "mosaicTrees.tre" contains a list with all relevant topologies, named as character "t" followed by a numerical ID (used by mosaicTreeFreq.txt). The file "mosaicTreeFreq.txt" contains a table with 8 columns, where each row represents one segment: - loc: location of last site of segment; - mapfreq: posterior frequency of MAP (Maximum A Posteriori, the mode) topology for segment; - maptre: ID of MAP tree (according to mosaicTrees.tre); - map2freq: posterior frequency of second most frequent topology; - map2tre: tree ID of second most frequent topology; - mapBP: indicator of whether the MAP topologies changed or not between the segment and next; - mosBP: mosaic structure according to centroid sample (zero means no break-point and number other than zero is the tree ID up to this segment); - medBP: mosaic structure based on median (the values different from zero indicate the estimated topology ID for segments up to this one); The file "recomb_freq.pdf" is a figure with the posterior distribution of (approx.) recombinations, where the upper panel uses all samples and lower panel uses only the fraction of samples that exhibit the modal (most frequent) number of break-points. For each panel, we have at the top the mosaic based on the piece-wise median (together with 95% inter-quantile ranges). If this estimate does not coincide with the "mosaicTreeFreq.txt" please use only the mosaicTreeFreq.txt one and report to me. I haven't verified the plotting functions... At the bottom we have blue and red bars: the blue bars are segments where in no sample we observed a break-point (cold spots). The red bars represent the 95% credible set (together, they are responsible for 95% of the recombinational signal). If you are curious: if a third argument is given to biomc2.summarise with a tree file it will compare the trees in this file with the posterior trees and output the corresponding index (tree ID) in the posterior samples (I used this obscure option when comparing with simulations, where we know the true tree and want to see if the sampler can find it). Don't worry bout this option. (2.3) biomc2.distance Usage: biomc2.distance tree_file number_of_SPR number_of_replicates > output_file This program may be run independently from the analysis. Given an initial tree it simulates new topologies whose recombination distance from the previously simulated tree is given by the user. The initial tree is sampled from 'tree_file', and the program will create a chain of topologies such that all have the same recombination distance from their neighbors. The program may not simulate perfectly the given recombination distance since it applies the specified number of SPRs on the current tree to generate the next tree, but for large numbers we may have cycles. The program tries to minimize this by choosing the prune and regraft edges without replacement. The output_file will contain the tree with uniformly sampled branch lengths, preceded by : the simulation number, real and estimated recombination distances, respectively. The program will also print to the stderr the mean and standard deviation of a few statistics: the overall error, given by the overestimated distance minus the underestimated distance; the modular error, given by the overestimated distance plus the underestimated distance; the overestimation error; and the underestimation error. (2.4) biomc2.sprdist Usage: biomc2.sprdist tree1 tree2 Calculates d_SPR between a pair of tree files and outputs the histogram of distances. A file "recomb_leaves.txt" will contain the frequency of recombination for each leaf. Remember that this frequency is an approximation. (this program is not described in the paper and we don't use it yet). leo at gmail dot com