Unit 18 Bayesian Phylogenetics
download and install tracer on your computer [LINK]
qlogin with 1 processor
mkdir ~/unit18 cd ~/unit18
Download the data
wget -O primates_alnMuscle.nex https://www.dropbox.com/s/zvdjtg8al6jrl17/primates_alnMuscle.nex?dl=0
We are using the same primate data from last week (1 Cytb sequence per genus for all (available) primates). The only difference is the data has been structured according to the nexus format. The file now looks like this:
#NEXUS begin data; dimensions nchar=1140 ntax=50; format missing=? gap=- matchchar=. datatype=DNA; matrix Allocebus_trichotis.AY441461 ATGACCAACATTCGAAAAATCCACCCCTTAATAAAAG... Alouatta_belzebul.AY374353.2 ATGACTACCCCCCGCAAAACACACCCACTAGCAAAAA... Aotus_azarai.DQ098863 ATGACTTCNCCCCGCAAAACACACCCACTAGCAAAGA... Arctocebus_calabarensis.AY441474 ATGACAATTATACGAAAACAACACCCATTAGCAAAAA... ... <break> ... Varecia_variegata.AF081047 ATGACTAACACCCGAAAAAACCACCCTCTAATAAAAA... Tupaia_belangeri.EU531776 ATGACCAACATACGAAAAAACCACCCATTACTAAAGA... ; end;
A nexus file always begins with
#NEXUS and data is organised in blocks. Here all of the sequence data is given in a single
data block. The
data block begins with
begin data; and ends with
end;. It is further broken down into specific commands that end with a
;. For example. The
dimensions command has two options
format command has options for
matchchar (matching character) coding, and
datatype. All of the sequences are in the
matrix section that ends with the
; after the last seqeunce.
I absolutely DESPISE entiring commands directly into
mrbayes. I keep two windows (or panels...if you are cool and use tmux) open to work with MrBayes. Open another window. You do not need to qlogin for this additional window.
cd unit18 nano primates_alnMuscle.nex
Now you can edit your file in this window while running it in the other.
mpirun -np 1 mb
Execute your data within
This loads our sequence data into
mrbayes. So how to build a tree?.
This gives us all of the commands that are available in
mrbayes. Look through them. If you need parameter options for each command then type
help <command. For example, to find out about the Monte-Carlo Markov Chain parameters you could use:
MrBayes > help mcmcp --------------------------------------------------------------------------- Mcmcp This command sets the parameters of the Markov chain Monte Carlo (MCMC) analysis without actually starting the chain. This command is identical in all respects to Mcmc, except that the analysis will not start after this command is issued. For more details on the options, check the help menu for Mcmc. Parameter Options Current Setting ----------------------------------------------------- Ngen <number> 1000000 Nruns <number> 2 Nchains <number> 4 Temp <number> 0.100000 Reweight <number>,<number> 0.00 v 0.00 ^ Swapfreq <number> 1 Nswaps <number> 1 Samplefreq <number> 500 Printfreq <number> 1000 Printall Yes/No Yes Printmax <number> 8 Mcmcdiagn Yes/No Yes Diagnfreq <number> 5000 Diagnstat Avgstddev/Maxstddev Avgstddev Minpartfreq <number> 0.10 Allchains Yes/No No Allcomps Yes/No No Relburnin Yes/No Yes Burnin <number> 0 Burninfrac <number> 0.25 Stoprule Yes/No No Stopval <number> 0.05 Savetrees Yes/No No Checkpoint Yes/No Yes Checkfreq <number> 2000 Filename <name> primates_alnMuscle.nex.<p/t> Startparams Current/Reset Current Starttree Current/Random/ Current Parsimony Nperts <number> 0 Data Yes/No Yes Ordertaxa Yes/No No Append Yes/No No Autotune Yes/No Yes Tunefreq <number> 100 ---------------------------------------------------------------------------
From this help menu not only do you get an idea of parameters, but you can find out what the default/current settings are. In this case, we would run two independent mcmc runs with 1M generations and four chains sampling every 500 generations.
To change a parameter:
Now set the bayes run to
mcmcp ngen=10000 samplefreq=100 printfreq=1000 filename=primates_10K-100s-1r-4c.bayes
Now execute the MCMC run.
This should run in less than 5 minutes. While its running we can discuss what is actually going on and what all of the output means.
<...> The MCMC sampler will use the following moves: With prob. Chain will use move 0.98 % Dirichlet(Pi) 0.98 % Slider(Pi) 9.80 % ExtSPR(Tau,V) 9.80 % ExtTBR(Tau,V) 9.80 % NNI(Tau,V) 9.80 % ParsSPR(Tau,V) 39.22 % Multiplier(V) 13.73 % Nodeslider(V) 5.88 % TLMultiplier(V) <...> Chain results (100000 generations requested): 0 -- [-47508.809] (-46388.062) (-46133.286) (-46633.652) (...0 remote chains...) 5000 -- [-28952.864] (-28977.447) (-28943.149) (-28960.214) (...0 remote chains...) -- 0:01:16 10000 -- (-28976.122) [-28958.717] (-28961.856) (-28962.195) (...0 remote chains...) -- 0:01:12 15000 -- (-28964.889) (-28947.323) (-28959.086) [-28970.997] (...0 remote chains...) -- 0:01:13 20000 -- (-28952.999) (-28968.473) [-28949.931] (-28970.990) (...0 remote chains...) -- 0:01:08 <...>
Other terms to be aware of:
- chains (hot vs. cold)
So this has generated
samplefreq trees that have been saved into the
primates_100K-100s-1r-4c.bayes.t and parameters have been saved in
3) Summarizing the run
Lets look at the distribution of our data. Download the
primates_100K-100s-1r-4c.bayes.p file and open it in
Here we can see if our data reached stationarity wihtin the ranges of our MCMC parameters. You can really see the importance of setting an approporiate burnin here.
Based on the tracer visualizaiton, we know that we need to burnin the first trees and ultimatley increase the number sampled. We will increase our sampling in a bit, but for know we just want to know what our tree looks like...but we generated more than 1K. All trees can be summarized via the following:
Why did we not include a burnin when summing the trees?
The files that were generated include
primates_10K-100s-1r-4c.bayes.vstat- length information
primates_10K-100s-1r-4c.bayes.trprobs- topologies in order of posterior probability
primates_10K-100s-1r-4c.bayes.parts- partition frequencies (of the consensus tree)
primates_10K-100s-1r-4c.bayes.con.tre- consensus tree.
The model parameteres and tree stats can be summarized using:
In this version of
mrbayes there is a lot of overlap in the information generated here and with Tracer.
Files generated are:
primates_10K-100s-1r-4c.bayes.pstat- model parameter statistics
primates_10K-100s-1r-4c.bayes.lstat- likelihood statistics
To this point you have given
mrbayes commands via the CLI. It is much easier to add the commands to a
mrbayes block in your nexus file. Lets re-run the analysis using a
mrbayes block while adding some new commands.
4) Running a more complex analysis via the #Nexus file
Add the following to the end of the
begin mrbayes; [set outgroup] outgroup Tupaia_belangeri.EU531776; [designate character sets] charset cytbPos1 = 1-1140\3; charset cytbPos2 = 2-1140\3; charset cytbPos3 = 3-1140\3; [combine character sets into partitions] partition cytbFull = 2: cytbPos1 cytbPos2, cytbPos3; [set parameters of the likelihood model] lset applyto=(all) nst=6 rates=invgamma; unlink shape=(all) revmat=(all) statefreq=(all) shape=(all) pinvar=(all) brlens=(all) topology=(all); [set priors on parameters - calculated in jModelTest] prset applyto=(all) revmatpr=fixed(0.15646,5.9045,3.590,0.3203,4.9898,1.00) statefreqpr=fixed(0.3851,0.364,0.0554,0.1955) pinvarpr=fixed(0.3220) shapepr=fixed(0.540); [set parameters of the monte-carlo markov chain] mcmcp ngen=2500000 nruns=2 samplefreq=100 printfreq=1000 filename=primates_2.5M-100s-2r-4c.bayes; [being the search] mcmc; [summarize trees and parameters] sumt contype=allcompat; sump; end;
While this is running think about the following questions:
- How many chains are being run?
- How many total trees are being sampled?
- How many trees are being summarized?
- Some people are hesitant to use Bayesian probablilities? What argument(s) could you make against Bayesian phylogenetics.
Open the one of the parameter files in Tracer. Look at the ESSs. They dropped. Why? Did this run reach stationarity? How many more samples would you guess we need? Generations?
How would a real
mrbayes run look? I normally run 4-8 runs of 4 chains with 10M generations, sampling every 100-1K. Why would one increase runs over chains or chains over runs?
Since this analsysis will take a while, you can download there results here
wget -O primates_2.5M-100s-2r-4c.bayes.tgz https://www.dropbox.com/s/7u1l3j5sb7ahjyl/primates_2.5M-100s-2r-4c.bayes.tgz?dl=0
4) Generating a "publication quailty" tree
Since we don't have time in class to generate a tree using more appropriate parameters, lets use the
primates_100K-100s-2r-4c.bayes.con.tre to see how we can take a raw tree and generate a publication quality image.
primates_100K-100s-2r-4c.bayes.con.tre in FigTree.
- Put the clade probability values (CPVs) on the branches.
- Color the clades for major primate groups.
- Order the taxa in a reasonable way.
- export as an SVG
Open the SVG in Inkscape
- remove CPVs that fail to meet our significance threshold...which is?
- Clean up the names for a few of the taxa.
- Save as an .svg and as a .tiff or .png.
- These trees are easy to manipulate and can be uploaded straight to the publishers website as a figure. There is no jumping through hurddles trying to meet resolution requirments.
A note about MPI and multithreading.
Normally you could run
mrbayes on our cluster using a simple command:
We are using:
mpirun -np 1 mb
This is telling a seperate program
mpi to run
-np threads. In this case 1.
mrbayes can take advantage of mp and run a number of threads equivelant to the number of runs. So...as an example, if you have the following
mcmcp ngen=1000000 nruns=2 samplefreq=100 printfreq=1000 nchains=10;
then you could distribute your analysis across 20 threads (
nchains = threads).
16 nov 2015