PhyGeo
is a tool for phylogenetic biogeography analysis.
There are two ways to install PhyGeo.
If you are only interested in the command-line tool,
just go to the Releases tab,
select the last release,
choose an executable for your system architecture,
renamed at your will
(Here,
it is assumed that you use the name phygeo
or phygeo.exe
if you use Windows),
and put it in your default bin
directory.
If you want an up-to-date tool
you require the go tool
and install the PhyGeo package by running:
go install github.com/js-arias/phygeo@latest
If you want to use the package in your own code, just import the package, for example:
import "github.com/js-arias/phygeo/infer/diffusion"
PhyGeo is a command-line tool formed by a set of commands. Many commands have their own sub-commands. To see the list of commands, just type the name of the application:
phygeo
The best way to learn about the commands
is by reading the included on-line help,
using the command help
,
and using the command of interest as a parameter:
phygeo help diff map
This simple example dataset, which includes the instructions to run it, will be helpful to start using the program.
A diffusion analysis with PhyGeo requires three data sources: a phylogenetic tree, the distribution range of the terminals, and a paleogeography model. These data sources are stored in a project file, so you don't have to define them every time. They also give you the possibility of having multiple projects based on slightly different data (for example, the same phylogenetic tree and distribution ranges, but a different paleogeographic model).
Maybe the best way to start a project is by setting the paleogeography model:
phygeo geo add --type geomotion project.tab muller-2022-motion.tab
phygeo geo add --type landscape project.tab cao-2017-landscape.tab
In this example,
a plate motion model called muller-2022-motion.tab
and a landscape model called cao-2017-landscape.tab
are added to project.tab
.
As it is possible that project.tab
does not exist,
it will be created automatically in the first call.
As paleogeography models are quite specialized datasets, here is a repository with several models ready to be used with PhyGeo.
To define the priors of the pixels, you must define a pixel prior file (here is an example of this kind of file) and then add it to the project:
phygeo geo prior --add model-pix-prior.tab project.tab
Phylogenetic trees in PhyGeo must be time-calibrated. The trees must be fully dichotomous. The preferred tree file is a tab-delimited file (here is an example file). To add a tree file, you must define the name of the destination file (if it is the first tree to be added), the project file, and one or more files with the trees (usually a single file):
phygeo tree add -f data-tree.tab project.tab vireya-tree.tab
Most users usually have the trees in newick format. To import a newick tree, use the flag --newick with the name of the added tree:
phygeo tree add -f data-tree.tab --newick vireya project.tab vireya.tre
See the documentation of the command tree add
for more information about adding trees.
Specimen records are stored as a set of pixels (presence-absence pixels) or range maps. Both files have the same format (here is an example file). To import a set of records to a project, you must define a destination file (if it is the first set of records to be added), the kind of file (points or ranges), the project file, and one or more files with the records:
phygeo range add -f data-points.tab project.tab vireya-points.tab pseudovireya-points.tab
It is possible that your file with specimen records is a table with latitudes and longitudes or a table downloaded from GBIF. In such cases, you can import them using the flag --format:
phygeo range add -f data-points.tab --format darwin project.tab gbif-download.csv
See the documentation of the command range add
for more information
about adding specimen records.
Note that the pixelation used for the specimen records must be of the same resolution as the paleogeography models.
To be sure that all terminals in all the trees in the project have at least a single valid record, use:
phygeo range taxa --val project.tab
if there is no problem, the command will finish silently; otherwise, it will report the name of the terminals without geographic data.
With a valid project, it is possible to make inferences from the data. There are several possibilities. Maybe the most simple is to just attempt a likelihood estimation of the data with an a priori lambda value, just to see what happens, or because from a previous analysis, you know that the given lambda is the maximum likelihood value:
phygeo diff like --lambda 100 -o like project.tab
This analysis will create a new file
with the prefix like
(for example, like-project.tab-vireya.100.000000-down.tab
),
which contains the down-pass likelihoods
(i.e., the conditional likelihoods)
of each pixel in each node,
so it is a large file.
As the calculation of the likelihood conditionals
is a time-consuming operation,
this file will be helpful
to skip that operation in further analysis.
For example,
it can be used to perform stochastic mapping:
phygeo diff particles -i like-project.tab-vireya-100.000000-down.tab -o p project.tab
This analysis
will create a new file with the prefix p
(for example,
'p-vireya-100.000000x1000.tab';
here is an example file.
This file will contain all the simulated dispersal paths,
so it is usually a large file.
As you probably want to know
the maximum likelihood estimate of lambda,
you can use the command diff ml
:
phygeo diff ml project.tab
The maximum likelihood estimation will be printed on the screen. It uses a simple hill-climbing algorithm that stops by default when the step size is smaller than 1.0; you can set a more detailed bound (but with a larger execution time).
Maybe you prefer a Bayesian analysis. As the only free parameter is the lambda value, you can make a simple integration:
phygeo diff integrate --min 100 --max 300 --parts 500 project.tab > log-like.tab
and then,
using any program to read tab-delimited data
(in this case log-like.tab
,
here is an example file),
you can provide the prior for lambda
(or just use the integration output,
assuming a flat uniform prior).
To sample from the posterior
(or for any distribution),
you can use the same diff integrate
command,
but define a sampling function
(at the moment,
it just implements the gamma distribution):
phygeo diff integrate --distribution "gamma=75,0.5" -p 100 --parts 1000 project.tab
In this execution,
for each sample
(it will make 1000,
defined with the flag --parts
),
it will make 100 stochastic mappings
(defined with the flag -p
).
The output will have the prefix
sample
(for example,
sample.tab-project.tab-vireya-sampling-1000x100.tab
).
These files are usually large
and are of the same format
as the output files produced
by the diff particles
command.
The results of the diff particles
command,
or diff integrate --distribution
command,
form the most important output of the program.
These files contain one
or more stochastic mappings
(usually more than 100),
i.e.,
the pixel locations of the nodes
and internodes
(branches that cross a time stage
defined by the paleogeography model).
We can transform the stochastic maps into pixel frequencies, which are the approximation of the pixel posterior at each node. These frequencies can be raw (i.e., just counts of sampled pixels) or smoothed using a spherical KDE:
phygeo diff freq --kde 1000 -i p-vireya-100.000000x1000.tab -o kde project.tab
In this example,
the pixel frequencies
will be stored in a file with the kde
prefix
(for example,
kde-project.tab-p-vireya-100.000000x1000.tab.tab
),
which is usually large.
Then we can create an image map of the frequencies:
phygeo diff map -c 1440 --key landscape-key.tab --gray -i kde-project.tab-p-vireya-100.000000x1000.tab.tab -o "ml-95/ml-95" project.tab
The command diff map
will create a reconstruction
from the frequency file
using a rainbow color scheme
(from blue for pixels with low posterior probability
to red for pixels with a high posterior);
see this directory
for an example output.
The command diff map
has a lot of output options;
see the command help for more information.
Here are some options:
to produce rotated
(the default)
or unrotated maps
(maps with current geographic locations,
--unrot
flag),
to output each node
(the default),
or output by time stage
(--richness
flag).
A key file
(here is an example)
can be used to define the colors
for the background geography
(with the flag --gray
,
it will use a grey scale).
As stochastic maps
include the starting and ending pixel
at each node,
it is possible to measure
the distance traveled by a particle
and its speed.
Use the command diff speed
to retrieve general speed results:
phygeo diff speed --tree speed --step 5 --box 5 -i ml-project.tab project.tab > speed.txt
This example
produces a tree
with the speed values colored
in a rainbow color scheme
(faster lineages in red,
slower in blue,
see this example file),
in an svg format,
and a log file
(here is an example file),
with the speed
and distance traveled on average
for each node.
With this command,
it is also possible
to measure the speed
at different time stages
(using the flag --time
).
Consult the help diff speed
command
to learn more about this command.
Here is the suggested citation for the PhyGeo tool:
Arias, J.S. (2023) PhyGeo: a tool for phylogenetic biogeography. (available at: https://github.com/js-arias/phygeo). doi:10.5281/zenodo.10636373.
The description of the method is available as a pre-print:
Arias, J.S. (2023) Phylogenetic biogeography inference using dynamic paleogeography models and explicit geographic ranges. BioRXiv 2023.11.16.567427. doi:10.1101/2023.11.16.567427.
See the manuscript for more information about previous proposals for phylogenetic biogeography methods using explicit geographic ranges or the diffusion model.
- Paleogeographic models for PhyGeo
- Sample PhyGeo datasets:
- Other tools:
The best way to contribute to the package is by running the program, detecting bugs, or asking for features. Use the tab issues to file a bug or ask for a feature.
If you like programming, you can create tools and packages to import export, or analyze data and results to or from PhyGeo. If you send me the link, I will post the link of your tool or package.
Of course, this package is open source, so you can modify it at your will!
Copyright © 2023 J. Salvador Arias jsalarias@gmail.com. All rights reserved. Distributed under BSD2 licenses that can be found in the LICENSE file.