Skip to content

Latest commit

 

History

History
122 lines (91 loc) · 8.08 KB

developer-guide.md

File metadata and controls

122 lines (91 loc) · 8.08 KB

This document will be used to document the code. Algorithms are given in the papers and will not be repeated here.

The project comes with an .project file that can be imported to Eclipse. Make sure you select at least Java 1.6 for this to compile fine in your Eclipse.

Versioning

Please update the version number in the Commandline.java file everytime the code changed. After updating the version, you need to commit to the repository the new file. To do that,

  • run make.sh
  • remove the old zip file
  • git add the new zip file
  • commit to git

Design

The code is designed such that various phylogeny reconstruction methods based on dynamic programming could be implemented. This generality creates some complications, because lots of functions have to be abstracted out to base classes. Currently, the general algorithm is instantiated twice, once for ASTRAL, and once for DynaDup. While ASTRAL is under constant development, testing, and use, DynaDup has not been tested or used in a long time (it may or may not be functioning at this point).

By Java convention, we use an I prefix for interfaces, and an Abstract prefix for abstract classes that need to be instantiated for each individual algorithm (e.g., for ASTARL). Instances for DynaDup are prefixed with DL and those relevant to ASTRAL are prefixed by WQ.

Another important point to consider is that the code was originally based on phylonet and inherits some of phylonet's structure to date. Many of the classes are used in binary form and as is from phylonet. A few of them are overwritten in ASTRAL.

Code components

The code is in several packages but phylonet.coalescent is where the majority of the code resides. When we don't mention a package, we mean this place.

General

In ASTRAL, each clade or partition is represented as a BitSet.

  • phylonet.util.BitSet: This simply modifies Java's Bitset class for improved efficiency and usability.
  • phylonet.tree.model.sti.STITreeCluster: This is a cluster, which is simply a subset of the set of leaves. Each Cluster should be associated with a taxon identifier, and inside includes a bitset.
    • phylonet.tree.model.sti.STITreeCluster.Vertex: This is an inner class inside STITreeCluster, which simply constitutes a node in the dynamic programming. It has various extra fields (in addition to the implicit reference to its outter class, which is the cluster in the dynamic programming) for backtracking of the dynamic programming.

Then, there are interfaces and abstract classes:

  • IClusterCollection, AbstractClusterCollection: Keep the set of clusters (used as the set X in ASTRAL)
    • Always has a top cluster; all other clusters are a subset of the top cluster
    • Has the important capacity to generate a subset of itself limited only to those clusters that are a subset of a given cluster; thus, it can generate a subset of itself with a different top element
    • Can find all resolutions of the top element among pairs of clusters in it. Useful in the dynamic programming.
  • AbstractDataCollection: Mostly just has methods for setting up the set X. Thus, this sets up and maintains an instance of IClusterCollection
  • AbstractInference: The hub for all types of operations that ASTRAL can perform (dynamic programming inference, scoring, building set X, etc.). Actual computations are delegated to other classes. This just "manages" stuff. Has access to input options, to data collections, and to weight calculators. Sets everything up and starts computations.
  • AbstractComputeMinCostTask: This is where the dynamic programming algorithm is actually implemented. Uses Vertex and IClusterCollection, in addition to the WeightCalculator to perform the dynamic programming.
  • AbstractWeightCalculator: Subclasses of this are where the weights in the dynamic programming are computed. The abstract class does little useful work. Important methods are abstract.

Relevant to ASTRAL

Classes specific to ASTRAL are:

  • WQClusterCollection: Doesn't do anything. Work done in the abstract class.
  • WQComputeMinCostTask: Doesn't do anything. Work done in the abstract class.
  • WQDataCollection: Has important functionality related to building the search space, including all the ASTRAL-II heuristics for augmenting the set X, the heuristics for completing incomplete gene trees, heuristics for handling multi-individual datasets, etc.
  • WQInference: has implementation of functions for setting up the data structures, and also:
    • Has functions for scoring a given species tree
    • Has functions (scoreBrances) for branch annotations using BipartitionWeightCalculator.
  • WQWeightCalculator: This class knows how to find the score of a tripartition from the gene trees. Has two algorithms, implemented in two nested inner classes:
    • WQWeightCalculator.SetWeightCalculator: this is the old algorithm, used in ASTRAL-I and is based on set intersections
    • WQWeightCalculator.TraversalWeightCalculator: this is the new algorithm based on tree traversal, used in ASTRAL-II
  • BipartitionWeightCalculator: This class computes a weight of Bipartition versus a set of input tripartitions. This is relevant to our local posterior probability paper and computation of branch lengths
  • Posterior: This computes posterior probabilities given quartet frequencies
  • SimilarityMatrix: This implements simple functionalities for a distance matrix. The distance matrix is used in completion of gene trees and in adding to set X.
  • Tripartition: This class represents a tripartition

Relevant to input/output

Classes relevant to the command-line GUI, including handling of input and output and taxon names are:

  • GlobalMaps: Static class, keeping instances of singleton classes that should be accessed only through here:
    • taxon identifier
    • random number generator
    • taxon name map
  • CommandLine: Reads command line input using the JSAP library and starts the correct inference based on the options provided by the user
    • Handles bootstrapping options (to be refactored out)
    • Handles creation of some singletons, such as taxon identifier, name maps, etc.
  • Options: Saves user input options. Inference class has an instance, which is created by the CommandLine.

Relevant to species names, individual names, etc.

  • TaxonIdentifier: Maps taxon names to taxon IDs.
  • TaxonNameMap: Maps the names in the gene trees to the names in the species tree
  • SpeciesMapper: Maps IDs between gene trees and the species tree
  • SingleIndividualSample.java: This class keeps track of a single-individual sample of a multi-individual dataset. For each species, we will include only one of its individuals in any instance of this class.
  • Solution: Used for building and keeping the output (not important; to be removed maybe)
  • phylonet.tree.io.NewickWriter: simple modifications to phylonet's newick writer