Fault-Tolerant RAxML Next Generation (FT-RAxML-NG)

Introduction

RAxML-NG is a phylogenetic tree inference tool which uses maximum-likelihood (ML) optimality criterion. Its search heuristic is based on iteratively performing a series of Subtree Pruning and Regrafting (SPR) moves, which allows to quickly navigate to the best-known ML tree. RAxML-NG is a successor of RAxML (Stamatakis 2014) and leverages the highly optimized likelihood computation implemented in libpll (Flouri et al. 2014).

RAxML-NG offers improvements in speed, flexibility and user-friendliness over the previous RAxML versions. It also implements some of the features previously available in ExaML (Kozlov et al. 2015), including checkpointing and efficient load balancing for partitioned alignments (Kobert et al. 2014).

Fault Tolerance

FT-RAxML-NG is expanded to include support for failing nodes during computations on large clusters. ULFM is used to detect failures and fix the MPI communicator. ULFM is a expansion of the OpenMPI MPI implementation. To get failure tolerance, you need to compile FT-RAxML-NG against ULFM and run in using ULFM's mpirun. FT-RAxML-NG will then automatically detect and mitigate failing ranks.

Installing ULFM on a cluster

If your cluster environment does not include an up-to date ULFM installation, you can try installing one into your home directory.

First, create the target directry, for example mkdir "$HOME/ulfm".
Download and extract the current version of ULFM from the official repo¹
Configure ULFM using ./configure --prefix="$HOME/ulfm"
Build using make -j
Install ULFM locally using make install

¹ Note, that we encountered problems with ULFM 4.0.2 on some systems. You can also try current development version from git. We had success on our systems with commit 0823ee3e57d.

Ensure to set the following environment variables when compiling and running FT-RAxML-NG.

# In this example $HOME/ulfm" is the prefix we installed ULFM into
export PATH="$HOME/ulfm/bin:$PATH"
export CPATH="$HOME/ulfm/src:$CPATH"
export LD_LIBRARY_PATH="$HOME/ulfm/lib:$LD_LIBRARY_PATH"

Your can verify that ULFM is used instead of a pre-installed MPI implementation by checking the output of which mpirun, which should point to your just installed binary. You can then proceed to build the MPI version of FT-RAxML-NG (see below).

Installing (FT-)RAxML-NG

Please clone this repository and build RAxML-NG from scratch.

Install the dependecies. On Ubuntu (and other Debian-based systems), you can simply run:

sudo apt-get install flex bison libgmp3-dev

For other systems, please make sure you have following packages/libraries installed:
GNU Bison Flex GMP

Build FT-RAxML-NG.

MPI version:

git clone --recursive https://github.com/lukashuebner/ft-raxml-ng
cd ft-raxml-ng
mkdir build && cd build
cmake -DUSE_MPI=ON ..
make

Running FT-RAxML-NG

Make sure you are using the mpirun shipped with ULFM. You can test this using which mpirun. We recommend the following settings to reduce the number of false positive failure reports:

mpirun -n $SLURM_NTASKS \
  --mca mpi_ft_enable true \
  --mca mpi_ft_detector_thread true \
  --mca mpi_ft_detector_period 0.3 \
  --mca ft_detector_timeout 1 \
  ../raxml-ng-mpi --search --msa inputMSA.phy --model inputModel.model --threads 1

This will increase ULFM's heartbeat period to 300 ms, and timeout to 1 s. It will also enable a separate thread responsible for sending and receiving heartbeats. See this article on the ULFM website for details.

Limitations

Some RAxML-ng features are not failure-tolerant yet. For example, currently only -search mode without bootstrap replicas is supported. Also, only fine-grained parallelization with a single thread per rank is supported.

Numerical Instability of Allreduce Operations

Allreduce operations on floating-point values are numerically unstable. If the number of PEs which take part in the allreduce operation changes, the result might change as well. This is, because floating-point operations are only approximately associative and commutative. The changed order of operations will cause the small inaccuracies to pile up differently. This has impacts on the reproducibility of tree searches. When no failure occurred, we can always reproduce a result by using the same number of nodes, cores per node, SIMD kernel, compiler, linker, OS version and the same random seed on the same system. If a failure occurred, RAxML-ng will conduct different allreduce operations with a different number of PEs. To reproduce this result, we would have to simulate a failure at the exact same moment in the tree search. Implementing either a numerically stable allreduce operation or a failure-log to enhance reproducibility is subject of future work.

Documentation and Support

RAxML-NG, including documentation, a tutorial, and support can be found here.

License and citation

The code is currently licensed under the GNU Affero General Public License version 3.

When using the fault-tolerant FT-RAxML-NG, please cite this paper:

Lukas Hübner, Alexey M. Kozlov, Demian Hespe, Peter Sanders, Alexandros Stamatakis (2021) Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG Preprint at bioRxiv doi:10.1101/2021.01.15.426773

Additionally, please mention that FT-RAxML-NG is an extension of RAxML-NG and cite this paper as well:

Alexey M. Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, and Alexandros Stamatakis (2019) RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, btz305 doi:10.1093/bioinformatics/btz305

References

Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9): 1312-1313. doi:10.1093/bioinformatics/btu033
Flouri T., Izquierdo-Carrasco F., Darriba D., Aberer AJ, Nguyen LT, Minh BQ, von Haeseler A., Stamatakis A. (2014) The Phylogenetic Likelihood Library. Systematic Biology, 64(2): 356-362. doi:10.1093/sysbio/syu084
Kozlov A.M., Aberer A.J., Stamatakis A. (2015) ExaML version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics (2015) 31 (15): 2577-2579. doi:10.1093/bioinformatics/btv184
Kobert K., Flouri T., Aberer A., Stamatakis A. (2014) The divisible load balance problem and its application to phylogenetic inference. Brown D., Morgenstern B., editors. (eds.) Algorithms in Bioinformatics, Vol. 8701 of Lecture Notes in Computer Science. Springer, Berlin, pp. 204–216

Name		Name	Last commit message	Last commit date
Latest commit History 486 Commits
libs		libs
scripts		scripts
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fault-Tolerant RAxML Next Generation (FT-RAxML-NG)

Introduction

Fault Tolerance

Installing ULFM on a cluster

Installing (FT-)RAxML-NG

Running FT-RAxML-NG

Limitations

Numerical Instability of Allreduce Operations

Documentation and Support

License and citation

References

About

Releases

Packages

Languages

License

lukashuebner/ft-raxml-ng

Folders and files

Latest commit

History

Repository files navigation

Fault-Tolerant RAxML Next Generation (FT-RAxML-NG)

Introduction

Fault Tolerance

Installing ULFM on a cluster

Installing (FT-)RAxML-NG

Running FT-RAxML-NG

Limitations

Numerical Instability of Allreduce Operations

Documentation and Support

License and citation

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages