# Alignment to ancestors pipeline in individual steps

We recommend performing the ancestral inference in a high-performance computing environment. Therefore, if working locally or in Colab, this notebook serves as an example for the user to visualize the effects of manipulating parameters for ancestral reconstruction on a toy dataset without need for a computing cluster. See [*03_alignment_to_ancestors.ipynb*](https:// [link text](https://)) to initiate the entire pipeline on a topiary's toy dataset with one code block.


The alignment-to-ancestors pipeline takes an alignment, finds the best phylogenetic model, builds a maximum-likelihood (ML) gene tree, reconciles this tree with the species tree, infers ancestral protein sequences, and then determines statistical supports for the existence of each ancestor. Because these steps are computationally intensive and have different parallelization requirements, the pipeline is broken into two scripts: *alignment-to-ancestors* and *bootstrap-reconcile*.

<a href="https://githubtocolab.com/harmslab/topiary-examples/blob/main/notebooks/04_alignment_to_ancestors_step_wise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Setup
Start by running the cells below to set up topiary and all required software.

In [None]:
### THIS CELL SETS UP TOPIARY IN A GOOGLE COLAB ENVIRONMENT. 
### IF RUNNING THIS NOTEBOOK LOCALLY, IT MAY BE SAFELY DELETED.

#@title Install software

#@markdown #### Installation requires two steps.

#@markdown 1. Install the software by pressing the _Play_ button on the left.
#@markdown Please be patient. This will take several minutes. <font color='teal'>
#@markdown After the  installation is complete, the kernel will reboot 
#@markdown and Colab will complain that the session crashed. This is normal.</font>
#@markdown <br/>

install_raxml = True    #@param {type:"boolean"}
install_generax = True  #@param {type:"boolean"}

#@markdown 2. After this cell runs, run the "Initialize environment" cell that follows.

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    import urllib.request
    urllib.request.urlretrieve("https://raw.githubusercontent.com/harmslab/topiary-examples/main/notebooks/colab_installer.py",
                              "colab_installer.py")

    import colab_installer
    colab_installer.install_topiary(install_raxml=install_raxml,
                                    install_generax=install_generax)

In [None]:
### IF RUNNING LOCALLY, ACTIVATE THE TOPIARY ENVIRONMENT IN CONDA
### AND RE-OPEN THIS NOTEBOOK.

import topiary
import numpy as np
import pandas as pd 

### EVERYTHING AFTER THIS LINE IS IS USED TO SET UP TOPIARY IN A GOOGLE
### COLAB ENVIRONMENT. IF RUNNING THIS NOTEBOOK LOCALLY, THE LINES BELOW
### IN THIS CELL MAY BE SAFELY DELETED. 

#@title Initialize environment

#@markdown  Run this cell to initialize the environment after installation.
#@markdown (This cell can also be run if the kernel dies during a calculation,
#@markdown allowing you to reload modules without having to
#@markdown reinstall). 

#@markdown We recommend setting up a working directory on your google drive. This is a 
#@markdown convenient way to pass files to topiary and will allow you to save
#@markdown your work. For example, if you type `topiary_work` into the form
#@markdown field below, topiary will save all of its calculations in the 
#@markdown `topiary_work` directory in MyDrive (i.e. the top directory at
#@markdown https://drive.google.com). This script will create the directory if 
#@markdown it does not already exist. If the directory already exists, any files
#@markdown that are already in that directory will be available to topiary. You could, 
#@markdown for example, put a file called `seed.csv` in `topiary_work` and then
#@markdown access it as "seed.csv" in all cells below.

#@markdown Note: Google may prompt you for permission to access the drive. 
#@markdown To work in a temporary colab environment, leave this blank. 

# Select a working directory on google drive
google_drive_directory = "" #@param {type:"string"}

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    topiary._in_notebook = "colab"
    import colab_installer
    colab_installer.initialize_environment()
    colab_installer.mount_google_drive(google_drive_directory)

--------

# *Alignment-to-ancestors*

--------

# 00. Infer the evolutionary model

The first step in a maximum likelihood phylogenetic analysis is determining the maximum likelihood model of sequence evolution. This includes the matrix for amino acid substitution (i.e., LG, JTT, WAG, etc.), the stationary frequencies for that model, rate variation parameters (𝚪 distribution, rate categories, etc.), and the proportion of invariant sites. Topiary uses a conventional method to find the best model (Abascal F, 2005). It uses RAxML-NG to generate a maximum parsimony tree from the alignment. It then optimizes branch lengths and other parameters using all 360 combinations of these model parameters implemented in the computational library that underlies RAxML-NG and GeneRax. Finally, it ranks these models based on a corrected Akaike Information Criterion, which penalizes models with excess parameters to prevent overfitting.

Although this protocol is done automatically, topiary returns a variety of statistics including AIC, AICc, and BIC to help users who want more control over model selection. Via the API, users can also specify a custom input tree or a subset of the models to test. (Note: as of the current version, topiary excludes the LG4M and LG4X models, as these cause GeneRax to crash during gene-species tree reconciliation).


This cell can be run without updating any parameters. For a full description of the meanings of all parameters, see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/topiary.pipeline.html#module-topiary.pipeline.alignment_to_ancestors).

In [None]:
# 00. Find the best phylogenetic model

# This cell takes the alignment and finds the best evolutionary model
# to explain relationships of sequences in the alignment. The output
# from this cell shows the comparison of all 360 combinations of
# defined model parameters conducted to find the best model. 

# Colab users: If you did not set a working directory on your Google 
# Drive to save your results to, remove the "../../../ from the 
# alignment_dataframe path.

alignment_dataframe = "../../../software/topiary/tests/data/small-phylo/dataframe.csv"

# Local users: Access the small-phylo alignment dataframe by commenting out the 
# path above and instead use the following path:

# alignment_dataframe = pd.read_csv("../data/dataframe.csv")

alignment_df = pd.read_csv(alignment_dataframe)

topiary.find_best_model(df=alignment_df,
                        calc_dir="00_find-model",                        
                        starting_tree=None,
                        #seed=12345,
                        model_matrices=["cpREV","Dayhoff","DCMut","DEN","Blosum62","FLU","HIVb","HIVw","JTT","JTT-DCMut","LG","mtART","mtMAM","mtREV","mtZOA","PMB","rtREV","stmtREV","VT","WAG"],
                        model_rates=["","G8"],
                        model_freqs=["","FC","FO"] ,
                        model_invariant=["","IC","IO"],
                        num_threads=-1,
                        restart=False,
                        overwrite=False)


# 01. Build a maximum likelihood gene tree

Topiary next infers an ML gene tree using the inferred phylogenetic model with the default RAxML-NG settings for the “--search” protocol. This starts the inference from 10 random trees and 10 different parsimony trees. It then optimizes the tree topology using an SPR subtree cutoff of 1, with an automatically selected fast versus slow SPR radius. Branch lengths are optimized using the NR-FAST algorithm. The tree with the highest likelihood is selected and used for downstream analyses. Advanced users have full access to all RAxML-NG options XXX.

This cell builds an ML tree using the best model found in the previous step.

In [None]:
# 01. Generate the maximum likelihood tree

topiary.generate_ml_tree(prev_calculation="00_find-model",
                        calc_dir="01_ml-tree",
                        num_threads=-1,
                        bootstrap=False,
                        restart=False,
                        overwrite=False)


# 02. Reconcile gene and species trees

The next step in the pipeline is to reconcile the gene tree with the species tree. This automatically roots the tree and has been shown to improve the quality of reconstructed sequences (Groussin M, 2015). For this purpose, we use GeneRax, a new high-performance program for reconciling gene and species trees. Unlike other, heuristic, methods, GeneRax uses an explicit likelihood framework (Morel B, 2020). The final tree is the maximum likelihood estimate for an evolutionary model that includes both sequence evolution (i.e., LG) and evolutionary events (speciation, duplication, and loss (and lateral gene transfer if specified).

To do this, topiary uses the ML evolutionary model and ML gene tree inferred previously as inputs to GeneRax. For the rooted species tree, topiary automatically downloads the most recent synthetic tree from the Open Tree of Life (OTL) database (Rees J, 2017; Mctavish EJ, 2021). (Previous steps in the pipeline ensure that all sequences that have made it to this step come from species that are present in the OTL database). Any polytomies in this tree are resolved arbitrarily prior to the reconciliation inference. Topiary runs GeneRax with the default parameters and the UndatedDL model (Morel B, 2020). The UndatedDL model accounts for duplication and loss events. Topiary users can select the other implemented model— UndatedDTL, which allows lateral transfer if they anticipate lateral gene transfer occurs for their gene of interest. 

The resulting tree is a maximum likelihood reconciled gene-species tree with optimized branch lengths and nodes labeled with inferred evolutionary events (speciation, duplication, or transfer). GeneRax returns a variety of other outputs that are made accessible to topiary users, but only the reconciled tree is used further in the pipeline.

This cell uses GeneRax software to improve the likelihood of the final
topology of the gene tree by reconciling the ML gene with the species
tree. GeneRax will only make changes to the ML gene tree if the
current topology requires a complicated series of evolutionary steps
relating sequences that could be more easily explained if the
topology matched the species tree.

If reconstructing microbe-specific ancestral proteins, reconciling
your gene tree with the species tree might not improve the confidence
in the final tree. Topiary will not reconcile a gene tree to the species tree if there are only microbial genes present unless the user sets `--force_reconcile`. In this case, the user may also opt to allow the probability of horizontal and/or lateral gene transfer between species to play a role in building the reconciled tree. To do this, use the flags `--horizontal_gene_transfer` and/or `--UndatedDTL`.

The user can also use `--force_no_reconcile` if they do not want to reconcile the ML gene tree with the species tree. Note that when allowing reconciliation, ancestors and statistical supports will be built for both the ML gene tree and the species-reconciled tree (see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#interpret-the-results)).

This cell can be run without updating any parameters. For a full description of the meanings of all parameters, see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/topiary.pipeline.html#module-topiary.pipeline.alignment_to_ancestors).

In [None]:
# 02. Reconcile the ML gene tree with species tree

topiary.reconcile(prev_calculation="01_ml-tree",
                  calc_dir="02_reconciliation",
                  species_tree=None,
                  horizontal_transfer=False,
                  num_threads=1, 
                  bootstrap=False,
                  restart=False,
                  overwrite=False)


# 03. Reconstruct ancestors

The next step is to infer sequences of ancestral nodes on the reconciled gene-species tree. For this, we use RAxML-NG, which implements a standard marginal ancestral reconstruction method (Yang Z, 1995). (This differs from previous versions of RAxML, which used a non- standard reconstruction method that was not comparable to other approaches). RAxML-NG finds the amino acid at each site in each ancestor that maximizes the likelihood of observing the sequence alignment given the tree, branch lengths, and phylogenetic model. This returns a matrix of posterior probabilities for each amino acid at each site in the alignment for each ancestral node. Topiary extracts the sequence of the maximum likelihood ancestor, as well as the so-called altAll version of the ancestor that incorporates alternate reconstructed amino acids at ambiguous positions. It uses a default cutoff of 0.25 to identify ambiguous sites (Eick GN, 2016); this can be set by the user.

The evolutionary models used by RAxML-NG do not explicitly treat gaps; therefore, the first draft of the reconstructed ancestor will be ungapped. Topiary assigns gaps by treating them as characters during ancestral character reconstruction (ACR). For this purpose, topiary uses the DOWNPASS (Maddison DR, 2000) algorithm as implemented by the PastML package (Ishikawa SA, 2019). The final output for this step consists of the gapped sequences of the maximum likelihood and altAll ancestors for each node. These have associated statistical supports: posterior probabilities for each reconstructed amino acid and support for gaps. Topiary also puts out a variety of summary graphs to help select high quality sequences.


In [None]:
# 03. Infer ancestral proteins.

topiary.generate_ancestors(prev_calculation="02_reconciliation",
                          calc_dir="03_ancestors",
                          num_threads=1,
                          alt_cutoff=0.25,
                          restart=False,
                          overwrite=False)


# Assess posterior probabilities for individual ancestral sequences

It is useful to visualize ancestral sequence posterior probabilities (PP) in detail at this stage (see output from the cell above). In particular, it is important to check ancestral sequences with low average posterior probability labels in the summary tree. Regions where the ML and altAll constructions have similar posterior probability represent amino acids in the protein sequence that are highly ambiguous given the multiple sequence alignment. Generally, we recommend moving forward with ancestors with high (>0.85) PP. However, functionally competent ancestors have been resurrected from reconstructions with an average PP > 0.75.

Assess if there is enough statistical support for the reconstructed amino acid sequence for the lowly supported ancestors. If there is not enough support, it may be beneficial to add more sequences to the multiple sequence alignment that would provide additional sequence signal for this particular ancestor.

In [None]:
#@title (Recommended) View amino acid-level statistical support for ancestral sequences  

#@markdown The ancestor-data.csv file
#@markdown shows posterior probability values calculated for the
#@markdown maximum likelihood and next most likely amino acid
#@markdown (or alternate state) at each site along the reconstructed
#@markdown ancestral sequences. The location of the file is shown
#@markdown below and its contents can be viewed by running the cell.

ancestor_data = "03_ancestors/output/reconciled-tree_ancestors/ancestor-data.csv" #@param {type:"string"}

df = pd.read_csv(ancestor_data)
df

--------

# *Bootstrap-reconcile*

--------

# 04. Branch supports

To determine branch supports, topiary uses non-parametric bootstrapping (Felsenstein J, 1985). Briefly, RAxML-NG generates pseudoreplicate alignments by sampling columns, with replacement, from the input alignment. RAxML-NG then infers an evolutionary tree for each of these alignments. Topiary generates up to 1,000 bootstrap pseudoreplicates, using RAxML- NG’s automatic Extended Majority Rules (autoMRE) method with a cutoff of 0.03 to determine the exact number. The output from RAxML-NG is a collection of pseudoreplicate alignments and pseudoreplicate gene trees. Because we are reconstructing ancestors on the reconciled tree, we pass each pseudoreplicate alignment and gene tree into GeneRax for gene-species tree reconciliation, yielding a final collection of pseudoreplicate reconciled trees. Topiary then uses RAxML-NG to map these pseudoreplicate trees onto the ML reconciled tree as branch supports. Topiary also assesses convergence for the branch support estimate using the `--bsconverge` option.

This cell first reconciles each bootstrap pseudoreplicate ML gene tree.
It then calculates branch support values from the frequencies of seeing
the same ancestral nodes found in the species-reconciled ML gene tree.


In [None]:
# 04. Determine statistical supports for the existence of each reconstructed ancestor

topiary.generate_bootstraps(prev_calculation="03_ancestors",
                            calc_dir="04_bootstraps",
                            num_threads=-1,
                            restart=False,
                            overwrite=False)


# 05. Summary reports

Execute this cell to generate an interactive html report summarizing
the results of your ancestral sequence reconstruction.

In [None]:
# 05. Create an html report for the calculation

topiary.tree_report(ancestor_directory="03_ancestors",
                    tree_directory="04_bootstraps",
                    output_directory="05_reports")

# Interpret the results

See the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#interpret-the-results) for a detailed description of how to determine if a particular ancestral sequence has reasonable statistical support to have existed. Such ancestors can be resurrected and functional characterized.