# Alignment to ancestors

We recommend performing the ancestral inference in a high-performance computing environment. This notebook is an example that shows the steps taken and how the code should run.

This calls
+ Infer the maximum likelihood model of sequence evolution
+ Infer a maximum likelihood gene tree
+ Infer ancestors on the maximum likelihood gene tree
+ Generate bootstrap replicates for the maximum likelihood gene tree
+ Reconcile the gene and species trees (for non-microbial proteins)

Because of different parallelization requirements, the ancestral inference step uses two scripts run in sequence ( and *bootstrap-reconcile*). The `alignment-to-ancestors` function infers the evolutionary model, builds the maximum likelkihood gene tree, reconciles the gene tree with the species tree, reconstructs ancestors, and generates bootstrap pseudoreplicates of the gene tree.


for statistical analysis in the bootstrap-reconcile script. The results produced from each of these processes in *alignment-to-ancestors* can be visualized as summary tree PDF files written out at each step. Reconstructing ancestors should take about a day for a reasonable alignment (1,000 columns, 500 sequences) running on a reasonable compute node (30 cores). 

The *bootstrap-reconcile* script reconciles each pseudoreplicate gene tree to the species tree and constructs the final branch supports. Bootstrap sampling the gene-species reconciliation is computationally intensive but can be readily parallelized. For a full alignemnt, it will likely take approximately a week spread across several cores.

In this notebook, the two scripts are initiated sequentially with a single block of code.

<a href="https://githubtocolab.com/harmslab/topiary-examples/blob/main/notebooks/03_alignment_to_ancestors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Setup
Run the next two cells to initialize the environment to run topiary.

In [4]:
### THIS CELL SETS UP TOPIARY IN A GOOGLE COLAB ENVIRONMENT. 
### IF RUNNING THIS NOTEBOOK LOCALLY, IT MAY BE SAFELY DELETED.

#@title Install software

#@markdown #### Installation requires two steps.

#@markdown 1. Install the software by pressing the _Play_ button on the left.
#@markdown Please be patient. This will take several minutes. <font color='teal'>
#@markdown After the  installation is complete, the kernel will reboot 
#@markdown and Colab will complain that the session crashed. This is normal.</font>
#@markdown <br/><br/>(If you wish to install raxml or generax, select the check boxes below. 
#@markdown These packages are only required for running the
#@markdown alignment-to-ancestors pipeline. Note: you can select
#@markdown the checkboxes and re-run this cell after doing the initial 
#@markdown installation.)

install_raxml = True    #@param {type:"boolean"}
install_generax = True  #@param {type:"boolean"}

#@markdown 2. After this cell runs, run the "Initialize environment" cell that follows.

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    import urllib.request
    urllib.request.urlretrieve("https://raw.githubusercontent.com/harmslab/topiary-examples/main/notebooks/colab_installer.py",
                              "colab_installer.py")

    import colab_installer
    colab_installer.install_topiary(install_raxml=install_raxml,
                                    install_generax=install_generax)

In [5]:
### IF RUNNING LOCALLY, ACTIVATE THE TOPIARY ENVIRONMENT IN CONDA
### AND RE-OPEN THIS NOTEBOOK.

import topiary
import numpy as np
import pandas as pd
import glob
import os

### EVERYTHING AFTER THIS LINE IS IS USED TO SET UP TOPIARY IN A GOOGLE
### COLAB ENVIRONMENT. IF RUNNING THIS NOTEBOOK LOCALLY, THE LINES BELOW
### IN THIS CELL MAY BE SAFELY DELETED. 

#@title Initialize environment

#@markdown  Run this cell to initialize the environment after installation.
#@markdown (This cell can also be run if the kernel dies during a calculation,
#@markdown allowing you to reload modules without having to
#@markdown reinstall). 

#@markdown We recommend setting up a working directory on your google drive. This is a 
#@markdown convenient way to pass files to topiary and will allow you to save
#@markdown your work. For example, if you type `topiary_work` into the form
#@markdown field below, topiary will save all of its calculations in the 
#@markdown `topiary_work` directory in MyDrive (i.e. the top directory at
#@markdown https://drive.google.com). This script will create the directory if 
#@markdown it does not already exist. If the directory already exists, any files
#@markdown that are already in that directory will be available to topiary. You could, 
#@markdown for example, put a file called `seed.csv` in `topiary_work` and then
#@markdown access it as "seed.csv" in all cells below.
#@markdown <br/><br/>
#@markdown Note: Google may prompt you for permission to access the drive. 
#@markdown To work in a temporary colab environment, leave this blank. 

# Select a working directory on google drive
google_drive_directory = "" #@param {type:"string"}

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    topiary._in_notebook = "colab"
    import colab_installer
    colab_installer.initialize_environment()
    colab_installer.mount_google_drive(google_drive_directory)

# Alignment-to-ancestors and bootstrap-reconcile

This pipeline takes an alignment, finds the best phylogenetic model
to explain relationships of sequences in the alignment, builds a
maximum likelihood tree, reconciles this tree with the species tree,
and then infers ancestral proteins.

If reconstructing microbe-specific ancestral proteins, reconciling
your gene tree with the species tree might not improve the confidence
in the final tree. Topiary will not reconcile a gene tree to the species tree if there are only microbial genes present unless the user sets `--force_reconcile`. In this case, the user may also opt to allow the probability of horizontal and/or lateral gene transfer between species to play a role in building the reconciled tree. To do this, use the flags `--horizontal_gene_transfer` and/or `--UndatedDTL`.

The user can also use `--force_no_reconcile` if they do not want to reconcile the ML gene tree with the species tree. Note that when allowing reconciliation, ancestors and statistical supports will be built for both the ML gene tree and the species-reconciled tree (see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#interpret-the-results)).

This cell can be run without updating any parameters. For a full description of the meanings of all parameters, see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/topiary.pipeline.html#module-topiary.pipeline.alignment_to_ancestors).


In [6]:
#@title Run the alignment_to_ancestors pipeline.

alignment_dataframe = "https://raw.githubusercontent.com/harmslab/topiary-examples/main/data/dataframe.csv" #@param {type:"string"} 

out_dir = "ali-to-anc"              #@param {type:"raw"}
starting_tree = None                #@param {type:"raw"}
no_bootstrap = False                #@param {type:"boolean"}
force_reconcile = False             #@param {type:"boolean"}
force_no_reconcile = False          #@param {type:"boolean"}
horizontal_transfer = False         #@param {type:"boolean"}
alt_cutoff = 0.25                   #@param {type:"number"}
model_matrices = "cpREV Dayhoff DCMut DEN Blosum62 FLU HIVb HIVw JTT JTT-DCMut LG mtART mtMAM mtREV mtZOA PMB rtREV stmtREV VT WAG" #@param {type: "string"}
model_rates = "G8"                  #@param {type:"string"}
model_freqs = "FC FO"               #@param {type:"string"}
model_invariant = "IC IO"           #@param {type:"string"}
num_threads = 1                     #@param {type:"integer"}
restart = False                     #@param {type:"boolean"}
overwrite = False                   #@param {type:"boolean"}

alignment_df = topiary.read_dataframe(alignment_dataframe)

model_matrices = model_matrices.split()
model_rates = model_rates.split()
model_rates.insert(0,"")
model_freqs = model_freqs.split()
model_freqs.insert(0,"")
model_invariant = model_invariant.split()
model_invariant.insert(0,"")

topiary.alignment_to_ancestors(df=alignment_df,
                               out_dir=out_dir,
                               starting_tree=starting_tree,
                               no_bootstrap=no_bootstrap,
                               force_reconcile=force_reconcile,
                               force_no_reconcile=force_no_reconcile,
                               alt_cutoff=alt_cutoff,
                               model_matrices=model_matrices,
                               model_rates=model_rates,
                               model_freqs=model_freqs,
                               model_invariant=model_invariant,
                               num_threads=num_threads,
                               restart=restart,
                               overwrite=overwrite)

if len(glob.glob(os.path.join(out_dir,"*reconciled*"))) > 0:
    topiary.bootstrap_reconcile(out_dir,
                                num_threads=num_threads)



Non-microbial dataset detected. Gene/species tree reconciliation will be performed
----------------------------------------------------------------------
Checking raxml-ng
----------------------------------------------------------------------

    installed:       Y
    binary_path:     /Users/harmsm/local/bin/raxml-ng
    binary runs:     Y
    version:         1.1.0
    minimum version: 1.1
    passes:          Y

----------------------------------------------------------------------
Checking generax
----------------------------------------------------------------------

    installed:       Y
    binary_path:     /Users/harmsm/local/bin/generax
    binary runs:     Y
    version:         2.0.4
    minimum version: 2.0
    passes:          Y

----------------------------------------------------------------------
Checking mpirun
----------------------------------------------------------------------

    installed:       Y
    binary_path:     /Users/harmsm/miniconda3/bin/mpirun
    b

  0%|          | 0/360 [00:00<?, ?it/s]


Top 10 models:

               model           AICc prob
               rtREV               0.996
                 DEN               0.003
                 PMB               0.001
                  LG               0.000
                  VT               0.000
            Blosum62               0.000
                 WAG               0.000
           JTT-DCMut               0.000
                 JTT               0.000
               cpREV               0.000


topiary ran a find_best_model calculation in ./00_find-model:

+ Completed in 0:00:51.719681 (H:M:S)
+ Wrote results to ./00_find-model/output

----------------------------------------------------------------------



----------------------------------------------------------------------

topiary is starting a ml_tree calculation in ./01_gene-tree:

Launching raxml-ng, 0:00:00.001627 (H:M:S)
Running '/Users/harmsm/local/bin/raxml-ng --search --msa alignment.phy --model rtREV --seed 3258301272 --threads auto{1}'

RAxML-NG v. 

  0%|          | 0/1000 [00:00<?, ?it/s]

Running bootstrap calculations., 0:00:07.379601 (H:M:S)

Generating reconciliation bootstraps.



100%|██████████| 2/2 [00:00<00:00,  3.95it/s]   


Combining bootstrap calculations., 0:00:10.825172 (H:M:S)

Compressing replicates.



topiary ran a reconcile_bootstrap calculation in ./06_reconciled-tree-bootstraps:

+ Completed in 0:00:12.490707 (H:M:S)
+ Wrote results to ./06_reconciled-tree-bootstraps/output

----------------------------------------------------------------------


Generating report in ali-to-anc/results/gene-tree/
Generating report in ali-to-anc/results/reconciled-tree/


# Interpret the results

See the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#interpret-the-results) for a detailed description of how to determine if a particular ancestral sequence has reasonable statistical support to have existed. Such ancestors can be resurrected and functional characterized.

To edit the alignment load `06_alignment.fasta` into an alignment editor. This will be in the directory you specified for `out_dir` above. 
If running on Google Colab, download the `06_alignment.fasta` file onto your computer. Click on the folder icon on the Colab menu on the left side of the window. Navigate into the seed-to-ali folder (or the name you gave your output directory in the previous step), hover over `06_alignment.fasta`, click on the three dots to the right and choose `Download`. If you mounted your Google Drive it will be in `gdrive/MyDrive/seed-to-ali`, and also be accessible directly on your Google Drive. 
