# Alignment to ancestors

This notebook demonstrates the *alignment-to-ancestors* and *bootstrap-reconcile* pipelines. For an actual inference, we recommend running these pipelines in a high-performance computing environment. 

<a href="https://githubtocolab.com/harmslab/topiary-examples/blob/main/notebooks/alignment-to-ancestors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Setup
Run the next two cells to initialize the environment to run topiary.

In [None]:
### THIS CELL SETS UP TOPIARY IN A GOOGLE COLAB ENVIRONMENT. 
### IF RUNNING THIS NOTEBOOK LOCALLY, IT MAY BE SAFELY DELETED.

#@title Install software

#@markdown #### Installation requires two steps.

#@markdown 1. Install the software by pressing the _Play_ button on the left.
#@markdown Please be patient. This will take several minutes. <font color='teal'>
#@markdown After the  installation is complete, the kernel will reboot 
#@markdown and Colab will complain that the session crashed. This is normal.</font>
#@markdown <br/><br/>

install_raxml = True    #@param {type:"boolean"}
install_generax = True  #@param {type:"boolean"}

#@markdown 2. After this cell runs, run the "Initialize environment" cell that follows.

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    import urllib.request
    urllib.request.urlretrieve("https://raw.githubusercontent.com/harmslab/topiary-examples/main/notebooks/colab_installer.py",
                              "colab_installer.py")

    import colab_installer
    colab_installer.install_topiary(install_raxml=install_raxml,
                                    install_generax=install_generax)

In [None]:
### IF YOU ARE RUNNING LOCALLY, make sure you activated 
### the topiary conda environment. (If you did not start this notebook
### within that environment, close the session, activate the topiary
### environment, and restart). 

import topiary
import numpy as np
import pandas as pd
import glob
import os

### EVERYTHING AFTER THIS LINE IS IS USED TO SET UP TOPIARY IN A GOOGLE
### COLAB ENVIRONMENT. IF RUNNING THIS NOTEBOOK LOCALLY, THE LINES BELOW
### IN THIS CELL MAY BE SAFELY DELETED. 

#@title Initialize environment

#@markdown  Run this cell to initialize the environment after installation.
#@markdown (This cell can also be run if the kernel dies during a calculation,
#@markdown allowing you to reload modules without having to
#@markdown reinstall). 

#@markdown We recommend setting up a working directory on your google drive. This is a 
#@markdown convenient way to pass files to topiary and will allow you to save
#@markdown your work. For example, if you type `topiary_work` into the form
#@markdown field below, topiary will save all of its calculations in the 
#@markdown `topiary_work` directory in MyDrive (i.e. the top directory at
#@markdown https://drive.google.com). This script will create the directory if 
#@markdown it does not already exist. If the directory already exists, any files
#@markdown that are already in that directory will be available to topiary. You could, 
#@markdown for example, put a file called `seed.csv` in `topiary_work` and then
#@markdown access it as "seed.csv" in all cells below.
#@markdown <br/><br/>
#@markdown Note: Google may prompt you for permission to access the drive. 
#@markdown To work in a temporary colab environment, leave this blank. 

# Select a working directory on google drive
google_drive_directory = "" #@param {type:"string"}

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    topiary._in_notebook = "colab"
    import colab_installer
    colab_installer.initialize_environment()
    colab_installer.mount_google_drive(google_drive_directory)

## Alignment-to-ancestors and bootstrap-reconcile

The following cell will takes a topiary dataframe as input. It will then:

+ Infer the maximum likelihood model of sequence evolution
+ Infer a maximum likelihood gene tree
+ Infer ancestors on the maximum likelihood gene tree
+ Generate bootstrap replicates for the maximum likelihood gene tree
+ Reconcile the gene and species trees (for a non-microbial dataset)
+ Generate bootstrap replicates for the reconciled tree (for a non-microbial dataset)

This cell can be run without updating any parameters. For a description of the meanings of all parameters, see the topiary documentation for [alignment_to_ancestors](https://topiary-asr.readthedocs.io/en/latest/topiary.pipeline.html#module-topiary.pipeline.alignment_to_ancestors) and [bootstrap_reconcile](https://topiary-asr.readthedocs.io/en/latest/topiary.pipeline.html#topiary.pipeline.bootstrap_reconcile.bootstrap_reconcile).

**Note**: You may get an MPI error if you are running this on *small-topiary-dataframe.csv* and increase the number of threads. For tiny datasets, multithreaded GeneRax gets out of sync and crashes. Try decreasing the number of threads or running on a larger dataset. 

### Output

This cell will create the file *out_dir/results.zip*, which contains the ancestors in a human-readable format. If you uncompress and open *results/index.html* you can view the results in a web browser. See the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#interpret-the-results) for how to interpret the output. 


In [None]:
#@title Run the alignment_to_ancestors pipeline.

alignment_dataframe = "https://raw.githubusercontent.com/harmslab/topiary-examples/main/data/small-topiary-dataframe.csv" #@param {type:"string"} 
out_dir = "ali-to-anc"              #@param {type:"raw"}
starting_tree = None                #@param {type:"raw"}
no_bootstrap = False                #@param {type:"boolean"}
force_reconcile = False             #@param {type:"boolean"}
force_no_reconcile = False          #@param {type:"boolean"}
horizontal_transfer = False         #@param {type:"boolean"}
alt_cutoff = 0.25                   #@param {type:"number"}
model_matrices = "cpREV Dayhoff DCMut DEN Blosum62 FLU HIVb HIVw JTT JTT-DCMut LG mtART mtMAM mtREV mtZOA PMB rtREV stmtREV VT WAG" #@param {type: "string"}
model_rates = "G8"                  #@param {type:"string"}
model_freqs = "FC FO"               #@param {type:"string"}
model_invariant = "IC IO"           #@param {type:"string"}
num_threads = 1                     #@param {type:"integer"}
restart = False                     #@param {type:"boolean"}
overwrite = False                   #@param {type:"boolean"}

# Instructions for colab users
#@markdown **To download results** click the folder icon on the left and look
#@markdown for out_dir/results.zip. (ali-to-anc/results.zip, by default). 
#@markdown Right-click and download this file. You can then uncompress this 
#@markdown on a local computer and open results/index.html to browse the 
#@markdown inferred trees and ancestors. See the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#interpret-the-results)
#@markdown for how to interpret the output. 

alignment_df = topiary.read_dataframe(alignment_dataframe)

# These should be lists of strings. Note we add "" to the front of model_rates,
# model_freqs, and model_invariant. This tests the model where we do not 
# explicitly treat model_rates, model_freqs, or model_invariant. 
model_matrices = model_matrices.split()
model_rates = model_rates.split()
model_rates.insert(0,"")
model_freqs = model_freqs.split()
model_freqs.insert(0,"")
model_invariant = model_invariant.split()
model_invariant.insert(0,"")

# If installed by conda, RAxML-NG runs multithreaded on a single processor.
topiary.alignment_to_ancestors(df=alignment_df,
                               out_dir=out_dir,
                               starting_tree=starting_tree,
                               no_bootstrap=no_bootstrap,
                               force_reconcile=force_reconcile,
                               force_no_reconcile=force_no_reconcile,
                               alt_cutoff=alt_cutoff,
                               model_matrices=model_matrices,
                               model_rates=model_rates,
                               model_freqs=model_freqs,
                               model_invariant=model_invariant,
                               num_threads=num_threads,
                               restart=restart,
                               overwrite=overwrite)

# This script calculates bootstrap replicates on individual threads. It can
# be run across multiple processors with minimal overhead. We recommend running
# with as many threads on as many cores as possible. If you are adapting this
# notebook to run in a high-throughput environment, we recommend running this
# calculation in its own notebook. 
if len(glob.glob(os.path.join(out_dir,"*reconciled*"))) > 0:
    topiary.bootstrap_reconcile(out_dir,
                                num_threads=num_threads)
