# Seed to alignment

The seed-to-alignment pipeline takes a small seed dataframe, BLASTs to find sequence hits, performs quality control, lowers alignment redundancy in a taxonomically informed fashion, and generates an alignment.

<a href="https://githubtocolab.com/harmslab/topiary-examples/blob/main/notebooks/seed-to-alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Setup

Run the next two cells to initialize the environment to run topiary. 

In [None]:
### THIS CELL SETS UP TOPIARY IN A GOOGLE COLAB ENVIRONMENT. 
### IF RUNNING THIS NOTEBOOK LOCALLY, IT MAY BE SAFELY DELETED.

#@title Install software

#@markdown #### Installation requires two steps.

#@markdown 1. Install the software by pressing the _Play_ button on the left.
#@markdown Please be patient. This will take several minutes. <font color='teal'>
#@markdown After the  installation is complete, the kernel will reboot 
#@markdown and Colab will complain that the session crashed. This is normal.</font>
#@markdown <br/>
#@markdown 2. After this cell runs, run the "Initialize environment" cell that follows.

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    import urllib.request
    urllib.request.urlretrieve("https://raw.githubusercontent.com/harmslab/topiary-examples/main/notebooks/colab_installer.py",
                              "colab_installer.py")

    import colab_installer
    colab_installer.install_topiary(install_raxml=False,
                                    install_generax=False)

In [None]:
### IF YOU ARE RUNNING LOCALLY, make sure you activated 
### the topiary conda environment. (If you did not start this notebook
### within that environment, close the session, activate the topiary
### environment, and restart). 

import topiary
import numpy as np
import pandas as pd 

### EVERYTHING AFTER THIS LINE IS IS USED TO SET UP TOPIARY IN A GOOGLE
### COLAB ENVIRONMENT. IF RUNNING THIS NOTEBOOK LOCALLY, THE LINES BELOW
### IN THIS CELL MAY BE SAFELY DELETED. 

#@title Initialize environment

#@markdown  Run this cell to initialize the environment after installation.
#@markdown (This cell can also be run if the kernel dies during a calculation,
#@markdown allowing you to reload modules without having to
#@markdown reinstall.) 

#@markdown We recommend setting up a working directory on your google drive. This is a 
#@markdown convenient way to pass files to topiary and will allow you to save
#@markdown your work. For example, if you type `topiary_work` into the form
#@markdown field below, topiary will save all of its calculations in the 
#@markdown `topiary_work` directory in MyDrive (i.e. the top directory at
#@markdown https://drive.google.com). This script will create the directory if 
#@markdown it does not already exist. If the directory already exists, any files
#@markdown that are already in that directory will be available to topiary. You could, 
#@markdown for example, put a file called `seed.csv` in `topiary_work` and then
#@markdown access it as "seed.csv" in all cells below.
#@markdown <br/><br/>
#@markdown Note: Google may prompt you for permission to access the drive. 
#@markdown To work in a temporary colab environment, leave this blank. 

# Select a working directory on google drive
google_drive_directory = "" #@param {type:"string"}

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    import os
    os.chdir("/content/")

    topiary._in_notebook = "colab"
    import colab_installer
    colab_installer.initialize_environment()
    colab_installer.mount_google_drive(google_drive_directory)

## Construct a seed dataset
The first step in a topiary ASR calculation is to construct a seed dataset. This dataset defines protein family members of interest and the distribution of these proteins across species. Topiary uses this seed dataset to automatically find and download sequences to put into the alignment and, ultimately, evolutionary tree. An example for the LY86/LY96 protein family, a pair of closely related innate immune proteins, is shown below. 

name | species      | sequence   | aliases
---- | ------------ | ---------- | -------------------------------------------------------------------------------------------
LY96 | Homo sapiens | MLPFLFF... | ESOP1;Myeloid Differentiation Protein-2;MD-2;lymphocyte antigen 96;LY-96
LY96 | Danio rerio  | MALWCPS... | ESOP1;Myeloid Differentiation Protein-2;MD-2;lymphocyte antigen 96;LY-96
LY86 | Homo sapiens | MKGFTAT... | Lymphocyte Antigen 86;LY86;Myeloid Differentiation Protein-1;MD-1;RP105-associated 3;MMD-1
LY86 | Danio rerio  | MKTYFNM... | Lymphocyte Antigen 86;LY86;Myeloid Differentiation Protein-1;MD-1;RP105-associated 3;MMD-1

[Download the full spreadsheet](https://topiary-asr.readthedocs.io/en/latest/_static/data/seed-dataframe_example.csv)

### To prepare the table:

We present this briefly here. For details see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html).

1. **Choose the proteins of interest for your ASR calculation.** In our example, we included two paralogs: LY86 and LY96. The choice of proteins sets the scope of the evolutionary study. To study the deepest ancestor of LY86, we want to include LY96 as the relevant outgroup. In our experience, you generally want ~1-5 paralogs for a robust ASR investigation. As you add more paralogs, you need more sequences to resolve the evolutionary tree, slowing the calculation and—eventually—making the problem computationally intractable.
2. **Determine the taxonomic distribution of the protein family.** LY86 and LY96 are found across bony vertebrates (humans and bony fishes, but not sharks). If you are unsure of the taxonomic distribution of your proteins of interest, we discuss BLAST strategies for asking this question in the online [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#determine-what-sequences-to-include).
3. **Choose two or three key species with well-annotated genomes that span the whole taxonomic distribution of your proteins of interest.** For LY86 and LY96, we selected humans and zebrafish, covering the breadth of species over which these proteins are found. Choosing humans and chimps would be a poor choice, as this covers only primates; even choosing humans and chickens would be non-optimal, as this covers only amniotes.
4. **Add sequences for each protein from your key species to the table.** These sequences are the basis for automatic dataset construction; they should therefore be high quality sequences: canonical rather than isoform, not hypothetical, not partial, etc. Our usual source for these seed sequences is Uniprot, but these can come from any source.
5. **Compile a list of aliases for each protein.** The same protein can have different names across different databases and species. Even in the same genome, gene nomenclature can be inconsistent. By using a human-curated list of aliases, topiary is more effective at identifying sequences that truly correspond to the paralogs of interest. Aliases can be found in many online databases. (A list of databases is given in the online documentation).

### Load the table

Run the following cell to load your seed dataframe into the variable `seed_df`.

In [None]:
### IF RUNNING LOCALLY: set `seed_dataset =` to point to your desired csv or xlsx file. 
### Alternatively, you can set a `seed_df` to point to a pandas dataframe holding the
### seed dataset. 

seed_dataset = "https://raw.githubusercontent.com/harmslab/topiary-examples/main/data/ly86-ly96.csv"
seed_df = None

# -----------------------------------------------------------------------------
# COLAB SPECIFIC BLOCK

#@title Load seed dataset

#@markdown Before running this cell, specify either: 
#@markdown + A file containing a seed dataset in your working
#@markdown directory (your google drive specified above).
#@markdown The default input file is an example LY86/LY96 seed dataset.
#@markdown + Select `upload_file` to upload a file directly from your computer. 

try:
    import google.colab
    RUNNING_IN_COLAB = True
except ImportError:
    RUNNING_IN_COLAB = False
except Exception as e: 
    err = "Could not figure out if runnning in a colab notebook\n"
    raise Exception(err) from e

if RUNNING_IN_COLAB:

    seed_dataset = "https://raw.githubusercontent.com/harmslab/topiary-examples/main/data/ly86-ly96.csv" #@param {type:"string"}
    upload_file = False #@param {type:"boolean"}

    if issubclass(type(seed_dataset),str):
        seed_dataset = seed_dataset.strip()

    if seed_dataset != "" and upload_file:
        err = "Please give a seed_dataset OR select upload file\n"
        raise ValueError(err)

    if seed_dataset == "" and not upload_file:
        err = "Please either give a seed_dataset or select upload file\n"
        raise ValueError(err)

    if upload_file:

        try:
            from google.colab import files
            uploaded_files = files.upload()
            keys = list(uploaded_files.keys())
            seed_dataset = keys[0] #uploaded_files[keys[0]]
        except ImportError:
            pass

# END COLAB SPECIFIC BLOCK
# -----------------------------------------------------------------------------

if seed_df is None:

    try:
        seed_df = pd.read_csv(seed_dataset)
    except:
        try:
            seed_df = pd.read_excel(seed_dataset)
        except:
            err = f"Could not read {seed_dataset}. This should be a csv or xlsx file\n"
            raise ValueError(err)

seed_df 

## Seed-to-alignment

The seed dataset is passed directly into the topiary seed-to-alignment pipeline. This script uses BLAST to build a dataset of thousands of protein sequences, performs quality control, lowers alignment redundancy in a taxonomically informed fashion, and then generates an alignment of sequences. This generally takes less than an hour on a modern laptop. The slowest step in this pipeline is often the initial NCBI BLAST step. If your connection is unstable or the NCBI server proves too slow, topiary can BLAST against local databases or load previously saved BLAST XML results. (If running in Google Colab, you would have to upload the XML files to your Google Drive to access them). 

This script will generate and save a series of spreadsheets, eeach capturing the state of the dataset at each step in the pipeline. The final output consists of a single spreadsheet (`05_clean-aligned-dataframe.csv`) and a single fasta file (`06_alignment.fasta`) holding the alignment. The results can be found in the accessed in the `out_dir` folder (`seed-to-ali` by default).

This cell can be run without updating any parameters. For a full description of the meanings of all parameters, see the [topiary documentation](https://topiary-asr.readthedocs.io/en/latest/topiary.pipeline.html#module-topiary.pipeline.seed_to_alignment).

**NOTE**: If this cell gives the error `HTTPError: HTTP Error 429: Too Many Requests`, wait a few minutes and try again. This is because the NCBI BLAST servers have hit their limit. 


In [None]:
#@title Run the seed-to-alignment script

#@markdown Please execute this cell by pressing the _Play_ button
#@markdown to run the full seed-to-alignment pipeline.

# parameters                    # google colab parameter selectors
out_dir = "seed-to-ali"         #@param {type:"string"}
seqs_per_column = 1             #@param {type:"number"}
max_seq_number = 500            #@param {type:"integer"}
redundancy_cutoff = 0.90        #@param {type:"number"}
worst_align_drop_fx = 0.1       #@param {type:"number"}
sparse_column_cutoff = 0.80     #@param {type:"number"}
align_trim_first = 0.05         #@param {type:"number"}
align_trim_last = 0.95          #@param {type:"number"}

force_species_aware = False     #@param {type:"boolean"}
force_not_species_aware = False #@param {type:"boolean"}

ncbi_blast_db = None            #@param {type:"string"}
local_blast_db = None           #@param {type:"string"}
blast_xml = None                #@param {type:"string"}

move_mrca_up_by = 2             #@param {type:"integer"}
local_recip_blast_db = None     #@param {type:"string"}
min_call_prob = 0.95            #@param {type:"slider", min:0.01, max:0.99, step:0.01}
partition_temp = 1              #@param {type:"number"}

hitlist_size = 5000             #@param {type:"integer"}
e_value_cutoff = 0.001          #@param {type:"number"}
gapcost_gap_exists = 11         #@param {type:"integer"}
gapcost_per_residue = 1         #@param {type:"integer"}
num_ncbi_blast_threads = 1      #@param {type:"integer"}
num_local_blast_threads = -1    #@param {type:"integer"}

restart = False                 #@param {type:"boolean"}
overwrite = False               #@param {type:"boolean"}
keep_recip_blast_xml = False    #@param {type:"boolean"}

df = topiary.seed_to_alignment(seed_df=seed_df,
                               out_dir=out_dir,
                               seqs_per_column=seqs_per_column,
                               max_seq_number=max_seq_number,
                               redundancy_cutoff=redundancy_cutoff,
                               worst_align_drop_fx=worst_align_drop_fx,
                               sparse_column_cutoff=sparse_column_cutoff,
                               align_trim=(align_trim_first,align_trim_last),
                               ncbi_blast_db=ncbi_blast_db,
                               local_blast_db=local_blast_db,
                               blast_xml=blast_xml,
                               move_mrca_up_by=move_mrca_up_by,
                               local_recip_blast_db=local_blast_db, 
                               min_call_prob=min_call_prob,
                               partition_temp=partition_temp,
                               hitlist_size=hitlist_size,
                               e_value_cutoff=e_value_cutoff,
                               gapcosts=(gapcost_gap_exists,gapcost_per_residue),
                               num_ncbi_blast_threads=num_ncbi_blast_threads,
                               num_local_blast_threads=num_local_blast_threads,
                               restart=restart,
                               overwrite=overwrite,
                               keep_recip_blast_xml=keep_recip_blast_xml)

df

## Inspect and edit alignment

Before reconstructing a phylogenetic tree and ancestors, we strongly recommend inspecting and possibly editing the alignment. There are a variety of pieces of software for visualizing alignments, including AliView, JALView, and MEGA. We generally use [AliView](https://ormbunkar.se/aliview/) because of its balance of utility and simplicity.

### Load the alignment fasta file in an alignment editor

To edit the alignment load `06_alignment.fasta` into an alignment editor. This will be in the directory you specified for `out_dir` above. 
If running on Google Colab, download the `06_alignment.fasta` file onto your computer. Click on the folder icon on the Colab menu on the left side of the window. Navigate into the seed-to-ali folder (or the name you gave your output directory in the previous step), hover over `06_alignment.fasta`, click on the three dots to the right and choose `Download`. If you mounted your Google Drive it will be in `gdrive/MyDrive/seed-to-ali`, and also be accessible directly on your Google Drive. 

### (Possibly) edit the alignment

There are differing views on whether to manually edit an alignment (e.g. [Catanach 2019](https://peerj.com/articles/6142/) vs. [Morrison 2006](https://doi.org/10.1071/SB06020)); the topiary package allows a user to manually edit their alignment but does not require it. We generally recommend making a few adjustments to alignments. Importantly, if we edit an alignment, we publish the alignment as supplemental material in the resulting manuscript so others can reproduce our work. 

IMPORTANT: when editing an alignment, *do not change the names of the sequences* as this is how topiary maps the alignment back into the dataframe. Also, *do not add sequences to the alignment*. To add sequences, you should add them to the dataframe itself before writing out the alignment. 

When editing up alignments, we use the four "moves" listed below (see the [documentation](https://topiary-asr.readthedocs.io/en/latest/protocol.html#visually-inspect-and-possibly-edit-the-alignment) for detailed instructions and examples).

1. **Trim variable-length N- and C-terminal regions from the alignment.** A huge number of sparse and variable columns will slow evolutionary analyses and will generally not provide enough signal to be reconstructed with confidence.
2. **Delete sequences with long, unique insertions or deletions (indels).** Indels can lead to alignment ambiguity around flanking regions. Further, they provide no information for most ancestors, most of whom do not have the indel, while increasing the computational cost of the phylogenetic analysis. Note, we do not make internal edits to sequences (say, by deleting a long lineage-specific insertion) as this becomes difficult to track or justify upon future realignment steps.
3. **Delete lineage-specific duplicates, selecting the sequence with the greatest sequence coverage.** The pipeline generally does a good job of deleting sequences in this class; however, if such sequences slip through, we delete them from the alignment. Because trying to align long, unique, and variable sequences can affect the alignment of other sequences, we generally use Muscle5 to re-align the full MSA after we perform steps 1-3. This can be done directly from AliView. (We will often iterate through steps 1-3 and full alignment several times.)
4. **Finally, after we are satisfied that we have sequences of reasonable length and composition, we carefully inspect the alignment and may correct “obvious” local misalignments.** In our view, these edits makes the alignment a more accurate description of sequence homology than otherwise; however, we recognize that this is subjective and difficult to quantify. As noted above, we publish our alignment with our final ancestors to allow others to assess our judgement and promote reproducibility.

### Save your alignment and read back into the topiary dataframe

If you edited the alignment in the editor, save that edited alignment out as a `.fasta` file. 

### Read the edited alignment back into the topiary dataframe
If you made edits to the alignment, it needs to be read back into the topiary dataframe to infer ancestors. You have a two options to do this. 

+ Run the following cell. Change `aligned_dataframe` and `edited_fasta_file` to point to the relevant files. This will create an output file with the name specified in `output_file_name`. (If you are running on Google Colab, you should upload your edited fasta file to the relevant directory on your Google Drive). 
+ If you are using a cluster to do the ancestral inference steps (recommended), you can upload both your edited alignment and the topiary dataframe from the last step up to the cluster. You can then run a command line script: `topiary-fasta-into-dataframe 05_clean-aligned-dataframe.csv edited_alignment.fasta final-dataframe.csv`. This will create a new file (`final-dataframe.csv`) that can be fed into subsequent analyses. 

In [None]:
#@title Load the topiary spreadsheet

previous_dataframe = "seed-to-ali/05_clean-aligned-dataframe.csv" #@param {type: "string"}
edited_fasta_file = "edited-alignment.fasta"                      #@param {type: "string"}
output_file_name = "final-dataframe.csv"                          #@param {type: "string}"

df = topiary.read_dataframe(previous_dataframe)
alignment_df = topiary.read_fasta_into(df,edited_fasta_file)
topiary.write_dataframe(alignment_df,output_file_name)

alignment_df

## Complete!

You should now have a spreadsheet (i.e. `final-dataframe.csv`) that has all of your sequences aligned, with meta-data. This is the only required input for the next step. We strongly recommend including this dataframe as a supplemental file in a manuscript using topiary, as it has accessions, sequences, and the alignment necessary for others to reproduce your work. (And since it's a simple spreadsheet, anyone can read it -- not just people with topiary installed). 