In [1]:
import topiary
import pandas as pd
import numpy as np

## Project organization


Keeping track of all of the files is always annoying in these projects.  I'm not sure there is a "right" way to do it, but here are a few principles that have kept me sane over the years:

+ Whenever you have to save out a file, do it with a numbered prefix and description.  Something like `00_initial-sequence-to-blast.fasta`, `01_blast-results.xml`, etc. This way, if you sort the directory, you can see what you did, in what order. 
+ Keep your sequences in a `.csv` file with columns for species, database id, raw sequence, aligned sequence, etc. You can then share this `.csv` file as the supplement to your paper.  *This protocol implements this `.csv` strategy throughout.*

PARADIGM
```
df = do_something(df)
topiary.write_dataframe(df,"dataset.csv")
```


## Create a set of BLAST .xml files containing sequences

Building sequence datasets is complicated.  How you go about it depends on what your goals are for the project.  In this protocol, I am going to assume we are looking members of protein families (meaning a mixture of orthologs and paralogs) from animals (meaning minimal lateral gene transfer). 

I'll mention two questions that often come up.  The first is: *should I include paralogs and orthologs?* For a gene tree—what we normally generate—you generally want to a large number of orthologs and paralogs. A good check for the quality of the tree is whether the orthologs group together and roughly reproduce the species tree.  In practice, this means BLASTing away and then grabbing as many sequences as possible without worrying too much about whether you are pulling down orthologs or paralogs.  

One reason this is useful is that sequence databases sometimes have paralogs and orthologs mislabeled.  Imagine you are building a tree of protein *X*.  You BLAST the NCBI and pull out out a sequence labeled protein *Y*.  You then build a tree, including this protein, and find it groups squarely with the *X* proteins, but not the other *Y* proteins in your dataset.  The simplest explanation for this result is that the protein was annotated incorrectly.   If you had dropped this sequence from your analysis, just because it was labeled *Y*, you would have removed valuable information from your dataset.  

A second question: *should I include "hypothetical," predicted", and "low-quality" sequences?* As is usual for bioinformatics, the answer is "it depends." If the protein has a huge number of isoforms and weirdo splice sites, the predicted genes could be difficult to align and thus mess up your phylogeny. On the other hand, if the protein has a very well defined set of exons, you'd probably be okay. Further, you sometimes really need sequences from taxa for which sequencing data are particularly poor and a "low-quality" sequence is the best you can do.  In the following pipeline, we include everything up front, and then remove redundant sequences.  We preferentially remove these low-quality sequences if better sequences are available. 
### Practical thoughts:

+ The bigger the initial dataset, the better. We can pare down later. 

+ Do not worry about duplicate sequences at this point; we'll remove them later. 

+ BLAST with multiple paralogs for the gene of interest, sampled from multiple species. 

+ Use PSI-BLAST to iteratively build your dataset.  In PSI-BLAST, your results from the first BLAST search are fed into the next BLAST search, yielding more hits overall.  This process can be repeated until you find no more new hits. 

+ Check for good taxonomic sampling. (You gain more information from 1,000 sequences from across mammals than just taking 1,000 sequences from rodents.) The `Taxonomy` tab on the blast interface is useful for this. If you are not seeing sequences from key species, you can do targeted BLAST within a clade.  

  + To do this, use the `Organism` selection flag to make sure you get sequences from key lineages.  Since we often study vertebrate proteins, here's a useful strategy: do separate BLAST searches for: `Mammalia (taxid:40674)`, `Sauropsida (taxid:8457)`, `Amphibia (taxid:8292)`,  `bony fishes  (taxid:7898)`,  `Elasmobranchii (taxid:7778)` (skates, rays and sharks), `jawless vertebrates (taxid:1476529)` (hagfish and lampreys), and `tunicates (taxid:7712)` (closest non-vertebrate outgroup). I also usually do `lobe-finned fishes (taxid:118072)`, excluding `Tetrapoda (taxid:32523)`.  This should pull up lungfish and coelacanth sequences if available. 
  + If you get a ton of hits in your earliest-diverging lineage--`tunicates` above--it suggests the protein evolved earlier than you have sampled.  If so, expand to earlier-diverging groups.  In this case, you would expand to earlier-diverging groups like `Ecdysozoa (taxid:1206794)` (which includes both *D. Melanogaster* and *C. elegans*), `Lophotrochozoa (taxid:1206795)` and `Hemichordata (taxid:10219)`. 
  + If one of those searches yields a small number of hits, it might be worthwhile to search for non-NCBI genomes and transcriptomes in an effort to fill out the taxa. Some lineages are just have few sequence resources.  Amphibians are notoriously undersampled, and there are very few extant jawless vertebrates. As a result, these bits of vertebrate protein/gene treees are often sparse. 
  
### Other sources of sequences

*Note: If you have to go this route, you'll have to manually load the non-NCBI results into an existing topiary dataframe in excel or the like. See #12 XX below for an example. After manually adding information to the .csv file, you have to re-load it into a pandas dataframe and then write it back out into a new .csv file. Maybe this puts everything into the right dtype (?)
*

+ Some good places to look for sequences outside of NCBI are [Fish10K](https://db.cngb.org/datamart/animal/DATAani16/) and [Bird10K](https://b10k.genomics.cn/index.html). You can also switch to [tblastn](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=tblastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) approach to see if there are nucleic acid sequences corresponding to your protein that have not been annotated as proteins.  I find [https://ensembl.org/](https://ensembl.org/) is the easiest database for this task. Finally, if you are desperate, can look for transcriptomes in the [short read archive](https://www.ncbi.nlm.nih.gov/sra).  These often have to be assembled (a somewhat tedious process). 

+ [PFAM](http://pfam.xfam.org/) has some amazing pre-built alignments.  I've generally found them more useful for studies of whole domains evolving over very long timescales than our typical vertebrate protein work, but it's worth keeping in mind these alignments are out there. 

## Load sequences from xml files into a dataframe

This will create a data frame and .csv file with all of the sequences you identified by BLAST.  Most of the columns are self-explanatory, but there are two important columns that bear more explanation: 
 + `uid`: a random 10-letter string unique to each sequence, used for generating files compatible with PAML. 
 + `keep`: whether or not the sequence should be written out in alignments. Sequences are never deleted from the dataframe, just marked as `keep = False`. 
 
Running this code requires you provide a list of `.xml` files.  You can also 

In [2]:
# List of xml files to load and process
list_of_xml_files = ["../data/tiny.xml"] #,"example/sauropsid.xml"]

# Load sequences into data frame
df = topiary.ncbi_blast_xml_to_df(list_of_xml_files) #,aliases=alias_dictionary)

# Write output file
topiary.write_dataframe(df,"01_initial-hits.csv")

# Print to notebook
df

Downloading 1 blocks of <=50 sequences... 


  0%|          | 0/1 [00:00<?, ?it/s]

Done.


Unnamed: 0,keep,species,name,sequence,uid,accession,xml,length,evalue,start,end,structure,low_quality,precursor,predicted,isoform,hypothetical,partial,raw_line
0,True,Homo sapiens,lymphocyte antigen 96 isoform 1 precursor,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,ZRDNxgNWHM,NP_056179.4,../data/tiny.xml,160,2.53915e-116,1,160,False,False,True,False,True,False,False,ref|NP_056179.4| lymphocyte antigen 96 isoform...
1,True,Pan troglodytes,lymphocyte antigen 96 precursor,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,zquSRbrlOA,NP_001123946.1,../data/tiny.xml,160,4.17052e-115,1,160,False,False,True,False,False,False,False,ref|NP_001123946.1| lymphocyte antigen 96 prec...
2,True,Pongo abelii,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,hwvOhKSHWr,XP_002819229.1,../data/tiny.xml,160,1.1076800000000001e-114,1,160,False,False,False,False,False,False,False,ref|XP_002819229.1| lymphocyte antigen 96 [Pon...
3,True,Gorilla gorilla,lymphocyte antigen 96 precursor,MLPFLFFSTLFSSTFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,cCVAZtDgBe,NP_001266676.1,../data/tiny.xml,160,2.09351e-114,1,160,False,False,True,False,False,False,False,ref|NP_001266676.1| lymphocyte antigen 96 prec...
4,True,Hylobates moloch,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,qiVpUYglzx,XP_032614344.1,../data/tiny.xml,160,2.75448e-114,1,160,False,False,False,False,False,False,False,ref|XP_032614344.1| lymphocyte antigen 96 [Hyl...
5,True,Nomascus leucogenys,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEPQKQYWVCNSSDASISYTYCDKMQYPISI...,wLQULRDQyk,XP_003269495.1,../data/tiny.xml,160,6.5972e-111,1,160,False,False,False,False,False,False,False,ref|XP_003269495.1| lymphocyte antigen 96 [Nom...
6,True,Colobus angolensis palliatus,PREDICTED: lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,uWQUZwJHPz,XP_011792572.1,../data/tiny.xml,160,6.743599999999999e-111,1,160,False,False,False,True,True,False,False,ref|XP_011792572.1| PREDICTED: lymphocyte anti...
7,True,Chlorocebus sabaeus,lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,WvHPOPkHtB,XP_007999098.1,../data/tiny.xml,160,7.13828e-110,1,160,False,False,False,False,True,False,False,ref|XP_007999098.1| lymphocyte antigen 96 isof...
8,True,Rhinopithecus roxellana,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,YEmpyViUYd,XP_010364303.1,../data/tiny.xml,160,3.66201e-109,1,160,False,False,False,False,False,False,False,ref|XP_010364303.1| lymphocyte antigen 96 [Rhi...
9,True,Piliocolobus tephrosceles,lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKIQYPISI...,ExMhNdHxmQ,XP_023039859.1,../data/tiny.xml,160,6.55105e-109,1,160,False,False,False,False,True,False,False,ref|XP_023039859.1| lymphocyte antigen 96 isof...


## Assign nicknames to sequences

In [3]:
# Aliases as a dictionary. It will map anything in the values to the key. For
# this example, "lymphocyte antigen 96", "MD2", and "MD-2" will all be replaced
# by "LY96". 
alias_dictionary = {"LY96":("lymphocyte antigen 96","MD2","MD-2"),
                    "LY86":("lymphocyte antigen 86","MD1","MD-1")}

df = topiary.create_nicknames(df,aliases=alias_dictionary)
df

Unnamed: 0,nickname,keep,species,name,sequence,uid,accession,xml,length,evalue,start,end,structure,low_quality,precursor,predicted,isoform,hypothetical,partial,raw_line
0,LY96,True,Homo sapiens,lymphocyte antigen 96 isoform 1 precursor,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,ZRDNxgNWHM,NP_056179.4,../data/tiny.xml,160,2.53915e-116,1,160,False,False,True,False,True,False,False,ref|NP_056179.4| lymphocyte antigen 96 isoform...
1,LY96,True,Pan troglodytes,lymphocyte antigen 96 precursor,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,zquSRbrlOA,NP_001123946.1,../data/tiny.xml,160,4.17052e-115,1,160,False,False,True,False,False,False,False,ref|NP_001123946.1| lymphocyte antigen 96 prec...
2,LY96,True,Pongo abelii,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,hwvOhKSHWr,XP_002819229.1,../data/tiny.xml,160,1.1076800000000001e-114,1,160,False,False,False,False,False,False,False,ref|XP_002819229.1| lymphocyte antigen 96 [Pon...
3,LY96,True,Gorilla gorilla,lymphocyte antigen 96 precursor,MLPFLFFSTLFSSTFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,cCVAZtDgBe,NP_001266676.1,../data/tiny.xml,160,2.09351e-114,1,160,False,False,True,False,False,False,False,ref|NP_001266676.1| lymphocyte antigen 96 prec...
4,LY96,True,Hylobates moloch,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,qiVpUYglzx,XP_032614344.1,../data/tiny.xml,160,2.75448e-114,1,160,False,False,False,False,False,False,False,ref|XP_032614344.1| lymphocyte antigen 96 [Hyl...
5,LY96,True,Nomascus leucogenys,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEPQKQYWVCNSSDASISYTYCDKMQYPISI...,wLQULRDQyk,XP_003269495.1,../data/tiny.xml,160,6.5972e-111,1,160,False,False,False,False,False,False,False,ref|XP_003269495.1| lymphocyte antigen 96 [Nom...
6,LY96,True,Colobus angolensis palliatus,PREDICTED: lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,uWQUZwJHPz,XP_011792572.1,../data/tiny.xml,160,6.743599999999999e-111,1,160,False,False,False,True,True,False,False,ref|XP_011792572.1| PREDICTED: lymphocyte anti...
7,LY96,True,Chlorocebus sabaeus,lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,WvHPOPkHtB,XP_007999098.1,../data/tiny.xml,160,7.13828e-110,1,160,False,False,False,False,True,False,False,ref|XP_007999098.1| lymphocyte antigen 96 isof...
8,LY96,True,Rhinopithecus roxellana,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,YEmpyViUYd,XP_010364303.1,../data/tiny.xml,160,3.66201e-109,1,160,False,False,False,False,False,False,False,ref|XP_010364303.1| lymphocyte antigen 96 [Rhi...
9,LY96,True,Piliocolobus tephrosceles,lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKIQYPISI...,ExMhNdHxmQ,XP_023039859.1,../data/tiny.xml,160,6.55105e-109,1,160,False,False,False,False,True,False,False,ref|XP_023039859.1| lymphocyte antigen 96 isof...


## Find unique species identifiers

Idea here is to make sure we actually know what species each sequence comes from. NCBI and the opentree of life use slightly different sequence names. 


In [5]:
df = topiary.get_ott_id(df,phylo_context="Animals")
df

Unnamed: 0,nickname,keep,species,name,sequence,uid,accession,xml,length,evalue,...,structure,low_quality,precursor,predicted,isoform,hypothetical,partial,raw_line,orig_species,ott
0,LY96,True,Homo sapiens,lymphocyte antigen 96 isoform 1 precursor,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,ZRDNxgNWHM,NP_056179.4,../data/tiny.xml,160,2.53915e-116,...,False,False,True,False,True,False,False,ref|NP_056179.4| lymphocyte antigen 96 isoform...,Homo sapiens,ott770315
1,LY96,True,Pan troglodytes,lymphocyte antigen 96 precursor,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,zquSRbrlOA,NP_001123946.1,../data/tiny.xml,160,4.17052e-115,...,False,False,True,False,False,False,False,ref|NP_001123946.1| lymphocyte antigen 96 prec...,Pan troglodytes,ott417950
2,LY96,True,Pongo abelii,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,hwvOhKSHWr,XP_002819229.1,../data/tiny.xml,160,1.1076800000000001e-114,...,False,False,False,False,False,False,False,ref|XP_002819229.1| lymphocyte antigen 96 [Pon...,Pongo abelii,ott770295
3,LY96,True,Gorilla gorilla,lymphocyte antigen 96 precursor,MLPFLFFSTLFSSTFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,cCVAZtDgBe,NP_001266676.1,../data/tiny.xml,160,2.09351e-114,...,False,False,True,False,False,False,False,ref|NP_001266676.1| lymphocyte antigen 96 prec...,Gorilla gorilla,ott417965
4,LY96,True,Hylobates moloch,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKQYWVCNSSDASISYTYCDKMQYPISI...,qiVpUYglzx,XP_032614344.1,../data/tiny.xml,160,2.75448e-114,...,False,False,False,False,False,False,False,ref|XP_032614344.1| lymphocyte antigen 96 [Hyl...,Hylobates moloch,ott732213
5,LY96,True,Nomascus leucogenys,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEPQKQYWVCNSSDASISYTYCDKMQYPISI...,wLQULRDQyk,XP_003269495.1,../data/tiny.xml,160,6.5972e-111,...,False,False,False,False,False,False,False,ref|XP_003269495.1| lymphocyte antigen 96 [Nom...,Nomascus leucogenys,ott1029454
6,LY96,True,Colobus angolensis palliatus,PREDICTED: lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,uWQUZwJHPz,XP_011792572.1,../data/tiny.xml,160,6.743599999999999e-111,...,False,False,False,True,True,False,False,ref|XP_011792572.1| PREDICTED: lymphocyte anti...,Colobus angolensis palliatus,ott713997
7,LY96,True,Chlorocebus sabaeus,lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,WvHPOPkHtB,XP_007999098.1,../data/tiny.xml,160,7.13828e-110,...,False,False,False,False,True,False,False,ref|XP_007999098.1| lymphocyte antigen 96 isof...,Chlorocebus sabaeus,ott571316
8,LY96,True,Rhinopithecus roxellana,lymphocyte antigen 96,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKMQYPISI...,YEmpyViUYd,XP_010364303.1,../data/tiny.xml,160,3.66201e-109,...,False,False,False,False,False,False,False,ref|XP_010364303.1| lymphocyte antigen 96 [Rhi...,Rhinopithecus roxellana,ott77083
9,LY96,True,Piliocolobus tephrosceles,lymphocyte antigen 96 isoform X1,MLPFLFFSTLFSSIFTEAQKHYWVCNSSDASISYTYCDKIQYPISI...,ExMhNdHxmQ,XP_023039859.1,../data/tiny.xml,160,6.55105e-109,...,False,False,False,False,True,False,False,ref|XP_023039859.1| lymphocyte antigen 96 isof...,Piliocolobus tephrosceles,ott1013938


### THINK WE SHOULD HAVE SOMETHING LIKE THIS FOR MR/GR?

In these example data, "Apteryx mantelli mantelli" is not found in opentree of life.  If you start typing the species name into the open tree of life search engine (https://tree.opentreeoflife.org/), it pops up "Apteryx australis mantelli".  A quick google search reveals these are the same species. You can update this by the following command:

In [None]:
df.loc[df.species == "Apteryx mantelli mantelli","species"] = "Apteryx australis mantelli"
df.loc[df.species == "Apteryx australis mantelli","keep"] = True
df = topiary.get_ott_id(df,phylo_context="Animals")
df[df.ott.isnull()]

topiary.write_dataframe(df,"02_with-ott.csv")

## Check sequence identities using reverse BLAST

BLAST can pull down sequences that are homologous, but outside the clade of interest.  (For example, we might want to study TLR4, but BLAST also pulls up TLR2). To identify these sequences, we use reverse BLAST each sequence against the human genome. We will keep only those sequences that pull up the proteins of interest from the human genome. 


In [None]:
# Read the data frame from the previously written file.  This is not necessary
# if you are running the notebook in order, but is super handy if you want to 
# start the notebook midway through the analysis. 
df = topiary.read_dataframe("01_initial-hits.csv")

# Perform reverse blast, looking for hits on "lymphocyte antigen 96" and 
# "lymphocyte antigen 86" from the human genome, labeling them as LY96 
# and LY86 respectively. 

# Command to blast against NCBI nr database, selecting only human. To search
# based on more one taxid, pass a list of taxid
# df = topiary.reverse_blast(df,
#                            call_dict={"LY96":["lymphocyte antigen 96"],
#                                       "LY86":["lymphocyte antigen 86"]},
#                            ncbi_rev_blast_db="nr",taxid=9606)

# Command to blast against a local database named GRCh38
df = topiary.reverse_blast(df,
                          call_dict={"LY96":["lymphocyte antigen 96"],
                                     "LY86":["lymphocyte antigen 86"]},
                          local_rev_blast_db="GRCh38")

# Write output file
topiary.write_dataframe(df,"03_reverse-blasted.csv")

# Print to notebook
df



## Lower redundancy of sequences

To make the computation faster and avoid bias from inclusion of many very similar sequences, we usually remove sequences that are highly similar to one another. The code below will combine sequences with identities greater than 0.9, using relatively intelligent criteria to choose the higher quality sequence. 

The `key_species` list is a list of species that will be given preference over others. If we specify `["Homo sapiens","Mus musculus"]` as key species and then find a human and chimp sequence are highly similar, the software will drop the chimp. The software does two loops.  In the first loop, it discards similar sequences within each species. This *will* drop sequences from key species. (If you had two human sequences 99% identical, it would drop the lower quality sequence).  In the second loop, the software discards similar sequences between species.  That loop *will not* drop sequences from key species.  

In [None]:
# Read the data frame from the previously written file.  This is not necessary
# if you are running the notebook in order, but is super handy if you want to 
# start the notebook midway through the analysis. 
df = topiary.read_dataframe("03_reverse-blasted.csv")

# Remove redundancy
key_species = ["Homo sapiens","Mus musculus","Monodelphis domestica","Gallus gallus",
               "Xenopus laevis","Danio rerio"] #,"Tachyglossus aculeatus", "Ornithorhynchus anatinus"
df = topiary.remove_redundancy(df,0.90,key_species=key_species)

# Write out file
topiary.write_dataframe(df,"04_removed-redundancy.csv")

# Print in notebook
df

## Check and edit reduced data frame

At this point, you might want to look over the set of sequences and see if you like the result of the automatic redundancy reduction.  Some things to look for:

+ Is your sequence set still huge (>1000) or too small (<100)? (If so, you might want to play with the redundnacy cutoff above). 
+ Do you still have the sequence of modern species you care about? (If so, you may want to manually set those sequences to `keep = True`). 

You could load up the nonredudnant set in excel, but I *strongly* recommend you do these manipulations using the pandas dataframe. Pandas slicing allows you to easily pull out rows you care about based on selection criteria. The cell below shows a few of these slicing approaches as templates. 


In [None]:
# How many sequences are there in the dataset?
df = topiary.read_dataframe("04_removed-redundancy.csv")
print(np.sum(df.keep))

In [None]:
# How many of each of the key species (defined above) are in the dataset?
key_species = ["Homo sapiens","Mus musculus","Monodelphis domestica","Gallus gallus",
               "Xenopus laevis","Danio rerio"] #,"Tachyglossus aculeatus", "Ornithorhynchus anatinus"

for k in key_species:
    print(k,np.sum(np.logical_and(df.keep,df.species==k)))

In [None]:
# Show all of the human sequences we're keeping
df[np.logical_and(df.keep,df.species=="Homo sapiens")]

In [None]:
# Show all of the Danio rerio sequences we started with
df1 = topiary.read_dataframe("01_initial-hits.csv")
df1 = df1.loc[df1.species == "Danio rerio",:]
df1

## Align the sequences using MUSCLE

We now have a database of sequences.  We now need to align those sequences to one another.  The code below will create a file called `04_to-align.fasta` that has all of the sequences flagged with `keep = True`. The sequences will be assigned "pretty" names that have a defined structure: `ortholog_call|species|accession`. 

In [None]:
# Read the data frame from the previously written file.  This is not necessary
# if you are running the notebook in order, but is super handy if you want to 
# start the notebook midway through the analysis. 
df = topiary.read_dataframe("04_removed-redundancy.csv")

# Write fasta file. 
topiary.write_fasta(df,"05_to-align.fasta")


topiary.run_muscle(input_fasta="05_to-align.fasta",
                   output_fasta="06_aligned.fasta")

## Manually edit the alignment using AliView

XX NOTE THIS IS ASR, SO SLIGHTLY DIFFERENT THAN OTHER TREE INFERENCE PROBLEMS XX
Human brains are still better than computers at identifying patterns in sequence data. We're going to edit the alignment manually. Two consequences of this:

1. We need to publish our final alignment with our manuscript. It needs to be available for evaluation by readers and/or to reproduce the work. 

2.  If we're doing ancestral sequence reconstruction an we *delete* a column, we need to make sure we're okay with not including that column in the final reconstruction.  This probably makes sense for N- and C-terminal extensions, but makes less sense for columns in the middle of the protein. 

With that in mind, open the `05_aligned.fasta` file in aliview and do the following:

1. <span style="color:blue">*Look for sequences that are way longer or shorter than the average.*</span>  Super long sequences may align poorly--and suck other sequences into that poor alignment.  Super short sequences are also difficult to align and provide little taxonomic information.  Delete these sequences by selecting them and going to `Edit->Delete selected`.  When you reimport the alignment into your dataframe, the software will set `keep = False` for any sequence you deleted. 

2. <span style="color:blue">*Trim random long N-terminal and C-terminal extensions from alignment.*</span>  Deleting sequences that align poorly will not bias your tree, but including incorrectly aligned regions might. Select these sequence regions and go to `Edit->Clear selected bases`. 

3. <span style="color:blue">*Manually realign problematic regions.*</span>  Sometimes, alignment programs will make obvious mistakes, where the same sequence element is aligned different ways right next to one another.  To correct for this, you can manually move groups of amino acids by selecting them and dragging them right or left. You can also select whole blocks of the alignment and then go to `Align->Realign selected block` to re-run MUSCLE on that block. 

4. <span style="color:blue">*Remove empty columns and rows.*</span> Remove gaps-only columns (`Edit->Delete gap-only columns`) and any empty sequences (`Edit->Delete empty sequences`). 

5. <span style="color:blue">*Save out the edited alignment*</span> as:
```
06_aligned-edited.fasta
```

## Load newly aligned sequences into our dataframe and write out .csv file for tree building.

We will now load our alignment back into our dataframe.  This will create a new column called "alignment" with the aligned sequences and will also set all sequences *not* in the alignment file to have "keep = False".  We will then write out the aligned sequences into a new .csv file.

In [None]:
# Read the data frame from the previously written file.  This is not necessary
# if you are running the notebook in order, but is super handy if you want to 
# start the notebook midway through the analysis. 
df = topiary.read_dataframe("04_removed-redundancy.csv")

# Load the alignment into the data frame
df = topiary.read_fasta(df,"06_aligned.fasta",load_into_column="alignment")

# Write out file
topiary.write_dataframe(df,"07_seq-database.csv")

# Print df to notebook
df

## AFTER HERE NOT UPDATED

In [None]:
import ete3

In [None]:
## KO note -  
## If you make manual changes to your .csv file after this step, for example if you
## decide to add some sequences from taxa that were not found in your ncbi search 
## and then add those sequences by hand into your .csv file, it seems to cause issues.
## I found that my issues were remedied if I re-loaded my .csv into a pandas dataframe 
## and then wrote it back out into a new .csv file. Maybe this puts everything into the 
## right dtype (?)

# Read in the .csv file that has been manually edited
df = topiary.read_dataframe("07.1_seq-database.csv")

# Load the alignment into the data frame
df = topiary.load_fasta(df,"06_aligned-edited.fasta",load_into_column="alignment")

# Write out file
topiary.write_dataframe(df,"07.2_seq-database.csv")

# Print df to notebook
df

## 10. Generate ML phylogenetic tree using RAxML

Our next steps are to find a good evolutionary model that describes our data and to build a maximum likelihood phylogenetic tree.  This is likely something you will want to run on a high-performance computing cluster.  

You need to copy "07_seq-database.csv" (or your most current version) and `run_raxml.srun` (from Harms Lab GitHub asr-protocol/template/copy-to-hpc/) to a working directory in whatever server you use. *Hopefully it already has "raxmlHPC" installed - KO note - what does this mean?.  

You then need to execute two commands, using the helper `run_raxml.srun` script. See below:

### This is what your run_raxml.srun file should look like.

    #!/bin/bash -l
    #SBATCH --account=harmslab      ### change this to your actual account for charging
    #SBATCH --job-name=raxml        ### job name
    #SBATCH --output=hostname.out   ### file in which to store job stdout
    #SBATCH --error=hostname.err    ### file in which to store job stderr
    #SBATCH --partition=long        ### can be short long fat longfat
    #SBATCH --time=07-00:00:00      ### Run for 7 days
    #SBATCH --nodes=1               ### Run on a single node
    #SBATCH --ntasks-per-node=1     ### Run one job on the node
    #SBATCH --cpus-per-task=28      ### Use 28 cores to run job (should match threads below)

    module load gcc

    # Find the best phylogenetic model
    run-raxml model -c 07_seq-database.csv -o find-model -T 28

    # Consruct the ML tree
    run-raxml ml -c 07_seq-database.csv -m `cat find-model/best-model.txt` -o ml-tree -T 28

    # Construct ancestors on the ML tree.
    run-raxml anc -c 07_seq-database.csv -m `cat find-model/best-model.txt` -t ml-tree/02_ml-tree.newick -o ml-anc -T 28

    # Construct ancestors on the reconciled generax tree with supports
    # run-raxml anc -c topiary.csv -m `cat find-model/best-model.txt` -t generax-run/reconciled-tree.newick -o ml-anc_reconciled --anc-support-tree generax-run/reconciled-tree-with-supports.newick -T 28

#### What does this script say?
The computing cluster at University of Oregon uses SBATCH commands. Change to your clusters preferences.

The Environment Modules package is a tool that simplifies shell initialization and lets users easily modify their environment during a session using modulefiles. 
```
module load gcc
```

The first will search through a collection of different models of evolutionary rate distribution, amino acid substitution probability, etc. and find the model that gives the highest likelihood.  

```
run-raxml model -c 07_seq-database.csv -o find-model -T 28
```

This will print out the best model at the end, along with an `AIC Prob` score.  Hopefully, that number is close to 1.0, meaning that the chosen model is clearly a better choice than all the others. (If not, we may need to reconstruct our ancestors using different models to make sure the results are robust to the choice of model).  If you want to see the likelihoods and AIC probabilities for all models, check out `find_model/model-comparison.csv`). 

We next need to build the maximum likelihood phylogenetic tree for our alignment. To do that, run the following.  It will automatically grab the best model from the previous calculation. 
```
run-raxml ml -c 07_seq-database.csv -m `cat find-model/best-model.txt` -o ml-tree -T 28
```

Constructing ancestors on the ML tree is not necessary at this step, but you can delete the comment here and have it run if you would like to see these ancestors. We cannot yet construct ancestors on the reconciled generax tree with supports, so leave this commented out.

Once complete, you can download the resulting `find-model/` and `ml-tree/` directories to your local computer. 

## 11. Evaluate tree

The next step is to look at the ML tree and make sure it is well supported/sensical. To do so, you should first write out the tree with useful names for each taxon. To do so, run the following.

KO note: When trying to open my ML tree in FigTree I got an error saying there was a missing close parentheses. Turns out I had some unwanted punctuation in some of my manually entered accession numbers. This was causing problems with FigTree interpretting where to arrange species on a tree. Updated step 2 cell addresses this early on.

In [None]:
# Read the data frame from the previously written file.  This is not necessary
# if you are running the notebook in order, but is super handy if you want to 
# start the notebook midway through the analysis. 
df = pd.read_csv("07_seq-database.csv")

# Make the sequence names on the output tree human readable. 
topiary.util.uid_to_pretty(df,"ml-tree/02_ml-tree.newick",out_file="08_ml-tree.newick")
                           #"ml-tree/07_final-tree.newick",out_file="11_ml-tree.newick")\

Open `08_ml-tree.newick` in FigTree.  On the left-hand panel, go to "Branch-labels" and select "Display: label". This will label each branch with its SH support.  SH support values go from 0 (no support at all) to 100 (excellent support).  Then look at the following:

+ *Are the major clades well supported?* Major branch points should (hopefully) have $SH \ge 85$. If not, we may need to do our reconstructions on multiple versions of the tree to see if our ancestral sequences are robust to the tree topology. 
+ *Is the species tree approximately correct?* Do you see birds with birds, mammals with mammals, etc.?   If not, there could be a problem with the current alignment, or we might need to add more sequences to the alignment. 
+ *Are there long branches?* A long branch is one where you have a bunch of sequence change (say 0.7 subs/site) without branching.  This means the evolutionary model runs and runs without getting input from branches.  This can lead to bias and will certainly lead to very poor reconstructions of ancestors near the long branch. If there is a long branch for a single sequence, delete it from the alignment. It is too divergent or too poorly aligned to include effectively. If a long branch occurs between clades, you can try to find new sequences that "break" the branch.  For example, if there is a long branch between bony fishes and birds, adding amphibian sequences will cut the branch (about) in half and should improve the inference. 

## 12. Iteratively add sequences to alignment

At this point, you may need to go back and add sequences to the alignment. To do so, you have a couple of options.  One possibility is to open `07_seq-database.csv` in excel and manually add any new sequences to the database. Make sure you fill out every column, including pasting the sequence into the `alignment` column. You could also add new rows via pandas (see #5 for some examples). 

After you've edited the sequence database, save it out as `09_seq-database.csv`. Write this out as a fasta file.  You can then load into aliview, edit, and repeat steps 6-9 until you are satisfied with the ML tree.  (The following code block is an example of what you might run in a jupyter notebook.) 

```
# Read in the manually edited sequence database
df = pd.read_csv("09_seq-database.csv")

# Write out a fasta file. 
topiary.write_fasta(df,"10_new-alignment.fasta",seq_name="pretty",seq_column="alignment")

### edit in alivew and save as 11_aligned-edited.fasta

# Load the alignment into the data frame
df = topiary.load_fasta(df,"11_aligned-edited.fasta",load_into_column="alignment")

# Write out file
df.to_csv("12_seq-database.csv",index=False)

```

Copy `12_seq-database.csv` to your favorite cluster and use it to calculate a new maximum likelihood tree using an edited `run_raxml.srun` script that looks like this:

```
#!/bin/bash -l
#SBATCH --account=harmslab      ### change this to your actual account for charging
#SBATCH --job-name=raxml        ### job name
#SBATCH --output=hostname.out   ### file in which to store job stdout
#SBATCH --error=hostname.err    ### file in which to store job stderr
#SBATCH --partition=long        ### can be short long fat longfat
#SBATCH --time=07-00:00:00      ### Run for 7 days
#SBATCH --nodes=1               ### Run on a single node
#SBATCH --ntasks-per-node=1     ### Run one job on the node
#SBATCH --cpus-per-task=28      ### Use 28 cores to run job (should match threads below)

module load gcc

# Find the best phylogenetic model
# run-raxml model -c 07_seq-database.csv -o find-model -T 28

# Consruct the ML tree
run-raxml ml -c 12_seq-database.csv -m `cat find-model/best-model.txt` -o 13_ml-tree -T 28

# Construct ancestors on the ML tree.
run-raxml anc -c 12_seq-database.csv -m `cat find-model/best-model.txt` -t 13_ml-tree/02_ml-tree.newick -o 14_ml-anc -T 28

# Construct ancestors on the reconciled generax tree with supports
# run-raxml anc -c topiary.csv -m `cat find-model/best-model.txt` -t generax-run/reconciled-tree.newick -o ml-anc_reconciled --anc-support-tree generax-run/reconciled-tree-with-supports.newick -T 28

```

## 13. Assign paralogs

#### KO note: Not sure if we have to do this...

Once you have a tree and alignment that you are happy with, you now need to identify which sequence corresponds to which paralog. The reverse BLAST protocol we used above is error-prone, particularly for highly diverged sequences. 

In [None]:
# Read the data frame from the previously written file.  This is not necessary
# if you are running the notebook in order, but is super handy if you want to 
# start the notebook midway through the analysis. 
df = pd.read_csv("07_seq-database.csv") # or "12_seq-database.csv"

df.loc[df.accession == "OCT56128.1","paralog"] = "LY96a"
df.loc[df.accession == "XP_018120758.1","paralog"] = "LY96b"
df.loc[df.accession == "XP_040288185.1","paralog"] = "LY86b"
df.loc[df.accession == "XP_029446462.1","paralog"] = "LY86b"
df.loc[df.accession == "XP_029447354.1","paralog"] = "LY86a"
df.loc[df.accession == "XP_029446462.1","paralog"] = "LY86b"
df.loc[df.accession == "CAF5201687.1","paralog"] = "LY96a"
df.loc[df.accession == "XP_033791070.1","paralog"] = "LY96a"
df.loc[df.accession == "XP_040288186.1","paralog"] = "LY86a"

df.to_csv("15_seq-database.csv")


# Make the sequence names on the output tree human readable. 
#topiary.util.uid_to_pretty("13_ml-tree/final-tree.newick","15_ml-tree.newick",df)


Now, load up `15_ml-tree.newick` in FigTree. Look for well-supported clades that contain sequences with known paralogy and make sure all of the sequences in that clade have the same paralog name.  For example, if you have a clade with SH = 100 that contains human S100A9, but also a pangolin protein labeled S100A8, the pangolin protein is labeled incorrectly. 

Either via pandas or excel, edit the `paralog` column of your sequence database with the correct call for each ortholog. 

## 14. Generate species tree(s) and evaluate the maximum likelihood tree with bootstrapping

Open a command or terminal window and log into your cluster account. In your home directory, start an interactive job. Example:
```
    srun --account=mylab --pty bash
```

Download the most recent version of topiary from the Harms Lab GitHub and prepare the topiary directory for installing python packages:

```
git clone https://github.com/harmslab/topiary.git
cd topiary
python setup.py install
```
cd into your working directory containing "07_seq-database.csv" file (or most recent version) and use the setup-generax script to create a directory called "generax-trees" containing bootstrap directories and an "ml" directory containing a newly made species tree, your alignment, and information about the maximum likelihood tree.

```
setup-generax -c CSV -t TREE -m MODEL -b BS_DIR -o OUTPUT
Example: 
    setup-generax -c 07_seq-database.csv -t ml-tree/01_make-ml-tree/alignment.raxml.bestTree -m `cat find-model/best-model.txt` -b ml-tree/01_make-ml-tree/ -o generax-trees)
```
cd into the OUTPUT directory you just made (should be called generax-trees) and launch the generax bootstrapping script. This is very computationally heavy.
```
    bash 00_launch_generax_bootstrap.sh 20  
```
   Note: changing "20" to "1" will make this step run faster because you will be allocating 1 job to 1 node, instead of 20 jobs to 1 node. You may also want to edit the `00_launch_generax_bootstrap.sh` file to run for longer than 1 day. For large alignments (~1000 sequences) allocate 2 weeks time ("long" run). You might even use more than this.

## 15. Assemble bootstraps
When it's all done running, in the "generax-trees" directory run the generax bootstrap assemble script:
```
    bash 01_assemble_generax_bootstrap.sh
```

## 16. Reconstruct ancestors on tree with supports

If all of that goes smoothly, then go back to your original working directory that contains all of your files as well as ml and generax-trees directories.

Reconstruct ancestors using the `run_raxml.srun` script by removing the comment from the very last line (after # Construct ancestors on the reconciled generax tree with supports) and commenting out the previous raxml lines. Make sure that this last line of code points to each of the modified files you wish to include in the reconstruction. Your .srun file should look like this:
```
#!/bin/bash -l
#SBATCH --account=harmslab      ### change this to your actual account for charging
#SBATCH --job-name=raxml        ### job name
#SBATCH --output=hostname.out   ### file in which to store job stdout
#SBATCH --error=hostname.err    ### file in which to store job stderr
#SBATCH --partition=long        ### can be short long fat longfat
#SBATCH --time=07-00:00:00      ### Run for 7 days
#SBATCH --nodes=1               ### Run on a single node
#SBATCH --ntasks-per-node=1     ### Run one job on the node
#SBATCH --cpus-per-task=28      ### Use 28 cores to run job (should match threads below)

module load gcc

# Find the best phylogenetic model
# run-raxml model -c 07_seq-database.csv -o find-model -T 28

# Consruct the ML tree
# run-raxml ml -c 07_seq-database.csv -m `cat find-model/best-model.txt` -o ml-tree -T 28

# Construct ancestors on the ML tree.
# run-raxml anc -c 07_seq-database.csv -m `cat find-model/best-model.txt` -t ml-tree/02_ml-tree.newick -o ml-anc -T 28

# Construct ancestors on the reconciled generax tree with supports
run-raxml anc -c topiary.csv -m `cat find-model/best-model.txt` -t generax-run/reconciled-tree.newick -o ml-anc_reconciled --anc-support-tree generax-run/reconciled-tree-with-supports.newick -T 28
```
Then, in your working directory, run your raxml script:
```
    qsub run_raxml.srun
```

## 17. Evaluate ancestors
Download your cluster working directory onto your computer for safe-keeping and further analysis.
There should be a file called `10_ancestors_all.newick`. This is your phylogenetic tree constructed with information from the protein sequences you found and the known species tree. Open `10_ancestors_all.newick` in FigTree or your tree viewer of preference. Show the branch support values - branch support values indicate the degree to which one can be confident that the branch represents some "signal" present in the data. In other words, these values indicate how many times out of 100 (100/100 = 1) the same branch was observed when repeating the phylogenetic reconstruction on a re-sampled set of the data during the bootstrap analysis. Each reconstructed ancestral sequence is associated with a branch support value. We use these values to evaluate how confident we are in the reconstructed sequence. 

Furthermore, we can evaluate the ambiguity of our reconstructed ancestral sequences. Inside of the `ml-anc/02_final-ancestors/` directory you will find ancNodeXXX.pdf files. Opening one of these files shows a graph with posterior probability on the y-axis and alignment site on the x-axis. The black dots show the posterior probability that the amino acid called in the ancestor is the correct amino acid. The red dots show the posterior probability of the next most probable amino acid that could have been called at that the same position. The dashed line represents the posterior probability value if the site showed equal preference for any amino acid (ambiguous site), suggesting that red dots above this value represent amino acid substitutions that occurred frequently while red dots below this line suggest that the site's preference is skewed towards a specific amino acid. The histogram to the right displays the distribution of where these black and red points lie in the range of posterior probability. We have high confidence in our reconstructed ancestral node if the black and red histograms have little to no overlap.