# <span style="color:red">Warning: This tutorial is currently under development.</span>


# Introduction

This tutorial will walk you through a preliminary similarity searching analysis making use of scripts in the AMOEBAE toolkit. As a simple example, we will consider the the distribution of orthologues of subunits of the Adaptor Protein (AP) 2 vesicle adaptor complex, and several other membrane-trafficking proteins, in three model eukaryotes: the plant *Arabidopsis thaliana*, the yeast *Saccharomyces cerevisiae*, the fungus *Allomyces macrogynus*, the amoeba *Dictyostelium discoideum*, and the pathogenic protist *Trypanosoma brucei*. AP-2 subunits are homologous to subunits of other AP complexes (Robinson, 2004; Hirst et al., 2011), and published work has traced their evolution among plants (Larson et al., 2019), Fungi (Barlow et al., 2014), and trypanosomatid parasites (Manna et al., 2013). Thus, the protein subunits of the AP-2 complex provide a useful test of similarity searching methods to distinguish between orthologues and paralogues, which can be compared to the results of previous comprehensive studies. The membrane trafficking proteins Sec12 (a component of the COPII vesicle coat complex), SNAP33 (a Qbc-SNARE), and Rab2 (a small GTPase) are included to further explore the potential sources of error involved in identification of orthologous proteins. The end result of running this code successfully is a spreadsheet summarizing results of similarity searches, as well as a plot summarizing the results.


## Objectives


-  Perform similarity searches using the BLASTP, TBLASN, HMMer algorithms simultaneously using AMOBEAE code.

-  Apply a reciprocal-best-hit search strategy using AMOEBAE code.

- Practice interpreting interesting similarity search results obtained using AMOEBAE.
 


## Requirements

- MacOS or Linux operating system (or possibly a work-around on windows, although this has not been tested).

- Before running this code, you will need to have set up AMOEBAE according to the instructions in the main documentation file.

- The code in this notebook will take approximately <span style="color:red">XXXXXX</span> minutes to run.

# Check that dependencies are installed.
You should have already pulled the amoebae git repository to your computer as described in the main documentation file.

In [1]:
%%bash
amoebae check_depend

blastp: 2.10.0+
 Package: blast 2.10.0, build Dec  3 2019 18:03:18
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.3 (Nov 2019); http://hmmer.org/
# Copyright (C) 2019 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# esl-sfetch :: retrieve sequence(s) from a file
# Easel 0.46 (Nov 2019)
# Copyright (C) 2019 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
MUSCLE v3.8.31 by Robert C. Edgar
IQ-TREE multicore version 1.6.12 for Linux 64-bit built Aug 15 2019
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Usage: iqtree -s <alignment> [OPTIONS]

GENERAL OPTIONS:
  -? or -h             Print this help dialog
  -version             Display version number
  -s <alignment>       Input alignment in PHYLIP/FASTA/NEXUS/CLUSTAL/MSF format
  -st <data_type>      BIN, DNA, AA, NT2AA, CODON, MORPH (default: auto-detect)
  -q <partit

In [1]:
%%bash
amoebae check_imports


Non-redundant list of import statements:

1. import sys  # add_seq_man.py
2. import os  # add_seq_man.py
3. import shutil  # add_seq_man.py
4. import time  # add_seq_man.py
5. from module_afa_to_nex import afa_to_nex, nex_to_afa  # add_seq_man.py
6. from afa_to_fa import afa_to_fa  # add_seq_man.py
7. from module_afa_to_nex import align_one_fa  # add_seq_man.py
8. from subprocess import call  # add_seq_man.py
9. from parse_mod_num import update_mod_num_numeric  # add_seq_man.py
10. import subprocess  # boots_on_best_ml.py
11. import glob  # boots_on_best_ml.py
12. import settings  # boots_on_best_ml.py
13. from module_amoebae_name_replace import write_newick_tree_with_uncoded_names  # boots_on_best_ml.py
14. import re  # boots_on_mb.py
15. from ete3 import Tree  # boots_on_mb.py
16. from settings import raxmlname  # boots_on_mb.py
17. from module_boots_on_mb import reformat_combined_supports, combine_supports,\  # boots_on_mb.py
18. mbcontre_to_newick_w_probs, contre_to_newick  # boot

Traceback (most recent call last):
  File "./get_nonredun_import_statments_for_amoebae_output_793.py", line 51, in <module>
    from ete3 import Tree, TreeStyle, TextFace
ImportError: cannot import name 'TreeStyle'


# Import some basic python modules

In [13]:
import os
import sys
import subprocess
from Bio import SeqIO
from Bio import Entrez
import glob
from Bio.Blast import NCBIXML
import pandas as pd
from IPython.display import display, HTML
#sys.path.append(os.path.dirname(os.path.dirname(sys.path[0])))
sys.path.append('/opt/notebooks')

# Download peptide and nucleotide sequences for specific genomes.

Let's download the predicted peptide sequences, genomic assembly (nucleotide
sequences of assembled chromosomes), and annotation files (in GFF3 format) for the following eukaryotes from NCBI:

- *Arabidopsis thaliana*
- *Trypanosoma brucei*
- *Dictyostelium discoideum*
- *Allomyces macrogynus*
- *Saccharomyces cerevisiae*


This could take a while.

In [4]:
# Initiate a list of file paths for downloaded sequence and annotation files.
datafile_path_list = []

# Define a dictionary of source URLs and new filenames for sequence and annotation files.
datafile_dict = {"Athaliana_database.faa.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_protein.faa.gz",
                 "Athaliana_database.fna.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz",
                 "Athaliana_database.gff3.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gff.gz",
                 "Scerevisiae_database.faa.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_protein.faa.gz",
                 "Scerevisiae_database.fna.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz",
                 "Scerevisiae_database.gff3.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz",
                 "Tbrucei_database.faa.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/210/295/GCF_000210295.1_ASM21029v1/GCF_000210295.1_ASM21029v1_protein.faa.gz",
                 "Tbrucei_database.fna.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/210/295/GCF_000210295.1_ASM21029v1/GCF_000210295.1_ASM21029v1_genomic.fna.gz",
                 "Tbrucei_database.gff3.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/210/295/GCF_000210295.1_ASM21029v1/GCF_000210295.1_ASM21029v1_genomic.gff.gz",
                 "Ddiscoideum_database.faa.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_protein.faa.gz",
                 "Ddiscoideum_database.fna.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_genomic.fna.gz",
                 "Ddiscoideum_database.gff3.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_genomic.gff.gz",
                 "Amacrogynus_database.faa.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/151/295/GCA_000151295.1_A_macrogynus_V3/GCA_000151295.1_A_macrogynus_V3_protein.faa.gz",
                 "Amacrogynus_database.fna.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/151/295/GCA_000151295.1_A_macrogynus_V3/GCA_000151295.1_A_macrogynus_V3_genomic.fna.gz",
                 "Amacrogynus_database.gff3.gz": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/151/295/GCA_000151295.1_A_macrogynus_V3/GCA_000151295.1_A_macrogynus_V3_genomic.gff.gz"
          }

# Make a new temporary directory to store data files.
temp_db_dir_name = 'temporary_db_dir'
assert not os.path.isdir(temp_db_dir_name)
os.mkdir(temp_db_dir_name)

# Download all the data files via NCBI's FTP server.
for filename in datafile_dict.keys():
    url = datafile_dict[filename]
    filepath = os.path.join(temp_db_dir_name, filename)
    if not os.path.isfile(filepath):
        subprocess.call(['curl', url, '--output', filepath])
        subprocess.call(['gunzip', filepath])

# Initiating a data directory structure.
To generate a directory structure and spreadsheets for storing formatted sequence files
and metadata for each sequence file, use the 'mkdatadir' command (this takes a
single argument which is the full path that you want your new directory to be
written to):

In [6]:
%%bash
export DATADIR="AMOEBAE_Data"
amoebae mkdatadir $DATADIR


        
        To allow AMOEBAE scripts to locate your new data directory, change the
        value of the root_amoebae_data_dir variable in the settings.py file to
        the full path to the directory:

        AMOEBAE_Data
        


This will prompt you to set the 'root\_amoebae\_data\_dir' variable in the
settings.py file to this new directory path so that AMOEBAE scripts can locate
your files.

This can be done as follows:

In [22]:
# Check that the path indicated in the settings file is correct.
import settings
print(settings.root_amoebae_data_dir)
assert settings.root_amoebae_data_dir == "AMOEBAE_Data"

AMOEBAE_Data


# Preparing databases for searching.
To generate a directory structure and spreadsheets for storing formatted sequence files
and metadata for each sequence file, use the 'mkdatadir' command (this takes a
single argument which is the full path that you want your new directory to be
written to).

This will take several minutes.

In [23]:
%%bash
for X in temporary_db_dir/*; do amoebae add_to_dbs $X; done



Building a new DB, current time: 02/20/2020 02:25:26
New DB name:   /opt/notebooks/notebooks/AMOEBAE_Data/Genomes/Amacrogynus_database.faa
New DB title:  AMOEBAE_Data/Genomes/Amacrogynus_database.faa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 19447 sequences in 2.01947 seconds.


Creating SSI index for AMOEBAE_Data/Genomes/Amacrogynus_database.faa...    done.
Indexed 19447 sequences (19447 names).
SSI index written to file AMOEBAE_Data/Genomes/Amacrogynus_database.faa.ssi


Building a new DB, current time: 02/20/2020 02:25:31
New DB name:   /opt/notebooks/notebooks/AMOEBAE_Data/Genomes/Amacrogynus_database.fna
New DB title:  AMOEBAE_Data/Genomes/Amacrogynus_database.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 101 sequences in 1.27208 seconds.


Creating SSI index for AMOEBAE_Data/Genomes/Amacrogynus_database.fna...    done.
Indexed 101 sequences (101 names).
S

In [24]:
%%bash
# List the databases now accessible by AMOEBAE.
amoebae list_dbs

Amacrogynus_database.faa
Amacrogynus_database.fna
Athaliana_database.faa
Athaliana_database.fna
Ddiscoideum_database.faa
Ddiscoideum_database.fna
Scerevisiae_database.faa
Scerevisiae_database.fna
Tbrucei_database.faa
Tbrucei_database.fna


This may take some time, because an SQL database will be generated to store information from the GFF3 annotation file (this is what is will be listed in the genome info CSV file).

When this is finished, copy the name of the .sql file to the row for the corresponding genomic assembly (.fna) file in the column with the header "Annotations file", and do the same for the row describing the corresponding peptide sequence (.faa) file. This allows the correct GFF3 file to be used for the assembly (.fna file) and predicted amino acid sequences (.faa).

Next you must manually modify the spreadsheet so that it has the correct metadata for this sequence file. Open it with Excel or Open Office, and enter the following information:
- Fill the "Superbranch", "Supergroup", "Group", and "Species (if applicable)" fields with the values "Diaphoretickes", "Archaeplastida", "Embryophyta", and "Arabidopsis thaliana", respectively. These are arbitrary selected taxonomic groups to which Arabidopsis belongs (Adl et al., 2018), but if note similar taxonomic information for each genome you download then it will help to keep organized.
- Fill the "Taxon" field with the abbreviation "Athaliana". This is used for abbreviating names when necessary.
- Fill in the other fields as you see fit. It is recommended that you keep track of where you downloaded files from, and which assembly you used.



# Enter your email to access the NCBI protein database via NCBI Entrez.

In [25]:
Entrez.email = input("Enter your email address here: ")  # Tell NCBI who you are.

Enter your email address here: lael@ualberta.ca


# Download single-sequence queries

In [26]:
# Define a dictionary with NCBI sequence accessions as keys and filenames to write
# the corresponding sequences to as values.
query_dict = {"NP_194077.1": "AP1beta_Athaliana_NP_194077.1_query.faa",
              "NP_851058.1": "AP2alpha_Athaliana_NP_851058.1_query.faa",
              "NP_974895.1": "AP2mu_Athaliana_NP_974895.1_query.faa",
              "NP_175219.1": "AP2sigma_Athaliana_NP_175219.1_query.faa",
              "NP_566961.1": "Sec12_Athaliana_NP_566961.1_query.faa",
              "NP_200929.1": "SNAP33_Athaliana_NP_200929.1_query.faa",
              "NP_193449.1": "Rab2_Athaliana_NP_193449.1_query.faa"
          }

# Make a new temporary directory to store sequence files.
temp_query_dir_name = 'temporary_query_dir'
assert not os.path.isdir(temp_query_dir_name), """Directory already exists."""
os.mkdir(temp_query_dir_name)

# Loop over keys in the query_dict dictionary.
for accession in query_dict.keys():
    # Retrieve the corresponding filename from the dictionary.
    filename = query_dict[accession]
    # Only download sequences that have not already been downloaded.
    if not os.path.isfile(filename):
        # Download the sequence from NCBI via Entrez, using the Biopython module.
        net_handle = Entrez.efetch(db="protein", id=accession, rettype="fasta", retmode="text")
        out_handle = open(os.path.join(temp_query_dir_name, filename), "w")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()

# Prepare single-sequence queries for searching

Queries must be formatted and stored in a similar manner to genomic data files. The query files will include FASTA files containing one sequence and FASTA files containing multiple sequences.
Now we are going to generate the query files and add them to your AMOEBAE_Data/ Queries directory, in a similar way to how we added genomic data files to the AMOEBA E_Data/Genomes directory. Since you already downloaded all the peptide sequences for Arabidopsis thaliana, you can retrieve these from your downloaded data using one of the scripts in the amoebae/misc_scripts folder. First, let’s generate a query for the A. thaliana AP-1/2 beta subunit(s), which is a component of both the AP-1 and AP-2 complexes, using a representative sequence:

In [27]:
%%bash
for QUERYFILE in temporary_query_dir/*.faa; do amoebae add_to_queries $QUERYFILE; done

Information added to spreadsheet 0_query_info.csv:
	Filename: AP1beta_Athaliana_NP_194077.1_query.faa
	Query title: AP1beta
	Query source description: Athaliana
	Query taxon (species if applicable): -
	Data type: prot
	File type: faa
	Date added: 2020/02/20
	Citation: ?
	Query database filename (if applicable): -
Information added to spreadsheet 0_query_info.csv:
	Filename: AP2alpha_Athaliana_NP_851058.1_query.faa
	Query title: AP2alpha
	Query source description: Athaliana
	Query taxon (species if applicable): -
	Data type: prot
	File type: faa
	Date added: 2020/02/20
	Citation: ?
	Query database filename (if applicable): -
Information added to spreadsheet 0_query_info.csv:
	Filename: AP2mu_Athaliana_NP_974895.1_query.faa
	Query title: AP2mu
	Query source description: Athaliana
	Query taxon (species if applicable): -
	Data type: prot
	File type: faa
	Date added: 2020/02/20
	Citation: ?
	Query database filename (if applicable): -
Information added to spreadsheet 0_query_info.csv:
	Filen

In [28]:
%%bash
amoebae list_queries

AP1beta_Athaliana_NP_194077.1_query.faa
AP2alpha_Athaliana_NP_851058.1_query.faa
AP2mu_Athaliana_NP_974895.1_query.faa
AP2sigma_Athaliana_NP_175219.1_query.faa
Rab2_Athaliana_NP_193449.1_query.faa
SNAP33_Athaliana_NP_200929.1_query.faa
Sec12_Athaliana_NP_566961.1_query.faa


Now complete the information in the spreadsheet (AMOEBAE_Data/Queries/0_query_in fo.csv). Make sure that the query titles AP1beta, AP2alpha, AP2mu, and AP2sigma are entered in the appropriate rows in the "Query title" column. This allows multiple query files to be associated with the same query title if they are to be used to search for the same set of homologues.

# Construct alignments for profile searching.

In [29]:
# Define a dictionary of NCBI sequence accessions and filenames to which to write the corresponding sequences.
query_title_dict = {"AP1beta": "NP_194077.1,CBI34366.3,XP_015631818.1,XP_024516549.1,OAE33273.1",
                    "AP2alpha": "NP_851058.1,XP_002270388.1,XP_015631820.1,PTQ35247.1,XP_024525508.1",
                    "AP2mu": "NP_974895.1,XP_002281297.1,XP_015627628.1,OAE25965.1,XP_002973295.1",
                    "AP2sigma": "NP_175219.1,XP_015618362.1,PTQ50284.1,XP_002275803.1,XP_024518676.1",
                    "Sec12": "NP_566961.1,XP_002262948.1,XP_015647566.1,OAE21792.1,XP_024530559.1",
                    "SNAP33": "NP_200929.1,XP_002284486.1,AAW82752.1,EFJ31467.1,OAE29824.1,XP_006270633.1,XP_006010378.1,XP_006625751.1,NP_001080510.1,XP_020370357.1,XP_015181699.1,XP_031769811.1",
                    "Rab2": "NP_193449.1,XP_003635585.2,XP_015626284.1,XP_002965710.1,PTQ28228.1"
                   }
                    

# Make a new temporary directory to store sequence files.
temp_alignment_dir_name = 'temporary_alignment_dir'
assert not os.path.isdir(temp_alignment_dir_name), """Directory already exists."""
os.mkdir(temp_alignment_dir_name)

# Download query sequences and write to multiple-sequence FASTA files.
for query_title in query_title_dict.keys():
    accession_list_string = query_title_dict[query_title]
    filepath = os.path.join(temp_alignment_dir_name, query_title + '_hmm1.faa')
    if not os.path.isfile(filepath):
        net_handle = Entrez.efetch(db="protein", id=accession_list_string, rettype="fasta", retmode="text")
        out_handle = open(filepath, "w")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()

In [30]:
%%bash
for X in temporary_alignment_dir/*.faa; do amoebae align_fa $X --output_format fasta; done


MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

AP1beta_hmm1 5 seqs, max length 920, avg  length 901
00:00:00     10 MB(1%)  Iter   1    6.67%  K-mer dist pass 100:00:00     10 MB(1%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00     10 MB(1%)  Iter   1    6.67%  K-mer dist pass 200:00:00     10 MB(1%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00     11 MB(1%)  Iter   1   25.00%  Align node       00:00:00     15 MB(1%)  Iter   1   50.00%  Align node00:00:00     16 MB(1%)  Iter   1   75.00%  Align node00:00:00     16 MB(1%)  Iter   1  100.00%  Align node00:00:00     16 MB(1%)  Iter   1  100.00%  Align node
00:00:00     16 MB(1%)  Iter   1   20.00%  Root alignment00:00:00     16 MB(1%)  Iter   1   40.00%  Root alignment00:00:00     16 MB(1%)  Iter   1   60.00%  Root alignment00:00:00     16 MB(1%)  Iter   1   80.00%  Root alignment00:00:00     16 MB(

# Visually inspect alignments
Alignments used as queries should be visually inspected to make sure that there are no obvious errors in the alignment.

In [None]:
%%bash
for QUERYFILE in temporary_alignment_dir/*.afaa; do amoebae add_to_queries $QUERYFILE; done

# Prepare query alignments for searching

In [None]:
%%bash
for QUERYFILE in temporary_alignment_dir/*.afaa; do amoebae add_to_queries $QUERYFILE; done

# List queries

In [None]:
%%bash
amoebae list_queries

# Generate lists of potential redundant sequences among A. thaliana peptide sequences

In this tutorial, a reciprocal-best-hit search strategy will be used. If you are using a reciprocal- best-hit search strategy, then your initial round of searches will be performed using your original queries (assembled above) to search your genomes of interest. This initial round of searches will be referred to herein as "forward searches", and subsequent searches using forward search hits as queries into reference genomes will be referred to as "reverse searches".

A slight complication to this search strategy is that the NCBI RefSeq peptide sequences for the A. thaliana genome include alternative transcripts and lineage-specific inparalogues (as do other databases), implying that if these were retrieved as the top hits in the reverse searches instead of the original query sequence, then this would still potentially be a positive result. So, to properly interpret reverse search results it will be necessary to determine which sequences in our A. thaliana .faa file are redundant for our purposes. To do this we will use the get_redun_hits command:

In [None]:
%%bash
mkdir Redundant_hits
amoebae list_queries > Redundant_hits/queries.txt
amoebae get_redun_hits Redundant_hits --query_list_file Redundant_hits/queries.txt --db_name Athaliana_database.faa

This will output a directory in the Redundant_hits folder with a .csv file. Open the CSV file. This file contains a summary of BLASTP or HMMer search results for searches with the specified queries into the S. cerevisiae predicted proteins. In the column with the header "Positive/redundant (+) or negative (-) hit for queries with query title (edit this column)", change the ’-’ to ’+’ for hits that are the original query, or redundant with the original query for the purposes of this analysis.
It should be apparent upon inspection of the ranking of hits and comparison of the associated E-values which hits are redundant with your queries. The redundant accessions for each query (both single sequence and HMM queries for the same AP-2 subunit) should be similar to the following:

# Identify redundant sequences

In [None]:
# Define a dictionary with query titles as keys and lists of sequence IDs as values, where the IDs are for A. thaliana sequences that are redundant with the original A. thaliana query sequence.
redun_seq_dict = {"AP1beta": ["NP_194077.1",
                              "NP_192877.1",
                              "NP_001328014.1",
                              "NP_001190701.1"
                              ],
                  "AP2alpha": ["NP_851058.1",
                               "NP_851057.1",
                               "NP_197669.1",
                               "NP_001330971.1",
                               "NP_001330970.1",
                               "NP_001330969.1",
                               "NP_197670.1",
                               "NP_001330127.1"
                               ],
                  "AP2mu": ["NP_974895.1",
                            "NP_199475.1"
                            ],
                  "AP2sigma": ["NP_175219.1"
                               ],
                  "Sec12": ["NP_566961.1",
                            "NP_568738.1",
                            "NP_680414.1",
                            "NP_178256.1"
                            ],
                  "SNAP33": ["NP_200929.1",
                             "NP_001332102.1",
                             "NP_172842.1",
                             "NP_001318998.1",
                             "NP_196405.1",
                             "NP_001318503.1"
                             ],
                  "Rab2": ["NP_193449.1",
                           "NP_193450.1",
                           "NP_195311.1",
                           "NP_001078499.1"
                           ]
                   }


# Identify path to redundant seqs CSV file.
redundant_seqs_csv = glob.glob(os.path.join('Redundant_hits', os.path.join('redun_hits_*', '0_redun_hits_*.csv')))[0]

# Define path for new modified redundant seqs CSV file.
redundant_seqs_csv2 = redundant_seqs_csv.rsplit(".", 1)[0] + '_2.csv'

# Open the redundant seqs CSV file, and a new one.
with open(redundant_seqs_csv) as infh, open(redundant_seqs_csv2, 'w') as o:
    # Loop over lines in the CSV file.
    for i in infh:
        if not i.startswith("Query Title"):
            # Identify query title in line.
            line_query_title = i.split(',')[0].strip()
            # Identify accession/id for sequence hit represented in this row.
            line_accession = i.split(',')[9].strip().strip('\"')
            # Loop over keys (query titles) in the redundant seqs dictionary.
            query_title_in_keys = False
            for query_title in redun_seq_dict.keys():
                if line_query_title == query_title:
                    query_title_in_keys = True
                    #print('YYY')
                    #print(line_accession)
                    #print(redun_seq_dict[line_query_title])
                    # Determine whether the accession is a redundant accession.
                    for acc in redun_seq_dict[line_query_title]:
                        #print(line_accession, acc)
                        if line_accession == acc:
                            # Change the - to + so that the accession will be included in the list of redundant accessions used by AMOEBAE.
                            i = ','.join(i.split(',')[:4]) + ',+,' + ','.join(i.split(',')[5:])
                            break
                # Break loop if the corresponding query title was found already.
                if query_title_in_keys:
                    break
            # Check that a query title could be recognized as one that is a key in the dictionary.
            assert query_title_in_keys, """Could not find query title %s in dictionary.""" % line_query_title
        # Write (modified) line to new CSV file.
        o.write(i)

Remember to update the 0\_query\_info.csv file with the correct query title for
these, which should match those of the single-sequence queries.

# Running forward searches

To begin searching, make a new folder to contain search results, and write text files listing the names (not full paths) of FASTA files you want to use as queries and those that you want to search in.

In [None]:
%%bash
# Make a new directory to contain search results.
mkdir AMOEBAE_Search_Results_1
# Write query and database list files.
amoebae list_queries > AMOEBAE_Search_Results_1/queries.txt
amoebae list_dbs > AMOEBAE_Search_Results_1/databases.txt

Set up searches using the setup_fwd_srch command:

In [None]:
%%bash
# Set up forward searches.
amoebae setup_fwd_srch AMOEBAE_Search_Results_1\
                       AMOEBAE_Search_Results_1/queries.txt\
                       AMOEBAE_Search_Results_1/databases.txt\
                       --outdir AMOEBAE_Search_Results_1/fwd_srch_1

This will output a new sub-directory with a name that starts with "fwd_srch_". Now run the searches with this directory as input via the run_fwd_srch command. Forward search criteria may be selected at this point (view the relevant optional arguments via the -h option).

In [None]:
%%bash
# Run forward searches. This could take a while.
amoebae run_fwd_srch AMOEBAE_Search_Results_1/fwd_srch_1


This will run BLASTP or HMMer for searches into the .faa files (depending on whether queries are single- or multi-fasta), or TBLASTN for searches into the .fna files with any single-fasta queries.
Now we can generate a summary of the raw output files. Important criteria may be cus- tomized here as well. Specifically the forward search E-value threshold, and the maximum number of nucleotide bases allowed between TBLASTN HSPs to be considered part of the same gene (view optional arguments via the -h option).

In [None]:
%%bash
# Summarize forward search results in a CSV file.
amoebae sum_fwd_srch AMOEBAE_Search_Results_1/fwd_srch_1\
                     AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv

Examine the resulting CSV file. Note that maximum E-value cutoffs, and other criteria were applied as specified.

# Running reverse searches

Now, to determine which of the "forward hits" in these search results are really specific to our original A. thaliana queries, let’s search with these hits as queries back into the A. thaliana genome (i.e., perform "reverse" searches).

Similar to the forward searches, we need to first set up the reverse search directory:



In [None]:
%%bash
amoebae setup_rev_srch AMOEBAE_Search_Results_1\
                       AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out.csv\
                       Athaliana_database.faa\
                       --outdir AMOEBAE_Search_Results_1/rev_srch_1

This will output a new directory with "rev_srch_" and a timestamp in the name. Run reverse searches using the path to this directory as an input:

In [None]:
%%bash
amoebae run_rev_srch AMOEBAE_Search_Results_1/rev_srch_1

Now append columns summarizing the results of these reverse searches to our CSV file. This is where the file listing redundant hits for each query title is used. Also, a criterion is applied here based on the order of magnitude difference in E-value between the original query (or redundant hits) in the reverse search results compared to other hits (if present), and this can be optionally modified (view optional arguments via the -h option).

This could take a while.

**Error: extra quotation marks were written to the redundant hits CSV file...**


In [None]:
%%bash
amoebae sum_rev_srch AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out.csv\
                     AMOEBAE_Search_Results_1/rev_srch_1\
                     --redun_hit_csv Redundant_hits/redun_hits_20200121161452/0_redun_hits_20200121161452_2.csv

By default, this will output a CSV file with the same path as the forward search summary CSV file, but with a "_1" added before the filename extension. Examine the resulting CSV file.
You could run additional reverse searches into different files, appending columns to the same summary spreadsheet. Reverse searches into the A. thaliana peptide sequences is all that is necessary for this tutorial.

Next run the interp_srchs command to do an additional interpretation of the results (if reverse searches into multiple reference databases were performed then this would be done following summarization of all the reverse searches). Again, customized criteria may be applied at this point using the optional arguments.


In [None]:
%%bash
amoebae interp_srchs AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out_1.csv

Again, examine the resulting CSV file to see whether the results match your expectations. You will notice that the results in this file do not account for the fact that the HMMer, BLASTP, and TBLASTN hits are redundant in many cases as might be expected if each of these search algorithms were effective.

# Sorting out which positive hits are redundant

We need to determine which hits correspond to the same loci based on having identical accessions or being associated with the same locus in the GFF3 annotation file, or likely represent distinct paralogous gene loci based on sequence similarity in a multiple sequence alignment (see Larson et al. (2019) for explanation of how these are identified).
To do this, first we will append a column listing what alignment to use (by default it will be the alignments that are used as queries for the corresponding query title):


In [None]:
%%bash
amoebae find_redun_seqs AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out_1_interp_20200122122113.csv --add_ali_col

Now identify distinct paralogues (use the -h option to view optional arguments):

In [None]:
%%bash
amoebae find_redun_seqs AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out_1_interp_20200122122113_with_ali_col.csv

This will output another copy of the CSV file with additional columns. Take some time to decide whether you agree with the exclusion of some of the hits, as indicated in the appended columns.

# Plotting search results

Finally, we can plot the results of the searches. To customize the organization of the output coulson plot, an additional input CSV file may be optionally provided here. This file simply contains the names of protein complexes in the first column and query titles for proteins that you want to include in each complex in the second column (see example file provided with this tutorial).

In [None]:
%%bash

echo \
"AP2,AP1beta
AP2,AP2alpha
AP2,AP2mu
AP2,AP2sigma
COPII,Sec12
SNAREs,SNAP33
Rabs,Rab2" > AMOEBAE_Search_Results_1/complex_info_1.csv



In [None]:
%%bash
# Problem: species name not automatically added to genome_info.csv..
amoebae plot AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out_1_interp_20200122122113_with_ali_col_paralogue_count_20200122232356_2.csv\
             --complex_info AMOEBAE_Search_Results_1/complex_info_1.csv\
             --out_pdf AMOEBAE_Search_Results_1/plot.pdf

Examine the resulting PDF files. Your coulson plot should look something like that in Figure 1. Compare with the results of searches for AP-2 subunits published by Manna et al. (2013), Barlow et al. (2014), and Larson et al. (2019). You will need to customize formatting of coulson plots output by the ’plot’ command using software such as Adobe Illustrator.

<img src="AMOEBAE_Search_Results_1/plot_coulson_both.png" style="width: 500px;">

Figure 1: A coulson plot summarizing similarity search results for AP-2 complex subunits in Trypanosoma brucei gambiense and Saccharomyces cerevisiae peptide and nucleotide se- quences using Arabidopsis thaliana queries and Hidden Markov Models generated from align- ments of embryophyte orthologues. BLASTP and TBLASTN were used to search peptide and nucleotide sequences, respectively, with single sequence queries, and the HMMer3 pack- age was used to perform profile searches. Subplot sectors with blue fill indicate that one or more sequences were found to meet the search criteria applied (with the number being indicated within each subplot sector). Note that the ancestral eukaryotic AP-1 and AP-2 complexes shared a single beta subunit (Dacks et al., 2008). This is why identified "AP1beta" orthologues are shown as a component of the AP-2 complex here, even though T. brucei lacks an AP-2 complex (Manna et al., 2013). These results are comparable to the relevant results published by Manna et al. (2013), Barlow et al. (2014), and Larson et al. (2019).

# Interpretation and re-analysis

It should be clear that AMOEBAE identifies "positive" and "negative" results simply by applying criteria that the user specifies. So, it is entirely the users responsibility to select appropriate criteria and interpret the results critically.

Points to consider regarding interpretation of the results of the analysis in this tutorial include the following:
- The BLASTP and HMMer searches (both followed by reverse BLASTP searches) yielded the same results in this analysis.
- The TBLASTN searches were able to identify all of the genes represented by the peptide sequences identified by BLASTP and HMMer searches.
- A TBLASTN hit in the A. thaliana chromosome 5 (NC_003076.8) met the forward and reverse search criteria, but was excluded because the translation of the region that aligned to the query was only 50 amino acids long (this sequence also contained stop codons). If you look on the NCBI genome browser for A. thaliana you will see that this region on chromosome 5 (as indicated in the summary CSV file) corresponds to a pseudogene for AP-2 sigma with the gene ID AT5G42568.
- The two A. thaliana AP-1/2 beta paralogues and the two S. cerevisiae paralogues are brassicalid and fungal inparalogues, respectively, which arose from independent gene duplications. Phylogenetic analysis would be required to determine this (see Larson et al. (2019) and Barlow et al. (2014)).
- An Arabidopsis thaliana AP-2 mu splice variant was excluded after running the ’find _redun_seqs’ command, because it was found to be encoded by the same gene as the other splice variant based on information in the GFF3 annotation file.
- An A. thaliana AP-2 alpha gene was excluded after running the ’find_redun_seqs’ command, because it shows over 98% identity with the other AP-2 alpha gene. The summary CSV file indicates which file contains an alignment of these two sequences (see Larson et al. (2019) for relevant discussion).

If the analysis in this tutorial were a project you were working on for publication, then upon completing the above analysis steps you work would have only just begun. AMOEBAE merely finds sequences that match your specified search criteria, which may or may not be sufficient to accurately identify homologues of interest. Careful inspection of the summary CSV file will reveal that minor adjustments to the search criteria would cause the analysis to yield different results. Moreover, there are many different possibilities that would lead to innacurate results based on the criteria applied in the above analysis. A comprehensive discussion of this is beyond the scope of this tutorial, but one obvious example would be if an identified sequence contained a domain that was not present in the query sequence, causing sequences to be retrieved in the reverse search with no homology to the original query. Therefore, it is recommended that you commit to an iterative approach to analysis involving adjustment of search criteria and re-analysis to include sequences that you know are homologues of interest, but to exclude those that you know are not homologues of interest.

To generate an alignment of similar sequences identified using AMOEBAE, use the ’csv_to _fasta’ command to generate FASTA files for alignment, and then align using your preferred software (e.g., MUSCLE or MAFFT). For visually assessing the sequences for possible issues such as contrasting domain topologies, you may wish to generate FASTA files including all your forward search results for each query title:

In [None]:
%%bash
amoebae csv_to_fasta AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out_1_interp_20200122122113_with_ali_col_paralogue_count_20200122232356_2.csv\
                     --all_hits --split_by_query_title

If you are planning to run a phylognetic analysis, you may wish to generate a FASTA file with only those sequences that match all your search criteria, and with abbreviated sequence headers:

In [None]:
%%bash
amoebae csv_to_fasta AMOEBAE_Search_Results_1/fwd_srch_1_sum.csv_out_1_interp_20200122122113_with_ali_col_paralogue_count_20200122232356_2.csv\
                     --abbrev --split_by_query_title

# Delete search output files (optional).

In [None]:
%%bash
#rm *blastp_search_output.txt
#rm *blastp_search_output.xml
#rm *reverse_query.faa
#rm *blastp_reverse_search_output.txt
#rm *blastp_reverse_search_output.xml
#rm 0_summary_of_forward_blastp_searches.csv
#rm 0_summary_of_forward_and_reverse_blastp_searches.csv

# Where to go from here?

You can customize this notebook to search with different queries in different genomes.

# References

Barlow, L.D., Dacks, J.B., Wideman, J.G., 2014. From all to (nearly) none: Tracing adaptin evolution in Fungi. Cellular Logistics 4, e28114. https://doi.org/10.4161/cl.28114

Hirst, J., D. Barlow, L., Francisco, G.C., Sahlender, D.A., Seaman, M.N.J., Dacks, J.B., Robinson, M.S., 2011. The Fifth Adaptor Protein Complex. PLoS Biology 9, e1001170. https://doi.org/10.1371/journal.pbio.1001170

Larson, R.T., Dacks, J.B., Barlow, L.D., 2019. Recent gene duplications dominate evolutionary dynamics of adaptor protein complex subunits in embryophytes. Traffic 20, 961–973. https://doi.org/10.1111/tra.12698

Manna, P.T., Kelly, S., Field, M.C., 2013. Adaptin evolution in kinetoplastids and emergence of the variant surface glycoprotein coat in African trypanosomatids. Molecular Phylogenetics and Evolution 67, 123–128. https://doi.org/10.1016/j.ympev.2013.01.002

Robinson, M.S., 2004. Adaptable adaptors for coated vesicles. Trends in Cell Biology 14, 167–174. https://doi.org/10.1016/j.tcb.2004.02.002

