# Introduction

This tutorial will walk you through a preliminary similarity searching analysis making use of scripts in the AMOEBAE toolkit. While AMOEBAE was not originally written to be used via the command line, Jupyter notebooks provide an easy means of guiding new users through an example analysis with limited need for manual input. The end result of running this code successfully is a spreadsheet summarizing results of similarity searches, as well as a plot to visualize the results.

As a simple example, we will consider the the distribution of orthologues of subunits of the Adaptor Protein (AP) 2 vesicle adaptor complex, and several other membrane-trafficking proteins, in five model eukaryotes: the plant *Arabidopsis thaliana*, the yeast *Saccharomyces cerevisiae*, the fungus *Allomyces macrogynus*, the amoeba *Dictyostelium discoideum*, and the pathogenic protist *Trypanosoma brucei*. AP-2 subunits are homologous to subunits of other AP complexes (Robinson, 2004; Hirst et al., 2011), and published work has traced their evolution among plants (Larson et al., 2019), Fungi (Barlow et al., 2014), and trypanosomatid parasites (Manna et al., 2013). Thus, the protein subunits of the AP-2 complex provide a useful test of similarity searching methods to distinguish between orthologues and paralogues, which can be compared to the results of previous studies. In addition, the membrane trafficking proteins Sec12 (a component of the COPII vesicle coat complex), SNAP33 (a Qbc-SNARE), and Rab2 (a small GTPase) are included to further explore the potential sources of error involved in identification of orthologous proteins.

## Objectives of this tutorial


-  Perform similarity searches using the BLASTP, TBLASN, HMMer algorithms simultaneously using AMOBEAE scripts.

-  Apply a reciprocal-best-hit search strategy using AMOEBAE code.

- Practice interpreting similarity search results obtained using AMOEBAE.


## Requirements

- Before running this code, you will need to have set up AMOEBAE according to the instructions in the main documentation file here (which you likely have already done): [AMOEBAE_documentation.pdf](
https://github.com/laelbarlow/amoebae/blob/master/documentation/AMOEBAE_documentation.pdf).

- MacOS or Linux operating system (or possibly a work-around on windows, although this has not been tested).

- Approximately 3GB of storage space.

- An internet connection.

- At least an hour of your time (the code in this notebook will take approximately 60 minutes to run).

- Running the code in this notebook is more computationally intensive than webbrowsing for example, so if you are running this on a laptop computer, then make sure it is connected to an electrical outlet.

## Testing
If you wish to simply run all the code in this notebook for testing purposes: Select "Cell" > "Run All" from the Jupyter menu above.

# Preliminary steps

## Find the amoebae script

The directory containing the amoebae executable script must be present in your $PATH.

In [None]:
%%bash
printf "\nThis is the directory that this notebook is run in:\n"
pwd
echo
printf "\nThis is the path to the amoebae executable script that will be used:\n"
command -v amoebae
#echo
#printf "\nThese are all the paths in the \$PATH variable:\n"
#tr ':' '\n' <<< "$PATH"

## Check that dependencies are installed

You should have already pulled the amoebae git repository to your computer as described in the main documentation file.

In [None]:
%%bash
# This command simply prints the versions of some dependencies which are now available for use by amoebae.
amoebae check_depend

In [None]:
%%bash
# This command tests all the import statements in amoebae modules.
amoebae check_imports

## Import some basic python modules

In [None]:
import os
import sys
import platform
import subprocess
from Bio import SeqIO
import glob
from Bio.Blast import NCBIXML
import pandas as pd
from IPython.display import display, HTML, Image
import requests

## Update PATH so that additional modules can be imported.

In [None]:
# Add parent directory (the main amoebae repository directory) to the $PATH.
sys.path.append('..')
!echo $PATH

In [None]:
import settings

## Record the specific version of AMOEBAE code used

In [None]:
# Record git repository version information.
wd = !pwd
script_dir = wd[0] 
git_hash = str(subprocess.check_output(["git", "rev-parse", "HEAD"], cwd=script_dir).strip())
git_branch = str(subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd=script_dir).strip())  
print('\nGit repository (code) version: ' + git_hash + ' (branch name: ' + git_branch + ')\n')

## Make a subdirectory to store output.

In [None]:
%%bash
mkdir amoebae_tutorial_2_output

In [None]:
%cd amoebae_tutorial_2_output

In [None]:
%env DATADIR=AMOEBAE_Data

In [None]:
%%bash
amoebae mkdatadir $DATADIR

In [None]:
# Check that the path indicated in the settings file is correct.
print(settings.root_amoebae_data_dir)
assert settings.root_amoebae_data_dir == "AMOEBAE_Data"

# Set up queries

## Download single-sequence queries

In [None]:
%%time

# Define a dictionary with NCBI sequence accessions as keys and filenames to write
# the corresponding sequences to as values.
query_dict = {"NP_194077.1": "AP1beta_Athaliana_NP_194077.1_query.faa",
              "NP_851058.1": "AP2alpha_Athaliana_NP_851058.1_query.faa",
              "NP_974895.1": "AP2mu_Athaliana_NP_974895.1_query.faa",
              "NP_175219.1": "AP2sigma_Athaliana_NP_175219.1_query.faa",
              "NP_566961.1": "Sec12_Athaliana_NP_566961.1_query.faa",
              "NP_200929.1": "SNAP33_Athaliana_NP_200929.1_query.faa",
              "NP_193449.1": "Rab2_Athaliana_NP_193449.1_query.faa"
          }

# Make a new temporary directory to store sequence files.
temp_query_dir_name = 'temporary_query_dir'
if not os.path.isdir(temp_query_dir_name):
    os.mkdir(temp_query_dir_name)

# Loop over keys in the query_dict dictionary.
for accession in query_dict.keys():
    # Retrieve the corresponding filename from the dictionary.
    filename = query_dict[accession]
    filepath = os.path.join(temp_query_dir_name, filename)
    # Only download sequences that have not already been downloaded.
    if not os.path.isfile(filepath):
        # Download the sequence from the NCBI Protein database.
        accessions = [accession]
        url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&id=' + ','.join(accessions)
        r = requests.get(url)
        with open(filepath, 'w') as o:
            o.write(r.text)
    # Check that the sequence was actually downloaded.
    assert os.path.isfile(filepath), """The sequence with the following accession could not be downloaded from NCBI: %s\n
    Try re-running this cell.""" % accession