# Introduction

## Purpose

This jupyter notebook is intended purely for training purposes, and illustrates how easy it is to perform similarity searches and summarize the results with a few short lines of code. None of the code in this notebook is dependent on the main AMOEBAE library, but it reproduces some of the core functionality in a self-sufficient manner. Accordingly, it is easier to see how lines of code generate lines of results in the output files. For an introduction to running the main AMOEBAE scripts, see the amoebae/notebooks/amoebae_tutorial_2.ipynb notebook.


## Objectives

- Apply a reciprocal-best-hit (RBH) search strategy using Basic Local Alignment Search Tool for Protein (BLASTP) to search for orthologues of a small collection of membrane trafficking proteins in predicted peptide sequences from a handfull of genomes. 

- Generate a spreadsheet summarizing results of reciprocal BLASTP searches.

- Visually inspect the summary of results to distinguish between positive and negative results.


## Requirements
 
If you are new to Jupyter notebooks, see this documentation: https://jupyter-notebook.readthedocs.io/en/stable/notebook.html. Or here: https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb. Or, just try it out; it's rather intuitive.

You do not necessarily need to be able to read or write complex computer code to use this notebook. However, basic understanding of bash (the language used in the unix/linux shell) and python (version 3) would be advantageous. The code contained in the cells in this notebook are written in either bash or python, and the bash cells have "%%bash" as the first line to indicate that bash is being used.

The dependencies for this notebooks are simply NCBI BLAST+ as well as some popular python libraries (biopython and pandas) that can be installed via the pip command.


# Import Python modules

In [None]:
import os
import platform
import subprocess
from Bio import SeqIO
import glob
from Bio.Blast import NCBIXML
import pandas as pd
from IPython.display import display, HTML
import requests

# Record which version of the AMOEBAE repository this notebook is from

In [None]:
# Record git repository version information.
wd = !pwd
script_dir = wd[0] 
git_hash = str(subprocess.check_output(["git", "rev-parse", "HEAD"], cwd=script_dir).strip())
git_branch = str(subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd=script_dir).strip())  
print('\nGit repository (code) version: ' + git_hash + ' (branch name: ' + git_branch + ')\n')

# Record system information.
print('System info: ' + str(platform.uname()) + '\n')

# Download all RefSeq peptide sequences for specific genomes.

In [None]:
# Make a new subdirectory to contain output files.
%env SD=amoebae_tutorial_1_output
!mkdir $SD

In [None]:
%%bash

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_protein.faa.gz --output $SD/Athaliana_database.faa.gz
gunzip $SD/Athaliana_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/210/295/GCF_000210295.1_ASM21029v1/GCF_000210295.1_ASM21029v1_protein.faa.gz --output $SD/Tbrucei_database.faa.gz
gunzip $SD/Tbrucei_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_protein.faa.gz --output $SD/Scerevisiae_database.faa.gz
gunzip $SD/Scerevisiae_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/151/295/GCA_000151295.1_A_macrogynus_V3/GCA_000151295.1_A_macrogynus_V3_protein.faa.gz --output $SD/Amacrogynus_database.faa.gz
gunzip $SD/Amacrogynus_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_protein.faa.gz --output $SD/Ddiscoideum_database.faa.gz
gunzip $SD/Ddiscoideum_database.faa.gz


In [None]:
%%bash
# List downloaded FASTA files.
ls $SD/*database.faa

# Generate BLASTable databases from sequence files.

In [None]:
%%bash
for X in $SD/*_database.faa; do makeblastdb -in $X -dbtype prot; done

In [None]:
%%bash
# List BLASTable database files.
ls $SD/*database.faa.p*

# Download query peptide sequences.

In [None]:
# Define a dictionary with NCBI sequence accessions as keys and filenames to write
# the corresponding sequences to as values.
query_dict = {"NP_194077.1": "AP1beta_Athaliana_NP_194077.1_query.faa",
              "NP_851058.1": "AP2alpha_Athaliana_NP_851058.1_query.faa",
              "NP_974895.1": "AP2mu_Athaliana_NP_974895.1_query.faa",
              "NP_175219.1": "AP2sigma_Athaliana_NP_175219.1_query.faa",
              "NP_566961.1": "Sec12_Athaliana_NP_566961.1_query.faa",
              "NP_200929.1": "SNAP33_Athaliana_NP_200929.1_query.faa",
              "NP_193449.1": "Rab2_Athaliana_NP_193449.1_query.faa"
          }

# Loop over keys in the query_dict dictionary.
for accession in query_dict.keys():
    # Retrieve the corresponding filename from the dictionary.
    filename = os.path.join(os.environ['SD'], query_dict[accession])
    # Only download sequences that have not already been downloaded.
    if not os.path.isfile(filename):
        # Download the sequence from the NCBI Protein database.
        accessions = [accession]
        url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&id=' + ','.join(accessions)
        r = requests.get(url)
        with open(filename, 'w') as o:
            o.write(r.text)

In [None]:
%%bash
# List downloaded query FASTA files.
ls $SD/*_Athaliana_*_query.faa

# Run BLASTP searches with all queries in all databases.

In [None]:
%%bash
cd $SD
for QUERY in *_query.faa
do
    for DATABASE in *_database.faa
    do
        OUTPUT=$QUERY'__'$DATABASE'__blastp_search_output.txt'
        blastp -query $QUERY -db $DATABASE -out $OUTPUT
        OUTPUT2=$QUERY'__'$DATABASE'__blastp_search_output.xml'
        blastp -query $QUERY -db $DATABASE -out $OUTPUT2 -outfmt 5
    done
done
cd ..

In [None]:
%%bash
# List forward search BLASTP output files.
ls $SD/*__blastp_search_output.*

# Summarize initial search results in a spreadsheet.

In [None]:
# Open a CSV file.
summary_file_path = os.path.join(os.environ['SD'], '0_summary_of_forward_blastp_searches.csv')
with open(summary_file_path, 'w') as o:
    # Write a line containing column headers.
    o.write(','.join(['Query',
                      'Database',
                      'Hit rank',
                      'ID',
                      'Description',
                      'E-value\n']))
    # Loop over the XML format BLASTP output files.
    for blastp_output in glob.glob(os.path.join(os.environ['SD'], '*blastp_search_output.xml')):
        # Open XML file.
        with open(blastp_output) as blastp_output_handle:
            # Loop over BLAST results (only one query was used, so there should only be one BLAST result anyway).
            for blast_record in NCBIXML.parse(blastp_output_handle):
                hit_rank = 0
                # Loop over hits in the BLAST result.
                for hit in blast_record.descriptions:
                    hit_rank += 1
                    # Ignore hits after the first 10 hits.
                    if hit_rank <= 10:
                        # Parse the sequence ID/accession out of the title attribute of the hit object.
                        hit_id = hit.title.split(' ', 2)[1]
                        # Parse the sequence description out of the title attribute of the hit object.
                        hit_description = hit.title.split(' ', 2)[2]
                        # Write a line with information about this hit to the open CSV file. 
                        o.write(','.join([os.path.basename(blastp_output).split('__')[0], os.path.basename(blastp_output).split('__')[1],
                            str(hit_rank), hit_id, '\"' + hit_description + '\"', str(hit.e)]) + '\n')

# Visually inspect summary of forward search results

In [None]:
# Load data from the CSV file using the pandas library.
df = pd.read_csv(summary_file_path)
# Display the data in an HTML table.
print('Contents of %s:' % summary_file_path)
display(HTML(df.to_html()))

# Generate reverse search query files.

In [None]:
# Initiate a dictionary to keep track of which sequences...
db_hit_id_dict = {}
with open(summary_file_path) as infh:
    for line in infh:
        if not line.startswith('Query') and not line.startswith('\n'):
            db_file = os.path.join(os.environ['SD'], line.split(',')[1])
            hit_id = line.split(',')[3]
            if db_file not in db_hit_id_dict.keys():
                db_hit_id_dict[db_file] = [hit_id]
            else:
                db_hit_id_dict[db_file].append(hit_id)
for database in db_hit_id_dict.keys():
    with open(database) as infh:
        for seq in SeqIO.parse(infh, 'fasta'):
            if seq.id in set(db_hit_id_dict[database]):
                with open(database + '_' + seq.id + '_reverse_query.faa', 'w') as o:
                    SeqIO.write(seq, o, 'fasta')

In [None]:
%%bash
# List reverse search query FASTA files.
ls $SD/*_reverse_query.faa

# Run BLASTP to search with all reverse search queries in a sequence database.

In [None]:
%%bash
REVSRCHDB='Athaliana_database.faa'
cd $SD
for QUERY in *_reverse_query.faa
do
    OUTPUT=$QUERY'__'$DATABASE'__blastp_reverse_search_output.txt'
    blastp -query $QUERY -db $REVSRCHDB -out $OUTPUT
    OUTPUT2=$QUERY'__'$DATABASE'__blastp_reverse_search_output.xml'
    blastp -query $QUERY -db $REVSRCHDB -out $OUTPUT2 -outfmt 5
done
cd ..

In [None]:
%%bash
# List reverse search BLAST output files.
ls $SD/*__blastp_reverse_search_output.*

# Summarize reverse search results by appending columns in a new spreadsheet.

In [None]:
summary_file_path_2 = os.path.join(os.environ['SD'], '0_summary_of_forward_and_reverse_blastp_searches.csv')
with open(summary_file_path) as infh,\
    open(summary_file_path_2, 'w') as o:
    o.write(','.join(['Query',
                      'Database',
                      'Hit rank',
                      'ID',
                      'Hit description',
                      'E-value',
                      'Top reverse search hit ID',
                      'Top reverse search hit description',
                      'Top reverse search hit E-value\n']))
           
    for line in infh:
        if not line.startswith('Query'):
            fwd_hit_id = line.split(',')[3]
            for blastp_output in glob.glob(os.path.join(os.environ['SD'], '*blastp_reverse_search_output.xml')):
                if fwd_hit_id in blastp_output:
                    at_least_one_hit = False
                    with open(blastp_output) as blastp_output_handle:
                        if len(list(NCBIXML.parse(blastp_output_handle))[0].descriptions) >= 1:
                            at_least_one_hit = True
                    if at_least_one_hit:
                        with open(blastp_output) as blastp_output_handle:
                            top_rev_hit = list(NCBIXML.parse(blastp_output_handle))[0].descriptions[0]
                            top_rev_hit_id = top_rev_hit.title.split(' ', 2)[1]
                            top_rev_hit_description = top_rev_hit.title.split(' ', 2)[2]
                            o.write(','.join([line.strip(), top_rev_hit_id, '\"' + top_rev_hit_description + '\"', str(top_rev_hit.e)]) + '\n')
                    else:
                        o.write(line.strip() + ',No reverse search hits\n')

# Visually inspect summary of forward and reverse search results

The cell below will display the data in an HTML table, but you will probably find it more useful to view the contents of the CSV file using a spreadsheet program like microsoft Excel.

In [None]:
# Load data from the CSV file using the pandas library.
df = pd.read_csv(summary_file_path_2)
# Display the data in an HTML table.
print('Contents of %s:' % summary_file_path_2)
display(HTML(df.to_html()))

# Interpret results
Modify the table below to reflect your interpretation of the above results. Carefully consider potential sources of error.

**Table 1: Summary of similarity search results.** Numbers indicate the number of orthologues of each protein in each genome.

|     Genome      | AP-2 beta | AP-2 alpha | AP-2 mu | AP-2 sigma | Sec12 | SNAP33 | Rab2 |
|       ---       |    ---    |    ---     |   ---   |    ---     |  ---  |   ---  | ---  |
| *A. thaliana*   |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *T. brucei*     |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *D. discoideum* |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *A. macrogynus* |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *S. cerevisiae* |     0     |     0      |    0    |     0      |   0   |    0   |  0   |


# Checkpoint this notebook

In [None]:
%%javascript
// Save and checkpoint the current notebook (same as doing it manually through the GUI).
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});

# Print this notebook
This is optional, and will require installation of additional dependencies: nbconvert, pandoc, and latex.

In [None]:
# Define author name for this notebook.
author_name = ""

# Define title of this notebook.
notebook_title = 'amoebae_tutorial_1'.replace('_', ' ')

In [None]:
# Import modules.
import os
from string import Template

# Write a latex template file for converting this notebook to latex (as an intermediate to PDF).
latex_template_string = Template(r"""
((*- extends 'article.tplx' -*))

((* block author *))
\author{$an}
((* endblock author *))

((* block title *))
\title{$nt}
((* endblock title *))
""")
latex_file_contents =\
latex_template_string.substitute(an=author_name,
                                 nt=notebook_title
                                )
latex_template_file_path = 'latex_template.tplx'
with open(latex_template_file_path, 'w') as o:
    o.write(latex_file_contents)

# Convert notebook to PDF (with latex as an intermediate to process bibtex citations, etc.).
!jupyter nbconvert ./amoebae_tutorial_1.ipynb --to pdf --template {latex_template_file_path}

# Remove latex template file and bibtex file.
os.remove(latex_template_file_path)

# References

**The first publication to report the use of AMOEBAE for comparative genomics**:

Larson, R.T., Dacks, J.B., Barlow, L.D., 2019. Recent gene duplications dominate evolutionary dynamics of adaptor protein complex subunits in embryophytes. Traffic tra.12698. https://doi.org/10.1111/tra.12698


**The AMOEBAE GitHub Repository**:

https://github.com/laelbarlow/amoebae