# Introduction

## Purpose

This jupyter notebook is intended purely for training purposes, and illustrates how easy it is to perform similarity searches and summarize the results with a few short lines of code. None of the code in this notebook is dependent on the main AMOEBAE library, but it reproduces some of the core functionality in a self-sufficient manner. Accordingly, it is easier to see how lines of code generate lines of results in the output files. For an introduction to running the main AMOEBAE scripts, see the amoebae/notebooks/amoebae_tutorial_2.ipynb notebook.


## Objectives

- The code in this notebooks performs a form of reciprocal-best-hit (RBH) search strategy using BLASTP to search for orthologues of a small collection of membrane trafficking proteins in a handfull of genomes. 

- The main output from running this code successfully is a spreadsheet summarizing results of Basic Local Alignment Search Tool for Protein (BLASTP) searches in a selection of peptide sequence databases, as well as top BLASTP hits retrieved in a database when each initial hit is used as a query. 


## Requirements
 
If you are new to Jupyter notebooks, see this documentation: https://jupyter-notebook.readthedocs.io/en/stable/notebook.html. Or here: https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb. Or, just try it out; it's rather intuitive.

You do not necessarily need to be able to read or write complex computer code to use this notebook. However, basic understanding of bash (the language used in the unix/linux shell) and python (version 3) would be advantageous. The code contained in the cells in this notebook are written in either bash or python, and the bash cells have "%%bash" as the first line to indicate that bash is being used.

The dependencies for this notebooks are simply NCBI BLAST+ as well as some popular python libraries. This notebook is intended to be run in a virtual environment set up using Docker (see the main documentation file for instructions). Doing so ensures that all the dependencies are available for use by the code in this notebook.


In [1]:
%%bash
pwd

/opt/notebooks/notebooks


# Import Python modules.

In [3]:
import os
from Bio import SeqIO
from Bio import Entrez
import glob
from Bio.Blast import NCBIXML
#import ipywidgets as widgets
import pandas as pd
from IPython.display import display, HTML

# Download all RefSeq peptide sequences for specific genomes.

In [4]:
%%bash

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_protein.faa.gz --output Athaliana_database.faa.gz
gunzip Athaliana_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/210/295/GCF_000210295.1_ASM21029v1/GCF_000210295.1_ASM21029v1_protein.faa.gz --output Tbrucei_database.faa.gz
gunzip Tbrucei_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_protein.faa.gz --output Scerevisiae_database.faa.gz
gunzip Scerevisiae_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/151/295/GCA_000151295.1_A_macrogynus_V3/GCA_000151295.1_A_macrogynus_V3_protein.faa.gz --output Amacrogynus_database.faa.gz
gunzip Amacrogynus_database.faa.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_protein.faa.gz --output Ddiscoideum_database.faa.gz
gunzip Ddiscoideum_database.faa.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0  0 11.3M    0 68056    0     0  34028      0  0:05:48  0:00:02  0:05:46 23290  2 11.3M    2  326k    0     0   108k      0  0:01:46  0:00:03  0:01:43 84981  7 11.3M    7  856k    0     0   214k      0  0:00:54  0:00:04  0:00:50  174k 16 11.3M   16 1948k    0     0   389k      0  0:00:29  0:00:05  0:00:24  395k 34 11.3M   34 3967k    0     0   661k      0  0:00:17  0:00:06  0:00:11  812k 56 11.3M   56 6542k    0     0   934k      0  0:00:12  0:00:07  0:00:05 1288k 81 11.3M   81 9420k    0     0  1177k      0  0:00:09  0:00:08  0:00:01 1822k100 11.3M  100 11.3M    0     0  1286k      0  0:00

In [17]:
%%bash
# List downloaded FASTA files.
ls *database.faa

Amacrogynus_database.faa
Athaliana_database.faa
Ddiscoideum_database.faa
Scerevisiae_database.faa
Tbrucei_database.faa


# Generate BLASTable databases from sequence files.

In [5]:
%%bash
for X in *_database.faa; do makeblastdb -in $X -dbtype prot; done



Building a new DB, current time: 02/01/2020 14:45:51
New DB name:   /opt/notebooks/notebooks/Amacrogynus_database.faa
New DB title:  Amacrogynus_database.faa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 19447 sequences in 2.49697 seconds.


Building a new DB, current time: 02/01/2020 14:45:53
New DB name:   /opt/notebooks/notebooks/Athaliana_database.faa
New DB title:  Athaliana_database.faa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 48265 sequences in 5.1589 seconds.


Building a new DB, current time: 02/01/2020 14:45:58
New DB name:   /opt/notebooks/notebooks/Ddiscoideum_database.faa
New DB title:  Ddiscoideum_database.faa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13315 sequences in 1.73108 seconds.


Building a new DB, current time: 02/01/2020 14:46:00
New DB name:   /opt/notebooks/notebooks/Scerevisiae_d

In [18]:
%%bash
# List BLASTable database files.
ls *database.faa.p*

Amacrogynus_database.faa.phr
Amacrogynus_database.faa.pin
Amacrogynus_database.faa.psq
Athaliana_database.faa.phr
Athaliana_database.faa.pin
Athaliana_database.faa.psq
Ddiscoideum_database.faa.phr
Ddiscoideum_database.faa.pin
Ddiscoideum_database.faa.psq
Scerevisiae_database.faa.phr
Scerevisiae_database.faa.pin
Scerevisiae_database.faa.psq
Tbrucei_database.faa.phr
Tbrucei_database.faa.pin
Tbrucei_database.faa.psq


# Enter your email to access the NCBI protein database via NCBI Entrez.

In [6]:
Entrez.email = input("Enter your email address here: ")  # Tell NCBI who you are.

Enter your email address here: lael@ualberta.ca


# Download query peptide sequences.

In [7]:
# Define a dictionary with NCBI sequence accessions as keys and filenames to write
# the corresponding sequences to as values.
query_dict = {"NP_194077.1": "AP1beta_Athaliana_NP_194077.1_query.faa",
              "NP_851058.1": "AP2alpha_Athaliana_NP_851058.1_query.faa",
              "NP_974895.1": "AP2mu_Athaliana_NP_974895.1_query.faa",
              "NP_175219.1": "AP2sigma_Athaliana_NP_175219.1_query.faa",
              "NP_566961.1": "Sec12_Athaliana_NP_566961.1_query.faa",
              "NP_200929.1": "SNAP33_Athaliana_NP_200929.1_query.faa",
              "NP_193449.1": "Rab2_Athaliana_NP_193449.1_query.faa"
          }

# Loop over keys in the query_dict dictionary.
for accession in query_dict.keys():
    # Retrieve the corresponding filename from the dictionary.
    filename = query_dict[accession]
    # Only download sequences that have not already been downloaded.
    if not os.path.isfile(filename):
        # Download the sequence from NCBI via Entrez, using the Biopython module.
        net_handle = Entrez.efetch(db="protein", id=accession, rettype="fasta", retmode="text")
        out_handle = open(filename, "w")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()

In [20]:
%%bash
# List downloaded query FASTA files.
ls *_Athaliana_*_query.faa

AP1beta_Athaliana_NP_194077.1_query.faa
AP2alpha_Athaliana_NP_851058.1_query.faa
AP2mu_Athaliana_NP_974895.1_query.faa
AP2sigma_Athaliana_NP_175219.1_query.faa
Rab2_Athaliana_NP_193449.1_query.faa
SNAP33_Athaliana_NP_200929.1_query.faa
Sec12_Athaliana_NP_566961.1_query.faa


# Run BLASTP searches with all queries in all databases.

In [8]:
%%bash
for QUERY in *_query.faa
do
    for DATABASE in *_database.faa
    do
        OUTPUT=$QUERY'__'$DATABASE'__blastp_search_output.txt'
        blastp -query $QUERY -db $DATABASE -out $OUTPUT
        OUTPUT2=$QUERY'__'$DATABASE'__blastp_search_output.xml'
        blastp -query $QUERY -db $DATABASE -out $OUTPUT2 -outfmt 5
    done
done

In [21]:
%%bash
# List forward search BLASTP output files.
ls *__blastp_search_output.*

AP1beta_Athaliana_NP_194077.1_query.faa__Amacrogynus_database.faa__blastp_search_output.txt
AP1beta_Athaliana_NP_194077.1_query.faa__Amacrogynus_database.faa__blastp_search_output.xml
AP1beta_Athaliana_NP_194077.1_query.faa__Athaliana_database.faa__blastp_search_output.txt
AP1beta_Athaliana_NP_194077.1_query.faa__Athaliana_database.faa__blastp_search_output.xml
AP1beta_Athaliana_NP_194077.1_query.faa__Ddiscoideum_database.faa__blastp_search_output.txt
AP1beta_Athaliana_NP_194077.1_query.faa__Ddiscoideum_database.faa__blastp_search_output.xml
AP1beta_Athaliana_NP_194077.1_query.faa__Scerevisiae_database.faa__blastp_search_output.txt
AP1beta_Athaliana_NP_194077.1_query.faa__Scerevisiae_database.faa__blastp_search_output.xml
AP1beta_Athaliana_NP_194077.1_query.faa__Tbrucei_database.faa__blastp_search_output.txt
AP1beta_Athaliana_NP_194077.1_query.faa__Tbrucei_database.faa__blastp_search_output.xml
AP2alpha_Athaliana_NP_851058.1_query.faa__Amacrogynus_database.faa__blastp_search_output.txt

# Summarize initial search results in a spreadsheet.

In [9]:
# Open a CSV file.
with open('0_summary_of_forward_blastp_searches.csv', 'w') as o:
    # Write a line containing column headers.
    o.write(','.join(['Query',
                      'Database',
                      'Hit rank',
                      'ID',
                      'Description',
                      'E-value\n']))
    # Loop over the XML format BLASTP output files.
    for blastp_output in glob.glob('*blastp_search_output.xml'):
        # Open XML file.
        with open(blastp_output) as blastp_output_handle:
            # Loop over BLAST results (only one query was used, so there should only be one BLAST result anyway).
            for blast_record in NCBIXML.parse(blastp_output_handle):
                hit_rank = 0
                # Loop over hits in the BLAST result.
                for hit in blast_record.descriptions:
                    hit_rank += 1
                    # Ignore hits after the first 10 hits.
                    if hit_rank <= 10:
                        # Parse the sequence ID/accession out of the title attribute of the hit object.
                        hit_id = hit.title.split(' ', 2)[1]
                        # Parse the sequence description out of the title attribute of the hit object.
                        hit_description = hit.title.split(' ', 2)[2]
                        # Write a line with information about this hit to the open CSV file. 
                        o.write(','.join([blastp_output.split('__')[0], blastp_output.split('__')[1],
                            str(hit_rank), hit_id, '\"' + hit_description + '\"', str(hit.e)]) + '\n')

# Generate reverse search query files.

In [10]:
# Initiate a dictionary to keep track of which sequences...
db_hit_id_dict = {}
with open('0_summary_of_forward_blastp_searches.csv') as infh:
    for line in infh:
        if not line.startswith('Query') and not line.startswith('\n'):
            db_file = line.split(',')[1]
            hit_id = line.split(',')[3]
            if db_file not in db_hit_id_dict.keys():
                db_hit_id_dict[db_file] = [hit_id]
            else:
                db_hit_id_dict[db_file].append(hit_id)
for database in db_hit_id_dict.keys():
    with open(database) as infh:
        for seq in SeqIO.parse(infh, 'fasta'):
            if seq.id in set(db_hit_id_dict[database]):
                with open(database + '_' + seq.id + '_reverse_query.faa', 'w') as o:
                    SeqIO.write(seq, o, 'fasta')

In [22]:
%%bash
# List reverse search query FASTA files.
ls *_reverse_query.faa

Amacrogynus_database.faa_KNE54303.1_reverse_query.faa
Amacrogynus_database.faa_KNE54381.1_reverse_query.faa
Amacrogynus_database.faa_KNE54458.1_reverse_query.faa
Amacrogynus_database.faa_KNE54583.1_reverse_query.faa
Amacrogynus_database.faa_KNE56115.1_reverse_query.faa
Amacrogynus_database.faa_KNE56227.1_reverse_query.faa
Amacrogynus_database.faa_KNE56932.1_reverse_query.faa
Amacrogynus_database.faa_KNE57938.1_reverse_query.faa
Amacrogynus_database.faa_KNE58904.1_reverse_query.faa
Amacrogynus_database.faa_KNE59113.1_reverse_query.faa
Amacrogynus_database.faa_KNE60410.1_reverse_query.faa
Amacrogynus_database.faa_KNE60456.1_reverse_query.faa
Amacrogynus_database.faa_KNE61506.1_reverse_query.faa
Amacrogynus_database.faa_KNE61706.1_reverse_query.faa
Amacrogynus_database.faa_KNE61724.1_reverse_query.faa
Amacrogynus_database.faa_KNE61753.1_reverse_query.faa
Amacrogynus_database.faa_KNE61843.1_reverse_query.faa
Amacrogynus_database.faa_KNE61855.1_reverse_query.faa
Amacrogynus_database.faa_KNE

# Run BLASTP to search with all reverse search queries in a sequence database.

In [11]:
%%bash
REVSRCHDB='Athaliana_database.faa'
for QUERY in *_reverse_query.faa
do
    OUTPUT=$QUERY'__'$DATABASE'__blastp_reverse_search_output.txt'
    blastp -query $QUERY -db $REVSRCHDB -out $OUTPUT
    OUTPUT2=$QUERY'__'$DATABASE'__blastp_reverse_search_output.xml'
    blastp -query $QUERY -db $REVSRCHDB -out $OUTPUT2 -outfmt 5
done

In [23]:
%%bash
# List reverse search BLAST output files.
ls *__blastp_reverse_search_output.*

Amacrogynus_database.faa_KNE54303.1_reverse_query.faa____blastp_reverse_search_output.txt
Amacrogynus_database.faa_KNE54303.1_reverse_query.faa____blastp_reverse_search_output.xml
Amacrogynus_database.faa_KNE54381.1_reverse_query.faa____blastp_reverse_search_output.txt
Amacrogynus_database.faa_KNE54381.1_reverse_query.faa____blastp_reverse_search_output.xml
Amacrogynus_database.faa_KNE54458.1_reverse_query.faa____blastp_reverse_search_output.txt
Amacrogynus_database.faa_KNE54458.1_reverse_query.faa____blastp_reverse_search_output.xml
Amacrogynus_database.faa_KNE54583.1_reverse_query.faa____blastp_reverse_search_output.txt
Amacrogynus_database.faa_KNE54583.1_reverse_query.faa____blastp_reverse_search_output.xml
Amacrogynus_database.faa_KNE56115.1_reverse_query.faa____blastp_reverse_search_output.txt
Amacrogynus_database.faa_KNE56115.1_reverse_query.faa____blastp_reverse_search_output.xml
Amacrogynus_database.faa_KNE56227.1_reverse_query.faa____blastp_reverse_search_output.txt
Amacrogynu

# Summarize reverse search results by appending columns in a new spreadsheet.

In [14]:
with open('0_summary_of_forward_blastp_searches.csv') as infh,\
    open('0_summary_of_forward_and_reverse_blastp_searches.csv', 'w') as o:
    o.write(','.join(['Query',
                      'Database',
                      'Hit rank',
                      'ID',
                      'Hit description',
                      'E-value',
                      'Top reverse search hit ID',
                      'Top reverse search hit description',
                      'Top reverse search hit E-value\n']))
           
    for line in infh:
        if not line.startswith('Query'):
            fwd_hit_id = line.split(',')[3]
            for blastp_output in glob.glob('*blastp_reverse_search_output.xml'):
                if fwd_hit_id in blastp_output:
                    at_least_one_hit = False
                    with open(blastp_output) as blastp_output_handle:
                        if len(list(NCBIXML.parse(blastp_output_handle))[0].descriptions) >= 1:
                            at_least_one_hit = True
                    if at_least_one_hit:
                        with open(blastp_output) as blastp_output_handle:
                            top_rev_hit = list(NCBIXML.parse(blastp_output_handle))[0].descriptions[0]
                            top_rev_hit_id = top_rev_hit.title.split(' ', 2)[1]
                            top_rev_hit_description = top_rev_hit.title.split(' ', 2)[2]
                            o.write(','.join([line.strip(), top_rev_hit_id, '\"' + top_rev_hit_description + '\"', str(top_rev_hit.e)]) + '\n')
                    else:
                        o.write(line.strip() + ',No reverse search hits\n')

# Visually inspect output spreadsheet

The cell below will display the data in an HTML table, but you will probably find it more useful to view the contents of the CSV file using a spreadsheet program like microsoft Excel.

In [15]:
# Load data from the CSV file using the pandas library.
df = pd.read_csv('0_summary_of_forward_and_reverse_blastp_searches.csv')
# Display the data in an HTML table.
display(HTML(df.to_html()))

Unnamed: 0,Query,Database,Hit rank,ID,Hit description,E-value,Top reverse search hit ID,Top reverse search hit description,Top reverse search hit E-value
0,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,1,XP_629685.1,Rab GTPase [Dictyostelium discoideum AX4],3.8356e-101,NP_193450.1,RAB GTPase homolog B1C [Arabidopsis thaliana],1.2443200000000001e-117
1,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,2,XP_640740.1,Rab GTPase [Dictyostelium discoideum AX4],7.13085e-79,NP_193450.1,RAB GTPase homolog B1C [Arabidopsis thaliana],1.0153900000000001e-82
2,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,3,XP_645208.1,Rab GTPase [Dictyostelium discoideum AX4],1.2046599999999999e-70,NP_193450.1,RAB GTPase homolog B1C [Arabidopsis thaliana],1.55882e-76
3,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,4,XP_641380.1,Rab GTPase [Dictyostelium discoideum AX4],8.25828e-64,NP_850696.1,RAB GTPase homolog 8 [Arabidopsis thaliana],4.06126e-106
4,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,5,XP_643115.1,Rab GTPase [Dictyostelium discoideum AX4],1.50976e-62,NP_850696.1,RAB GTPase homolog 8 [Arabidopsis thaliana],8.32972e-104
5,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,6,XP_629589.1,Rab GTPase [Dictyostelium discoideum AX4],1.21715e-60,NP_193450.1,RAB GTPase homolog B1C [Arabidopsis thaliana],2.81261e-64
6,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,7,XP_642309.1,Rab GTPase [Dictyostelium discoideum AX4],4.9017600000000003e-60,NP_171715.1,RAS 5 [Arabidopsis thaliana],1.14546e-99
7,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,8,XP_639975.1,Rab GTPase [Dictyostelium discoideum AX4],6.02265e-59,NP_568678.1,RAB GTPase homolog 1A [Arabidopsis thaliana],7.31024e-97
8,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,9,XP_646937.1,Rab GTPase [Dictyostelium discoideum AX4],4.41913e-58,NP_195311.1,GTP-binding 2 [Arabidopsis thaliana],4.15972e-61
9,Rab2_Athaliana_NP_193449.1_query.faa,Ddiscoideum_database.faa,10,XP_638915.1,Rab GTPase [Dictyostelium discoideum AX4],2.6427200000000002e-55,NP_171715.1,RAS 5 [Arabidopsis thaliana],8.89518e-107


# Interpret results
Modify the table below to reflect your interpretation of the above results.

**Table 1: Summary of similarity search results.** Numbers indicate the number of orthologues of each protein in each genome.

|     Genome      | AP-2 beta | AP-2 alpha | AP-2 mu | AP-2 sigma | Sec12 | SNAP33 | Rab2 |
|       ---       |    ---    |    ---     |   ---   |    ---     |  ---  |   ---  | ---  |
| *A. thaliana*   |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *T. brucei*     |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *D. discoideum* |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *A. macrogynus* |     0     |     0      |    0    |     0      |   0   |    0   |  0   |
| *S. cerevisiae* |     0     |     0      |    0    |     0      |   0   |    0   |  0   |


# Analyze results critically

Work through the following checklists. Consider whether you have addressed the potential source of error, and describe the steps you took, or would be required, to do so.

## Potential causes of false-positive results:

**Mistaken identity of query** sequences and of sequences expected to be retrieved in reverse searches. Sometimes terminology and annotations are misleading.


    Notes:


**Outparalogues of query are absent from query database**, but present in a subject database. This can result in the original query being retrieved as the top hit in reverse searches by outparalogues rather than just orthologous sequences.


    Notes:

**Presence of a highly conserved domain additional domain** in the query sequence(s) that is also found in non-orthologous sequences. For example, WD40 repeat regions.


    Notes:

**Pseudogenes**. If your purpose is to search for genes that are expressed, then you shouldn’t count these, even though they might meet some basic search criteria.


    Notes:

**Contamination** of sequence data. For example, human sequences in parasite data.


    Notes:

**Split gene models**. Coding regions for exons of the same gene are often distributed among two or more genomic nucleotide sequences, especially when introns are large. You should avoid counting these as separate genes.


    Notes:

**Redundant sequences**. That is, presence of multiple sequences corresponding to the same genomic locus. This can be due to the presence of splice variants (isoforms) in a database, identification of peptide and nucleotide sequences corresponding to the same gene (*e.g.*, overlapping BLASTP and TBLASTN results), or presence of alleles (sometimes included erroneously in assembly of non-haploid genomes). Also, false segmental duplications due to genome assembly errors can result in apparently paralogous loci which actually just correspond to different alleles for the same gene.

    Notes:

## Potential causes of false-negative results:

**Additional domains** present in identified sequences compared to original query sequences. This can cause sequences to be retrieved in the reverse searches which are not homologous to the original query.

    Notes:

**Presence of a highly conserved domain in the query sequence(s)** that is also found in non-orthologous sequences. For example, WD40 repeat regions.

    Notes:

**High levels of sequence identity between paralogues**. This may make reverse search results uninformative, as orthologous and non-orthologous sequences may be retrieved with similar E-values.

    Notes:

**High levels of  sequence divergence** among orthologues. This may also make reverse search results uninformative, as orthologous and non-orthologous sequences may be retrieved with similar E-values.

    Notes:

**Insufficient sensitivity of search methods**. Some homologues can be detected by some methods but not others.

    Notes:

**Sequencing or assembly errors**. This may result in genomic assemblies that are incomplete.

    Notes:

**Gene prediction errors** resulting in failure to predict genes that are in fact present in a genomic assembly.

    Notes:

**Lack of expression** of a gene at the time of transcript collection for transcriptome sequencing.

    Notes:

# Delete search output files (optional).

In [24]:
%%bash
rm *blastp_search_output.txt
rm *blastp_search_output.xml
rm *reverse_query.faa
rm *blastp_reverse_search_output.txt
rm *blastp_reverse_search_output.xml
#rm 0_summary_of_forward_blastp_searches.csv
#rm 0_summary_of_forward_and_reverse_blastp_searches.csv

# Save this notebook as an HTML or PDF document

In the menu bar above, select File>Download as>HTML