<h1 id="toctitle">Exercise solutions</h1>
<ul id="toc"/>

The first thing to do is to figure out how to do the query on the GenBank website. For now, let's assume that we can get the results we want by searching for 

`Nematoda COX1 complete`

We will re-use the code from the example to search. This time, we want the protein database, not the nucleotide one. We will limit the results to ten records for speed:

In [6]:
from Bio import Entrez, SeqIO
Entrez.email = "martin.jones@ed.ac.uk"
mysearch = Entrez.esearch(db="protein", term="Nematoda COX1 complete", retmax="10")
result = Entrez.read(mysearch)
print(result['Count'])
for accession in result['IdList']:
    genbank = Entrez.efetch(db="nucleotide",id=accession,rettype="gb")
    record = SeqIO.read(genbank, "genbank")
    print(record.description)

132
cytochrome c oxidase subunit I (mitochondrion) [Litoditis aff. marina PmIV].
cytochrome c oxidase subunit I (mitochondrion) [Litoditis aff. marina PmII].
cytochrome c oxidase subunit I (mitochondrion) [Litoditis aff. marina PmI].
RecName: Full=SURF1-like protein.
RecName: Full=SURF1-like protein.
cytochrome c oxidase subunit I (mitochondrion) [Oxyuris equi].
cytochrome c oxidase subunit I (mitochondrion) [Pseudoterranova azarasi].
cytochrome c oxidase subunit 1 (mitochondrion) [Oxyuris equi].
cytochrome c oxidase subunit 1 (mitochondrion) [Pseudoterranova azarasi].
cytochrome c oxidase subunit I (mitochondrion) [Meloidogyne enterolobii].


We find 132 records.

Next, we want to get the length for each of the first ten. The easiest way to do this is to get the sequence, then use `len()` to get the length:

In [8]:
from Bio import Entrez, SeqIO
Entrez.email = "martin.jones@ed.ac.uk"
mysearch = Entrez.esearch(db="protein", term="Nematoda COX1 complete", retmax="10")
result = Entrez.read(mysearch)
print(result['Count'])
for accession in result['IdList']:
    genbank = Entrez.efetch(db="nucleotide",id=accession,rettype="gb")
    record = SeqIO.read(genbank, "genbank")
    print(len(record.seq))

132
525
525
525
317
323
520
525
520
525
507


Now rather than just printing the lengths, let's add them up and divide by the number of records:

In [11]:
from Bio import Entrez, SeqIO
Entrez.email = "martin.jones@ed.ac.uk"
mysearch = Entrez.esearch(db="protein", term="Nematoda COX1 complete", retmax="10")
result = Entrez.read(mysearch)
print(result['Count'])

total_length = 0

for accession in result['IdList']:
    genbank = Entrez.efetch(db="nucleotide",id=accession,rettype="gb")
    record = SeqIO.read(genbank, "genbank")
    total_length = total_length + len(record.seq)
    
average_length = total_length / 10
print(average_length)

132
481


We get our answer: 481 amino acid residues. 

We can now turn this code into a function that takes a taxonomic name and a gene name and returns the average length of the first ten results. 

In [14]:
def get_average_length(taxonomy, gene):
    # figure out what the search term will be
    search_term = taxonomy + " " + gene + " complete"
    
    
    from Bio import Entrez, SeqIO
    Entrez.email = "martin.jones@ed.ac.uk"
    mysearch = Entrez.esearch(db="protein", term=search_term, retmax="10")
    result = Entrez.read(mysearch)

    total_length = 0

    for accession in result['IdList']:
        genbank = Entrez.efetch(db="nucleotide",id=accession,rettype="gb")
        record = SeqIO.read(genbank, "genbank")
        total_length = total_length + len(record.seq)

    average_length = total_length / 10
    return(average_length)

Let's try it. Remember that this function will take a while to run as it has to wait to download the records:

In [15]:
get_average_length('Nematoda', 'COX1')

132


481

In [16]:
get_average_length('Arthropoda', 'ATP6')

1515


224

In [3]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [2]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")