![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

# Python for Genomics 
## Section 4: SeqFeatures Exercises

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## 1 - Look into a file and observe the SeqFeatures

Ebola outbreaks periodically emerge in The Democratic Republic of Congo and we'll take a look at sequences from a recent outbreak starting in 2018 in North Kivu and Ituri. 

Again, we will be trying to extract the nucleoprotein (NP) gene.

Here is the genbank record of one of those sequences:


In [1]:
from IPython.display import IFrame
url = "https://www.ncbi.nlm.nih.gov/nucleotide/MK007329"
IFrame(url, 800, 400)

I have downloaded this file into: data/MK007329.1.gb

### Read that file in and create a `set` of the types of features that are annotated in record MK007329.1.

In [2]:
# your code goes here...

In [7]:
# Solution

from Bio import SeqIO

DRC_Ebola = SeqIO.read('data/MK007329.1.gb', 'genbank')
set(feature.type for feature in DRC_Ebola.features)

{'CDS', 'gene', 'mRNA', 'misc_feature', 'regulatory', 'source'}

## 2 - What is the type and name of the 6th feature in MK007329.1? 

In [None]:
# your code goes here...

In [8]:
# Solution

feature = DRC_Ebola.features[6]
feature.qualifiers

OrderedDict([('gene', ['VP35'])])

## 3 - Extract the sequence for this 6th feature in MK007329.1. 

In [None]:
# your code goes here...

In [9]:
feature.extract(DRC_Ebola.seq)

Seq('GATGAAGATTAAAACCTTCATCATCCTTACGTCAATTGAATTCTCTAGCACTCG...AAA', IUPACAmbiguousDNA())

## 4 - Can you make a list of all the gene names contained in MK007329.1? 

In [1]:
# your code goes here...

In [10]:
gene_list = []

for feature in DRC_Ebola.features:
    if feature.type == 'gene':
        gene_list.append(feature.qualifiers['gene'])

gene_list

[['NP'], ['VP35'], ['VP40'], ['GP'], ['VP30'], ['VP24'], ['L']]

## 5 - Extract the sequence for the nucleoprotein (NP) and save as a Seq Object.

In [None]:
# your code goes here...

In [11]:
for feature in DRC_Ebola.features:
    if feature.type == 'gene':
        if feature.qualifiers['gene'] ==['NP']:
            NP_seq = feature.extract(DRC_Ebola.seq)

NP_seq

Seq('GAGGAAGATTAATAATTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGA...AAA', IUPACAmbiguousDNA())

## 6 - Can you loop through a genbank file ("Ituri_sequences.gb) containing 61 records and create a list that with all the NP sequences? You can return the entire list since it's not too big.

I have saved a file named "Ituri_sequences.gb", which is a list of genbank records from the Ituri/North Kivu outbreak. (They are not associated with a bioproject, but rather a set of sequences from studies done by the United States Army Medical Research Institute of Infectious Diseases (USAMRIID))

In [13]:
# your code goes here...

ituri_seqs = SeqIO.parse('data/Ituri_sequences.gb', 'genbank')

NP_list = []

for record in ituri_seqs:
    for feature in record.features:
        if feature.type == 'gene' and feature.qualifiers['gene'] == ['NP']:
            NP_gene = feature.extract(record.seq)
            NP_list.append(NP_gene)

NP_list

## 7 - Can you count the number of records in the file and compare it to the number of NP seq objects in your list?

In [22]:
# your code goes here...

The numbers don't have to match 😉

In [21]:
count = 0

ituri = SeqIO.parse('data/Ituri_sequences.gb', 'genbank')
for rec in ituri:
    count += 1

print("There are %i records in the Ituri_sequences.gb file" % count)
print("Our script created %i Seq Objects in NP_list" % len(NP_list))

There are 61 records in the Ituri_sequences.gb file
Our script created 59 Seq Objects in NP_list


Uh oh! You didn't do anything wrong. This could mean a few things... 

Perhaps a few records were not annotated and contained a NP gene.
Perhaps there was a sequencing error.
Maybe they were partially sequenced.

It's just good to know moving forward that you don't have all 61 sequences. 

Of course, now that you know the basics of objects, you could loop through the entire file and return those sequences that do not contain the NP gene.

Would you like to try? Give it a go below 👇🏼 

In [None]:
# Extra credit code goes here

In [30]:
# There are plenty of ways to do this, but in my mind it was easiest to think of two lists,
# one that contained all the records, and one that contained the NP gene.
# Then use a list comprehension to compare the lists and return the differences.

all_NP_list = []
yes_NP_list = []

ituri_seqs = SeqIO.parse('data/Ituri_sequences.gb', 'genbank')

for record in ituri_seqs:
    all_NP_list.append(record.id)
    for feature in record.features:
        if feature.type == 'gene' and feature.qualifiers['gene'] == ['NP']:
            yes_NP_list.append(record.id)

no_NP_list = [record for record in all_NP_list if record not in yes_NP_list]
print(no_NP_list)            

['MK731991.1', 'MK731984.1']


If you would like to see the genbank file and why it didn't contain the NP gene, the link is below.
Notice how it's a partial genome. It snuck through the cracks because it was within the range I specified for a full genome, yet is classified as "partial". 

In [31]:
from IPython.display import IFrame
url = "https://www.ncbi.nlm.nih.gov/nuccore/MK731991.1"
IFrame(url, 800, 400)