![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

![filo_virion](https://user-images.githubusercontent.com/22747792/73687685-7111bc00-467f-11ea-906e-e16132529840.png)

# Python for Genomics 
## Section 5: Extracting Sequences 

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)




![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Now let's start pulling in sequences using everything we know about Seq obj, SeqRecord Obj, and SeqFeatures!


The Gire, et al. paper states:

"One notable intrahost variation is the RNA editing site of the glycoprotein (GP) gene ..."

So let's look into the glycoprotein (GP in yellow, below) gene. 

![filo_genome](https://user-images.githubusercontent.com/22747792/73678324-02c3fe00-466d-11ea-90f9-73ea6e741877.png)

---

Let's start with a single genbank file from KM034562.

We are looking a feature that is a gene called GP.

First we need to figure what we features we have in this record. I feel it is easiest to view the types of features and go from there.

What kinds of feature 'types' are available for KM034562?

In [4]:
from Bio import SeqIO

ebola_gb = SeqIO.read('data/KM034562.gb', 'genbank')
ebola_gb

SeqRecord(seq=Seq('CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTA...GTC', IUPACAmbiguousDNA()), id='KM034562.1', name='KM034562', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome', dbxrefs=['BioProject:PRJNA257197', 'BioSample:SAMN02951978'])

In [5]:
list(feature.type for feature in ebola_gb.features)

['source',
 'gene',
 'mRNA',
 'regulatory',
 'CDS',
 'regulatory',
 'gene',
 'mRNA',
 'regulatory',
 'CDS',
 'gene',
 'mRNA',
 'regulatory',
 'regulatory',
 'CDS',
 'regulatory',
 'gene',
 'regulatory',
 'mRNA',
 'CDS',
 'mRNA',
 'CDS',
 'mRNA',
 'CDS',
 'misc_feature',
 'misc_feature',
 'gene',
 'mRNA',
 'regulatory',
 'regulatory',
 'CDS',
 'regulatory',
 'gene',
 'mRNA',
 'regulatory',
 'CDS',
 'regulatory',
 'gene',
 'mRNA',
 'regulatory',
 'regulatory',
 'CDS',
 'regulatory']

Let's place all of our genes in a list so we can find our GP sequence. (remember, they will be in SeqFeature objs) 

## The process follows this logic:

1. Read in file containing genbank record
2. Within that record, loop through each feature
3. Is the feature of type 'gene'?
4. If so, please add to our list
5. Next feature...

In [7]:
gene_list = []

ebola_gb = SeqIO.read('data/KM034562.gb', 'genbank')

for feature in ebola_gb.features:
    if feature.type == 'gene':
        gene_list.append(feature)
    
        
print("There are %s genes in %s" % (str(len(gene_list)), ebola_gb.description))

There are 7 genes in Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome


Awesome, looks like we've found our genes, but can we get the names of the genes? 

In [8]:
gene_list

[SeqFeature(FeatureLocation(ExactPosition(55), ExactPosition(3026), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(3031), ExactPosition(4407), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(4389), ExactPosition(5894), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(5899), ExactPosition(8305), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(8287), ExactPosition(9740), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(9884), ExactPosition(11518), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(11500), ExactPosition(18282), strand=1), type='gene')]

🤔 Looks the names of each genes are hidden within the features.

Let's open up one of the features and see if there are any attributes that will specify names of the coding sequences.

In [9]:
dir(gene_list[0])

['__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_flip',
 '_get_location_operator',
 '_get_ref',
 '_get_ref_db',
 '_get_strand',
 '_set_location_operator',
 '_set_ref',
 '_set_ref_db',
 '_set_strand',
 '_shift',
 'extract',
 'id',
 'location',
 'location_operator',
 'qualifiers',
 'ref',
 'ref_db',
 'strand',
 'translate',
 'type']

The qualifiers are where the gene names hang out. So if I look at the qualifier attribute for the first feature in my gene_list:

In [10]:
# we see that the names are contained within a dictionary, and the gene name is contained in a list

gene_list[0].qualifiers

OrderedDict([('gene', ['NP'])])

In [11]:
# accessing the names can be done by using the keys, in this case 'gene'

for gene in gene_list:
    print(gene.qualifiers['gene'])

['NP']
['VP35']
['VP40']
['GP']
['VP30']
['VP24']
['L']


Ok, we're getting deep into these SeqFeature objects but we're getting closer to obtaining our sequence.

We have a list of features that are of the type 'gene', and of those, we are selecting the one is is named 'GP'.

We do this by looking into the qualifiers for each SeqFeature object.

In [13]:
for gene in gene_list:
    if gene.qualifiers['gene'] == ['GP']:
        print(gene)

type: gene
location: [5899:8305](+)
qualifiers:
    Key: gene, Value: ['GP']



In [14]:
# Remember to use the parent sequence for extract method. 
# Use the `.seq` attribute to retreive the Seq Object within the ebola_gb SeqRecord.

for gene in gene_list:
    if gene.qualifiers['gene'] == ['GP']:
        GP_sequence = gene.extract(ebola_gb.seq)

GP_sequence

Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA())

Et voila! We extacted the sequence of that feature by using the `.extract()` method - and passing the parent sequence as the argument.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## For multiple records we use the same methodology, just with a loop

So we have our BioProject PRJNA2571:

In [16]:
from IPython.display import IFrame
url = "https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA257197"
IFrame(url, 800, 400)

## The process follows this logic:

1. Read in large file containing multiple genbank records
2. Loop through each record
3. Within that record, loop through each feature
4. Is the feature of type 'gene' and with a qualifier of name 'GP'?
5. Extract that sequence, and place it in our list.
6. Next record...and so on.

In [17]:
bioproject = SeqIO.parse('data/PRJNA257197.gb', 'genbank')
GP_list = []

for record in bioproject:
    for feature in record.features:
        if feature.type == 'gene' and feature.qualifiers['gene'] == ['GP']:
            GP_gene = feature.extract(record.seq)
            GP_list.append(GP_gene)

GP_list[0:4]

[Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA()),
 Seq('GATGAAGATTAAGCCGACAGTGAGCGTAATCTTCATCTCTCTTAGATTATTTGT...AAA', IUPACAmbiguousDNA())]

We now have a list of all our GP genes as Seq objects. 

Let's do a lazy validation and compare counts between our list and our initial file.

In [18]:
count = 0

bioproject = SeqIO.parse('data/PRJNA257197.gb', 'genbank')
for rec in bioproject:
    count += 1

print("There are %i records in the PRJNA2357197.gb file" % count)
print("Our script created %i Seq Objects in GP_list" % len(GP_list))

There are 249 records in the PRJNA2357197.gb file
Our script created 249 Seq Objects in GP_list
