![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

# Seq Record exercises



![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### 1 - Let's read in some sequences using SeqIO.

We have a fasta file located in the Data/ folder named KR105316.1.fasta with only one sequence. Use SeqIO.read() it in. 
(Remember to import the SeqIO module from the Bio package.)

In [None]:
# your code goes here...

In [3]:
# Solution

from Bio import SeqIO

ebola_fasta_seq = SeqIO.read('Data/KR105316.1.fasta', 'fasta')
ebola_fasta_seq

SeqRecord(seq=Seq('GTGTGTGAATAACTATGAGGAAGATTAATAATTTTCCTCTCATTGAAATTTATA...TGT', SingleLetterAlphabet()), id='KR105316.1', name='KR105316.1', description='KR105316.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5684.1, partial genome', dbxrefs=[])

We also have a genbank file of the same accession: KR105316.1.gb. Read that file in. Note any differences between the resulting SeqRecords objects when you read in a fasta file vs. a genbank file.


In [4]:
# your code goes here...

In [7]:
# Solution

ebola_gb_seq = SeqIO.read('Data/KR105316.1.gb', 'genbank')
ebola_gb_seq

SeqRecord(seq=Seq('GTGTGTGAATAACTATGAGGAAGATTAATAATTTTCCTCTCATTGAAATTTATA...TGT', IUPACAmbiguousDNA()), id='KR105316.1', name='KR105316', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5684.1, partial genome', dbxrefs=['BioProject:PRJNA257197', 'BioSample:SAMN03254195'])

Now we have files that contain many sequences, and we'd like to investigate a bit. 

Read in a large file PRJNA278759.gb using `SeqIO.parse()` and save the returned iterator to a variable of your choice.

Can you use the iterator to count how many records there are in this file?

In [None]:
# your code goes here...

In [14]:
# Solution

ebola_seqs = SeqIO.parse('Data/PRJNA278759.gb', 'genbank')

count = 0
for record in ebola_seqs:
    count +=1
print('There are ' + str(count) + ' records in PRJNA278759.gb')

There are 166 records in PRJNA278759.gb


With the same file, we'd like to be able to index its contents for easy retrival.

Read in a large file PRJNA278759.gb using `SeqIO.index()` and retrieve the record with the accession number: KT725386.1

In [None]:
# your code goes here...

In [5]:
# Solution

ebola_index = SeqIO.index('Data/PRJNA278759.gb', 'genbank')
ebola_index['KT725386.1']

SeqRecord(seq=Seq('TAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTTCAGAGCCATATCACC...AGG', IUPACAmbiguousDNA()), id='KT725386.1', name='KT725386', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/LBR/2014/Makona-LIBR10033, partial genome', dbxrefs=['BioProject:PRJNA278759', 'BioSample:SAMN04040242'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### 2 - Let's investigate some SeqRecord attributes


You now have an index of our BioProject PRJNA278759. Find the record with accession number KR013754.3 and assign the SeqRecord to a new variable.


In [None]:
# your code goes here...

In [12]:
# Solution

KR013754 = ebola_index['KR013754.3']
KR013754

SeqRecord(seq=Seq('TTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC...ATA', IUPACAmbiguousDNA()), id='KR013754.3', name='KR013754', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3864.1, partial genome', dbxrefs=['BioProject:PRJNA278759', 'BioSample:SAMN03447201'])

Take a look at the description, annotations, and format attributes of this record.

In [8]:
# your code goes here...

In [11]:
# Solution

print(KR013754.description)
print(KR013754.annotations)


Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3864.1, partial genome
{'molecule_type': 'cRNA', 'topology': 'linear', 'data_file_division': 'VRL', 'date': '11-APR-2016', 'accessions': ['KR013754'], 'sequence_version': 3, 'keywords': [''], 'source': 'Zaire ebolavirus', 'organism': 'Zaire ebolavirus', 'taxonomy': ['Viruses', 'ssRNA viruses', 'ssRNA negative-strand viruses', 'Mononegavirales', 'Filoviridae', 'Ebolavirus'], 'references': [Reference(title='Real-Time Monitoring of Ebolavirus Genomic Drift in Liberia', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'On Apr 11, 2016 this sequence version replaced KR013754.2.', 'structured_comment': OrderedDict([('Assembly-Data', OrderedDict([('Assembly Method', 'Bowtie2 v. 2.1.0'), ('Genome Coverage', '300x'), ('Sequencing Technology', 'Illumina MiSeq')]))])}


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### 3 - Let's investigate some SeqFeature attributes

![178-Ebola_Virus_Proteins-EbolaProteins2018](https://user-images.githubusercontent.com/22747792/73787722-b5b95800-4750-11ea-85cf-d94489fa5ce9.jpg)

The glycoprotein (GP) was inferred to contain lots of asynchronous mutations.

Within the KR013754 record, identify and extract all the gene names?

In [13]:
KR013754.features

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(18905), strand=1), type='source'),
 SeqFeature(FeatureLocation(ExactPosition(29), ExactPosition(3000), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(29), ExactPosition(3000), strand=1), type='mRNA'),
 SeqFeature(FeatureLocation(ExactPosition(443), ExactPosition(2663), strand=1), type='CDS'),
 SeqFeature(FeatureLocation(ExactPosition(2988), ExactPosition(3000), strand=1), type='regulatory'),
 SeqFeature(FeatureLocation(ExactPosition(3005), ExactPosition(4381), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(3005), ExactPosition(4381), strand=1), type='mRNA'),
 SeqFeature(FeatureLocation(ExactPosition(3005), ExactPosition(3017), strand=1), type='regulatory'),
 SeqFeature(FeatureLocation(ExactPosition(3102), ExactPosition(4125), strand=1), type='CDS'),
 SeqFeature(FeatureLocation(ExactPosition(4363), ExactPosition(5868), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(436

In [15]:
genes = []

for feature in KR013754.features:
    if feature.type == "gene":
        genes.append(feature)
genes

[SeqFeature(FeatureLocation(ExactPosition(29), ExactPosition(3000), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(3005), ExactPosition(4381), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(4363), ExactPosition(5868), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(5873), ExactPosition(8279), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(8261), ExactPosition(9714), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(9858), ExactPosition(11492), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(11474), ExactPosition(18256), strand=1), type='gene')]

In [17]:
dir(genes[0])

['__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_flip',
 '_get_location_operator',
 '_get_ref',
 '_get_ref_db',
 '_get_strand',
 '_set_location_operator',
 '_set_ref',
 '_set_ref_db',
 '_set_strand',
 '_shift',
 'extract',
 'id',
 'location',
 'location_operator',
 'qualifiers',
 'ref',
 'ref_db',
 'strand',
 'translate',
 'type']

In [18]:
for feature in genes:
    print(feature.qualifiers)

OrderedDict([('gene', ['NP'])])
OrderedDict([('gene', ['VP35'])])
OrderedDict([('gene', ['VP40'])])
OrderedDict([('gene', ['GP'])])
OrderedDict([('gene', ['VP30'])])
OrderedDict([('gene', ['VP24']), ('note', ['putative'])])
OrderedDict([('gene', ['L'])])


In [26]:
for feature in genes:
    print((feature.qualifiers['gene']))

['NP']
['VP35']
['VP40']
['GP']
['VP30']
['VP24']
['L']
