![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

# Python for Genomics 
## Section 3: Reading Files and SeqRecord Objects Exercises


<b> All of the files mentioned in these exercises are located in the '03 SeqRecord Obj/data' folder.</b>

<br>

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### 1. Access the fasta sequence in file 'KR105316.1.fasta'.

There is only one sequence this file. Read the file, save the SeqRecord to a variable of your choice.

(Remember to import the `SeqIO` module from the `Bio` package.)

In [4]:
# Solution

from Bio import SeqIO

ebola_fasta_seq = SeqIO.read('data/KR105316.1.fasta', 'fasta')
ebola_fasta_seq

SeqRecord(seq=Seq('GTGTGTGAATAACTATGAGGAAGATTAATAATTTTCCTCTCATTGAAATTTATA...TGT', SingleLetterAlphabet()), id='KR105316.1', name='KR105316.1', description='KR105316.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5684.1, partial genome', dbxrefs=[])

### 2. File 'KR105316.1.gb' contains the same sequence as above, but it is in genbank format. Read that file in to another new variable.

What are the differences between the resulting SeqRecords objects when you read in a fasta file vs. a genbank file?


In [4]:
# your code goes here...

In [6]:
# Solution

ebola_gb_seq = SeqIO.read('data/KR105316.1.gb', 'genbank')
ebola_gb_seq

SeqRecord(seq=Seq('GTGTGTGAATAACTATGAGGAAGATTAATAATTTTCCTCTCATTGAAATTTATA...TGT', IUPACAmbiguousDNA()), id='KR105316.1', name='KR105316', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G5684.1, partial genome', dbxrefs=['BioProject:PRJNA257197', 'BioSample:SAMN03254195'])

The difference is that the genbank file contains more metadata, and therefore the resulting SeqRecord file will have more attributes associated within it. For example, the record above as 'dbxrefs' (database cross references) and BioSample accession numbers.

### 3. 'PRJNA278759.gb' is a file containing the genbank records of BioProject PRJNA278759.

Open PRJNA278759.gb using `SeqIO.parse()` and save the returned iterator to a variable of your choice.

Can you use the iterator to count how the number of records in this file?

In [None]:
# your code goes here...

In [14]:
# Solution

ebola_seqs = SeqIO.parse('data/PRJNA278759.gb', 'genbank')


count = 0
for record in ebola_seqs:
    count +=1
print('There are %s records in PRJNA278759.gb' % (count))


There are 166 records in PRJNA278759.gb


### 4.With the same 'PRJNA278759.gb' file, index its contents and retreive the record with the accession 'KT725386.1'. Can you provide the reverse complement of this record as a Seq Object?


In [None]:
# your code goes here...

In [45]:
# Solution

ebola_index = SeqIO.index('data/PRJNA278759.gb', 'genbank')
ebola_index['KT725386.1'].reverse_complement().seq

Seq('CCTTCTGATGAGCGTGGTNAATGCCTTAATTATCATTAACACGAAGATTATTAT...TTA', IUPACAmbiguousDNA())

### Great job!