![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

# Python for Genomics 
## Section 3: Reading Sequence Files and SeqRecord Objects

Most of the time, we will be dealing with sequences that are saved as files. 

There are plenty of file formats that are compatible with BioPython. Take a look here:

[BioPython file formats](https://biopython.org/DIST/docs/api/Bio.SeqIO-module.html)

The file formats we use will largely depend on the type of analysis we want to perform. 


Here, we are going to work with .fasta files, which is a common sequence file format.


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## What is a fasta file?
 

In [74]:
! head data/KM034562.fasta

>KM034562.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
TTCAACAGGGGATTGTTCGGCAAAGAGTCATCCCAGTGTATCAAGTAAACAATCTTGAGGAAATTTGCCA


#### 👉🏼 A fasta file contains:

* one or more nucleotide/amino acid sequence(s), represented by their single letter codes;
* two lines per sequence, with the first line beginning with a ‘>’ and then a description, and a second line that contains the sequence;
* a file extension of either: .fasta, .fa, .fna, .ffn, .faa (amino acids), or .frn; and 


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## How do we access sequence data from external files?

There are 3 ways to read in external sequence files.

1. SeqIO.read()
2. SeqIO.parse()
3. SeqIO.index()

Notice how these are a part of a package called SeqIO.

When you read sequence files in with SeqIO, they are converted to a <b>SeqRecord object</b>. 

#### 👉🏼 SeqRecord objects contain a Seq object + metadata on the sequence. 

Let's import some fasta files using these functions.

---

## 1. `SeqIO.read()`

* Use when you have <b>one</B> sequence
* requires two arguments:
    * location/name of file
    * the file format

Let's look at a fasta file that only contains one sequence.


In [75]:
from Bio import SeqIO

ebola_seq = SeqIO.read('data/KM034562.fasta', 'fasta')

In [76]:
# check out our first SeqRecord object 👇🏼

ebola_seq

SeqRecord(seq=Seq('CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTA...GTC', SingleLetterAlphabet()), id='KM034562.1', name='KM034562.1', description='KM034562.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome', dbxrefs=[])

---

## 2. `SeqIO.parse()`

* Use when you have <b>multiple</B> sequences
* like `SeqIO.read()` requires two arguments:
    * your file location/name
    * the file format
* Does not return the seqrecords, but <b>an interator</b> that processes each instance sequentially 

Our Ebola project, which is listed in NCBI's BioProject Repository as:

[PRJNA257197](https://www.ncbi.nlm.nih.gov/nuccore?term=257197%5BBioProject%5D)

when downloaded, contains all the fasta files associated with this project. 

In [77]:
ebola_seqs = SeqIO.parse('data/PRJNA257197.fasta', 'fasta')
ebola_seqs

<generator object FastaIterator at 0x7f871bfa2cf0>

In [78]:
# we use the iterator to retrieve only the information we want for our dataset, saving memory with large files.

count = 0
for seq in ebola_seqs:
    count += 1
    
print(count)

249


In [79]:
ebola_seqs.close()

---

## 3. `SeqIO.index()`

* Use for files with multiple sequences
* again, like `SeqIO.read()` and `SeqIO.parse()`, this requires two arguments:
    * your file name/location
    * the file format
* Does not return seqrecords but a dict-like object, <b> which is searchable by accession (ID) </b>

In [80]:
seq_index = SeqIO.index('data/PRJNA257197.fasta', 'fasta')

type(seq_index)

Bio.File._IndexedSeqFileDict

We can then retreive records based on their ID:

In [81]:
seq_index['KM034562.1']

SeqRecord(seq=Seq('CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTA...GTC', SingleLetterAlphabet()), id='KM034562.1', name='KM034562.1', description='KM034562.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome', dbxrefs=[])

In [82]:
seq_index.close()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## What's a SeqRecord object?

Let's switch it up a bit and work with some genbank files (*.gb) , which contain metatdata in addition to the sequence itself. 

Here's a genbank file - and if we open it, its pretty similiar to what we see posted online on KM034562.1's page in [NCBI's Nucleotide Repository](https://www.ncbi.nlm.nih.gov/nuccore/KM034562.1/)

In [83]:
! cat data/KM034532.gb

LOCUS       KM034562               18957 bp    cRNA    linear   VRL 15-DEC-2014
DEFINITION  Zaire ebolavirus isolate Ebola
            virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome.
ACCESSION   KM034562
VERSION     KM034562.1
DBLINK      BioProject: PRJNA257197
            BioSample: SAMN02951978
KEYWORDS    .
SOURCE      Zaire ebolavirus
  ORGANISM  Zaire ebolavirus
            Viruses; ssRNA viruses; ssRNA negative-strand viruses;
            Mononegavirales; Filoviridae; Ebolavirus.
REFERENCE   1  (bases 1 to 18957)
  AUTHORS   Gire,S.K., Goba,A., Andersen,K.G., Sealfon,R.S., Park,D.J.,
            Kanneh,L., Jalloh,S., Momoh,M., Fullah,M., Dudas,G., Wohl,S.,
            Moses,L.M., Yozwiak,N.L., Winnicki,S., Matranga,C.B.,
            Malboeuf,C.M., Qu,J., Gladden,A.D., Schaffner,S.F., Yang,X.,
            Jiang,P.P., Nekoui,M., Colubri,A., Coomber,M.R., Fonnie,M.,
            Moigboi,A., Gbakie,M., Kamara,F.K., Tucker,V., Konuwa,E., Saffa,S.,
            Sellu,J., Ja

We have much more information available to us than a fasta file; this metadata will be available for use to use if we want, along with the sequence.


To read in different filetypes using the same `SeqIO` functions above, all we have to do is change:

1. the filename; and
2. the file format argument 

You can always refer back to [BioPython's Bio.SeqIO documentation](https://biopython.org/DIST/docs/api/Bio.SeqIO-module.html) to check what the argument is for each file format.


In [84]:
ebola_gb = SeqIO.read('data/KM034532.gb', 'genbank')
ebola_gb

SeqRecord(seq=Seq('CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTA...GTC', IUPACAmbiguousDNA()), id='KM034562.1', name='KM034562', description='Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome', dbxrefs=['BioProject:PRJNA257197', 'BioSample:SAMN02951978'])

In [85]:
dir(ebola_gb)

['__add__',
 '__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__le___',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_per_letter_annotations',
 '_seq',
 '_set_per_letter_annotations',
 '_set_seq',
 'annotations',
 'dbxrefs',
 'description',
 'features',
 'format',
 'id',
 'letter_annotations',
 'lower',
 'name',
 'reverse_complement',
 'seq',
 'translate',
 'upper']

Notice that our SeqRecord objects has the attributes that we can now access easily.

In [86]:
ebola_gb.seq

Seq('CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTA...GTC', IUPACAmbiguousDNA())

In [87]:
ebola_gb.name

'KM034562'

In [88]:
ebola_gb.description

'Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome'

In [89]:
ebola_gb.id

'KM034562.1'

In [90]:
# also some other handy methods much like our seq objects

ebola_gb.reverse_complement()

SeqRecord(seq=Seq('GACACACAAAAAAGAAGAAATAGATTTATTTTTAAATTTTTGTGTGCGACCATT...CCG', IUPACAmbiguousDNA()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])