# Day 3 Part 1 - more exercices!
---


One powerful function of biopython is its ability to read file formats of biological data. Let's try and read in a single fasta sequene from the file "myFavouriteProtein_ncbi.fasta"

In [None]:
from Bio import SeqIO

# Here, we call the module SeqIO, let's see what it is
SeqIO
print(SeqIO)

### Reading a single fasta-formatting sequence 

Reminder, fasta files look like this :

In [None]:
# myFastaRecord is a single fasta record from the file 
# myFastaRecord IS NOT a string
# myFastaRecord IS NOT a list
# myFastaRecord is a Biopython Object = SeqRecord
myFastaRecord = SeqIO.read("myFavouriteProtein_ncbi.fasta", "fasta")

# what does it look like?
print(myFastaRecord)

In [None]:
# Let's call (not print) myFastaRecord
myFastaRecord

Above, we can see that there is some 'SeqRecord' object that lists all the elements encoded in the object.  You can access all of these elements by typing the string myFastaRecord.XXX 

In [None]:
# this is the sequence as a biopython object
myFastaRecord.seq

In [None]:
# this is the sequence as a biopython object
print(myFastaRecord.seq)

In [None]:
# you can query the length of the biopython object using len()
len(myFastaRecord.seq)

In [None]:
# In this box try to query another element of myFastaRecord. 
# call myFastaRecord to see you option (press play)
# or try tabbing with a 'myFastaRecord.'
myFastaRecord.description


Let's try and get some information out of the description.  Write a conditional statement asking whether there is a string 'protein' in the description of the sequence.  

In [None]:
# Does this protein description have the word 'protein'
"protein" in myFastaRecord.description


In [None]:
# What is the accession number (=id) of this protein? 
myFastaRecord.id
# store the accession number in a variable 
mySeqID = myFastaRecord.id
mySeqID

## Parsing a file with more than one sequence

To parse a larger file, it is best to read through the file sequence by sequence and store the information you are interested in. We can do this using the SeqIO.parse()

In [None]:
SeqIO.parse("NFU1_proteins.fasta", "fasta")

Here it says that this object is a 'Fasta Iterator' this gives you a hint that whatever is stored in this object is iterable - that is, you can iterate over it. Just like you would iterate over a list, dictionary or string. 

In [None]:
for record in SeqIO.parse("NFU1_proteins.fasta", "fasta"):
    print(record)

In [None]:
# let's print just the first 3 records to the screen.
# hint: we can use 'break' to stop the for loop if 
# we need to

counter = 0
for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    counter +=1
    if counter <= 3 :
        print(record)
    else:
        break



In [None]:
# Find all the records with 'Drosophila' in the description
# print them to screen AND store their record object in a list 
# myFlies 

myFlies = []

for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    description = record.description
    if 'Drosophila' in description:
        print(record.description)

for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    description = record.description
    if 'Drosophila' in description:
          myFlies.append(record)

print(myFlies)


In [None]:
# Let's try and perform some actions on the myFlies records    
# Try printing the length of every sequences 

for record in myFlies:
    print("The sequence %s is %i amino acids long" % ((record.id), len(record.seq)))

In [None]:
# Let's count the number of Alanines in each sequence from myFlies
# print 'this sequence has XX Alanines'
for record in myFlies:
    sequence = record.seq
    print("The sequence %s has %i Alanines" % ((record.id), sequence.count('A')))