# Day 3 Part 1 - more exercices!
---


One powerful function of biopython is its ability to read file formats of biological data. Let's try and read in a single fasta sequene from the file "myFavouriteProtein_ncbi.fasta"

In [6]:
from Bio import SeqIO

# Here, we call the module SeqIO, let's see what it is
SeqIO
print(SeqIO)

<module 'Bio.SeqIO' from '/home/julieb/anaconda3/envs/PythonCourses22/lib/python3.10/site-packages/Bio/SeqIO/__init__.py'>


### Reading a single fasta-formatting sequence 

Reminder, fasta files look like this :

In [7]:
# myFastaRecord is a single fasta record from the file 
# myFastaRecord IS NOT a string
# myFastaRecord IS NOT a list
# myFastaRecord is a Biopython Object = SeqRecord
myFastaRecord = SeqIO.read("myFavouriteProtein_ncbi.fasta", "fasta")

# what does it look like?
print(myFastaRecord)

ID: AKA62179.1
Name: AKA62179.1
Description: AKA62179.1 putative rhodoquinone biosynthesis methyltransferase-like protein RQUA [Pygsuia biforma]
Number of features: 0
Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA')


In [8]:
# Let's call (not print) myFastaRecord
myFastaRecord

SeqRecord(seq=Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA'), id='AKA62179.1', name='AKA62179.1', description='AKA62179.1 putative rhodoquinone biosynthesis methyltransferase-like protein RQUA [Pygsuia biforma]', dbxrefs=[])

Above, we can see that there is some 'SeqRecord' object that lists all the elements encoded in the object.  You can access all of these elements by typing the string myFastaRecord.XXX 

In [9]:
# this is the sequence as a biopython object
myFastaRecord.seq

Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA')

In [10]:
# this is the sequence as a biopython object
print(myFastaRecord.seq)

MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPEYMANTYHWAYVNPRNVALLDNNFVVNTILFGNYIRIQNFALSEIKQGDQVFMPASVYGSACRNIAKAVGEAGRLDIIDISPIQVVRNTRKLSRYPQVTVLRGDARSFDLQAAYDVACSFMLLHEIPDENKSSVVNNVLNSVKVGGKAVFIDYGRPSTLHPVRPILSFVNDWLEPWAKTLWAHPISSFAAPESQDHFVWETERTIFGGVYQKVVAHRIA


In [11]:
# you can query the length of the biopython object using len()
len(myFastaRecord.seq)

272

In [13]:
# In this box try to query another element of myFastaRecord. 
# call myFastaRecord to see you option (press play)
# or try tabbing with a 'myFastaRecord.'
myFastaRecord.description


'AKA62179.1 putative rhodoquinone biosynthesis methyltransferase-like protein RQUA [Pygsuia biforma]'

Let's try and get some information out of the description.  Write a conditional statement asking whether there is a string 'protein' in the description of the sequence.  

In [16]:
# Does this protein description have the word 'protein'
"protein" in myFastaRecord.description


True

In [19]:
# What is the accession number (=id) of this protein? 
myFastaRecord.id
# store the accession number in a variable 
mySeqID = myFastaRecord.id
mySeqID

'AKA62179.1'

## Parsing a file with more than one sequence

To parse a larger file, it is best to read through the file sequence by sequence and store the information you are interested in. We can do this using the SeqIO.parse()

In [20]:
SeqIO.parse("NFU1_proteins.fasta", "fasta")

<Bio.SeqIO.FastaIO.FastaIterator at 0x7f2150e5e9e0>

Here it says that this object is a 'Fasta Iterator' this gives you a hint that whatever is stored in this object is iterable - that is, you can iterate over it. Just like you would iterate over a list, dictionary or string. 

In [None]:
for record in SeqIO.parse("NFU1_proteins.fasta", "fasta"):
    print(record)

In [23]:
# let's print just the first 3 records to the screen.
# hint: we can use 'break' to stop the for loop if 
# we need to

counter = 0
for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    counter +=1
    if counter <= 3 :
        print(record)
    else:
        break



ID: Q9UMS0.2
Name: Q9UMS0.2
Description: Q9UMS0.2 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; AltName: Full=HIRA-interacting protein 5; Flags: Precursor [Homo sapiens]
Number of features: 0
Seq('MAATARRGWGAAAVAAGLRRRFCHMLKNPYTIKKQPLHQFVQRPLFPLPAAFYH...NSP')
ID: Q9QZ23.2
Name: Q9QZ23.2
Description: Q9QZ23.2 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; AltName: Full=HIRA-interacting protein 5; Short=mHIRIP5; Flags: Precursor [Mus musculus]
Number of features: 0
Seq('MAAAERAWGAAVGVVRLCRRFCHVATPHTFKKQPLHQYVRRPLFPLRAPLCNTV...NSS')
ID: B4M375.1
Name: B4M375.1
Description: B4M375.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila virilis]
Number of features: 0
Seq('MSKLISYAAKNTLRNTRLGANPICQHATRDYMHLAAASAARNTYSTPAVGFAKQ...TPN')


In [46]:
# Find all the records with 'Drosophila' in the description
# print them to screen AND store their record object in a list 
# myFlies 

myFlies = []

for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    description = record.description
    if 'Drosophila' in description:
        print(record.description)

for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    description = record.description
    if 'Drosophila' in description:
          myFlies.append(record)

print(myFlies)


B4M375.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila virilis]
B3MRT7.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila ananassae]
B4JWR9.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila grimshawi]
B4H303.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila persimilis]
B5DKJ8.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila pseudoobscura pseudoobscura]
B4NE93.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila willistoni]
B4PZ52.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila yakuba]
Q8SY96.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila melanogaster]
B4IMF6.1 RecNa

In [52]:
# Let's try and perform some actions on the myFlies records    
# Try printing the length of every sequences 

for record in myFlies:
    print("The sequence %s is %i amino acids long" % ((record.id), len(record.seq)))

The sequence B4M375.1 is 298 amino acids long
The sequence B3MRT7.1 is 286 amino acids long
The sequence B4JWR9.1 is 298 amino acids long
The sequence B4H303.1 is 282 amino acids long
The sequence B5DKJ8.1 is 286 amino acids long
The sequence B4NE93.1 is 289 amino acids long
The sequence B4PZ52.1 is 283 amino acids long
The sequence Q8SY96.1 is 283 amino acids long
The sequence B4IMF6.1 is 283 amino acids long
The sequence B3NYF7.1 is 283 amino acids long
The sequence B4R3T1.1 is 283 amino acids long


In [57]:
# Let's count the number of Alanines in each sequence from myFlies
# print 'this sequence has XX Alanines'
for record in myFlies:
    sequence = record.seq
    print("The sequence %s has %i Alanines" % ((record.id), sequence.count('A')))

The sequence B4M375.1 has 24 Alanines
The sequence B3MRT7.1 has 21 Alanines
The sequence B4JWR9.1 has 22 Alanines
The sequence B4H303.1 has 17 Alanines
The sequence B5DKJ8.1 has 16 Alanines
The sequence B4NE93.1 has 20 Alanines
The sequence B4PZ52.1 has 17 Alanines
The sequence Q8SY96.1 has 13 Alanines
The sequence B4IMF6.1 has 16 Alanines
The sequence B3NYF7.1 has 18 Alanines
The sequence B4R3T1.1 has 16 Alanines
