# Day 3 - more exercices!
---


One powerful function of biopython is its ability to read file formats of biological data. Let's try and read in a single fasta sequene from the file "myFavouriteProtein_ncbi.fasta"

In [1]:
from Bio import SeqIO

# Here, we call the module SeqIO, let's see what it is
SeqIO
print(SeqIO)

<module 'Bio.SeqIO' from '/home/julieb/anaconda3/lib/python3.9/site-packages/Bio/SeqIO/__init__.py'>


### Reading a single fasta-formatting sequence 

Reminder, fasta files look like this :

In [3]:
# myFastaRecord is a single fasta record from the file 
# myFastaRecord IS NOT a string
# myFastaRecord IS NOT a list
# myFastaRecord is a Biopython Object = SeqRecord
myFastaRecord = SeqIO.read("myFavouriteProtein_ncbi.fasta", "fasta")

# what does it look like?
print(myFastaRecord)

ID: AKA62179.1
Name: AKA62179.1
Description: AKA62179.1 putative rhodoquinone biosynthesis methyltransferase-like protein RQUA [Pygsuia biforma]
Number of features: 0
Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA')


In [4]:
# Let's call (not print) myFastaRecord
myFastaRecord

SeqRecord(seq=Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA'), id='AKA62179.1', name='AKA62179.1', description='AKA62179.1 putative rhodoquinone biosynthesis methyltransferase-like protein RQUA [Pygsuia biforma]', dbxrefs=[])

Above, we can see that there is some 'SeqRecord' object that lists all the elements encoded in the object.  You can access all of these elements by typing the string myFastaRecord.XXX 

In [5]:
# this is the sequence as a biopython object
myFastaRecord.seq

Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA')

In [6]:
# this is the sequence as a biopython object
print(myFastaRecord.seq)

MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPEYMANTYHWAYVNPRNVALLDNNFVVNTILFGNYIRIQNFALSEIKQGDQVFMPASVYGSACRNIAKAVGEAGRLDIIDISPIQVVRNTRKLSRYPQVTVLRGDARSFDLQAAYDVACSFMLLHEIPDENKSSVVNNVLNSVKVGGKAVFIDYGRPSTLHPVRPILSFVNDWLEPWAKTLWAHPISSFAAPESQDHFVWETERTIFGGVYQKVVAHRIA


In [7]:
# you can query the length of the biopython object using len()
len(myFastaRecord.seq)

272

In [8]:
# In this box try to query another element of myFastaRecord. 
# call myFastaRecord to see you option (press play)
# or try tabbing with a 'myFastaRecord.'
myFastaRecord

SeqRecord(seq=Seq('MNSLRITSLQRCCSIGFRQFSSLRNTFGCRSFLHSSKFFHSTTVRGNDKEELPE...RIA'), id='AKA62179.1', name='AKA62179.1', description='AKA62179.1 putative rhodoquinone biosynthesis methyltransferase-like protein RQUA [Pygsuia biforma]', dbxrefs=[])

Let's try and get some information out of the description.  Write a conditional statement asking whether there is a string 'protein' in the description of the sequence.  

In [None]:
# Does this protein description have the word 'protein'?

In [21]:
# What is the accession number (=id) of this protein? 
# store the accession number in a variable 

## Parsing a file with more than one sequence

To parse a larger file, it is best to read through the file sequence by sequence and store the information you are interested in. We can do this using the SeqIO.parse()

In [9]:
SeqIO.parse("NFU1_proteins.fasta", "fasta")

<Bio.SeqIO.FastaIO.FastaIterator at 0x7f51440dd6a0>

Here it says that this object is a 'Fasta Iterator' this gives you a hint that whatever is stored in this object is iterable - that is, you can iterate over it. Just like you would iterate over a list, dictionary or string. 

In [10]:
for record in SeqIO.parse("NFU1_proteins.fasta", "fasta"):
    # do something 
    pass   # pass does nothing

In [22]:
# let's print just the first 3 records to the screen.
# hint: we can use 'break' to stop the for loop if 
# we need to

counter = 0
for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    counter +=1
    if counter <= 3 :
        print(record)
    else:
        break



ID: Q9UMS0.2
Name: Q9UMS0.2
Description: Q9UMS0.2 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; AltName: Full=HIRA-interacting protein 5; Flags: Precursor [Homo sapiens]
Number of features: 0
Seq('MAATARRGWGAAAVAAGLRRRFCHMLKNPYTIKKQPLHQFVQRPLFPLPAAFYH...NSP')
ID: Q9QZ23.2
Name: Q9QZ23.2
Description: Q9QZ23.2 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; AltName: Full=HIRA-interacting protein 5; Short=mHIRIP5; Flags: Precursor [Mus musculus]
Number of features: 0
Seq('MAAAERAWGAAVGVVRLCRRFCHVATPHTFKKQPLHQYVRRPLFPLRAPLCNTV...NSS')
ID: B4M375.1
Name: B4M375.1
Description: B4M375.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila virilis]
Number of features: 0
Seq('MSKLISYAAKNTLRNTRLGANPICQHATRDYMHLAAASAARNTYSTPAVGFAKQ...TPN')


In [50]:
# Find all the records with 'Drosophila' in the description
# print them to screen AND store their record object in a dictionary 
# myFlies 

myFlies = {}

for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):
    description = record.description
    if 'Drosophila' in description:
        
        print(description)
    
https://www.biostars.org/p/169723/

B4M375.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila virilis]
B3MRT7.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila ananassae]
B4JWR9.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila grimshawi]
B4H303.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila persimilis]
B5DKJ8.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila pseudoobscura pseudoobscura]
B4NE93.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila willistoni]
B4PZ52.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila yakuba]
Q8SY96.1 RecName: Full=NFU1 iron-sulfur cluster scaffold homolog, mitochondrial; Flags: Precursor [Drosophila melanogaster]
B4IMF6.1 RecNa

{}

let's try and perform some actions on the myFlies records  

In [44]:
# Note, this record variable is UNRELATED to our variables above.  
for record in myFlies:
    
    print("This sequence is %i amino acids long" %len(record.seq))

AttributeError: 'str' object has no attribute 'seq'

In [45]:
# Let's count the number of Alanines in each sequence from myFlies
# print 'this sequence has XX Alanines'
for record in myFlies:
    sequence = record.seq


AttributeError: 'str' object has no attribute 'seq'

Read all the records again (i started it for you)
this time, add the records that have an odd number 
of amino acids in their sequence lengths to a list. write this list to a file using SeqIO.write

Recommended: 

 - Make a function 'odd_num' like we did before that returns 'True' if a number is odd. 
 - read your fasta file (I started it for you below)
 - make a variable that will store your sequence
 - calculate the length of that variable 
 - check if that variable is odd
 - append this variable to a list
 - outside of your for loop, write the odd-list to file . 




In [67]:
for record in list(SeqIO.parse("NFU1_proteins.fasta", "fasta")):

    pass

#SeqIO.write( ??? , "myOddSequences.fasta","fasta")

12.204724409448819
12.941176470588236
15.771812080536913
16.433566433566433
16.44295302013423
17.02127659574468
17.482517482517483
17.301038062283737
16.607773851590107
18.021201413427562
17.6678445229682
16.607773851590107
18.021201413427562
24.02826855123675
23.272727272727273
16.53846153846154
14.453125
16.8141592920354
21.1864406779661
14.893617021276595
13.392857142857142
22.51082251082251
16.216216216216218
18.0
18.333333333333332
13.541666666666666
16.097560975609756
13.917525773195877
14.43298969072165
16.580310880829014
16.666666666666668
17.52577319587629
15.609756097560975
15.492957746478874
17.08542713567839
14.136125654450261
18.09045226130653
18.75
16.582914572864322
17.08542713567839
18.592964824120603
15.18324607329843
16.666666666666668


## Rosalind

Try to solve the following problems with Biopython.

In [None]:
# http://rosalind.info/problems/gc/
# GC content

# hint, try from Bio.SeqUtils import GC


In [None]:
# http://rosalind.info/problems/prtm/
# Mass of a protein

# hint: make a dictionary of the table
#

In [None]:
# http://rosalind.info/problems/orf/
# Finding all possible ORFs

In [None]:
# http://rosalind.info/problems/splc/
# Splice splice baby. 

In [None]:
# http://rosalind.info/problems/revp/
# Palindromania

# hint: don't write out every possible palindrome. 
# this one looks fun!