# **Day 2 - Part 2** 

## Learning objectives

- Solve problems using python core
- Introduction to biopython



## Reminder: Lists and Dictionaries and the 0-index
---

### Lists
**Lists** (also called 'arrays' or 'vectors' in other languages) are a collection of ordered items. Lists are represented with square parentheses separated by commas.

In [None]:
this_is_a_list_of_integers = [1,2,3,4]
this_is_a_list_of_strings = ["a","c","b"]

# You can also have lists within lists
this_is_a_list_of_integers_strings_and_lists = [1,2,3,4,"a","c","b",this_is_a_list_of_strings]
print(this_is_a_list_of_integers_strings_and_lists)

You can retrieve items from a list based on their position in the list. However, python uses a 0-index, which means the first item in the list is 'item 0' and the second item is 'item 1' the last item is the length of the list - 1.

In [None]:
myList = [1,2,3,4]

# first item in the list
first = myList[0] 
print("First item is", first)

# last item in the list
last = myList[-1]  
print("Last item is", last)

# second last item in the list
second_last = myList[-2]
print("Second last item is", second_last)

# provide the list starting with the 2nd item (index 1)
skip_first = myList[1:] 
print("List without first item is", skip_first)

# provide the last two items in the list
last_two = myList[-2:] 
print("Last two items are", last_two)

# provide item 2 and item 3 in the list
print("Items 2 and 3 in the list are", myList[1:3])


You can add an item to a list using `.append()` or remove items from a list using `.pop()` or `del` or `.remove()`

In [None]:
myList = [1,2,3,4]
myList.append("apple")
print(myList)

myList.pop(1) # 'pops' out the 2nd item in myList
print(myList)

del myList[1] # deletes the 2nd item in the new myList
print(myList)

myList.remove("apple") # removes the item called 'apple'
print(myList)

In [None]:
myList = [1,2,3,4]
myList.append("apple")
print(myList)

myList.pop(1) # 'pops' out the 2nd item in myList
print(myList)

### Dictionaries 

Dictionaries are incredibly powerful tools for storing information. A dictionary is a list of keys (=entries) and values (=attributes) you would like to store with each key separated with a ':'.  They are stored in {}.

In [None]:
myDictionary = {"a":1,"b":2,"c":3}

# You can access the value of key 'b' using the following syntax.  
print("The value for key b is:", myDictionary["b"])

# You can add elements to a dictionary
myDictionary["d"] = 4
print(myDictionary)

# Overwrite elements of a key 
myDictionary["d"] = 5
print(myDictionary)

# Add/subtract elements of a keys
myDictionary["d"] += 5
print(myDictionary)

# Remove keys from a dictionary 
del myDictionary["d"] 
myDictionary.pop("a")
print(myDictionary)

### Section 01 - Analyzing biological data with python core
---

We are going to start playing with actual biological data. 

[Here](https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf) you can find a useful cheat sheet with common python core functions.


Let's reverse complement a DNA sequence. Hint, you can reverse strings using "[::-1]" or .reverse()

In [None]:
myDNA = "agttgcccag"
print("Original sequence is:", myDNA)

RC_dict = {"a":"t","t":"a","c":"g","g":"c"} # dictionnary with each base complement

myDNA_RC ="" #new string to store the reverse complemented sequence

for i in myDNA:
    myDNA_RC += RC_dict[i] # for each base in myDNA, the complement base will be added in the new list using the dictionnary

print("Complement is:", myDNA_RC) # the complemented sequence
print("Reverse complement is:", myDNA_RC[::-1]) # we can reverse the string in the new list to get the reverse complement

Work on the following problems:

### Rosalind Problem - Complementing a Strand of DNA

Go on [Rosalind](http://rosalind.info/problems/revc/) website.

> In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

> The reverse complement of a DNA string 's' is the string 'sc' formed by reversing the symbols of 's', then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

> Given: A DNA string 's' of length at most 1000 bp.

> Return: The reverse complement 'sc' of 's'.

Try to find other ways to reverse complement a sequence (without a dictionnary).

In [None]:
## Correction

S = open('rosalind_revc_1_dataset.txt', 'r')
S = S.read()

print("Original sequence", S)

#method1 using a for loop

revS = S[::-1] # reverse the string

print("Reversed sequence", revS)

comp="" #new string to store the complement

for i in revS:
    if i == 'A':              
        comp+='T'             
    elif i == 'C':            
        comp+='G'             
    elif i == 'T':            
        comp+='A'             
    elif i == 'G':            
        comp+='C' 

        
print("\nReverse complement", comp)



In [None]:
#method2 using a built-in function
S = open('rosalind_revc_1_dataset.txt', 'r')
S = S.read()

print("Original sequence", S)

revS = S[::-1]
print("Reversed sequence", revS)

revS = revS.replace('A', 't') # why are we using lower case here?
print("All A are replaced by t", revS)
revS = revS.replace('T', 'a')
print("All T are replaced by a", revS)
revS = revS.replace('C', 'g')
print("All C are replaced by g", revS)
revS = revS.replace('G', 'c')
print("All G are replaced by c", revS)
revS = revS.upper() # get all characters back upper

print("Reverse complement #1 ", revS)

#method2 simplified
S = open('rosalind_revc_1_dataset.txt', 'r')
S = S.read()

S = S.replace('A', 't').replace('T', 'a').replace('C', 'g').replace('G', 'c').upper()[::-1]
print("Reverse complement #2", S)

### Rosalind Problem - Counting DNA nucleotides

Go on [Rosalind](http://rosalind.info/problems/dna/) website.

> A string is simply an ordered collection of symbols selected from some alphabet and formed into a word

> The length of a string is the number of symbols that it contains.

> An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

> Given: A DNA string 's' of length at most 1000 nt.

> Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in 's'.


In [None]:
## Correction

seq = open('rosalind_dna_1_dataset.txt', 'r')
seq = seq.read()
print("Original sequence:", seq)

countA = 0
countC = 0
countG = 0
countT = 0

for n in seq:
    if n == 'A':
        countA += 1
    if n == 'C':
        countC += 1
    if n == 'G':
        countG += 1
    if n == 'T':
        countT += 1

print("Count A is", str(countA), ", Count C is " + str(countC), ", Count G is ", str(countG), ", Count T is",  str(countT))

### Rosalind Problem - Transcribing DNA into RNA

Go on [Rosalind](http://rosalind.info/problems/rna/) website.

> An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

> Given a DNA string 't' corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u

> Given: A DNA string 't' having length at most 1000 nt.

> Return: The transcribed RNA string of 't'

In [None]:
## Correction

#method1
dna = open('rosalind_rna_1_dataset.txt', 'r')
dna = dna.read()
print("DNA sequence", dna)

rna = dna.replace("T", "U")
print("RNA sequence", rna)


### Rosalind Problem - Counting Point Mutations

Go on [Rosalind](http://rosalind.info/problems/hamm/) website.

> Given two strings 's' and 't' of equal length, the Hamming distance between 's' and 't', denoted dH(s,t), is the number of corresponding symbols that differ in 's' and 't'

> Given: Two DNA strings 's' and 't' of equal length (not exceeding 1 kbp).

> Return: The Hamming distance dH(s,t).



In [None]:
## Correction

#method1 with a while loop
a = "TTAGTTTCAATCGTCCCTAATGCCTTTCCTGGAAATCCAGTCTACGCTGGGCTATTAACGCCGGAACTTTTAGAGTGCCTCGACCTGTTCCTAAA"
b = "TTATACGAAATCATCAGTTGTTACAGTCCAGGTAAACCAATGGGACCGTCGGTATCCCCGAGGAAGCTCATACGAAGGGGCGCAATGTGAACCAT"
n = 0
hamming = 0

while n <= len(a)-1: # as long as n is <= than length of seq A minus 1
    if a[n] != b[n]: # compare each base from A with related base from B
        hamming += 1 # if different, we add 1 to the hamming counter
    n += 1 # the loop will run as long as n <= len(a)-1, then stop    
    
print("The number of mutations between sequence A and B is", hamming)

#method2 with a function
def hamming(a, b):
    return len(list(filter(lambda pair: pair[0]!=pair[1], zip(a,b))))

hamming(a,b) 

### Rosalind Problem - Finding a Motif in DNA

Go on [Rosalind](http://rosalind.info/problems/subs/) website.

> Given two strings 's' and 't', 't' is a substring of 's' if 't' is contained as a contiguous collection of symbols in 's' (as a result, t must be no longer than s).

> The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). 

> The symbol at position i of 's' is denoted by s[i]

> A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; 

> for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

> The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in 's' if it occurs more than once as a substring of 's' 

> Given: Two DNA strings 's' and 't' (each of length at most 1 kbp).

> Return: All locations of 't' as a substring of 's'.

In [None]:
## Correction
s,t = open('rosalind_subs_1_dataset.txt').read().split()
print("Sequence is", s)
print("Motif to find is", t)

loc = []
for i in range(len(s)):
    if s[i:].startswith(t):
        print(i+1) # why? hint: python starts counting at 0
        loc.append(i+1)

print("The motif can be find at positions:", loc)
        


### Section 02 - Biopython
---

When starting to code, chances are you on not the first person to try and write code to solve that problem. Some very clever biologists wrote a package called biopython that has most of the functionality one would ever need in a python program. 

[See the cookbook for biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html)



In [None]:
# if biopython not installed
!pip install biopython

In [None]:
# if no pip (but you should have it automatically installed with python3)
# !python -m ensurepip --upgrade

In [None]:
# Check the installation and path
from Bio import SeqIO
print(SeqIO.__file__)



In [None]:
# For example
from Bio.Seq import Seq

# we can convert a string into a sequence object using Seq()
myDNA = Seq("agttgcccag")

# Use the function 'reverse_complement()'
print(myDNA.reverse_complement())

### Rosalind Problem - Transcribing DNA into RNA

Repeat this [Rosalind](http://rosalind.info/problems/rna/) exercice using a biopython function.

> An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

> Given a DNA string 't' corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u

> Given: A DNA string 't' having length at most 1000 nt.

> Return: The transcribed RNA string of 't'

In [None]:
dna = open('rosalind_rna_1_dataset.txt', 'r').read()
#dna = dna.read()
dna = Seq(dna)

#dna = Seq("GTGCTTAGACACTACCCTGCTGCTACGTTTGTTACGACGTGTCTATACCTGTTATCTAGACTTTACAAACCGAAGTAGGCCAACATTTGCGAAATTTAAAGTGACGAGTGCCAG")
print(dna)
print(dna.transcribe())

To convert between DNA and RNA we can use the `transcribe()` function and to convert between DNA/RNA and amino acids we can use the `translate()` function:

In [None]:
myDNA = Seq("agtcaatcccaacgggagccccag")


myRNA = myDNA.transcribe()
myProtein = myDNA.translate()

print(myRNA)
print(myProtein)

### Reading fasta files - Python Core

One of the most common methods to store DNA and protein information is in FASTA files. The consist of two main elements: 

- the title of the sequence (also called header) preceded by a '>'
- the sequence


Notice how the the sequence can be wrapped or multiple lines? This can be frustrating whey trying to read fasta files using the core python functions.  

In [None]:
# Let's try and make a dictionary storing 
# the sequence title as a key and the sequence as the value
seqDict={}
myLines = open("MyFastaLines.txt").readlines()

print(myLines) # you can see there are /n characters? these are 'end-of-line characters'
# we can remove these using '.strip()' function

for line in myLines:
    print(line)  
    l = line.strip()
    if  l.startswith(">"):
        #Easy, if the line starts with a > then it must be the header
        titleline= l
        seqDict[titleline]="" # here the first line is set as the key in the dictionnary
    else:
        # The next line must be part of the sequence
        seqDict[titleline]+= l # the other lines are set as value in the dictionnary
        
# does this look right?
print(seqDict)

# How you could fix this code? 
# seqDict[titleline]+= l

### Writing fasta files - Python Core

You can create a fasta file using python core from variables of your choosing. 

In [None]:
seqDict = {"seq1":"AGTCGTA", "seq2":"GTCACTGC"}

#one way
with open("myoutputSeq.fasta","w") as output:
    for seq in seqDict:
        output.write(">" + seq + "\n" + seqDict[seq] + "\n")
        
#another way using % operator
with open("myoutputSeq.fasta","w") as output:
    for seq in seqDict:
        # %s operator is read as a placeholder for the strings in the parentheses 
        # after the string
        output.write(">%s\n%s\n" % (seq, seqDict[seq]))


### Biopython and Fasta files 

Of course, there is a much easier way to read and write fasta files using biopython using the `SeqIO` module. 

In [None]:
from Bio import SeqIO

# SeqIO.read takes two arguments, the file and the format of the file
# you can use 'read' when you file has one sequence
record = SeqIO.read("myFavouriteProtein_ncbi.fasta", 'fasta')

#Here is some information available in the 'record' object
header = record.description
id = record.id
sequence = record.seq

print(header)
print(id)
print(sequence)

# You can write this record to another file
SeqIO.write(record, "myFavouriteProtein_ncbi_copy.fasta", "fasta")

In [None]:
# You can read a file with more than one sequence using  SeqIO.parse()
outputList = []
for record in SeqIO.parse("NFU1_proteins.fasta", 'fasta'):
    outputList.append(record)
    # however, to write more than one sequence to a file using SeqIO, 
    # you have to save the record objects in a list and write them later

SeqIO.write(outputList, "NFU1_proteins_copy.fasta", "fasta")

In [None]:
# HAND IN #5

# Write a fasta parser for the fastafile 'NFU1_proteins.fasta'
# Use this parser to indicate how many sequences are in the file

# make a file with the longest sequence called 'NFU1_longestSeq.fasta'

# save the lengths of all the proteins as a list 'all_seq_lengths




In [None]:
from Bio import SeqIO

maxSeqs=[] # the records of the longest sequences
maxSeq_length = 0 # a list in wich we will store the length of the longest sequence
all_seq_lengths = [] # a list to store all the different lengths

for record in SeqIO.parse("NFU1_proteins.fasta", 'fasta'):
    all_seq_lengths.append(len(record.seq))
    if len(record.seq) > maxSeq_length:
        # wipe the list! we found a new winner!
        maxSeqs=[]
        maxSeq_length = len(record.seq)
        maxSeqs.append(record)
    elif len(record.seq) == maxSeq_length:
        # we found something the same lenght as the best, add it to the list!
        maxSeqs.append(record)
        
print("There are %i sequences in the file" % (len(all_seq_lengths)))

print("The longest sequence as a lenght of", maxSeq_length)
print("Here are all the sequences length", all_seq_lengths)

SeqIO.write(maxSeqs, "NFU1_longestSeq.fasta", "fasta")