# Big Data for Biologists: Decoding Genomic Function- Class 4

## How can we compare two or more DNA sequences? 

##  Learning Objectives
***Students should be able to***
 <ol>
   <li><a href=#SeqAlignIntro>Identify ways that DNA sequence alignments can provide insights into human biology</a></li>
 <li><a href=#Import>Import a module into a Python script</a></li>
 <li><a href=#ModuleHelp>Run the help command to get information about Python modules</a></li>
 <li><a href=#WriteModuleHelp>Write help information for a Python module</a></li>
 <li><a href=#Package>Explain what a Python package is and how to import modules from a package </a></li>
 <li><a href=#SeqIO>Read in a sequence using SeqIO and be able to identify the type and attributes of an object in Python</a></li>
 <li><a href=#Align2>Align two sequences using modules from the BioPython package </a></li>
 <li><a href=#DataStructures> Identify the difference between tuple, list, and dictionary data structures in Python</a></li>
 <li><a href=#Align2>Interpret the output of a pairwise2 sequence alignment from the BioPython package </a></li>


# How can DNA sequence alignments provide insights into human biology?<a name='SeqAlignIntro' />


<i>
    
    * "What model organism can I use to study a gene that has been associated with a human disease?"
    
    * "I made a discovery about how a gene works in fruit flies, could my finding also be relevant in humans?"  
    
    * "How can I analyze my DNA sequencing results to determine if I am at risk of a disease?"  
    
    * "How different are humans from Neanderthals or other ancient humans?"
</i>


**ALL of these questions utilize the tools of sequence alignment**   


For today's class we will be looking at the very important procedure of DNA sequence alignment. 

We will look at this class and in the next class of examples of two types of sequence alignments that can be performed:

* Comparing two sequences **pairwise sequence alignment**  
* Comparing three or more sequences **multiple sequence alignment**

In our examples today we will use DNA sequences, but there are also algorithms that can be applied to aligning protein sequences. 

We will be showing you how to perform alignments in Python to continue building your skills in Python, but there are a number of web-based tools for performing both pairwise and multiple sequence alignments such as [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) for pairwise alignments and [CLUSTAL Omega](http://www.ebi.ac.uk/Tools/msa/clustalo/) for multiple sequence alignments. 


# Import a module into a Python script<a name='Import' />

Writing the algorithms for sequence alignments is beyond the scope of this class. However, we can perform sequence alignments with the help of algorithms that have been developed and shared by others. 

In order to use code that has been shared by others, we first need to learn how to import the code into Python. 

As a starting example, we can look first at how to import the module that we created in the last class, central_dogma_helpers.py, into a Python script.

Remeber, that we defined three functions in central_dogma_helpers.py: 

    read_nt_from_fastasequence
    write_RNA_from_DNA
    write_protein_1_letter_aa_from_RNA

Once the module is imported, we will be able to call these three functions by name in our code. We will not have to write out the entire function.  

In our example, we will also use the sys module that comes with the Python distribution. 

We've seen a few examples of the import command already in earlier classes, and now you should have a better understanding of what that command means. 

As a reminder, the box below has the helper functions that we wrote in the last class.

In [None]:
%%writefile ../helpers/central_dogma_helpers.py

#read in a nucleotide (DNA or RNA) sequence from a FASTA sequence

def read_nt_from_fastasequence(FASTAsequence):
    FASTAsequence=open(FASTAsequence,'r')
    nt_sequence=(FASTAsequence.readlines()[1:])
    nt_sequence=''.join(nt_sequence)
    nt_sequence=nt_sequence.replace('\n','')
    return(nt_sequence)

#write an RNA sequence from a DNA sequence
def write_RNA_from_DNA(DNAsequence):
    RNAsequence=DNAsequence.replace('T','U')
    return(RNAsequence)

#Write out the protein 1-letter amino acid from an mRNA sequence
def write_protein_1_letter_aa_from_RNA(RNAsequence):

#defines the python dictionary for the one letter genetic code 
    geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

#defines the string variable proteinseq
    proteinseq=''

#range command (start,stop(not included),step)

    for i in range(0,len(RNAsequence),3): 
        proteinseq=proteinseq+str(geneticcode1let[RNAsequence[i:i+3]])
    return (proteinseq)

In [None]:
#Write a line of code to list the files in the helpers directory
###BEGIN SOLUTION
###END SOLUTION

In [None]:
#Tells python where to look for .py files
#adds ../helpers to the list of directories where to look for .py files. 
#The list of directories to look in is called the "path". 
#sys is a pre-installed module that comes with standard Python distributions (https://docs.python.org/3.6/library/sys.html)
 
import sys
sys.path.append('../helpers')

#Imports the module central_dogma_helpers.py
import central_dogma_helpers

#Import the names of all the functions in central_dogma_helpers.py
#The names of all the functions is denoted by the *. 
#You could also import each function by its individual name. 
#Or you could call each function by using the syntax central_dogma_helpers.function name"""

from central_dogma_helpers import *

#Runs the three functions in central_dogma_helpers

insulin_DNA_sequence=read_nt_from_fastasequence('../class_01_Gene_Sequences/data/Human-Insulin NM_000207.2.txt')

insulin_RNA_sequence=write_RNA_from_DNA(insulin_DNA_sequence)
                        
insulin_protein_sequence=write_protein_1_letter_aa_from_RNA(insulin_RNA_sequence[59:389])

#Prints the output 
print('RNAsequence:  '+ insulin_RNA_sequence )
print('\n'+'Protein Sequence:'+ insulin_protein_sequence)

In [None]:
import os
os.getcwd()

In [None]:
#Add the helpers folder using the absolute path
###BEGIN SOLUTION
###END SOLUTION

## How can I get information about a Python module that I imported?<a name='ModuleHelp' />

After you load a module into Python if you want to get more information about the module, you can use the help command. 

In [None]:
import sys
help (sys)

## How can I write help  information for a module that I write?<a name='WriteModuleHelp' />

In [None]:
def read_nt_from_fastasequence(FASTAsequence):
    '''this function reads in a sequence in FASTA format and
       returns the nucleotide sequence without the header information'''
    FASTAsequence=open(FASTAsequence,'r')
    nt_sequence=(FASTAsequence.readlines()[1:])
    nt_sequence=''.join(nt_sequence)
    nt_sequence=nt_sequence.replace('\n','')
    return(nt_sequence)

In [None]:
help (read_nt_from_fastasequence)

## What are python pacakges and how can I import packages?<a name='Package' />

There are many publicly available modules that can be imported into Python. Often, modules are made available as part of **packages** which are sets of module files that can be installed by users to expand Python functionality. Python packages are a collection of modules and also contain a required file called  \_\_init\_\_.py that instructs Python to treat a directory as a package. 

See the figure below for the relationship between functions, modules and packages. 

<img src="../Images/4-Package.png" style="width: 30%; height: 30%" align="center"/>

To use a package, the package first needs to be installed. How you install a package will depend on the system that you are using. 

For this class we have pre-installed the packages that you will need. Today we will be using a package called Biopython which you can learn more about [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html). 

You can check to see if a package has been installed by running the import command and seeing if there is an error. 

In the code below, we are checking to be sure that the Bio package from BioPython has been installed. We can also check the version number. 

In [None]:
import Bio 
print(Bio.__version__)

As a summary, we've looked today at how to import three types of modules into Python:

* modules that you wrote and saved as .py files (eg. central_dogma_helpers.py from the last class) 
* modules that came with the Python distribution 
* modules that come from packages that you install 

One final note is that you may hear packages being referred to as **libraries**.  

Now that we've set our system up to use the BioPython package we are going to look at ways that we can use it. 

## Read in a sequence using SeqIO and identify the type and attributes of an object in Python <a name='SeqIO'/>

In the last class we looked at the sequence for human insulin. 

Mice have two copies of the insulin gene. In this question we will ask how similar these sequences are to the human gene? And specifically, which one is more similar? 

This is an example question that we'll look at as we learn about pairwise sequence alignments.    

We saved the FASTA sequences for the mouse genes in two files Mouse Insulin GeneID 16333.txt and Mouse Insulin Gene ID 16334.txt in files in the data directory for this class.  

In this example we are going to use two modules:

SeqIO a convenient tool for reading FASTA sequences into Python. 

And pairwise2, an algorithm for aligning two sequences. 

If you remember back to the first class, we wrote some code to read a FASTA sequence into Python, but we had to separate the header (the first line starting with >) from the actual sequence. 

SeqIO is a Biopython package that conveniently reads in a file and separates (or **parses**) a FASTA sequence into its ID, Name, Description, features and the sequence.

Before we run SeqIO, here is a link to the documentation. 

In [None]:
from IPython.display import IFrame

display(IFrame("https://biopython.org/DIST/docs/api/Bio.SeqIO-pysrc.html",height=1000,width=1000))

In [None]:
# note we used the import Bio command above otherwise we would need to have it here.

#imports the sequence reading package SeqIO from the Bio module and prints the sequence identifier.  
from Bio import SeqIO

#Reads the FASTA sequences 
human_seq=SeqIO.read('../class_01_Gene_Sequences/data/Human-Insulin-NG_007114.1.txt',"fasta")

print(human_seq)

Up until now, we have been working mostly with <b> string variables </b> which are simple sequences of characters. As you can see, the output of the print(human_seq) command is more complex. 

Typical variables in Python can be just assigned a single value (eg. sequence='ATGC') but you often want to create more complex data structures that have multiple properties and associated operations you can perform on them. 

So Python supports Classes/Objects which are more complex data structures that are made up of
1. <b>attributes</b>: these help define properties of the class
2. functions: these functions specifically associated with a class define commands that be applied to objects of that class.

Classes are "blueprints" or templates for objects. Objects are specific instantiations of a Class

An analogy may be helpful:

We can define a class called Fruit with attributes color, shape, taste. We can also define some functions like squeeze() and print () that specifically does some operations on objects of Class fruits. 

We can then instantiate orange as an object which is a specific instance of the Class fruit with attributes color='orange', shape='round', taste='sweet'
If we run squeeze(orange) it may for example change the shape attribute from shape='round' to shape='oblong' and produces an output object of another Class called Juice
If we run print(orange), it may print a picture of an orange

If a function is not defined for a class, you will get an error if you try to apply it to the object.
e.g. if you try to run plant(orange) it will throw an error since in our current definition of the class there is no such function. We can of course define such a function in the Class definition of Fruits in which case we will be able to apply it to any object of Class Fruits.

The Python object that gets created with the SeqIO.read command is an object in the class Bio.SeqRecord.SeqRecord. This can be determined by using the type command. 

In [None]:
#print the type of the human_seq variable  

print(type(human_seq))

The sequence variables that we have been looking at are in the class string. In the code box below, check the type of the insulin_RNA_sequence variable that was defined above. 

In [None]:
#print the type of the insulin_RNA_sequence variable
###BEGIN SOLUTION
###END SOLUTION

The Bio.SeqRecord.SeqRecord class has associated with it a set of  attributes. These attributes include the ID, the Name, the Description in addition to the sequence itself. The list of attributes is defined within the SeqIO code. 

You can get more information about the attributes associated with a particular variable class using the help command as shown below.

In [None]:
#The help command can be used to obtain more information about a variable class 
help(Bio.SeqRecord.SeqRecord)

If you run the help command on str, you can see that this class does not have attributes associated with it. 

In [None]:
help(str)

The attributes of an object can each be called separately using a "." extension. For example, the identifier of the human_seq variable can be referred to as human_seq.id.

In [None]:
#imports the sequence reading package SeqIO from the Bio module and prints the identifier. 
print(human_seq.id)

In the box below, revise the code above to print out the gene sequence instead of the identifier. 

In [None]:
#imports the sequence reading package SeqIO from the Bio module and prints the nucleotide sequence. 
###BEGIN SOLUTION
###END SOLUTION 

## Align two sequences using Pairwise2<a name='Align2' />
Now we are ready to use the SeqIO package with the pairwise2 module to run the sequence alignment. For more information on the pairwise2 module see the BioPython documentation [here](http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html).

In [None]:
# Write code to do a pairwise sequence alignment between the Human-Insulin Gene NG_007114.1
# and the Mouse Insulin GeneID 16333

# note we used the import Bio and import SeqIO command above otherwise we would need to have them here.

#imports the pairwise sequence alignment algorithm pairwise2 from the Bio module. 
from Bio import pairwise2 

#import the sequence_alignment_helpers.py file in the helpers directory
import sequence_alignment_helpers
from sequence_alignment_helpers import *


#Reads the FASTA sequences 
human_seq=SeqIO.read('../class_01_Gene_Sequences/data/Human-Insulin-NG_007114.1.txt',"fasta") 
mouse_seq_16333=SeqIO.read('data/Mouse Insulin GeneID 16333.txt',"fasta")

#Conducts a global pairwise alignment between the two sequences  
#the xx gives instructions about how to calculate the alignment score
#http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

alignments = pairwise2.align.globalxx(human_seq.seq, mouse_seq_16333.seq)

print(alignments)

In [None]:
!cat ../helpers/sequence_alignment_helpers.py

In [None]:
#Write the code to determine the variable type for alignments 
###BEGIN SOLUTION
###END SOLUTION

## Identify the difference between list, tuple, and dictionary data structures in Python  <a name='DataStructures' />

The alignments output of the pairwise2 algorithm is a list of a data type called tuples. 

* A **list** is denoted by square brackets. For example: 

       alignments=[alignment0, alignment1, alignment2] 

       Individual elements of a list can be referred to using an index. 

       alignments[0]= alignment0  
       alignments[1]= alignment1 
       alignments[2]= alignment2  
       
       Values of lists can be changed, for example:  
       
       alignments[2]= new_alignment2 
       
       alignments=[alignment0, alignment1,new_alignment2]
       
       Values in lists can be different data types. In the alignments example, the data types are tuples. 


* A **tuple** is denoted by parentheses. A tuple behaves similarly to a list, but it is "immutable". That means that once you define a tuple in your script or program, you cannot change, add or remove elements. Why would you ever want this constraint? There are two main reasons: using tuples can make some operations faster due to how they are stored internally in the computer's memory. Additionally, tuples can be used as dictionary keys (more on dictionary keys below), while lists cannot.

       alignments[0]=('sequence1','sequence2',alignment_score,start,stop) 
       
       Individual elements of a tuple can be referred to using an index. 
       
       alignments[0][0]='sequence1' 
       alignments[0][1]='sequence2' 
       alignments[0][2]= alignment_score
       alignments[0][3]=start
       alignments[0][4]=stop
       
       alignments[0][1]='new_sequence2' will give error message: 'tuple' object does not support item assignment
       
       

* As a review, a third data type that we have already seen is a **dictionary**. Dictionaries are denoted by curly braces and define a map of a value to a key.  
       
       In the last class, for example, we saw: 
       
       DNAdict={'A':'T','T':'A','G':'C','C':'G'}
       
       Elements in dictionaries are referred to as keys and values. 

       a={key1:value1,key2:value2,key3:value3}

       a[key1]=value1

       a[key2]=value2

       a[key3]=value3



## Interpret the output of a pairwise 2 sequence alignment from the BioPython package  <a name='Align2' />

Now that we have defined lists, tuples and dictionaries, let's look more carefully at the BioPython pairwise2 sequence alignment output. 

In [None]:
print (alignments)

    What is alignments [0][1] (describe in a word)?
    What is alignments[0][2] (describe in a word)?
    What is alignments[0][3] (a number)?
    What is alignments[0][4] (a number)?
    What is alignments[0][5] (a number)?

The sequence_alignment_helper functions: 
    insert_newlines 
    format_alignment_linebreak  
    
 Can be used to print the alignments output from pairwise2 in a format that is easier to read.  

In [None]:
#uses the sequence_alignment_helper functions to print the alignments with a nice format
align1_linebreaks=insert_newlines(alignments[0][0])
align2_linebreaks=insert_newlines(alignments[0][1])
 

#format_alignment_linebreak inputs are: align1_linebreaks,align2_linebreaks,score,begin,end,seq1.id,seq2.id
print(format_alignment_linebreak(align1_linebreaks,align2_linebreaks,alignments[0][2],alignments[0][3],
                                 alignments[0][4],str(human_seq.id),str(mouse_seq_16333.id)))

In [None]:
print(type(alignments))

Now we can return to the question that we started with by looking at whether it is the mouse insulin GeneID 16333 or 16334 that is more similar to the human insulin gene. 

In the space below, write the code to run a pairwise sequence alignment with the Mouse Insulin GeneID 16334.txt file. 

In [None]:
# Write code to do a pairwise sequence alignment between the Human-Insulin Gene NG_007114.1
# and the Mouse Insulin GeneID 16334

###BEGIN SOLUTION 
###END SOLUTION

Looking at the output, which mouse insulin gene do you think is more similar to human insulin?

We'll talk more in the next class about the scoring of alignments. 