# Lab 2 : Sequences and Strings

The biologist's string quartet is composed of nucleic acids.  The computer-science terminology is a little different from the biology terminology. In computer science parlance, a sequence of symbols is called a string. For instance, this sentence is a string. A language is a (finite or infinite) set of strings. You often hear bioinformaticians referring to an actual sequence of DNA or protein as a "string," as opposed to its representation as sequence data. This is an example of the terminologies of the two disciplines crossing over into one another.

In today's lab we will begin working with DNA and RNA strings (sequences) and you will write programs to join exons, transcribe DNA into RNA and find codons in your mRNA. 

## 2.1 The Programming Process

Fundamental elements in the programming process include:

* Making an overall design for the program, including the general algorithm by which the program computes the output.
* Identifying the required inputs.
* Deciding how the outputs will print; for example, to files or displayed graphically.
* Refining the overall design by specifying more detail.
* Writing the Python program code.

It may seem like writing code is the hard part, but really the design phase often is the tricky step because you need to understand what you can do with Python before you start writing.  As you learn the Python programming language some programs will become trivial while others may take hours to days to design.  Usually the best way to tackle the design is to break the program into small parts and try to connect them together.   There is actually a whole field of specialization called software engineering that addresses these issues.  The style of programming we will be using is called imperative programming that relies on dividing a problem into interacting subroutines.  A popular programming style is object-oriented programming which, as we will see later, is used in Biopython.

### Writing in Pseudocode

Writing in pseudocode is a lot like scribbling ideas down on a napkin.  If I wanted to find all heme motifs in a genome I might begin by writing

* Get sequence from GenBank.  
* Extract the coding sequences.
* Find heme motifs in coding sequence.
* Save proteins with heme motifs to a file

Then you begin writing code to accomplish these tasks.  Often as you work on one task you may find that it needs to be broken down into even smaller parts.  This modular approach to programming is also very useful for finding errors in the smaller parts of your program before you stick them together into a complete program.

### Comments

Often the pseudocode is integrated into the program using comments.  Comments are all lines beginning with # and are ignored by the Python interpreter (except the first line of your program as you will see below).  They are extremely useful in understanding what a program does.  Comments can make your program intelligible by others, but more importantly they will help you understand what you did when you revise the program a month later.  You can also use the # sign to "comment out" portions of your code when you are testing and debugging the program.

### Changing TextWrangler Preferences

It is very helpful for interpreting error messages to see the line number in your program.   To do this in TextWrangler in the menu bar go to TextWrangler-> Preferences -> Text Status Display.  Click on "Show line numbers".

If the small type drives you nuts or if you want me to be able to read your code you can make the font bigger by going to TextWrangler-> Preferences -> Editor Defaults.  I recommend 12pt font or greater.  


## 2.2 Python standard data types

The data stored in computer memory can be of many types. For example, a person's telephone number is stored as a numeric value and his or her address is stored as alphanumeric characters. Python has five standard data types that are used to define the operations possible on them and the storage method for each of them.

* Numbers - Number data types store numeric values. 
* Strings - A string is any piece of text, whether it a single character or a complete genome. Because of the many ways that Python provides for working with strings, it is an great language for text processing and bioinformatic applications.
* Lists - A list stores a group of items one after the other, just like a grocery list or a list of gene names. It keeps them in order for you. You can add or remove items, sort your list or search through it to see if it contains a specific item.
* Tuples - Tuples are a lot like lists in that they are used to store sequences of information, but they can't be altered after they have been created.  In some cases tuples are more computational efficienct over working with lists.
* Dictionaries - Dictionaries store values with keys, so you can look them up with the key. You can use them to build complex nested data-structures or just for storing values. The values can be numbers, strings, lists, tuples and even other dictionaries.

Today we will work with Strings and Numbers.

## 2.3 String quotes: single, double, and triple!

In our first program in session we printed "Hello World!  I am writing computer programs."  The quotes can actually be three different flavors. The first two, single (a.k.a. the apostrophe) and double, are familiar (although don't confuse the single quote (') with the backtick (`) -- the one that's probably with the tilde (~) on your keyboard).

Single and double quotes can more or less be used interchangeably, the only exception being which type of quote is allowed to appear inside the string. If the string itself is double-quoted, single quotes may appear inside the string, and visa-versa:

In [None]:
print ('Hello "world"! I am writing computer programs.')

In [None]:
print ("Hello world! I'm writing computer programs.")

The key things to notice here are that double quotes are present in the first, and a single quote appears in the second, but the two cannot be combined. In order to use both single and double quotes in the same print statement, employ the extra-spiffy triple quote, which is actually just three single quotes:

In [None]:
print ('''Hello "world"! I'm writing computer programs.''')

This snippet does almost exactly the same thing as the last snippet.

Note two aspects of the triple quotes:

1. Both single and double quotes can be used inside triple quotes.
2. Triple quoted strings can span multiple lines, and line breaks inside the quoted string are stored and faithfully displayed in the print operation.

## 2.4 A Program to Store a DNA Sequence

Let's write a small program that stores some DNA in a variable and prints it to the screen. The DNA is written in the usual fashion, as a string
made of the letters A, C, G, and T, and we'll call the variable DNA. In other words, DNA is the name of the DNA sequence data used in the
program. Note that in Python, a variable is really the name for some data you wish to use. The name gives you full access to the data. Example 1 shows the entire program.

In [1]:
#!/usr/bin/env python

# Example 2.1
# Name: DNA_printout.py
# Description:  This program stores a DNA sequence in a variable and prints out the DNA sequence

# First store the DNA in a variable called DNA
DNA = 'AGTTGTAATGAGGCTGCCGTGATA'

# Next, print the DNA onto the screen (Terminal or Command Prompt Window)
print (DNA)

AGTTGTAATGAGGCTGCCGTGATA


The second step is to run the program. Type the above code into your text editor and save the file as DNA_printout.py.  Now in the Terminal or Command Prompt type

<pre>
    python DNA_printout.py
</pre> 
If you've successfully run the program, you'll see the output printed on your computer screen in the Terminal or Command Prompt.  If not one common error when starting is that the file is not saved in the same directory.  Trying typing ls (unix/OSX) or dir (Windows) to check on whether your file is in the directory.

Example 2-1 illustrates many of the ideas all our Python programs will rely on. One of these ideas is control flow , or the order in which the statements in the program are executed by the computer.  Every program starts at the first line and executes the statements one after the other until it reaches the end, unless it is explicitly told to do otherwise. Example 2-1 simply proceeds from top to bottom, with no detours.  In later sessions, you'll learn how programs can control the flow of execution.

In the above code the first line is commonly called the shebang line.  A shebang occurs as the initial two characters on the initial line of a script, is the character sequence consisting of the characters number sign and exclamation mark (that is, "#!").  When a script with a shebang is run as a program, the program loader parses the rest of the script's initial line as an interpreter directive; the specified interpreter program is run instead, which in our example is Python.  It is good practice to start off all of your Python code with this line.

The next set of lines are comments (denoted by the #) indicating who wrote the program when and for what purpose. Comments also explain what each section of the code is for and sometimes give explanations on how the code achieves its goals.

 Make sure you save it to your "AdvGen or Bioinformatics" directory.  Here is an example of how I often begin my programs.


    #!/usr/bin/env python

    #####
    # Name:     DNA_printout.py
    # Author:   Jeff Blanchard
    # Date:     9/1/2013
    #
    # Description:  This program stores a DNA sequence in a variable and prints out the DNA sequence
    #
    # Usage: python DNA_printout.py
    #####

It's tempting to belabor the point about the importance of comments. Suffice it to say that in most university-level, computer-science class assignments, the program without comments typically gets a low or failing grade; also, the programmer on the job who doesn't comment code is liable to have a short and unsuccessful career.

Now let's look at the variable DNA. Strings in Python are a type of variable.  We will learn about the other two types, integers and floating point numbers later on. The variable name DNA is somewhat arbitrary. You can pick another name for it, and the program behaves the same way. For instance, if you replace the two lines:

<pre>
DNA = 'AGTTGTAATGAGGCTGCCGTGATA'
print (DNA)
</pre>
with these
<pre>
A_poem_by_Emily_Dickinson = 'AGTTGTAATGAGGCTGCCGTGATA'
print (A_poem_by_Emily_Dickinson)
</pre>

the program behaves in exactly the same way, printing out the DNA to the computer screen. The computer attaches no meaning to the use of the string name DNA instead of A_poem_by_Emily_Dickinson, but whoever reads the program certainly will. One name makes perfect sense, clearly indicates what the string is for in the program, and eases the chore of understanding the program. The other name makes it unclear what the program is doing or what the variable is for. Using well-chosen names is part of what's called self-documenting code. You'll still need comments, but perhaps not as many, if you pick your names well.  Another important point along the same lines is using blank lines and comments to make your code more easily read by humans. 

Here are a few basic rules for variable names:

1. Python variable names are case-sensitive, so Var and var are different variables.
2. Though variable names can contain letters, numbers and underscores ( _ ), they MUST start with a letter (a-z).
3. Variable names, CANNOT contain spaces or special non-alphanumeric characters (e.g. holyS#+%? is naughty, but holyMackerel is kid tested, mother approved), nor can they be any of the following words that already have special meaning in Python:

>         and      assert   break    class    continue def      del      elif
>         else     except   exec     finally  for      from     global   if
>         import   in       is       lambda   not      or       pass     print
>         raise    return   try      while    yield

For the most part, your text editor will remind you that these words are off-limits by coloring these words in helpful ways when you type them.



## 2.4. String Operators - Concatenating DNA Fragments




An important task for many biologists is to merge different strings of DNA in one unique sequence as in gene splicing where 2 exons are brought together or in phylogenetic analyses in which genes are concatened.   We can modify the previous script to concatenate two distinct DNA sequences in one.

In [2]:
#!/usr/bin/env python

# Example 2.2
# DNA_concatenate.py
# A program that concatenates 2 DNA fragments

# Store two DNA fragments into two variables called DNA1 and DNA2
DNA1 = 'AGTTGTAATGAGGCTGCCGTGATA'
DNA2 = 'CGATTACGGCATCATTTAAAGGGCAGGAGGGTA'

# Print the DNA onto the screen
print ("Here are the original two DNA fragments:")
print (DNA1)
print (DNA2)

# Concatenate the DNA fragments into a third variable and print them

DNA3 = DNA1 + DNA2

print ("Here is the concatenation of the first two fragments:")
print (DNA3)

# An alternative way to concatenate using the print command
print ("Here is an alternative concatenation of the first two fragments:")
print (DNA1 + DNA2)

Here are the original two DNA fragments:
AGTTGTAATGAGGCTGCCGTGATA
CGATTACGGCATCATTTAAAGGGCAGGAGGGTA
Here is the concatenation of the first two fragments:
AGTTGTAATGAGGCTGCCGTGATACGATTACGGCATCATTTAAAGGGCAGGAGGGTA
Here is an alternative concatenation of the first two fragments:
AGTTGTAATGAGGCTGCCGTGATACGATTACGGCATCATTTAAAGGGCAGGAGGGTA


## 2.5  Printing - Inserting and formatting variables in strings

After the print statment is run a new line is started so DNA1 and DNA2 are printed on separate lines as in the example above.  If we want to print them on the same line a comma is used to separate the variables. This code almost works, but it leaves a space a space after the DNA string.


In [3]:
#!/usr/bin/env python

# Example 2.2
# DNA_concatenate.py
# A program that concatenates 2 DNA fragments

# Store two DNA fragments into two variables called DNA1 and DNA2
DNA1 = 'AGTTGTAATGAGGCTGCCGTGATA'
DNA2 = 'CGATTACGGCATCATTTAAAGGGCAGGAGGGTA'

# Print the DNA onto the screen
print ('My DNA sequence 1 is DNA1', DNA1, ". " 'My DNA sequence 2 is', DNA2, '.')

My DNA sequence 1 is DNA1 AGTTGTAATGAGGCTGCCGTGATA . My DNA sequence 2 is CGATTACGGCATCATTTAAAGGGCAGGAGGGTA .


 The other method python offers, called string interpolation, for injecting variables into strings looks like the following:

In [4]:
#!/usr/bin/env python

# Example 2.2
# DNA_concatenate.py
# A program that concatenates 2 DNA fragments

# Store two DNA fragments into two variables called DNA1 and DNA2
DNA1 = 'AGTTGTAATGAGGCTGCCGTGATA'
DNA2 = 'CGATTACGGCATCATTTAAAGGGCAGGAGGGTA'

# Print the DNA onto the screen
print ('My DNA sequence 1 %s. My DNA sequence 2 is %s.' % (DNA1, DNA2))

My DNA sequence 1 AGTTGTAATGAGGCTGCCGTGATA. My DNA sequence 2 is CGATTACGGCATCATTTAAAGGGCAGGAGGGTA.


This handily replaces all those comma and + operations with a very readable string, where %s represents spots where the variables or values you supply next will be inserted, in the order you supply them. After the string comes a solitary %, then a set of values in parentheses. These are the values to interpolate, and there must be as many of these as there are %s elements in your string. This is a nice way of composing a string of other strings.  

## 2.6 Integers and Floating point numbers

So far we have working mainly with one type of variable, strings.  Integers and floating points numbers are the other 2 main types.  As you might expect there is a lot of math that can be done.  Remember that the CAPITILIZATION is essential for working with variable names.

In [5]:
#!/usr/bin/env python

# Example 2.3
# Floating_point_math.py
# A program that tests basic arithmetic operators on floating point numbers

x = 4.0
y = 10.0

print (x + y)
print (x - y)
print (x * y)
print (x / y)

14.0
-6.0
40.0
0.4


In [6]:
#!/usr/bin/env python

# Example 2.4
# Integer_math.py
# A program that tests basic arithmetic operators on integers

x = 4
y = 10

print (x + y)
print (x - y)
print (x * y)
print (x / y)

14
-6
40
0.4


In the integer example 4 / 10 = 0.  While this is often not the result we want for calculations, this is very useful for web programming.   Operations on combinations of integers and floating point numbers return floating point numbers by default, so 4 / 10.0 = 0.4 instead of 0. However, operations on strings and either intergers or floating point numbers result in an error.


In [7]:
#!/usr/bin/env python

# Example 2.5
# Floating_point_strings.py
# A program that tests basic arithmetic operators on integers and strings

x = '4'
y = 10

print (x + y)

TypeError: must be str, not int

If you had instead somehow managed to get a number like '4' stored as a string (for instance, you took it as input from a file or user), then you would need a way to convince python to let you use the number as…well…a number! Your tools for this are coercion functions. You'll see these again and in more detail tomorrow, but for now just know that if something looks like a number, but has quotes around it, the functions int() and float() will give you back real numbers to play with. Since strings can not be changed.  You need to sign the result of int() or float() to a new variable.  Use them like so:

In [8]:
#!/usr/bin/env python

# Example 2.5
# Floating_point_strings.py
# A program that tests basic arithmetic operators on integers and strings

x = '4'
x0 = int(x)
y = 10

print (x0 + y)

14


Similarly an integer or floating point number can be converted to a string.

In [9]:
#!/usr/bin/env python

# Example 2.5
# Floating_point_strings.py
# A program that tests basic arithmetic operators on integers and strings

x = 4
y = 10
x0 = str(x)
y0 = str(y)

print (x + y)
print (x0 + y0)

14
410


Note that in the above example adding to strings contatenated them.  This may not have been what we expected, but it is the sames as concatenating the two DNA strings above.

To print out numbers in a sentence or with other strings, you can supply them to strings with %s elements (like we just did with string variables), but there are also special interpolation operators for numbers %d and %f (corresponding to integer and floating point, respectively). For a full workup, see http://docs.python.org/lib/typesseq-strings.html , but here's a start:

In [10]:
#!/usr/bin/env python
   
x = 4
y = 10.0002
 
print ('Variables can be interpolated as strings here %s and here %s.' % (x,y)) 

Variables can be interpolated as strings here 4 and here 10.0002.


To get 2 decimal places write %.2f in place of %s

In [11]:
#!/usr/bin/env python
   
x = 4
y = 10.0002
 
print ('Variables can be interpolated as strings here %s and here %.2f.' % (x,y)) 

Variables can be interpolated as strings here 4 and here 10.00.


Practically speaking, the most commonly used formatting tools are %s to shove variables of any and all types into strings, and %.xf where x is the number of decimal places to display for floating point numbers. Most commonly, you will see and employ a lot of '%.2f' string interpolations, and almost never see or use any of the other numerical interpolators.

## 2.7 More work with strings 

### DNA sequence length 

In addition to concatenating strings there are several built in operations that we can use such as finding the length of the DNA sequence.

In [12]:
#!/usr/bin/env python

# Example 2.6
# DNA_length.py
# A program that determines the length of a DNA sequence

DNA = 'AGTTGTAATGAGGCTGCCGTGATA'
print ('There are %s nucleotides in my DNA sequence.' % (len(DNA)))

There are 24 nucleotides in my DNA sequence.


### A substring of DNA

The splice function can be used to extract a substring of the DNA sequence.

In [13]:
#!/usr/bin/env python

# Example 2.7
# DNA_position.py
# A program finds a nucleotide at a specified position

DNA = 'AGTTGTAATGAGGCTGCCGTGATA'
print ('In my DNA sequence the first nucleotide is %s' % (DNA[0]))
print ('In my DNA sequencethe last nucleotide is %s.' % (DNA[23]))
print ('Another way to find the last nucleotide which is %s.' % (DNA[-1]))
print ('The first 3 nucleotides are %s.' % (DNA[0:3]))

In my DNA sequence the first nucleotide is A
In my DNA sequencethe last nucleotide is A.
Another way to find the last nucleotide which is A.
The first 3 nucleotides are AGT.


In Python the first position of the string is 0 not 1.  This is common in many computer languages.  This also means that every subsequent position is offset by one including the final position.  To further muck you up it is common in substrings to start counting with the first position and until the position after you want to end, such that DNA(0:3) only returns 3 nucleotides instead of 4.  You can also count from the end of the string as in the above example.

### Finding the reverse complement

In order to transcribe DNA into RNA we need to find the reverse complement of the DNA sequence and replace the Thymidines (T) with Uracils (U).  We can use a trick in Python string processing to reverse the string (.maketrans) to make a translation table for the complement of the given DNA strand. The translation table is used by the translate() method to make the complement.  Then we .replace is used to change the Ts to Us.   

In [14]:
#!/usr/bin/env python

# Example 2.8
# Reverse_Complement.py
# A program that prints the reverse complement of a DNA sequence

DNA = 'AGTTGTAATGAGGCTGCCGTGATA'

# Reverse the DNA sequence

REV = DNA[::-1]

# Get the Complement using the built-in string function .maketrans and the string translate method

NUC = "ATCG"
NUCCOMP = "TAGC"
trantab = str.maketrans(NUC, NUCCOMP)

REVCOMP = REV.translate(trantab);

# Substitute Uracil (U) for Thymidine (T)
RNA = REVCOMP.replace('T', 'U')

print (DNA)
print (REV)
print (REVCOMP)
print (RNA)

AGTTGTAATGAGGCTGCCGTGATA
ATAGTGCCGTCGGAGTAATGTTGA
TATCACGGCAGCCTCATTACAACT
UAUCACGGCAGCCUCAUUACAACU


### Finding codons in the DNA sequence

Next we can use a string operators (in) to find codons in the RNA sequence.

In [15]:
#!/usr/bin/env python

# Example 2.9
# Find_codon.py
# A program that finds codons in a sequence

# Store two DNA fragments into two variables called DNA1 and DNA2
DNA1 = 'AGTTGTAATGAGGCTGCCGTGATA'
DNA2 = 'CGATTACGGCATCATTTAAAGGGCAGGAGGGTA'

# Concatenate (join) the DNA fragments (exons)
DNA3 = DNA1 + DNA2

# Reverse the DNA sequence
REV = DNA3[::-1]

# Get the Complement using the 

# Get the Complement using the built-in string function .maketrans and the string translate method
NUC = "ATCG"
NUCCOMP = "TAGC"
trantab = str.maketrans(NUC, NUCCOMP)
REVCOMP = REV.translate(trantab);

# Substitute Uracil (U) for Thymidine (T)
RNA = REVCOMP.replace('T', 'U')

# Find codons in the sequence

print (RNA)
print ('Is there a start codon in my RNA sequence? - %s' % ('AUG' in RNA))
print ('Is there a stop codon in my RNA sequence? - %s' % ('UGA' in RNA))

UACCCUCCUGCCCUUUAAAUGAUGCCGUAAUCGUAUCACGGCAGCCUCAUUACAACU
Is there a start codon in my RNA sequence? - True
Is there a stop codon in my RNA sequence? - True


Next a string method (.find) find the codon position in the RNA sequence.  If a codon is not found the value -1 is returned.

In [16]:
#!/usr/bin/env python

# Example 2.10
# Find_codon_position.py
# A program that finds the positioncodons in a sequence

# Store two DNA fragments into two variables called DNA1 and DNA2
DNA1 = 'AGTTGTAATGAGGCTGCCGTGATA'
DNA2 = 'CGATTACGGCATCATTTAAAGGGCAGGAGGGTA'

# Concatenate (join) the DNA fragments (exons)
DNA3 = DNA1 + DNA2

# Reverse the DNA sequence
REV = DNA3[::-1]

# Get the Complement using the built-in string function .maketrans and the string translate method
NUC = "ATCG"
NUCCOMP = "TAGC"
trantab = str.maketrans(NUC, NUCCOMP)
REVCOMP = REV.translate(trantab);

# Substitute Uracil (U) for Thymidine (T)
RNA = REVCOMP.replace('T', 'U')

# print the results
print (RNA)

# Use .find to identify possible codons

print ('The position of the start codon is %s.' % (RNA.find('AUG')))
print ('The position of the GGA (glycine) codon is %s.' % (RNA.find('GGA')))
print ('The position of the stop codon is %s.' % (RNA.find('UGA')))

UACCCUCCUGCCCUUUAAAUGAUGCCGUAAUCGUAUCACGGCAGCCUCAUUACAACU
The position of the start codon is 18.
The position of the GGA (glycine) codon is -1.
The position of the stop codon is 19.


This doesn't make a lot of biological sense since the start and stop codons overlap and are not in the same reading frame.  The .find only returns the position of the first match and later matches (e.g. codons) are not found.  We will make improvements in latter labs.  For more information on Strings you can consult the Python documentations https://docs.python.org/3/library/stdtypes.html#string-methods.  For more advance string analyses we will use Regular Expressions in our next session.

## 2.8 Exercises

Turn in the above example programs along with the below exercises. First do the examples and exercises in your text editor and then when you have them running, cut and paste them into the Jupyter notebook. Remember to load the html file (and not the .ipynb file) into Moodle.

1. Write a program that puts your full name together using a separate variable for first, middle and last name and prints your full name.

2. Write a program that calculates and prints the number of codons possible in DNA1 from the above exericises (Hint: Find the length of the sequence and divide by 3.

3. Write a program that converts 'UACCCUCCUGCCCUUUAAAUGAUGCCGUAAUCGUAUCACGGCAGCCUCAUUACAACU' into lower case letters (Hint: Consult the python documentation https://docs.python.org/3/library/stdtypes.html#string-methods or Search the internet for "Python String Methods").

4. In the RNA sequence in ex3, write a program using find method to identify threonine and cysteine codons. Does this method find all possible codons or just 1?  See http://openwetware.org/wiki/Codon_table for the RNA codon table.

* Next - <a href="http://nbviewer.ipython.org/github/jeffreyblanchard/EvoGenV5/blob/master/EvoGenV5_Lab3.ipynb">Session 3 : Detecting Selection in Strings</a>
* Previous - <a href="http://nbviewer.ipython.org/github/jeffreyblanchard/EvoGenV5/blob/master/EvoGenV5_Lab1.ipynb">Session 1 : Computational Frameworks for Evolutionary Genomics</a> 