# BioInformatics exercises for python (Informatics I)
##### I.   Use flow control to check user input
##### II.  Program in functions

## Introduction

Author: Jurre Hageman
Date: 2017-09-18

This lesson is about flow control and functions.

For this lesson we will use the previous excersise:
We will write a small program that converts a DNA sequence in the reverse, the complementary strand and the reverse complementary strand.

We will rewrite the previous programm and use functions.
In addition, we will check the user input: we will check if only valid bases (ATCG) are passed as argument on the command-line by the user.

Description about the programm (this is still the same): <br>
DNA is a double stranded helix and can be depicted as follows (strongly simplified):<br>
GACCATGGAC<br>
CTGGTACCTG<br>

Bioinformatitions often store only one strand in databases. This saves considerable space.
When one strand is given, the other strand can be generated as a T always pairs an A and a C always pairs a G.<br>
If we look at the strand: <br>
GACCATGGAC<br>
Than we can generate the following: <br>
reverse strand: CAGGTACCAG<br>
complementary strand: CTGGTACCTG<br>
reverse complementary strand: GTCCATGGTC <br>

Online tools excist to convert DNA such as [this tool](http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html).

Your task is to write a similar DNA conversion tool. However, for simplicity, we will code it as a command line tool.


Let's first think of what the program should do:
- it should first check if the command-line argument only contains valid bases
- it should catch a DNA string
- it should reverse the string
- it should complement the string
- it should reverse-complement the string

## Flow control


Open IDLE3.
First generate a variable dna and assign it to the string "atcg"


In [1]:
valid_dna = "ATCG"
print(valid_dna)

ATCG


In [2]:
input_dna = 'attacgga'
input_dna = input_dna.upper() #Note that the old variable gets overwritten. This is OK in this situation.
print(input_dna)

ATTACGGA


Now we will check if the dna sequence contains only valid bases. We can check if a single letter is a member of a collection using the "in" keyword. This works for any collection including strings (collection of characters).

In [3]:
valid_dna = "ATCG"
print("A" in valid_dna)
print("Q" in valid_dna)

True
False


In [4]:
valid_dna = "ATCG"
dna = "atcgga"
dna = dna.upper()
#The for loop will loop through each base
for base in dna:
    print(base)

A
T
C
G
G
A


As you can see, the for loop loops through the string. Each loop the placeholder 'base' will be overwritten with the value of the following base of bases. 

Now we can check if each base is a valid DNA character using an "if" statement.
As soon as we find an invalid character we will break out of the loop.
It is no longer required to finish the loop as we do not want any further processing to take place.
We can end the loop by using the "break" statement.

In [5]:
valid_dna = "ATCG"
dna = "atcQga"
dna = dna.upper()
#The for loop will loop through each base
for base in dna:
    if not base in valid_dna:
        print("invalid character:", base)
        break

invalid character: Q


As you can see, the for loop ends directly after the Q.

## Functions

Now that we have code that checks if characters are valid DNA bases we can organise the code in functions.
A function is a block of organized and reusable code. A function is used to perform a single action. 
Do not write long functions that do a lot of different things. A function should do a single task.
In this way, functions provide a high degree of code reusing.
Below is the simples function possible:

In [6]:
def is_valid_dna():
    pass

The above function has a function name (is_valid_dna) and accepts no arguments(). Pass means that it will not generate an error even if there is no function body. In other words: it does nothing.
This is how we can call the function:

In [7]:
#declare
def is_valid_dna():
    pass

#call
print(is_valid_dna())

None


You can see that it does nothing (yet). It only returns None.

Now we will add functinality:

In [8]:
dna_seq = "ATCGQACT"
#declare
def is_valid_dna(seq):
    valid_dna = "ATCG"
    for base in seq:
        if not base in valid_dna:
            print("Not a valid base", base)
            break

#call 
is_valid_dna(dna_seq)


Not a valid base Q


is_valid_dna(dna_seq) calls the function. <br>
dna_seq is passed as an argument in the is_valid_dna() function. <br>
The function header: "def is_valid_dna(seq):" contains the parameter "seq"<br>
The argument dna_seq is passed to the is_valid_dna() function and becomes the parameter seq.<br>
Note that the names do need to match. The match is by position. Therefore, we call these positional arguments. <br>
If you have more arguments, you can seperate them by comma's. <br>
Example: <br>
print("Hello", "World") <br>
The above code will call the print function with two positional arguments. <br>

Note that valid_dna is defined within the function. This way, the variable valid_dna is scoped within the function. This ensures that the variable is only accesible within the function. The next code will show you:

In [9]:
def my_func():
    mssg = "hello"
    return mssg

print("within my_func", my_func())
#print("outside my_func", mssg) #This line without the starting # will raise an error.
    

within my_func hello


Thus, the variable mssg is only accesible within the function definition. Imagine functions as seperate rooms with doors. Objects within a room are only visible within that particular room. Not from other rooms. They can be passed from room to room via doors. This is exactly what the return statement does. It hands the variable mssg back to the function call.

The is_valid_dna() function prints a message when a non-valid character is encountered. Also note that the break statement will stop the loop as soon as a non-valid character is encountered. There is no need to loop further as soon as a non-valid base is encountered. <br>
<br>
The function works but it is better to return a boolean (False) when a non-valid base is encountered. When the whole loop is finished and a non-valid base was not encountered, we can safely return the boolean True. In code:

In [10]:
dna_seq1 = "ATCGQACT"
dna_seq2 = "ATCGGACT"
#declare
def is_valid_dna(seq):
    valid_dna = "ATCG"
    for base in seq:
        if not base in valid_dna:
            #the break statement is not needed anymore as return automatically breaks the loop.
            return False
    #the loop is finished so we are sure no non-valid bases were encountered.
    #we can return True now
    return True
            

#call 
print(is_valid_dna(dna_seq1))
print(is_valid_dna(dna_seq2))

False
True


The above function returns a boolean. The code is organised in a function. There still is some code outside a function. We can add another level of organisation: a main function that calls all the other functions. Study the code below:

In [11]:
def is_valid_dna(seq):
    #checks if all letters of seq are valid bases
    valid_dna = "ATCG"
    for base in seq:
        if not base in valid_dna:
            #the break statement is not needed anymore as return automatically breaks the loop.
            return False
    #the loop is finished so we are sure no non-valid bases were encountered.
    #we can return True now
    return True

def main():
    #main function calls other functions
    dna_seq1 = "ATCGQACT"
    dna_seq2 = "ATCGGACT"
    #the following line is a shorthand for:
    #if is_valid_dna(dna_seq1) == True:
    if is_valid_dna(dna_seq1):
        print("dna_seq1 is valid")
    else:
        print("dna_seq1 is not valid")
    if is_valid_dna(dna_seq2):
        print("dna_seq2 is valid")
    else:
        print("dna_seq2 is not valid")

#call the main function:
main()

dna_seq1 is not valid
dna_seq2 is valid


All code above is organised in functions. The only line which is not organised in a function is the call to the main function: <br>
main()

## Excercise: DNA converter organised in functions

Now we come to the final excersise: <br>
Use the code from the previous excersise. Organise all code in functions.
Code a program that will catch a DNA sequence from the command line. Remember that you can catch arguments using the sys module and using the sys.argv property. This yields a list with the command line arguments:

In [12]:
import sys

def main():
    args = sys.argv #this will provide you with a list of arguments. Use indexing to select the correct item.

main()

We can also check if an argument is given by the user:
The length of the list will be smaler than 2.
sys.exit() will cause the script to stop.
In code:

In [13]:
if len(sys.argv) < 2:
    sys.exit()

Now organise your code in the following functions:
- The function is_valid_dna() returns True if the DNA sequence only contains valid bases, else returns False
- The main() function catches the original sequence as command line argument.
- The reverse_dna() function returns the reverse dna sequence.
- The complement_dna() function returns the complementary dna.
- The reverse_complement_dna() function returns the reverse-complement dna sequence by calling the reverse_dna() function and the complement_dna() function. Do NOT repeat code from the reverse_dna() function or the complement_dna() function. 
Bonus: stop the script if no sequence is provided or if non-valid characters are given. Hint: you can stop a script using the sys.exit() function.

## Solutions

Needs to be updated!

<p><a href="L2_solutions/excercise01.py">excercise01.py</a></p>

