<img src="IMAGES_PYTHON_COURSE_2018/LOGO_MEINBIO_TRANSPARENT.png" width="200" height="200" />

# <center>Python course 2018
### <center>MeInBio Training Group 
#### <center> by Florian Heyl & Francesco Ferrari 

# PART 5: CONDITIONAL STATEMENTS

An interesting piece of program is one that can evaluate if certain conditions are met and act accordingly. For example, imagine we have a DNA sequence and we want to print it to the screen if the sequence is longer than 300 bp. Or imagine we have a count table (from an RNA-Seq experiment) with the number of reads mapped to each gene and we want our program to discard a gene if that gene has less that 10 reads mapped to it. 

Whenever we want our program to verify if a condition is met, we use a conditional statement. Let's have a look at what are conditions and how conditional statements work in Python. 

### What is a condition? 

A condition is simply a bit of code that can produce a true or false answer.

In [None]:
# examples of conditions

print(len('ATCCGTA') > 2)
print(20 == 30) ### PLEASE NOTICE THAT WE USE "==" AND NOT "=" TO TEST IF TWO ITEMS ARE THE SAME
print("I Love Pizza".startswith('I'))
print("I Also Love Carbonara".endswith("y"))
print("apple" in ["banana","orange","cherry"])
print("Anakin" != "Luke")

- # If else Statements

We have just seen how to build some conditions and have our code to output True or False if the condition is met or not met respectively. Now, we are ready to make our program take decisions and act according to a condition's outcome.

The simplest kind of conditional statement is an "if statement". In Python, the syntax for if statements in very easy:

We write the word "if", followed by a condition, and end the first line with a colon. There follows a block of indented lines of code (the body of the if statement), which will only be executed if the condition is true.

In [None]:
# example of if statement 1 (easy)

my_sequence = "AAAATCCCCGTACT"
if len(my_sequence) > 4:
    print("my sequence is long".upper())
else:
    print("my sequence is short".lower())

### Exercise n° 1 (medium): 

Given a list of tuples of the form (chr, gene_name), identify all genes present in the X chromosome and save them to a text file.

In [None]:
import sys
import os
sys.path.append(os.path.abspath("./BOOK_and_other_resources/"))

In [None]:
import python_course_functions as pcf
gene_list = pcf.loadGeneList("./BOOK_and_other_resources/gene_list.txt")

print("You are working with {} genes".format(len(gene_list)))

gene_list[0:10] # to see the structure of your data

In [None]:
### insert your commands in the space below






- ##  Elif statements

If we have more than 2 possible branches, instead of using else, we use elif. 

In [None]:
# example of elif statement
my_fav_fruit=["banana", "apple", "cherry", "kiwi", "pear", "rasberry", "blueberry", "peach", "ca"]

for fruit in my_fav_fruit:
    if len(fruit) == 4:
        print("'{}' has 4 letters\n".format(fruit))
    elif len(fruit) == 5:
        print("'{}' has 5 letters\n".format(fruit))
    elif len(fruit) > 5:
        print("'{}' is a very long word\n".format(fruit))
    else:
        print("'{}' is a very short word\n".format(fruit))

### Exercise n° 2 (medium): 

Using the same list of tuples of the form (chr, gene_name) from exercise 1, write to separate files all genes found on chromosome 1, chromosome 4, chromosome 11 and chromosome Y

In [None]:
### insert your commands in the space below






- # Combining Conditions with and / or

It is possible to express conditions made up of several parts. For example, given a list of accession IDs, print all IDs that start with "a"/"A" and end with "z/"Z". To do that, we use the operator "and". The combined "if" condition will return a True if all conditions are satisfied, False otherwise. 

Alternatively, we can write disjunctive conditions using the "or" logical operator. In this case, the combined "if" condition will return a True if at least one of the conditions are satisfied, False if all condition are not satisfied.    

In [None]:
list_ID = ["ID_a132i398zzz",
           "ID_b34573y4jnj",
           "ID_S45u92s8fgz",
           "ID_Afkj3ehrdZZ",
           "ID_Askdsa3s4nz",
           "ID_EFGJNG24OIN",
           "ID_234oitfunkj",
           "ID_ASOI0934U5Z",
           "ID_ewowrit13ir"]

print("IDs starting with 'a' AND ending with 'z':")
for i in list_ID:
    id_seq = i.split("_")[1]
    if id_seq[0].lower() == "a" and id_seq[-1].lower() == "z":
        print(i)
print("\n")     
print("IDs starting with 'e' OR ending with 'j':")
for i in list_ID:
    id_seq = i.split("_")[1]
    if id_seq[0].lower() == "e" or id_seq[-1].lower() == "j":
        print(i)

### Exercise n° 3 (medium):

A) Using the gene list from exercise 1, count how many genes are found in chromosome 1 and have a ID that ends with 1. 

B) Using the same list, count how many genes are found in chromosome 2 and 4 which have and ID whose number after the dot is either a 3 or a 7. (PAY ATTENTION! THIS IS DIFFERENT FROM THE CONDITION OF 3A).  

In [None]:
### insert your commands in the space below





Expected Outcome:


3A): 2092          
3B): 559

- # Boolean functions and conditional statements

Sometime, you may need to check if a complex condition (that means, a condition that cannot be expressed with the standard tools "==", ">", "<", "endswith", "startswith", etc.) holds true and have your script react according to the outcome.

To do that, we must write our own function that returns a True if the condition we want to test is met, and False otherwise. 

For example:
  
Say that you want to check if a sequence is AT-rich. If the sequence is AT-rich, then you print it to the screen with the corresponding percentage. You also want to collect all not AT-rich regions in a list, and print it to the screen.

In [None]:
def is_AT_rich(sequence):
    sequence = sequence.lower()  # sequence is made of lower characters to avoid mismatches due to capitalization
    a = sequence.count("a") # count how many "a" are in the sequence
    t = sequence.count("t") # count how many "t" are in the sequence
    fr_at = (a+t)/len(sequence) # compute the fraction of at present in the sequence
    if fr_at >= 0.65:    # if the fraction is greater or equal 0.65 (65%) ...
        return (True, fr_at) # return a tuple where the first element is a True (boolean), and the second is the at fraction
    else:                   # otherwise
        return (False, fr_at)  # return a tuple where the first element is a False (boolean), and the second is the at fraction 
    
sequences = ["ATTTGTCCAAAGTATATAT",
             "GGGCGAGGCGTTGCGAGGG",
             "GGCGAAAAAAATTTTTTTT",
             "AAATATATATATATATATA",
             "GGATTATCCTGGATATATAT",
             "TTTTTTTTTTTAAAAAAAAA",
             "GGGTTTTTATATATTTAAAT",
             "GGGTGGGCGCGCGCGAGGGG"]

not_AT_rich = []
for sequence in sequences:
    if is_AT_rich(sequence)[0]:
        print("{} is composed for the {}% of AT\n".format(sequence, round(is_AT_rich(sequence)[1]*100,1) ))
    else:
        not_AT_rich.append(sequence)
print("{} sequence were not AT-rich:{}".format(len(not_AT_rich), not_AT_rich))

### Exercise n° 4 (difficult):

write a function that verifies whether a in a sequence there are at least 4 CpG (a C immediately followed by a G, i.e. "CG") and return a True if so, False otherwise. 

Then, given a list of sequence, print to screen those sequences that have at least 4 CpG in them.

In [None]:
new_sequences = ["CGGGGAGTCGAAATTGCTA",
                 "ATTCGCGCGCGATAATTCG",
                 "TTTCGGCCGCGATTTACGG",
                 "TTAATTCGGGATCGCGAAT"]

### insert your commands in the space below






Expected Outcome:

ATTCGCGCGCGATAATTCG

TTTCGGCCGCGATTTACGG