# Functions

You have already encountered many functions that are built-in for Python:

In [None]:
seq = ["AGTTTTA","CACAGAGG","TACCAGA","ACGG"]

# Len is a function
print(len(seq))

# Append is a function 
seq.append("CCACA")
print(seq)

### Why Functions?
We want to bundle a set of instructions together that we can use repeatedly. 

Remember when we calculated the GC content of a sequence:

In [3]:
seq = "ATGACCTTATGACTACTATACAT"
numGs_seq = seq.count('G')
numCs_seq = seq.count('C')
totalGCs_seq = numGs_seq + numCs_seq
numbases_seq = len(seq)
GCpercent_seq = 100*float(totalGCs_seq)/numbases_seq
print(GCpercent_seq)

30.434782608695652


Now what if we wanted to calculate GC content for another sequence? 

In [4]:
seq2 = "ACTTACTGGGCCTGATC"
numGs_seq2 = seq2.count('G')
numCs_seq2 = seq2.count('C')
totalGCs_seq2 = numGs_seq2 + numCs_seq2
numbases_seq2 = len(seq2)
GCpercent_seq2 = 100*float(totalGCs_seq2)/numbases_seq2
print(GCpercent_seq2)

52.94117647058823


This seems just too repetitive and we are bound to make some mistakes. 

Defining a function that calculates GC content which would take in a sequence input would make this much easier. 

## Defining functions

Functions definitions have the following hard rules:

* They the keyword `def`, to show that we're defining a function. 
* They need a name for the function followed by parentheses
* The body of the function needs to be tabbed over

In [5]:
def hello():
    print("Hello World")

In [8]:
# Execute the function
hello()
hello()
print("as;ldjfa;s")
hello()

Hello World
Hello World
as;ldjfa;s
Hello World


When we run the function, the code inside it is executed as if it was all written right where the function was called. It does not matter where we define the function in our notebook or script, only where we call it. 

We can get more complicated with the code in our functions. Here I've written a function that prints out a rectangle with the help of a for loop:

In [9]:
def print_rectangle():
    height = 4
    width = 6
    for i in range(height):
        print("*" * width)
        
print_rectangle()

******
******
******
******


Now we just type `print_rectangle()` anywhere in our code to run this function. This can be useful if we need the same rectangle to be printed multiple times, but it is not configurable. You might try making functions for every triangle size that we need:

In [10]:
def print_5_by_5_rectangle():
    height = 5
    width = 5
    for i in range(height):
        print("*" * width)
    
def print_4_by_6_rectangle():
    height = 4
    width = 6
    for i in range(height):
        print("*" * width)
    
def print_3_by_7_rectangle():
    height = 3
    width = 7
    for i in range(height):
        print("*" * width)
        
print_5_by_5_rectangle()
print("") #print empty line
print_4_by_6_rectangle()
print("") #print empty line
print_3_by_7_rectangle()

*****
*****
*****
*****
*****

******
******
******
******

*******
*******
*******


Hopefully it is clear that this is not the approach we want to take, as we end up writing almost the same exact code over and over again. Functions can use **arguments**/**parameters** in order to add some customizability. The words argument and parameter technically mean different things, but you will hear people use them interchangeably to refer to data that we pass to a function. We will redefine the earlier `hello()` function to take in a number of times that we want to print hello in one line:

In [1]:
def hello(n):
    s = ""
    for i in range(n):
        s = s + "hello"
    
    print(s)
    
hello(1)
hello(4)

hello
hellohellohellohello


Here we've named our parameter `n`, and used it within the `range()` function. You can name parameters whatever you want, following the rules of variable naming. Once the function is called, the data that is passed to it is store in these variables.

This can be a little confusing at first, which can be shown in the following example:

In [3]:
def print_twice(s):
    print(s)
    print(s)
        
x = "my_string"
print_twice(x)
#print(s)

my_string
my_string


NameError: name 's' is not defined

We used the parameter name `s` to store our string, and then used it in the `print()` function calls in the body of our function. However, when we used the function, nothing was called `s`. We used our variable `x` to store the string we wanted to send to the function. The variable `s` only exists in the context of the function we wrote, not outside of it. If you uncomment line 7 above, Python will complain that there is no variable called `s`. This is a very important concept called **scope**, which controls where and for how long a variable is accessible. Here's another example of scope coming into play, this time with a variable we define inside a function rather than a parameter:

In [6]:
def do_something():
    function_variable = 7
    function_variable = function_variable ** 8
    
do_something()
print(function_variable)

NameError: name 'function_variable' is not defined

This code throws a similar error to the one above. The variable named `function_variable` is defined inside our function, and only exists until the end of the function. When we try to use it outisde of the function, Python has no idea what we are talking about.

Unfortunately, the same thing does not happen in the opposite case:

In [11]:
def do_something_else():
    print(outside_variable)
    
outside_variable = 7
do_something_else()

7


Here `outside_variable` is known as a **global variable**, because it can be accessed anywhere. Personally I find this behavior to be a huge issue, as it breaks the convention of scope. Other programming languages like C and Java explicitly forbid this. I would advise that you never use global variables in this way, and only use parameters to pass data to a function. The main reason for this is that it is not clear in every case where the data is coming from. If I defined `outside_variable` 2,000 lines of code before I called `do_something_else()`, it could be difficult to figure out where the number 7 was coming from. To correct the above code, I would make sure to pass the data as a parameter:

In [None]:
def do_something_else(our_param):
    print(our_param)
    
outside_variable = 7
do_something_else(outside_variable)

We get the same exact output, but the flow of information is much clearer here.

So far we've only used functions to print things, but a lot of their power comes from the `return` statement. Just like functions in the world of math, Python function can have inputs and outputs. So far we've only seen inputs, which we've called **arguments** or **parameters**. Output values are usually called **return values**, because we use the keyword `return` to define them. Here's a simple example of a function that adds 4 to a number passed to it:

In [13]:
def add_four(x):
    y = x + 4
    return y

x = 7
print(x)
output = add_four(x)
print(output)

7
11


Notice that the `print()` call occurs outside the function. Our function just does some addition and hands the value back with the `return` keyword. I've chosen to store our intermediate value in a function variable called `y`, but this is not necessary. This code will do the exact thing with just one line in the function:

In [4]:
def add_four(x):
    return (x + 4)

x = 7
print(x)
print(add_four(x))

7
11


This is a good time to note that a function can do pretty much anything, not just what we want it to do. The following code is perfectly legal in Python, but does not really make any sense:

In [5]:
def double_my_number(x):
    return 8

num = 14
print(double_my_number(num))

8


We called our function `double_my_number()`, which should have a pretty well-defined behavior. However, the code just returns the number 8 no matter what, and completely ignores the `x` parameter that we pass to it. You have to always be careful to make sure that a function is doing what you want it to do, as Python will not enforce anything but the basic rules about how a function should behave.

### Exercise

Write a function that calculates the square root of a number, and returns it. Use that function to calculate the square root of 144.

In [26]:
def square_root(square):
    sr = square**0.5
    return sr

x = 144
y = square_root(144)
print(y)

12.0


There is one final "gotcha" that can happen with functions. With simple data types like `int` and `float`, the values are copied into the function as new variables when the function is called. You can see this behavior with the following code:

In [27]:
def add_one(x):
    x = x + 1
    return x

n = 8
m = add_one(n)
print(n)
print(m)

8
9


Even though we explicitly update the value of `x` in our function, this does not update the value of the variable that we called passed in as `x` (in this case that variable was called `n`). Only the returned value that we stored in `m` reflects the addition.

When we pass something more complicated like a list, the behavior is different:

In [28]:
def append_to_list(x):
    x.append(7)
    return x

my_list = [4,5,6]
new_list = append_to_list(my_list)
print(my_list)
print(new_list)

[4, 5, 6, 7]
[4, 5, 6, 7]


Here you can see that the list stored in the variable `my_list` gets updated, even though that update only happens in the function. This is different behavior that in the above case where our parameter is an `int`. We can't go into too much detail in this class about why the behavior is different, but you should always make sure that your function is doing what you expect by running a few test cases.

Now that all of the warnings are out of the way, we can start working on some more advanced examples:

In [30]:
def subtract(x,y):
    z = x - y
    return(z)

In [31]:
subtract(10,6)

4

In [33]:
print(subtract(10,4))
print(subtract(y = 10,x = 4))

6
-6


Two new concepts are introduced in the above code. First, we can use an unlimited number of parameters in our function. `subtract()` takes both an `x` and a `y` argument. By default, the order of the parameters will match the order that they are defined in the function. For `subtract()`, the first parameter will be assigned to `x` and the second will be assigned to `y`, because that is the order that we typed them in the function. We can bypass this behavior by explicitly stating which parameter gets which value, which is shown in the line `subtract(y = 10, x = 4)`. The arguments here are reversed, but we tell the function what value we want to be what variable in the function. This is usually called **passing by keyword**, and is considered good programming practice.

Let's revisit our rectangle functions from earlier, and generalize them to work with any size:

In [36]:
def print_rectangle(height, width):
    for i in range(height):
        print("*" * width)
        
print_rectangle(5,5)
print("")    #empty line
print_rectangle(6,4)
print("")    #empty line
print_rectangle(height = 7, width = 2)     #passed by keyword
print("")
#print_rectangle(100,100)

*****
*****
*****
*****
*****

****
****
****
****
****
****

**
**
**
**
**
**
**



In [6]:
def return_rectangle(height, width):
    s = ""
    for i in range(height):
        s = s + ("*" * width) + '\n'
        
    return s

y = return_rectangle(5,5)

for i in range(3):
    print(y)

*****
*****
*****
*****
*****

*****
*****
*****
*****
*****

*****
*****
*****
*****
*****



Now we have cleaned up what used to be three functions into just one, which is much more pleasant to use. I called the function three times with different arguments. The last time I passed by keyword. Hopefully you can see how this makes the behavior of the function clearer to whoever reads it.

### Exercises

Write a function that prints a triangle, like the example we did last week. This function should have one parameter that controls how tall the triangle will be. Use this function to print a triangle that is 10 lines tall.

In [8]:
#*
#**
#***
#****

def triangle(height):
    t = ""
    for i in range(height):
        t = t + ("*" * (i + 1)) + '\n'
        
    return t
        
x = triangle(10)
print(x)




*
**
***
****
*****
******
*******
********
*********
**********



Write a function that calculates the GC content of a sequence. 

In [62]:
def GC_content(seq):     
    numGs_seq = seq.count('G')
    numCs_seq = seq.count('C')
    totalGCs_seq = numGs_seq + numCs_seq
    numbases_seq = len(seq)
    GCpercent_seq = 100*float(totalGCs_seq)/numbases_seq
    return GCpercent_seq
    
seq = "ATGACCTTATGACTACTATACAT"
seq2 = "ATGCATGCTAGACTGCGTATGCGT"
gc = GC_content(seq)
print(gc)

30.434782608695652


Write a function that calculates the AT content of a sequence.

In [63]:
def AT_content(seq):
    return 100 - GC_content(seq)

AT_content(seq)

69.56521739130434

### A more complicated example

This function takes in a list of genes and returns only the genes that don't start with A:

In [64]:
# In this case, the parameter genelist will contain a list of genes
def getGenesDoNotBeginWithA(genelist):
    # Define an empty list variable that will contain the genes that don't begin with A
    genesToReturn = []
    
    #Loop through the genelist
    for gene in genelist:
        # Check if first letter of gene is not A
        if gene[0]!='A':
            # If the gene does not start with A, add this gene to the genesToReturn list
            genesToReturn.append(gene)
    
    return(genesToReturn)

In [65]:
genes_of_interest = ['ABL1','AKT2','BCL2','BRAF','EGFR','KRAS','MAFB','MYC','TET2']
genes_DoNotBeginWithA = getGenesDoNotBeginWithA(genes_of_interest)
print(genes_DoNotBeginWithA)

['BCL2', 'BRAF', 'EGFR', 'KRAS', 'MAFB', 'MYC', 'TET2']


### Advanced Exercises

Modify the function we created above to take in another parameter that defines what letter the gene should not begin with, instead of always using "A":

Modify the function we created return every even gene name, and only odd gene names that don't start with the specified letter.

Write a function that returns the average value of a list of numbers. For example,


In [None]:
my_list =  [51, 87, 67, 70, 85, 49, 21, 68, 95, 73, 83, 48, 55, 58,  5, 27, 36, 65, 67,  5,  3, 96, 81, 84]

Write a function that will take in a file that contains gene names and returns all genes that begin with the letter B or the letter R. Run this function on `SmokingRelatedGenes.txt`

Write a function that will take in a file of sequences and calculate the GC content of each of the sequences and write the sequence and the GC content into another file. In the written file, each line will contain the sequence and separated by a tab the GC content of that sequence. 

Write a function that will take in a FASTA file as the input and return a written FASTA file that contains the reverse complement of all the sequences that have a GC content greater than 50%. 

In [96]:

f = open("/home/josh/tmp/biostats.csv")

names = []
sexes = []
ages = []
for i, line in enumerate(f):
    if i != 0:
        s = line.split(",")
        name = s[0]
        sex = s[1]
        age = int(s[2])
        names.append(name)
        sexes.append(sex)
        ages.append(age)

print(ages)

[41, 42, 32, 39, 30, 33, 26, 30, 53, 32, 47, 34, 23, 36, 38, 31, 29, 28]
