<h1 id="toctitle">Writing functions</h1>
<ul id="toc"/>

##Writing a simple function

Look at the AT content code from the start of the course:

In [5]:
from __future__ import division

my_dna = "ACTGATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0.678571428571


There's one line to define the input:

```python
my_dna = "ACTGATACATATATATCGATGCGTTCAT"
```

and one line to define the output:

```python
print("AT content is " + str(at_content))
```

and only four lines to do the actualy work:

```python
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
```

We can turn this bit of code into a function that we call call:

In [13]:
# define a new function
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return at_content

# call the function
at = get_at_content("ACTGATACATATATATCGATGCGTTCAT")
print(at)

0.678571428571


Things to note about the definition line:

- the definition line starts with `def`
- next comes the name of our new function (arbirary)
- function argument names go in brackets (also arbitrary)
- def line ends with a colon
- body is indented
- use the argument names inside the body
- body ends with `return` to return the result

__Defining__ a function doesn't cause it to run; that only happens when we __call__ it. 

Argument variables (e.g. `dna`) only exist inside the function.

When we write a function we don't have to worry about how it will be used - we just need to know the __inputs__ (arguments) and __output__.

When we use a function we don't have to worry about how it works inside - we just need to know the __inputs__ and __output__.

###Calling our new function

Once we've written our function, we can use it in lots of different ways:


Calulate the AT content of a sequence in a file
```python
dna = open('dna.txt').read()
at = get_at_content(dna)
```

Print the AT content for a given sequence
```python
print(get_at_content("ATTAGCGTAGC"))
```

Write the AT content for a sequence to a file
```python
result = open('output.txt', 'w')
result.write(get_at_content('ACTGTCGA'))
```

This separation of code is very valuable - __encapsulation__. 


##Things to avoid when writing functions
###No input

We can write a function that relies on variables defined outside it rather than arguments:

In [15]:
dna = "ATCGCTAGCTGC"
def get_at_content():
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return at_content

at = get_at_content()
print(at)

0.416666666667


This breaks encapsulation - now we have to know what variables are set in order to write the function. 

###No output

We can write a function that prints the value instead of returning it:

In [16]:
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    print(at_content)

get_at_content("ATGCGTATTTGAGCA")

0.6


but this also breaks encapsulation - we have to know how the function will be used in order to write it. 

A good rule of thumb: _information gets in by arguments, information gets out by return value_.

##Improving our function
###Adding another argument

One problem currently is that we get too many decimal places:

In [20]:
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return(at_content)

get_at_content("ATGCGTATTTTTGAGCA")

0.6470588235294118

We can get round this by calling the `round()` function on the answer before we return it:

In [28]:
round(1.23456789, 5)

1.23457

In [22]:
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, 2)
                 
get_at_content("ATGCGTATTTTTGAGCA")

0.65

What if we want more/fewer decimal places? Make the argument to `round()` an argument of our function:

In [25]:
def get_at_content(dna, sig_figs):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, sig_figs)

get_at_content("ATGCGTATTTTTGAGCA", 2)

0.65

In [26]:
get_at_content("ATGCGTATTTTTGAGCA", 4)

0.6471

###Default values for function arguments

In many cases, we don't really care about picking a number of significant figures. We can add a default to the definition:

In [29]:
def get_at_content(dna, sig_figs=2):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, sig_figs)

In [31]:
get_at_content("ATGCGTATTTTTGAGCA", 4)

0.6471

In [32]:
get_at_content("ATGCGTATTTTTGAGCA")

0.65

###Keyword arguments

In all the examples above, we supply the arguments in the same order as the definition. If we want to use a different order (or just be more explicit) we can use keyword arguments:

In [33]:
get_at_content(sig_figs=3, dna="ATGCGTATTTTTGAGCA")

0.647

##Testing functions

When we're working on a new function, we might want to test if it's working correctly. Use `assert` with a condition:

In [35]:
assert get_at_content("A") == 1
assert get_at_content("G") == 0
assert get_at_content("ATGC") == 0.5
assert get_at_content("AGG") == 0.33
assert get_at_content("AGG", 1) == 0.3
assert get_at_content("AGG", 5) == 0.33333

If an `assert` is false, then it will stop and print an error message:

In [36]:
assert get_at_content("G") == 1

AssertionError: 

Assertions are good for:

- providing some documentation about the behaviour of the function
- reassuring you that your function is giving the right answer
- letting you know if you break the function 
- demonstrating how the function can be used to other people

and also for

- making it easy to me to write exercises!

---

##Exercises

###Amino acid percentage, part one

Write a function that takes two arguments – a protein sequence and an amino acid residue code – and returns the percentage of the protein that the amino acid makes up. Use the following assertions to test your function:

```python
assert my_function("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert my_function("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert my_function("MSRSLLLRFLLFLLLLPPLP", "Y") == 0
```

Remember to change the name of the function in the `assert` statements to match your function name. 

###Amino acid percentage, part two

Modify the function from part one so that it accepts a list of amino acid residues rather than a single one. If no list is given, the function should return the percentage of hydrophobic amino acid residues (A, I, L, M, F, W, Y and V). Your function should pass the following assertions:

```python
assert my_function("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert round(my_function("MSRSLLLRFLLFLLLLPPLP", ['M', 'L']), 8) == 55 
assert my_function("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert my_function("MSRSLLLRFLLFLLLLPPLP") == 65
```

###Base counter

Write a function that will take a DNA sequence along with an optional threshold and return `True` or `False` to indicate whether the DNA sequence contains a high proportion of undetermined bases (i.e not A, T, G or C).

In [2]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [1]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")