# Notebook 6.3: Getting started with functions

### What are functions?
A function is used to perform a task based on a particular input. Functions are the bread and butter of any programming language. We have used many functions already that are builtin to the objects we have interacted with. For example, we saw that `string` objects have functions to capitalize letters, or add spacing, or query their length. Similarly, `list` objects have functions to search for elements in them, or to sort. The next step in our journey to begin writing our own functions. This is only an introduction, as we will continue over time to learn many new ways to write more and more advanced functions.  

### The basic structure of a function
In Python functions are defined using the keyword `def`. Optionally we can have the function return a result by ending it with the `return` operator. This is not required, but is usually desirable if we want to want to assign the result of the function to a variable. If no return statement is added the function will return an object called `None`. This is a special variable in Python, like `True` and `False`. 

In [1]:
# a simple function to add 100
def myfunc(x):
    return x + 100

In [2]:
# let's run our function on an integer
myfunc(200)

300

### More structure: doc string
So the basic elements are to have an input variable and a return variable. The next important thing is to add some documentation to our function. This reminds us what the function is for, and also allows other users to see how the function works. A documentation string (docstring) is simply a string object put directly into the code like below. It does not affect how the function functions at all.

In [3]:
def myfunc2(x):
    "This function adds 100 to an int or float and returns"
    return x + 100

In [12]:
myfunc2(300.3)

400.3

There is not hard-set rule on how to write your documentation string, but there are suggested conventions. Below is one of them, which starts with a brief summary of what the function does, followed by a list of the input types, and finally a listing of the returned values. When writing short scripts for practice like we are now, however, the short description above is adequate, rather than writing a full length docstring like below. But in the future we will be writing full docs. 

In [5]:
def myfunc3():
    """
    A function that takes a numeric input and adds 
    100 to it and returns it.
    
    Parameters:
    -----------
    x (int, float):
        An integer or float input.
        
    Returns:
    ---------
    int
    """
    return x + 100

### More structure: handling exceptions
The next step is to beef up our function a bit. Let's add some conditional statements to it to make sure that users don't misuse the function in a way that we did not intend. For example, this function tries to add an integer to the input, which is fine for an int or float input, but what if the input is some other type, we want our function to raise a warning, or fail gracefully. In fact, it will already do this do this by raising a Python TypeError. But let's catch the error first and warn the user.  

There are two general concepts for catching errors in programming, called `EAFP` and `LBYL`. This stands for "it's easier to ask forgiveness than permission", and "look before you leap". The idea is, you can either write your program to first try to do something and only bother handling exceptions when you get caught with an error, or, alternatively you can write your code to check that everything is properly formatted and no errors will be raise before it tries to execute any code. In general, the `EAFP` (ask forgiveness after getting caught) method is preferable, but both are typically used frequently in any program. 

#### EAFP
Easier to ask forgiveness is a bit faster because when the type is correct we do not waste time checking whether it is correct or not. We only bother if there is an exception raised by the code. We use a statement called a `try/except` statement. The indentation of the code is important in this part, if a TypeError is raised anywhere within the indented `try` section then it will be caught by the `except` clause. We capture and store the exception message into a variable `e` and print it for the user. 

In [6]:
def myfunc4(x):
    "return x + 100"
    try: 
        return x + 100
    except TypeError as e:
        print("There was an error: {}".format(e))

In [7]:
myfunc4('a')

There was an error: can only concatenate str (not "int") to str


In the example above we caught the exception and then did not re-raise it, so it was suppressed. We could alternatively catch it and re-raise it so that the code will stop running at the point of exception. You can see that the only difference of having a try/except clause now is that we get to do something after the exception occurred, and before it is raised. In this case we print a custom message, but it could be something else as well.

In [8]:
def myfunc4raise(x):
    "return x + 100"
    try: 
        return x + 100
    except TypeError as e:
        print("There was an error: {}".format(e))
        raise

In [9]:
myfunc4raise('a')

There was an error: can only concatenate str (not "int") to str


TypeError: can only concatenate str (not "int") to str

#### LBYL
Look before you leap checks the type of our input right away, which has the cost of performing one more operation than this EAFP example, but it also ensures for us that know the type of data, and so helps us to avoid errors a bit better. Here we use a conditional `if/else` statement to check the type of the input. In general you should not worry about the time it takes to perform a conditional check, since it is actually very very very fast. I only mention speed as a reason that people might choose between the two approaches.

In [13]:
def myfunc5(x):
    "return x + 100"
    if isinstance(x, (int, float)):
        return x + 100
    else:
        return "There was an error: x is not an int or float"    

In [14]:
myfunc5('a')

'There was an error: x is not an int or float'

## Multiple inputs 
Of course we often want to write functions that take multiple inputs. This is easy. 

In [15]:
def sumfunc1(arg1, arg2):
    "returns the sum of two input args"
    return arg1 + arg2

In [16]:
sumfunc1(10, 20)

30

### Writing a useful function
Let's write a function to perform the task that we ran in a previous challenge, which is to find the number of differences between two DNA strings. Here we write a function that will find the n differences between the DNA strings below. Try editing the DNA strings to add changes to make sure that it still works. You can see now that our function is getting more complex that it is useful to add some additional comment lines to the code to make clear what we are doing. This is done using the # comment character.

In [17]:
def seqdiff1(seq1, seq2):
    "return the number of differences between two sequences"
    # a counter to store the number of diffs
    count = 0

    # iterate over the index of bases and add to count if diff
    for idx in range(len(seq1)):
        if seq1[idx] != seq2[idx]:
            count += 1
    return count

In [20]:
dna1 = "TCAAAGTTGCCAGGAGATGACAGAAAGGTGTGGGTTACAACTCTCTCTAATTTAAGGGCCAATTAACATT"
dna2 = "ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTAAAATAAGGGCCAATTAACGTT"

seqdiff1(dna1, dna2)

7

### What if the sequences are different lengths?
Below we add an operation to compute the length of the two sequences and then use `min` to get the shortest one, and we only iterate over the length of the shortest sequence.
 

In [21]:
def seqdiff2(seq1, seq2):
    """
    return the number of differences between two sequences,
    compares sequences from start to the end of the shortest seq.
    """
    # a counter to store the number of diffs
    count = 0
    
    # get the shortest input sequence length
    slen = min([len(i) for i in (seq1, seq2)])
    
    # iterate over the index of bases and add to count if diff
    for idx in range(slen):
        if seq1[idx] != seq2[idx]:
            count += 1
    return count

In [22]:
dna1 = "ACAGAGTTGCCAGGAGATGACAGAAAGGTGTGGGTTAC"
dna2 = "ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTA"

seqdiff2(dna1, dna2)

2

## Challenges: 

<div class="alert alert-success">
Write functions and create sequences and test on them. 
For the challenges below try to write proper functions that include a documentation string and comments. 
</div>

A. Write a function that will generate and return a random sequence of "A" "C" "G" or "T"s of length N as a string. Hint, for this use the random package from the standard library.

In [59]:
import random

def rand_dna(x):
    "Return a random DNA sequence of length x"
    dna = []
    for i in range(x):
        dna.append(random.choice("ACGT"))   
    return ''.join(dna)

In [60]:
rand_dna(15)

'AGGTGTCTACAATCA'

B. Write a function to calculate and return the frequency of As, Cs, Ts and Gs in a sequence string from challenge A. 

In [61]:
def which_base(x):
    "Calculate frequency of each nucleotide in DNA sequence"
    dict = {}
    for i in x:
        if i not in dict:
            dict[i] = 1
        else:
            dict[i] += 1
    return(dict)

In [62]:
dna = rand_dna(15)
which_base(dna)

{'G': 2, 'C': 6, 'T': 4, 'A': 3}

C. Write a function to concatenate (join end-to-end) two sequences and return it

In [63]:
def join_dna(seq1, seq2):
    "Join two DNA sequences"
    return seq1 + seq2

In [64]:
dna1 = "ACGTGCTAG"
dna2 = "AGCTCGTGA"

join_dna(dna1, dna2)

'ACGTGCTAGAGCTCGTGA'

D. Write a function to take two sequences of different lengths and return both trimmed down to be the same length. 

In [88]:
def trim_dna(seq1, seq2):
    "Return two DNA sequences of different lengths trimmed down to length of shortest sequence"
    slen = min([len(i) for i in (seq1, seq2)])
    seq1 = seq1[:slen]
    seq2 = seq2[:slen]
    return seq1, seq2

In [89]:
dna3 = "ACTTTTTG"
dna4 = "ACTG"
trim_dna(dna3, dna4)

('ACTT', 'ACTG')

E. Write a function to return the proportion of bases across the shared length between two sequences that are the same. In this function, use the function that you created in `D` above to convert the sequences to be the same length (even if this is not necessarily the most efficient way to complete this task). So this function should include within it a call of your previous function.

In [107]:
def same_dna_prop(seq1, seq2):
    "Calculate the proportion of bases that are the same across the shared length of two DNA sequences"
    
    trim1, trim2 = trim_dna(seq1, seq2)
    
    count = 0
    
    # iterate over the index of bases and add to count if diff
    for idx in range(len(trim1)):
        if trim1[idx] == trim2[idx]:
            count += 1
            
    return count/len(trim1)

In [108]:
same_dna_prop(dna3, dna4)

0.75

## Deren's solutions

In [96]:
def random_dna(length):
    "returns a random string of A,C,G,T characters of specified length"
    return "".join(random.choices("ACGT", k=length))

In [97]:
# must be a string, not a list.
random_dna(20)

'TAAGCGCCGGGCCTGAGAAT'

In [98]:
def get_frequency(dnastring):
    """
    Calculates the frequency of each base in a dna string
    and returns as a dictionary (good choice of object)
    """
    freqs = {}
    for char in "ACGT":
        freqs[char] = dnastring.count(char) / len(dnastring)
    return freqs

In [99]:
get_frequency(random_dna(200))

{'A': 0.295, 'C': 0.25, 'G': 0.19, 'T': 0.265}

In [100]:
def concat_seqs(dnastring1, dnastring2):
    """
    concatenates two dna strings.
    """
    return dnastring1 + dnastring2

In [101]:
concat_seqs(random_dna(10), random_dna(15))

'GCCACCGAAATTTGTTAGGGACTAT'

In [102]:
def trim_seqs_equal_len(dnastring1, dnastring2):
    """
    concatenates two dna strings.
    """
    minlen = min(len(i) for i in (dnastring1, dnastring2))
    return dnastring1[:minlen], dnastring2[:minlen]

In [103]:
trim_seqs_equal_len(random_dna(50), random_dna(5))

('TCACG', 'ATCGG')

In [104]:
def get_overlap_freq(seq1, seq2):
    """
    calculates proportion of matching bases over length of overlap
    of two sequences.
    """
    trim1, trim2 = trim_seqs_equal_len(seq1, seq2)
    match = 0
    for i, j in zip(trim1, trim2):
        if i == j:
            match += 1
    return match / len(trim1)

In [105]:
# two random sequences of diff lengths
seq1 = random_dna(100)
seq2 = random_dna(50)

In [106]:
# their overlap freq.
get_overlap_freq(seq1, seq2)

0.34

## Finished
Save this notebook and close it. Commit and push changes to your repo.