# Notebook 3.4: Getting started with functions

The code in this notebook corresponds to notes in lecture 3. In this notebook you can follow along and execute or modify code as we go. All code in this notebook uses the `Python3` standard library. 

### What are functions?
A function is used to perform a task based on a particular input. Functions are the bread and butter of any programming language. We have used many functions already that are builtin to the objects we have interacted with. For example, we saw that `string` objects have functions to capitalize letters, or add spacing, or query their length. Similarly, `list` objects have functions to search for elements in them, or to sort. The next step in our journey to begin writing our own functions. This is only an introduction, as we will continue over time to learn many new ways to write more and more advanced functions.  

### The basic structure of a function
In Python functions are defined using the keyword `def`. Optionally we can have the function return a result by ending it with the `return` operator. This is not required, but is usually desirable if we want to want to assign the result of the function to a variable 

In [1]:
## a simple function to add 100
def myfunc(x):
    return x + 100

In [2]:
## let's run our function on an integer
myfunc(200)

300

### More structure: doc string
So the basic elements are to have an input variable and a return variable. The next important thing is to add some documentation to our function. This reminds us what the function is for, and also allows other users to see how the function works. 

In [3]:
def myfunc2(x):
    "This function adds 100 to an int or float and returns"
    return x + 100

In [4]:
myfunc2(300.3)

400.3

There is not hard-set rule on how to write your documentation string, but there are suggested conventions. Below is one of them, which starts with a brief summary of what the function does, followed by a list of the input types, and finally a listing of the returned values. When writing short scripts for practice like we are now, however, the short description above is adequate, rather than writing a full length docstring like below. But in the future we will be writing full docs. 

In [5]:
def myfunc3():
    """
    A function that adds 100 and returns
    
    Parameters:
    -----------
    x (int, float):
        An integer or float input.
        
    Returns:
    ---------
    int
    """
    return x + 100

### More structure: handling exceptions
The next step is to beef up our function a bit. Let's add some conditional statements to it to make sure that users don't misuse the function in a way that we did not intend. For example, this function tries to add an integer to the input, which is fine for an int or float input, but what is the input is some other type, we want our function to raise a warning. In fact, it will already do this do this by raising a Python TypeError. But let's catch the error first and warn the user.  

There are two general concepts for catching errors in programming, called `EAFP` and `LBYL`. This stands for "it's easier to ask forgiveness than permission", and "look before you leap". The idea is, you can either write your program to first try to do something and only bother handling exceptions when you get caught with an error, or, alternatively you can write your code to check that everything is properly formatted and no errors will be raise before it tries to execute any code. In general, the `EAFP` (ask forgiveness after getting caught) method is preferable, but both are typically used frequently in any program. 

#### EAFP
Easier to ask forgiveness is a bit faster because when the type is correct we do not waste time checking whether it is correct or not. We only bother if there is an exception raised by the code. We use a statement called a `try/except` statement. The indentation of the code is important in this part, if a TypeError is raised anywhere within the indented `try` section then it will be caught by the `except` clause. We capture and store the exception message into a variable `e` and print it for the user. 

In [6]:
def myfunc4(x):
    "return x + 100"
    try: 
        return x + 100
    except TypeError as e:
        print("There was an error: {}".format(e))

In [7]:
myfunc4('a')

There was an error: must be str, not int


#### LBYL
Look before you leap checks the type of our input right away, which has the cost of performing one more operation than this EAFP example, but it also ensures for us that know the type of data, and so helps us to avoid errors a bit better. Here we use a conditional `if/else` statement to check the type of the input. 

In [8]:
def myfunc5(x):
    "return x + 100"
    if isinstance(x, (int, float)):
        return x + 100
    else:
        return "There was an error: x is not an int or float"
    

In [9]:
myfunc5('a')

'There was an error: x is not an int or float'

## Multiple inputs 
Of course we often want to write functions that take multiple inputs. This is easy. 

In [10]:
def sumfunc1(arg1, arg2):
    "returns the sum of two input args"
    return arg1 + arg2

In [11]:
sumfunc1(10, 20)

30

### Writing a useful function
Let's write a function to perform the task that we ran in a previous challenge, which is to find the number of differences between two DNA strings. Write a function that will find the four differences between the DNA strings below. Then make your own strings and test it to make sure it works on any arbitrary input sequence. You can see now that our function is getting more complex it is usefult to add some comment lines to the code to make clear what we are doing. 

In [19]:
def seqdiff1(seq1, seq2): #Note this is seqdiff1 and not seqdiff as you see in the code later on
    "return the number of differences between two sequences"
    ## a counter to store the number of diffs
    count = 0

    ## iterate over the index of bases and add to count if diff
    for idx in range(len(seq1)): # Here, slen was not defined
        if seq1[idx] != seq2[idx]:
            count += 1
    return count

In [20]:
dna1 = "ACAGAGTTGCCAGGAGATGACAGAAAGGTGTGGGTTACAACTCTCTCTAATTTAAGGGCCAATTAACATT"
dna2 = "ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTAAAATAAGGGCCAATTAACGTT"

seqdiff1(dna1, dna2)

5

### What is the sequences are different lengths?
Below we add an operation to compute the length of the two sequences and then use `min` to get the shortest one. 
 

In [None]:
def seqdiff2(seq1, seq2):
    """
    return the number of differences between two sequences,
    compares sequences from start to the end of the shortest seq.
    """
    ## a counter to store the number of diffs
    count = 0
    
    ## get the shortest input sequence length
    slen = min([len(i) for i in (seq1, seq2)])
    
    ## iterate over the index of bases and add to count if diff
    for idx in range(slen):
        if seq1[idx] != seq2[idx]:
            count += 1
    return count

In [None]:
dna1 = "ACAGAGTTGCCAGGAGATGACAGAAAGGTGTGGGTTAC"
dna2 = "ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTA"

seqdiff2(dna1, dna2)

## Challenges: Write functions and create sequences and test on them. 
For the challenges below try to write proper functions that include a documentation string and comments. 

A. Write a function that will generate and return a random sequence of bases of length N. Hint, for this use a new package from the standard library that we haven't used yet called `random`. You will need to import the package and then look for commands that you can use. One that would work is `random.sample`, but there are other ways as well. If you get stuck on how to use it then try asking google. 

In [32]:
#got help from: https://stackoverflow.com/questions/21205836/generating-random-sequences-of-dna 

#import package
import random

#function selects N items from C, G, T, and A
def RandBase(N):
    """
    A function that selects N random bases and returns
    
    Parameters:
    -----------
    N (int):
        An integer input.
        
    Returns:
    ---------
    string
    """
    #initialize string for holding bases
    DNA=""
    #iterate across length of bases desired
    for count in range(N):
        #for each place in the length select one base and add it to the string
        DNA+=choice("CGTA")
    return DNA

In [33]:
RandBase(5)

'AGTAA'

B. Write a function to calculate and return the frequency of As, Cs, Ts and Gs in a sequence. 

In [45]:
#Got help from: http://www.ics.uci.edu/~thornton/cosmos/Lab2/Solutions/Program4.py 

def BaseFreq(dnaSequence):
    """
    A function that counts the frequency of each base and returns
    
    Parameters:
    -----------
    dnaSequence (str):
        A string input.
        
    Returns:
    ---------
    (int, int, int, int)
    """
    
    #initialize counters
    aCount = 0
    cCount = 0
    tCount = 0
    gCount = 0

    #loop through characters in sequence and increment the relevant counter
    for c in dnaSequence:
        if c == 'a':
            aCount = aCount + 1
        elif c == 'c':
            cCount = cCount + 1
        elif c == 't':
            tCount = tCount + 1
        elif c == 'g':
            gCount = gCount + 1

    #return  the count for each base
    return int(aCount), int(cCount), int(tCount), int(gCount)

#would be ideal to get a label for each of the bases ... 

In [46]:
BaseFreq("atcgatgagagctagcgata")

(7, 3, 4, 6)

C. Write a function to concatenate (join end-to-end) two sequences and return it

In [60]:
def Conc(str1, str2):
    """
    A function that takes any sequence and prints concatenated output
    
    Parameters:
    -----------
    str1, str2 (str, int, float):
        A string, integer, or float input.
        
    Returns:
    ---------
    (str)
    """
    #convert str1 to string so that it concatenates properly
    str1a = str(str1)
    #convert str2  to string so that it concatenates properly
    str2a = str(str2)
    #concatenate strings
    out = str1a + str2a
    #return concatenation
    return out

In [62]:
Conc('abc','def')

'abcdef'

D. Write a function to take two sequences of different lengths and return both trimmed down to be the same length. 

In [68]:
def same_length(seq1, seq2):
    """
    A function that takes any two sequences and prints the same length output
    
    Parameters:
    -----------
    str1, str2 (str, int, float):
        A string, integer, or float input.
        
    Returns:
    ---------
    (str, str)
    """
    #convert sequences to strings for proper handling
    seq1 = str(seq1)
    seq2 = str(seq2)
    
    #find length of each string
    len1 = len(seq1)
    len2 = len(seq2)
    
    #compare lengths and make output as long as shorter string
    if len1 < len2:
        shorter = len1
        out1 = seq1
        out2 = seq2[:len1]
    elif len2 < len1:
        shorter = len2
        out1 = seq1[:len2]
        out2 = seq2
    else:
        out1 = seq1
        out2 = seq2
    
    return out1, out2

In [69]:
same_length(2222,33)

('22', '33')

E. Write a function to return the proportion of bases across the shared length between two sequences that are the same. In this function, use the function that you created in `D` above to convert the sequences to be the same length (even if this is not necessarily the most efficient way to complete this task). 

In [91]:
def prop_bases(bases1,bases2):
    """
    A function that takes any two sequences and returns the proportion of overlap between as the output
    
    Parameters:
    -----------
    bases1, bases2 (str):
        A string input.
        
    Returns:
    ---------
    (float)
    """
    #cut the sequences to the same length
    bases = same_length(bases1, bases2)
    
    #get the frequency of each base letter from each sequence
    freqs1 = BaseFreq(bases[0])
    freqs2 = BaseFreq(bases[1])
    
    #get the value for the smaller number of bases of that letter (this is the max overlap of that base)
    a = min(freqs1[0], freqs2[0])
    c = min(freqs1[1], freqs2[1])
    t = min(freqs1[2], freqs2[2])
    g = min(freqs1[3], freqs2[3])
    
    #add the overlapping bases of each type and divide by the length of the sequence
    prop = (int(a)+int(c)+int(t)+int(g))/len(bases[0])
    
    #return this proportion
    return prop

In [92]:
prop_bases("ccctga","gctactg")

0.8333333333333334

## Finished
Save this notebook and close it. Push a copy of the notebook to the `assignment/` directory with your name in the filename like `./assignment/<myname>-3.4.ipynb`. 