### Lecture 13 2018-10-02: functions, modules

This worksheet accompanies the lecture notes.



## Function Definition, Abstraction ##

Functions are collections of code that transform input objects into return objects. 
Functions can return *any* object (lists, tuples, strings, even functions)
They *encapsulate* code, so that you can *reuse* code. 
Object methods are functions that exist inside an object. But functions need not be inside objects.

If you find yourself copying and pasting the same code, but with different parameters, consider defining and using a function

### creating functions

Use *def* to begin a function definition and *return* to return a value. 
Remember that indentation defines which code belongs in this definition, so *return* can appear anywhere.

To define a function f that takes input p and returns r, where both p and r can be a variable, list, tuple, function, etc., the syntax is:

```
def f(p):
    '''
    long comments here to describe f, p, r
    optional, but STRONGLY recommended
    '''
    ...your code here...
    return(r)
```

To use a function f with input parameters p:

>f(p)


Return can be formatted with or without parentheses.
That is, to return *value* you can use either

>return(value)

or 
>return value

You can have multiple *return* statements

In [2]:
def is_even(n):
    if n%2 == 0: return True
    else: return False

If there is no *return*, the last thing mentioned is the return value. 
This can be confusing, so always return a value. 
If you want to not return a value, then 

>return None

### Example of functions 

suppose we have

In [3]:
read_codes = ["A","C","G","T","N","Y","U"]

nucleotides = read_codes[:4]
ambiguity_codes = read_codes[4:]

sequence="CGCAGCNNYYGCATTUUNAAGCYACTCCGYYCCTGGGGAGTNNNTTGAA"

Suppose we want to write a function that takes a nucleotide sequence and a set of character codes for reads, and returns a list with corresponding counts of the number of times each code appears in the sequence.

For example, we want code such that 

>count_codes('ACCGN', 'AGCTNV') returns [1, 1, 2, 0, 1, 0]

*Always* use good names for functions and input/output parameters, and include a description of the function in long cmoments just after the definition. 


In [4]:
def count_codes(sequence, codes):
    '''
    Input: a string of characters, sequence, and a list of codes, codes
    
    Output: a list [c1, ... ci, ... cn] where ci is the number of times the ith entry in codes appears in sequence
    '''
    counts=[]
    for next_code in codes:
        counts.append( sequence.count(next_code) )
    return(counts) 

In [None]:
counts = count_codes( sequence, nucleotides )

print( '\ncount_codes({}, {}) returns {}'.format(sequence, nucleotides, counts) )

print( '\ncount_codes( sequence, nucleotides ) returns {}'.format(count_codes( sequence, read_codes)))

print( '\n{0} has {1} nucleotides and {2} ambiguity codes, where nucleotides = {3} and ambituity codes = {4}'.format(
    sequence, 
    count_codes( sequence, nucleotides ), 
    count_codes( sequence, ambiguity_codes ), 
    nucleotides, 
    ambiguity_codes) 
    )

#### better code

Another example, that's a bit more pythonic would use comprehension and return a dict, so one could look up individual counts more easily.

In [None]:
def count_codes(sequence, codes):
    '''
    Input: a an iterable variable of characters, sequence, and an iterable variable of codes, codes
    
    Output: a tuple ((c1, n1), (c2, n2), ...) where ni is the number of times ci in codes appears in sequence, sorted
    '''
    counts=[sequence.count(char) for char in codes]
    return dict(zip(codes, counts))

# example
counts = count_codes( sequence, nucleotides )
print(counts)
print(counts['G'])

### reusing the code

To get a tab delimited output, which may be easier for humans to read

In [13]:
for char, count in sorted(count_codes( sequence, nucleotides ).items() ):
    print('{}\t{}'.format(char,count) )

A	8
C	10
G	11
T	7


The function allows us to re-use the code with different parameters.

Note that we can use lists, strings, or tuples (even dicts!) for either the sequences or the code sets.
This is an example of "hardened" code, which is forgiving of user input. 

But this code ignores the occurances of codes in the sequence that don't appear in the code list. 
For example, in 'ACNTKT' there are two 'T's, but if one uses code set ('A', 'C'), they are just ignored.

*Is this the right thing to do?*

In [None]:
sequences = ['ACNTKT', 'GN?ZX',]
code_set = [["A","C","G","T","N","Y","U"], 'ACGTN?XYK', ('A','C'), {'A':'a', 'C': 'c', '?': '-'} ]
for next_sequence in sequences:
    for next_code_set in code_set:
        print('\nworking on sequence {} and code_set {}'.format(
            next_sequence, next_code_set) 
             )
        for char, count in sorted(count_codes( next_sequence, next_code_set).items() ):
            print('{}\t{}'.format(char,count) )

One can even convert this to a dict, so one can return specific counts.

### Optional parameter with default values

if the parameter is of the form *p=default* then it is optional, and has the value *default* if it is omitted. 
To use a value other than the default, just include *p=my_value* in the parameter list.

optional parameters can be in any order

In this example, the sequence parameter is required, and the default ambiguity code list is 'ACGTNXY?'. To count only non-ambiguous characters, override the default with *'ACGT'*.

In [18]:
def count_codes(sequence, codes = 'ACGTNXY?'):
    '''
    Input: a an iterable variable of characters, sequence, and an iterable variable of codes, codes
    
    Output: a tuple ((c1, n1), (c2, n2), ...) where ni is the number of times ci in codes appears in sequence, sorted
    '''
    counts=[sequence.count(char) for char in codes]
    return dict(zip(codes, counts))

In [None]:
# use default ambiguity codes
count_codes('ACCGN')

In [None]:
# count only non-ambiguous codes
count_codes('ACCGN', codes='ACGT')

### scope and introspection

Variables within the function definition are in the "scope" of the function, and do not exist outside the function. 

>what happens in the function stays in the function

To return a value, the function definition needs *return*

In [None]:
#del(counts)

def count_codes(sequence, codes = 'ACGTNXY?'):
    '''
    Input: a an iterable variable of characters, sequence, and an iterable variable of codes, codes
    
    Output: a tuple ((c1, n1), (c2, n2), ...) where ni is the number of times ci in codes appears in sequence, sorted
    '''
    counts=[sequence.count(char) for char in codes]
    return dict(zip(codes, counts))

count_codes('ACCGN', codes='ACGT')

#counts does not exist at this point

To see the long comment, use

>function?

This is why you should **always** include good long comments in every function you define.

To see more information about the function:

>function??

In [27]:
count_codes?

### Recursion

Functions call call themselves. This is called *recursion*, or *recursive code*. 
Recursive functions have (at least) two parts,

* one part calls the function on a smaller input
* one part does not call itself (this is the *base* case*

If you get this wrong, your computer will eat itself and implode. 
Use the *Kernel* function in the menu at the top of your jupyter notebook to kill the kernel if this happens.

Here are two examples. The *fibonacci* code is **very** inefficient (meaning you can't run it with very large n). there are cool ways to speed this up. 

In [28]:
def fibonacci(n, trace=False):
    '''
    for integer n, return the nth fibonacci number. 
    By definition: f(0 = 1, f(1) = 1, f(n) = f(n-1) + f(n-2))
    WARNING: this code is VERY inefficient
    '''
    if trace: print('in fibonacci({})'.format(n))
    if n < 2:
        return 1
    else:
        return fibonacci(n-1, trace) + fibonacci(n-2, trace)

In [None]:
#fibs = [fibonacci(x) for x in range(6)]
#print(fibs, trace=True)

fibonacci(5, trace=True)

In [25]:
codons = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }

In [34]:
def translate( dna, codon_table=codons):
    if len(dna) < 3:
        return ''
    else:
        return codon_table[dna[:3]]+translate(dna[3:], codon_table)

In [35]:
translate('ACCGTTTTC')

'TVF'

## Modules

Modules are special purpose collections of code, including functions and objects (with methods).
It is more efficient to require the user to include special purpose code than to include in in basic python. 
Moreover, Python has a large user base, so there are many special purpose modules available.

Some particularly useful modules are:
* numpy: high performance support for scientific numerical computing (we will cover this)
* pandas: support for dataframes, which are commonly used in scientific data analysis (we will cover this)
* matplotlib: support for plotting data and showing graphs
* pprint: print data objects in a more readable way
* sys: gives access to system objects, such as STDIN, STDOUT, the arguments to a function

To use a module you need to *import* it.
Import creates a new namespace with all the parts of the module in that namespace.

A *namespace* is the collection of all names of functions and objects which one's code has access. For example *print()* and *len()* are the names of functions in the default name space, which you get for free.

To import a module, use

>import module

or 

>import module as mod

The first creates a namespace named *module* that contains everything in the module named "module"

The second puts module's contents into a namespace named *mod*.

As with all python objects, you access the contents of the module by using the name of the namespace followed by '.'. 
You can find all the contents with introspection, and get help with a trailing '?'.

For example, *pprint* is a function in the *pprint* module. 

In [None]:
import pprint

pprint({'a': 1, 'b':2})
#pprint.pprint({'a': 1, 'b':2})


In [None]:
import pprint as pp
pp.pprint({'a': 1, 'b':2})

In [None]:
from pprint import pprint
pprint({'a': 1, 'b':2})

Modules can contain other modules.
Sometimes, with a large module, you only want a particular sub-module.

For example, the *numpy* module contains a module named *random*. Rather than import
everything in the (very large) numpy module, it is common to just import the *random* module, possibly with a separate namespace for clarity. 

>from numpy import random as rnd


In [None]:
from numpy import random as rnd
# introspect into rnd

Also, the *matplotlib* module contains a function *pyplt*, which is often the only function you want

In [None]:
from matplotlib import pyplot as plt
# introspect into plt

In [None]:
# less trivial example
x_values = np.arange(0,1,.01)
squares = [x**2 for x in x_values]
cubes = [y**3 for y in x_values]

plt.title('Cubes versus squares')
plt.plot(x_values, squares)
plt.plot(x_values, cubes)