##First, some FP concepts

###State and mutability
A variable is **mutable** if it can change **state** while a program is running:

In [2]:
x = 0
for i in range(10):
    x = x + i
print(x)

45


The value of `x` changes as the program runs. 

The same thing without using state:

In [4]:
x = sum(range(10))
print(x)

45


###Side effects
A function has side effects if it changes the state of variables in the program:

In [9]:
def my_function(i): 
    i.append('a') 
    return(i) 

foo = [1,2,3] 
print(foo) 
bar = my_function(foo)

[1, 2, 3]


Has the value of `foo` been changed? We can't know without inspecting `my_function`.

Same but without side effects:

In [12]:
def my_function(i): 
    return(i + ['a']) 

my_function([1,2,3])

[1, 2, 3, 'a']

It's even harder if we have a big stack of functions:

In [17]:
def f1(i): 
    i.append('a') 
    # real code...
    return(i) 

def f2(i): 
    j = f1(i)
    # real code...
    return(j) 

def f3(i): 
    k = f2(i)
    # real code...
    return(k) 


foo = [1,2,3] 
print(foo) 
bar = f3(foo)

[1, 2, 3]


In order to figure out whether `foo` has changed we have to look at `f3`, then `f2`, then `f1`, etc.

Another question: does the function always return the same output when given the same input?

In [19]:
def my_function(i): 
    return(i + to_add) 


to_add =['a', 'b', 'c'] 
x = [1,2,3] 
print(my_function(x)) 
to_add =['x', 'y', 'z'] 
print(my_function(x)) 

[1, 2, 3, 'a', 'b', 'c']
[1, 2, 3, 'x', 'y', 'z']


Above, identical function calls give different results depending on the value of `to_add`.

Functions which 

- don't have any side effects and
- always return the same output for the same input

are called **pure functions**.



###Functions as objects

Functions are values which can be passed around a program just like ints, strings, etc.

A function which takes a list an the name of a function, and prints the result of running the function on each element of the list:

In [21]:
def print_list_with_function(my_list, my_function): 
    # the my_function argument is the name of a function
    for element in my_list: 
        print(my_function(element)) 

input = ['abc', 'defhij', 'kl'] 
print_list_with_function(input, len) 

3
6
2


How about with the name of a function that we define?

In [23]:
def get_second(input): 
    return input[1] 

print_list_with_function(input, get_second)

b
e
l


For one-line functions we can use a shortcut: **lambda expressions**:

In [25]:
get_second = lambda(input) : input[1] 

print_list_with_function(input, get_second)

b
e
l


Or even:

In [27]:
print_list_with_function(input, lambda(input) : input[1] )

b
e
l


###How to get the answer vs. what the answer is

In non-functional (procedural?) code, we describe how to obtain the answer.

>Create a variable to hold a running total, and set it to zero. Then, for each number between zero and ten, add that number to the total. Finally, print the total.

In [29]:
total = 0 
for i in range(11): 
    total = total + i 
print(total) 

55


In functional code, we decribe the result we want:

>the sum of the numbers between zero and ten

and let the computer figure out how to calculate it

In [31]:
print(sum(range(11))) 

55


Another example: getting a list of the second letter of each element in a list of words:

In [120]:
input = ['hello', 'world']

# how to get the answer we want
result1 = []
for i in input:
    result1.append(i[1]) 

# describe the answer we want
result2 = map(lambda x : x[1], input)
result1,result2

(['e', 'o'], ['e', 'o'])

##Looking at some built in higher order functions

###map()

For a list of DNA sequences, create a list of their lengths:

In [35]:
dna_list = ['TAGC', 'ACGTATGC', 'ATG', 'ACGGCTAG'] 

lengths = [] 
for dna in dna_list: 
    lengths.append(len(dna)) 
lengths

[4, 8, 3, 8]

For the same list, create a list of their AT contents:

In [36]:
from __future__ import division

at_contents = [] 
for dna in dna_list: 
    at_contents.append((dna.count('A') + dna.count('T')) / len(dna)) 
at_contents

[0.5, 0.5, 0.6666666666666666, 0.375]

Lots of duplicated logic here. For the general pattern of transforming each element of a list, use `map()`. First argument is the name of a transformation function, either built in:

In [38]:
lengths = map(len, dna_list) 
lengths

[4, 8, 3, 8]

or defined:

In [41]:
def get_at(dna): 
    return (dna.count('A') + dna.count('T')) / len(dna) 

at_contents = map(get_at, dna_list) 
at_contents

[0.5, 0.5, 0.6666666666666666, 0.375]

Notice one-in-one-out function signature. 

Remember lambda expressions:

In [42]:
at_contents = map( 
    lambda dna : (dna.count('A') + dna.count('T')) / len(dna), 
    dna_list 
)
at_contents

[0.5, 0.5, 0.6666666666666666, 0.375]

Note: `map()` is **lazy** in Python 3.

We've talked about lists, but really any iterable:

In [43]:
map(lambda x: x.lower(), 'ABCDEF')

['a', 'b', 'c', 'd', 'e', 'f']

###filter()

Selecting elements from a list that fit a specific criterion e.g. minimum length, low AT content:

In [46]:
dna_list = ['TAGC', 'ACGTATGC', 'ATG', 'ACGGCTAG'] 

long_dna = [] 
for dna in dna_list: 
    if len(dna) > 5: 
        long_dna.append(dna) 
print(long_dna)

at_poor_dna = [] 
for dna in dna_list: 
    if (dna.count('A') + dna.count('T')) / len(dna) < 0.6: 
        at_poor_dna.append(dna) 
print(at_poor_dna)

['ACGTATGC', 'ACGGCTAG']
['TAGC', 'ACGTATGC', 'ACGGCTAG']


Lots of repetitive code. 

Just like `map()`, `filter()` takes a function argument which returns True or False:

In [45]:
def is_long(dna): 
    return len(dna) > 5 

def is_at_poor(dna): 
    at = (dna.count('A') + dna.count('T')) / len(dna) 
    return at < 0.6 

long_dna = filter(is_long, dna_list) 
at_poor_dna = filter(is_at_poor, dna_list) 

print(long_dna)
print(at_poor_dna)

['ACGTATGC', 'ACGGCTAG']
['TAGC', 'ACGTATGC', 'ACGGCTAG']


`filter()` is also lazy under Python 3.

###sorted

By default the `sorted()` function sorts alphabetically:

In [48]:
dna_list = ['TAGC', 'ACGTATGC', 'ATG', 'ACGGCTAG'] 
sorted(dna_list) 

['ACGGCTAG', 'ACGTATGC', 'ATG', 'TAGC']

For custom sorting, we pass in a `key` keyword argument which is the name of a transformation function. So to sort by length, pass in the name of the `len` function:

In [50]:
sorted(dna_list, key=len) 

['ATG', 'TAGC', 'ACGTATGC', 'ACGGCTAG']

Notice how we don't have to worry about writing any actual sorting code. Remember, we describe what the answer looks like:

> the elements of `dna_list` sorted according to their length

and Python figures out how to give it to us.

Reverse the order with `reverse` keyword argument:

In [52]:
sorted(dna_list, key=len, reverse=True) 

['ACGTATGC', 'ACGGCTAG', 'TAGC', 'ATG']

As illustrated, `sorted()` works by returning a copy of the list - or other iterable:

In [55]:
sorted('atcgatcg')

['a', 'a', 'c', 'c', 'g', 'g', 't', 't']

There is also `list.sort()` which behaves the same way but works by mutating the original list.

Key functions can be arbitrarily complex....

To sort by AT content, we can just re-use our `get_at()` function from before:

In [57]:
def get_at(dna): 
    return (dna.count('A') + dna.count('T')) / len(dna) 

sorted(dna_list, key=get_at) 

['ACGGCTAG', 'TAGC', 'ACGTATGC', 'ATG']

To sort by length of poly-A tail, we just have to write a function which takes a single DNA sequence and returns the poly-A tail length:

In [60]:
import re 
def poly_a_length(dna): 
    poly_a_match = re.search(r'A+$', dna) 
    if poly_a_match: 
        return len(poly_a_match.group()) 
    else: 
        return 0 

poly_a_length('ACGTGC')

0

and use it as the key to `sorted()`:

In [63]:
dna_list = ['ATCGA', 'ACGG', 'CGTAAA', 'ATCGAA']
sorted(dna_list, key=poly_a_length)

['ACGG', 'ATCGA', 'ATCGAA', 'CGTAAA']

`map()` and `sort()` both use transformation-type functions, so we can also just list the poly-A tail lengths:

In [66]:
map(poly_a_length, dna_list)

[1, 0, 3, 2]

Using the `key` argument lets us reach inside complex data structures to find the bit of data we need to sort on. 

Say we have a list of tuples which store gene expression measurements for two conditions:

In [67]:
# tuples are:
#    gene name 
#    expression in condition one
#    expression in condition two

measurements = [ 
    ('gene1', 121, 98), 
    ('gene2', 56,  32), 
    ('gene3', 1036, 1966), 
    ('gene4', 543, 522) 
] 

We want to find genes which are over-expressed in condition two compared to condition one. A transformation function takes a single tuple and returns the expression ratio:

In [68]:
def get_ratio(measurement): 
    return measurement[2] / measurement[1] 

Now we can use `map()` to list the ratios:

In [71]:
map(get_ratio, measurements)

[0.8099173553719008,
 0.5714285714285714,
 1.8976833976833978,
 0.9613259668508287]

and `sort()` to order the genes by ratio:

In [75]:
sorted(measurements, key=get_ratio, reverse=True)

[('gene3', 1036, 1966),
 ('gene4', 543, 522),
 ('gene1', 121, 98),
 ('gene2', 56, 32)]

or even `filter()` to find just genes with a ratio over a given threshold:

In [77]:
filter(lambda x : get_ratio(x) > 1.5, measurements)

[('gene3', 1036, 1966)]

A final useful sorting trick: sorts in Python are stable i.e. elements for which the keys are the same stay in the same order:

In [81]:
sorted(['yx', 'cd', 'jr'], key=len)

['yx', 'cd', 'jr']

So doing multi-level sorts is easy. Given a list of tuples which store chromosome number, base position, and gene name for a bunch of loci:

In [83]:
# tuples are:
#    chromosome 
#    start base
#    locus name

loci = [ 
    (4, 9200, 'gene1'), 
    (6, 63788, 'gene2'), 
    (4, 7633, 'gene3'), 
    (2, 8766, 'gene4') 
] 

We want to sort first by chromosome number then within each chromosome by base position. Solution: sort by base position first, then by chromosome number:

In [86]:
sorted_by_base = sorted(loci, key=lambda x : x[1])
sorted_by_base

[(4, 7633, 'gene3'),
 (2, 8766, 'gene4'),
 (4, 9200, 'gene1'),
 (6, 63788, 'gene2')]

In [89]:
final_sort = sorted(sorted_by_base, key = lambda x : x[0])
final_sort

[(2, 8766, 'gene4'),
 (4, 7633, 'gene3'),
 (4, 9200, 'gene1'),
 (6, 63788, 'gene2')]

More readably, using named helper functions to get the chromosome and base:

In [91]:
def get_chromosome(locus): 
    return locus[0] 
 
def get_base_position(locus): 
    return locus[1] 

sorted_by_base = sorted(loci, key=get_base_position) 
final_sort = sorted(sorted_by_base, key=get_chromosome) 
final_sort

[(2, 8766, 'gene4'),
 (4, 7633, 'gene3'),
 (4, 9200, 'gene1'),
 (6, 63788, 'gene2')]

Less readably, using nested sorts (probably don't do this unless your keyboard will literally fall apart if you type the extra characters):

In [93]:
sorted(sorted(loci, key=lambda x : x[1]), key = lambda x : x[0])

[(2, 8766, 'gene4'),
 (4, 7633, 'gene3'),
 (4, 9200, 'gene1'),
 (6, 63788, 'gene2')]

##Writing higher-order functions

Question: when might it be helpful to write a higher-order function?

A normal function lets us abstract part of the code. Here are functions to get a list of 4mer and 6mers from a DNA sequence:

In [98]:
def get_4mers(dna):
    all_4mers = [] 
    for i in range(len(dna) - 3): 
        all_4mers.append(dna[i:i+4]) 
    return all_4mers

def get_6mers(dna):
    all_6mers = [] 
    for i in range(len(dna) - 5): 
        all_6mers.append(dna[i:i+6]) 
    return all_6mers

dna = "acggcatcgtacg"
print(get_4mers(dna))
print(get_6mers(dna))

['acgg', 'cggc', 'ggca', 'gcat', 'catc', 'atcg', 'tcgt', 'cgta', 'gtac', 'tacg']
['acggca', 'cggcat', 'ggcatc', 'gcatcg', 'catcgt', 'atcgta', 'tcgtac', 'cgtacg']


What do these two functions have in common?
 - making an empty list
 - iterating over the sequence
 - extracting the kmer
 - adding it to the result
 - returning the result
 
What is different between the two functions?
 - the length of the kmer
 
So, we take the length of the kmer and turn it into a function argument i.e. we abstract it away:

In [100]:
def get_kmers(dna, k):
    kmers = [] 
    for i in range(len(dna) - k +1): 
        kmers.append(dna[i:i+k]) 
    return kmers

print(get_kmers(dna, 4))
print(get_kmers(dna, 6))

['acgg', 'cggc', 'ggca', 'gcat', 'catc', 'atcg', 'tcgt', 'cgta', 'gtac', 'tacg']
['acggca', 'cggcat', 'ggcatc', 'gcatcg', 'catcgt', 'atcgta', 'tcgtac', 'cgtacg']


OK, different problem. Just looking at 6mers, getting a list of the AT contents vs getting a list of the CG dinucleotide counts:

In [105]:
def get_6mers_at(dna): 
    result = [] 
    for i in range(len(dna) - 5): 
        one_6mer = dna[i:i+6] 
        at = (one_6mer.count('a') + one_6mer.count('t')) / 6 
        result.append(at) 
    return result 

def get_6mers_cg(dna): 
    result = [] 
    for i in range(len(dna) - 5): 
        one_6mer = dna[i:i+6] 
        cg = one_6mer.count('cg') 
        result.append(cg) 
    return result

print(get_6mers_at(dna))
print(get_6mers_cg(dna))

[0.3333333333333333, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333, 0.5, 0.6666666666666666, 0.5, 0.3333333333333333]
[1, 1, 0, 1, 1, 1, 1, 2]


What do these two functions have in common?
 - making an empty list
 - iterating over the sequence
 - extracting the kmer
 - carrying out some calculation on it
 - adding the calculation to the result
 - returning the result
 
What is different between the two functions?
 - the thing that we do to the kmer to get a single element of the result
 
So, we take *the-thing-that-we-do-to-the-kmer-to-get-a-single-element-of-the-result* and turn it into a function argument. 

What is *the-thing-that-we-do-to-the-kmer-to-get-a-single-element-of-the-result*? Yep, a transformation function:

In [108]:
def get_at(dna): 
    return (dna.count('a') + dna.count('t')) / len(dna) 

def get_6mers_f(dna, analyze_6mer): 
    result = [] 
    for i in range(len(dna) - 5): 
        one_6mer = dna[i:i+6] 
        result.append(analyze_6mer(one_6mer)) 
    return result 

get_6mers_f(dna, get_at)

[0.3333333333333333,
 0.3333333333333333,
 0.3333333333333333,
 0.3333333333333333,
 0.5,
 0.6666666666666666,
 0.5,
 0.3333333333333333]

Let's do the cg dinucleotide example with a lambda expression:

In [110]:
get_6mers_f(dna, lambda dna : dna.count('cg'))

[1, 1, 0, 1, 1, 1, 1, 2]

##Bonus round: returning functions from functions

Above, we considered passing a function as an argument to a function. Can we do the opposite: return a function from a function?

Here's a function factory that builds functions to generate lists of kmers from DNA sequences:

In [112]:
def kmer_generator_factory(k):
    
    def kmer_generator(dna):
        result = [] 
        for i in range(len(dna) - k +1): 
            kmer = dna[i:i+k] 
            result.append(kmer) 
        return result 
    
    return kmer_generator

We call this function with a value of k and it returns a function that takes a DNA sequence and return a list of kmers of length k. We can check this by looking at the type of the return value:

In [115]:
get_4mers = kmer_generator_factory(4)
type(get_4mers)

function

Now we can call the function that the factory created:

In [117]:
get_4mers(dna)

['acgg',
 'cggc',
 'ggca',
 'gcat',
 'catc',
 'atcg',
 'tcgt',
 'cgta',
 'gtac',
 'tacg']

We can use this factory to build functions for any kmer length:

In [119]:
get_2mers = kmer_generator_factory(2)
get_5mers = kmer_generator_factory(5)
get_2mers(dna), get_5mers(dna)

(['ac', 'cg', 'gg', 'gc', 'ca', 'at', 'tc', 'cg', 'gt', 'ta', 'ac', 'cg'],
 ['acggc',
  'cggca',
  'ggcat',
  'gcatc',
  'catcg',
  'atcgt',
  'tcgta',
  'cgtac',
  'gtacg'])

#Exercises

##BLAST result parser

The file *blast_result.txt* contains a BLAST result in tabular format. Each row represents a hit and the fields, in order, give:

1. the name of the query sequence
2. the name of the subject sequence
3. the percentage of positions that are identical between the two sequences
4. the alignment length
5. the number of mismatches
6. the number of gap opens
7. the position of the start of the match on the query sequence
8. the position of the end of the match on the query sequence
9. the position of the start of the match on the subject sequence
10. the position of the end of the match on the subject sequence
11. the evalue for the hit
12. the bit score for the hit

Example:

In [None]:
#gi|322830704:1426-2962	gi|188011119|ref|YP_001905892.1|	61.31	442	170	1	10	1335	14	454	4e-144	 429
#gi|322830704:1426-2962	gi|225622184|ref|YP_002725698.1|	61.49	444	170	1	4	1335	12	454	6e-144	 429
#gi|322830704:1426-2962	gi|171260186|ref|YP_001795390.1|	61.54	442	169	1	10	1335	15	455	6e-144	 429
#gi|322830704:1426-2962	gi|288903312|ref|YP_003434040.1|	61.99	442	167	1	10	1335	14	454	6e-144	 429
#gi|322830704:1426-2962	gi|49146527|ref|YP_026087.1|	61.54	442	169	1	10	1335	14	454	9e-144	 428

Open this file in a text editor and take a look at it. 

Use a combination of `map()`, `filter()` and `sorted()` to answer the following questions:

- How many hits have fewer than 20 mismatches?
- List the subject sequence names for the ten matches with the lowest percentage of identical positions
- For matches where the subject sequence name includes the string "COX1", list the start position on the query as a proportion of the length of the match
 
Hint: think about how to store the hits in a data structure

## FASTA editor

Write a function that copies FASTA format sequences from an input file to an output file while allowing for arbitrary modification of both the header and the sequence.
 
Your function should take four arguments: the name of the input file, the name of the output file, a header-modification function and a sequence-modification function:

In [None]:
modify_fasta(
    "input.fasta",
    "output.fasta",
    fix_header,
    fix_sequence)

Write some code that uses your FASTA copying function to fix these common FASTA file problems, one at a time:

- The sequence is in lower case and you need it in upper case
- The sequence contains unknown bases that should be removed
- The headers contain spaces that should be changed to underscores
- The headers are too long and need to be truncated to ten characters
- Append the length of the sequence to the header
- Append the AT content of the sequence to the header
- If the sequence starts with ATG and ends with a poly-A tail, append the phrase "putative transcript" to the header

Use the file *sequences.fasta* to test your code.  
 
**Hint: the sequences are single-line to make the parsing code easier**

**Hint: start by writing a program which simply copies the records**
