# Functions

Very similar to functions in mathematics, Python functions operate on arguments, and use them to compute a value. In programming, functions are used to encapsulate code fragments that we want to run multiple times, but on different values. Concider the mathematical function $f: x \to 3 x^2 + 2 x + 7$, this would be coded in Python as follows.

In [3]:
def f(x):
    return 3*x**2 + 2*x + 7

The name of the function is `f`, the name of its argument is `x`, and it returns a value that is the result of some arithmetic operations on `x`.

Calling the function `f`, or applying it to a value is again quite similar as in mathematics, we would write, e.g., $f(-1)$ to apply the function $f$ to the value $-1$.

In [7]:
print(f(-1), f(3), f(7))

8 40 168


A function can have multiple argument, e.g., $g: (x, y) \to \sqrt{x^2 + y^2}$, which would translate to python as follows.

In [8]:
from math import sqrt
def g(x, y):
    return sqrt(x**2 + y**2)

In [9]:
print(g(1, 2), g(3, 4))

2.23606797749979 5.0


The exampls so far map floating point number to floating point numbers, but obviously, functions are not restricted in that sense. Simply think of a function we've already used a few times, `len`. It maps strings, lists, tuples, sets and dictionaries to positive integers.

What makes a good function name? A function performs some action on its arguments, or uses the values do compute something, so it is an action. Hence a verb that describes the action makes a good function name.

Let's create a function using the code we wrote to count the number of nucleotides in a DNA string, taking into account that invalid characters might be present.

In [11]:
def count_nucl(sequence):
    nucl_count = dict()
    for i, nucl in enumerate(sequence):
        if nucl not in 'ACGT':
            print(f'### error: invalid symbol {nucl} at position {i} in {sequence}')
            return None
        if nucl not in nucl_count:
            nucl_count[nucl] = 0
        nucl_count[nucl] += 1
    return nucl_count

The function `count_nucl` takes a DNA sequence as an argument (a `str`), and returns a `dict` with the nucleotide counts if there are no invalid symbols, otherwise it return the special value `None`. Note that the `return` statement has two tasks, it will determine the value produced by the function, and it will end its execution.

In [12]:
sequences = ['AGGTACC', 'AQTTAC', 'GGATA', 'AATTQ']
for sequence in sequences:
    nucl_count = count_nucl(sequence)
    if nucl_count is not None:
        print(nucl_count)

{'A': 2, 'G': 2, 'T': 1, 'C': 2}
### error: invalid symbol Q at position 1 in AQTTAC
{'G': 2, 'A': 2, 'T': 1}
### error: invalid symbol Q at position 4 in AATTQ


#### Your turn now: to `count` or not to `count`?

In the definition of the function `count_nucl`, we didn't use the `count` method defined on `str`.  Could we and should we?

#### Your turn now: count all

Let's return to the code fragment we used to illustrate functions. For the sake of the argument, we will now assume that the DNA fragments contain no invalid symbols, so the function implementation can be simplified.

In [24]:
def count_nucl(sequence):
    nucl_count = dict()
    for nucl in sequence:
        if nucl not in nucl_count:
            nucl_count[nucl] = 0
        nucl_count[nucl] += 1
    return nucl_count

Let's complicate things a little bit by computing the counts over all sequences together, rather than for the individual sequences. Write a function `count_all_nucl` that takes a list of DNA sequences as an argument, and returns a `dict` with the counts for A, C, G, T. It uses the function `count_nucl` above.

In [None]:
def count_all_nucl(dna_sequences):
    ____

The code fragment below defines a list of DNA sequences, calls the function, and prints the resultng `dict`.

In [None]:
sequences = ['AGGTACC', 'AGTTAC', 'GGATA', 'AATTC']
nucl_count = count_all_nucl(sequences)
print(nucl_count)

## Optional arguments

Maybe you remember that the `split` method defined on `str` could be called without any arguments, or with the separator string. The argument to `split` is optional, if we do not specify it, a default value will be used (whitespace).

In [13]:
data_str = '3.1; 5.2; 7.3'
print(data_str.split())
print(data_str.split(';'))

['3.1;', '5.2;', '7.3']
['3.1', ' 5.2', ' 7.3']


If you look a the documentation, you will see that it actually has two optional arguments.

In [14]:
help(str.split)

Help on method_descriptor:

split(...)
    S.split(sep=None, maxsplit=-1) -> list of strings
    
    Return a list of the words in S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are
    removed from the result.



The first optional argument is `sep`, the separator used to split the string, the second is `maxsplit`, the maximum number of splits to do. The fact that those arguments are optional is indicated as, e.g., `sep=None`. The argument `sep` in the function will get the value `None` when it is called with no value for `sep` given. Similarly, if no value is passed for `maxsplit`, it will be assigned `-1` in the function.

Optional arguments to functions can (and very often have to) be specified by name in function calls, e.g.,

In [15]:
'gjb I like Python programming'.split(maxsplit=1)

['gjb', 'I like Python programming']

If we wouldn't have named the `maxsplit` value, Python would consider it to be the value of `sep`, and an error would be reported, since `sep` should be an `str` value, or `None`.

In [16]:
'gjb I like Python programming'.split(1)

TypeError: must be str or None, not int

Your solution to the 'count all' exercise probably looked something like the once below.

In [27]:
def count_all_nucl(dna_sequences):
    total_nucl_count = dict()
    for sequence in dna_sequences:
        nucl_count = count_nucl(sequence)
        for nucl in nucl_count:
            if nucl not in total_nucl_count:
                total_nucl_count[nucl] = 0
            total_nucl_count[nucl] += nucl_count[nucl]
    return total_nucl_count

In [28]:
sequences = ['AGGTACC', 'AGTTAC', 'GGATA', 'AATTC']
nucl_count = count_all_nucl(sequences)
print(nucl_count)

{'A': 8, 'G': 5, 'T': 6, 'C': 4}


Although this works, it is a bit of a hassle. We can simplify matters by modifying `count_nucl` to take an optional second argument, with a default value `None` that can be a `dict` containing nucleotide counts.

In [29]:
def count_nucl(sequence, nucl_count=None):
    if nucl_count is None:
        nucl_count = dict()
    for nucl in sequence:
        if nucl not in nucl_count:
            nucl_count[nucl] = 0
        nucl_count[nucl] += 1
    return nucl_count

If we use the optional `nucl_count` in a call to the `count_nucl` function, we can keep adding to an exisiting `dict`. This simplifies the code for `count_all_nucl` considerably.

In [30]:
def count_all_nucl(dna_sequences):
    total_nucl_count = dict()
    for sequence in dna_sequences:
        nucl_count = count_nucl(sequence, total_nucl_count)
    return total_nucl_count

In [31]:
sequences = ['AGGTACC', 'AGTTAC', 'GGATA', 'AATTC']
nucl_count = count_all_nucl(sequences)
print(nucl_count)

{'A': 8, 'G': 5, 'T': 6, 'C': 4}


Note that we can keep calling `count_nucl` as we did before, just with a single DNA sequence as an argument. This will work, since the second argument, `nucl_count` is optonal. If it is not in the function call, the `nucl_count` variable is `None` (the default value), and it will be initialized as an empty `dict`.

Of course, there is another implementation of `count_all_nucl` you may have come up with, which is even simpler.

In [32]:
def count_all_nucl(dna_sequences):
    return count_nucl(''.join(dna_sequences))

In [33]:
sequences = ['AGGTACC', 'AGTTAC', 'GGATA', 'AATTC']
nucl_count = count_all_nucl(sequences)
print(nucl_count)

{'A': 8, 'G': 5, 'T': 6, 'C': 4}


Another example of a function that has optional arguments is useful. Up to this point, the DNA sequences we used as examples have been created by hand, which is fine for short sequences, but would be rather tedious for long ones. We will define a function `random_dna_seq` that generates a DNA sequence of a given length, 10 by default.

Python has a nice choice of functions to generate pseudo-random numbers or sequences in the `random` module of its standard library. One of those functions, `choices`, is very useful for the task at hand.

In [36]:
import random
help(random.choices)

Help on method choices in module random:

choices(population, weights=None, *, cum_weights=None, k=1) method of random.Random instance
    Return a k sized list of population elements chosen with replacement.
    
    If the relative weights or cumulative weights are not specified,
    the selections are made with equal probability.



In [37]:
def random_dna_seq(length=10):
    return ''.join(random.choices('ACGT', k=length))

In [39]:
for seq_length in range(5, 15, 3):
    print(random_dna_seq(seq_length), random_dna_seq(seq_length))

CGCCC AGAGT
ACAGGCGG AAATTTCA
GAATGTCGTGA TCGTAGCGAAA
TTCACTAACCTTAT GCCACAGCACGTGA


Since there is only a single optional argument, we need not specify its name when calling `random_dna_seq`. As expected, when we omit the argument in the function call, `length` will be assigned the default value 10, and we get a random DNA sequence of that length.

In [40]:
random_dna_seq()

'CCGGAGCCAT'

#### Your turn now: random length sequences

Replace `___` in the function definition below such that the function will generate a random DNA sequence with a length such that `min_length <= length < max_length`. The function should have two optional arguments, `min_length` and `max_length` wth default values of 5 and 10 respectively. To determine a value for the actual random length, you can use a function in the `random` module, look it up in the [documentation](https://docs.python.org/3/library/random.html).

In [None]:
import random
def random_dna_seq(____, ____):
    length = ____
    seq_list = random.choices('ACGT', k=length)
    return ____