<span style="float:left;">Licence CC BY-NC-ND</span><span style="float:right;">François Rechenmann &amp; Thierry Parmentelat&nbsp;<img src="media/inria-25.png" style="display:inline"></span><br/>

# Computing the frequencies of the 4 bases

### Implementation in python

In this notebook we are going to implement the algorithm that was described in the video, that computes the respective frequency of appearance for each base `A`, `C`, `G` and `T` in a DNA fragment.

As opposed to the illustration in the video, we are not dealing anymore with pseudo-code, but this time we will write **executable code**, that thanks to the notebooks technology, we will be able to run right inside this document.

We start, as we have seen in the notebook on the basics in python, with a few magic formulas that will make our code run under both python2 and python3.

In [None]:
# this is so that we can use print() in python2 like in python3
from __future__ import print_function
# with this, division will behave in python2 like in python3
from __future__ import division

### The algorithm (1st version)

In its most elementary form, this first algorithm, as explained in the video, can be written like this. we start with initializing our variables:

In [None]:
### initializing variables
# numbers of occurrences
nbA = nbC = nbG = nbT = nbTotal = 0

# the input sequence
dna = "TATCCTGACTGGACGACAACGACGCAAT"

This does not cause anythong to get printed, it is the expected behaviour. We could see the contents of one of these variables like this:

In [None]:
print(dna)

or, even simpler, taking advantage of the fact that the last result in the cell is always printed:

In [None]:
dna

We can now scan the input sequence, and update the numbers of occurrences as we go, as well as the total number of bases:

In [None]:
# it is very simple to scan an entire string in python
for nucleotide in dna:
    if nucleotide == 'A':
        nbA += 1
    elif nucleotide == 'C':
        nbC += 1
    elif nucleotide == 'G':
        nbG += 1
    elif nucleotide == 'T':
        nbT += 1
    nbTotal += 1

Again this does not result in anything being printed. If we want to see the results:

In [None]:
print("Total sequence length", nbTotal)
print("A = ", 100 * nbA / nbTotal)
print("C = ", 100 * nbC / nbTotal)
print("G = ", 100 * nbG / nbTotal)
print("T = ", 100 * nbT / nbTotal)

This algorithm works perfectly well, but it is possible to improve it in several ways, that we are going to see step by step in the rest of this notebook.

##### Cosmetics

For starters, we will improve the way the results are displayed: 2 digits for the decimal part are accurate enough; and it turns out that python provides a format specifically appropriate for percentages, which will remove the need to multiply the ratio by 100: 

In [None]:
print("Total sequence length", nbTotal)
print("A = {:.2%}".format(nbA / nbTotal))
print("C = {:.2%}".format(nbC / nbTotal))
print("G = {:.2%}".format(nbG / nbTotal))
print("T = {:.2%}".format(nbT / nbTotal))

***

### Using a function (2nd version)

Our output now looks better, but we are left with a deeper issue, which is that we cannot easily use this code on another DNA fragment. Imagine that I now have

In [None]:
dna2 = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

If I want to run the same code again, I have to ... type all over again; of course this is not desirable, and this is precisely the purpose of functions in python. Here is what this gives us:

In [None]:
def display_freq_bases_v1(dna):
    """
    A function that displays the frequencies 
    of the 4 bases in a DNA sequence
    """
    # variables initialization
    nbA = nbC = nbG = nbT = nbTotal = 0
    # scanning the sequence
    for nucleotide in dna:
        if nucleotide == 'A':
            nbA += 1
        elif nucleotide == 'C':
            nbC += 1
        elif nucleotide == 'G':
            nbG += 1
        elif nucleotide == 'T':
            nbT += 1
        # this one of course needs to be always incremented 
        nbTotal += 1
    # displaying the result
    print("Total sequence length", nbTotal)
    print("A = {:.2%}".format(nbA/nbTotal))
    print("C = {:.2%}".format(nbC/nbTotal))
    print("G = {:.2%}".format(nbG/nbTotal))
    print("T = {:.2%}".format(nbT/nbTotal))

No output occurs after evaluating this cell; it is all right, in fact we only have taught the python interpreted what it is expected to do when this function **will be** called.

So now that the function is known to python, it can be called, several times with different inputs if needed, like this:

In [None]:
# the first input
print("input", dna)
display_freq_bases_v1(dna)

In [None]:
# the second input
print("input", dna2)
display_freq_bases_v1(dna2)

### Separate computation from printing (3rd version)

IN this third and last versino, we are going to separate the actual computation from the printing; indeed it is very likely that we will meet situations where we will need to do the computation without printing it.

For doing this we will use a very common technique, that allows for a function to *return* a value. Let us first see this technique on a small unrelated example.

##### A python function can return a value

In [None]:
# an example of a function that returns a value
# in this case, we return the double of the input
def double(integer):
    return 2 * integer

With this in place, we can store the function result like this:

In [None]:
x = double(10)
y = double(25)

Here again, notice that this does not trigger any printing, but we could see the result with `print`:

In [None]:
print("twice 10:", x)
print("twice 25:", y)

##### A python function can even return several values

In fact, it is even possible to return several values, like for example:

In [None]:
def doubles(integer1, integer2):
    return 2 * integer1, 2 * integer2

And now we can obtain and print the results like this:

In [None]:
x, y = doubles(10, 25)
print("twice 10:", x, "and twice 25:", y)

Technically, `doubles` actually return one single object which is a tuple, but let us keep things simple... 

##### Let us proceed

Now that we know how to write and use a function that returns several values, we can use this feature and rewrite a third and last time our algorithm:

In [None]:
# The function that computes
def count_bases(dna):
    """
    returns 5 values:
    * total sequence length
    * number of occurrences of 'A'
    * number of 'C'
    * number of 'G'
    * number of 'T'
    """
    nbA = nbC = nbG = nbT = nbTotal = 0
    for nucleotide in dna:
        if nucleotide == 'A':
            nbA += 1
        elif nucleotide == 'C':
            nbC += 1
        elif nucleotide == 'G':
            nbG += 1
        elif nucleotide == 'T':
            nbT += 1
        nbTotal += 1
    return (nbTotal, nbA, nbC, nbG, nbT)

In [None]:
# The function that displays
def display_freq_bases_v2 (counts):
    """
    displays the result of count_bases
    """
    # we extract the 5 values from count_bases
    nbTotal, nbA, nbC, nbG, nbT = counts
    # and we print them
    print("Total sequence length", nbTotal)
    print("A = {:.2%}".format(nbA / nbTotal))
    print("C = {:.2%}".format(nbC / nbTotal))
    print("G = {:.2%}".format(nbG / nbTotal))
    print("T = {:.2%}".format(nbT / nbTotal))
    # optionnally we could as well display
    # the proportions of CG and TA
    print("CG = {:.2%}".format((nbC + nbG) / nbTotal))
    print("TA = {:.2%}".format((nbT + nbA) / nbTotal))

And now again we can use all this code on several input fragments:

In [None]:
# the first fragment
print("input", dna)
counts = count_bases(dna)
display_freq_bases_v2(counts)

In [None]:
# the second fragment
print("input", dna2)
# equivalently, if you want to be shorter
display_freq_bases_v2(count_bases(dna2))