<span style="float:left;">Licence CC BY-NC-ND</span><span style="float:right;">François Rechenmann &amp; Thierry Parmentelat&nbsp;<img src="media/inria-25.png" style="display:inline"></span><br/>

# Counting nucleotides over a window

In this notebook we are going to write a python program that will let us visualize the counts of nucleotides over sliding and overlapping windows, as was explained in the video.

Let us start as always with our python2/python3 compatibility cell

In [None]:
# this is so that we can use print() in python2 like in python3
from __future__ import print_function
# with this, division will behave in python2 like in python3
from __future__ import division

And likewise, we are going to need `matplotlib` for drawing results:

In [None]:
# so that the graphics appear inside the notebook
%matplotlib inline
# importing the library
import matplotlib.pyplot as pyplot

# finally: the sizes to use when drawing figures
import pylab
pylab.rcParams['figure.figsize'] = 10., 6.

### Counting on a DNA fragment

In the very first algorithm that we had written in python, we were counting the respective frequencies for the 4 bases on **a whole DNA string**. In the present context, this needs to be improved so that we consider only **a segment** of the input string.

For that reason, we will start with some notions of python that will turn out useful.

### Indices in python

When accessing a character in a string from its index, the python syntax is as always very simple, it is like with the pseudo-language from the video, we use square brackets.

However, we need to be careful here because in python, unlike the assumptions in the video, **indices start of 0**. But nothing to be concerned about, everything remains quite simple:

In [None]:
string = "abc"
print("at index 0:", string[0])
print("at index 1:", string[1])
print("at index 2:", string[2])

### *slicing* in python

python also exhibits a less usual mechanism, called *slicing*, which allows us to extract substrings from a sequence, with notation `[begin:end]`. Let us start with a simple example:

In [None]:
string = "abcdefghijklmnopqrstuv"
zoom = string[3:6]
print(zoom)

In order to clear up any confusion related to limits, observe that there is no need for any convoluted computation:

In [None]:
string[0:3]

In [None]:
string[3:6]

In [None]:
string[6:9]

And it is even possible to take advantage of a very useful trick, which is that you can given a very high right limit, it does not matter:

In [None]:
string[9:200]

### Let us proceed

With all this new weaponry at our disposal, we can improve our counting function; so as to be able to count only over a subsequence between indices `begin` and `end`, we can do this:

In [None]:
def count_c_g(dna, begin, end):
    # return values
    c = g = 0
    # scan only over the segment of interest
    for nucleo in dna[begin:end]:
        if nucleo == 'C':
            c += 1
        elif nucleo == 'G':
            g += 1
    # return both results
    return c, g

###Sliding windows

Using this conuting function, we are now able to write the algorithm that we have in mind. 

Like with the case of walking the DNA, we will need to compute 2 lists corresponding to the X's and the Y's for the graph we want to draw. Of course we are going to choose:

* to use for X the beginning of the sliding window; this is admittedly arbitrary, we could as well choose the middle or the end, but that would only result in a slightly translated figure;
* to draw for Y the ratio $\frac{G-C}{G+C}$

Finally and before we go ahead, let us notice that we need to be careful, because in the - unlikely, but not entirely impossible - case where a window would have **no `C` and no `G`**, then we **cannot divide by $C+G=0$**. So in these cases we decide that the ratio in question is `0`.

All this leasds us to the following code:

In [None]:
def window_x_y(dna, window, overlap):
    """
    inputs
      dna:          input DNA
      window:       the window width
      overlap:      how much do two successive windows overlap
    outputs
      X:            list of X's - these are multiples of (window - overlap)
      Y:            list of Y's - the value of (G-C)/(G+C) on that window
    """

    # compute length of input once and for good
    length = len(dna)
    # beginning of the window
    begin = 0
    # the two resulting lists
    X = []
    Y = []
    
    while begin < length:
        # with slicing it is no problem if we overspill on the right
        c, g = count_c_g (dna, begin, begin + window)
        # X denotes the beginning of the window
        x = begin
        # pathological case with no C and no G
        if c == 0 and g == 0:
            y = 0.
        else:
            y = (g - c) / (g + c)
        # store that point in the results
        X.append(x)
        Y.append(y)
        # we shift the window by (window - overlap)
        begin += (window - overlap)

    # we are done, let us return the results
    return X, Y

### Shortcut

Like we had done for walking the DNA, we are going to define a shortcut that computes and displays the result in a single call. to improve legibility, we will take advantage of that shortcut to also draw a red line that will materialize the line $y=0$:

In [None]:
def sliding_window(dna, window, overlap):
    X, Y = window_x_y(dna, window, overlap)
    pyplot.plot(X, Y)
    # add a line y=0 on the whole width of the figure
    # this width is obtained from the last element in X, in python X[-1]
    pyplot.plot([0, X[-1]], [0, 0], color='r', linestyle='dashed', linewidth=2)
    pyplot.show()

### On test data

Before we run this on real data, let us convince ourselves that it does behave as expected on data where we can easily do the computations manually, like the following data:

In [None]:
test = 3 * ( 5 * 'C' + 5 * 'G')
print(test)

In [None]:
sliding_window(test, 10, 5)

We correctly obtain here a null value on all non-truncated windows, because for any window that is 10 signs wide, we always have 5 `C` and 5 `G`. The last window however, because it only contains the 5 last letters, indeed exhibits a 100% ratio of `G`s.

I suggest that you, as an exercise, check that this result is also correct when we modify the `overlap` value:

In [None]:
sliding_window(test, 10, 3)

### Real data

Let us consider the Borrelia sample on which we had observed such a clear loopback point:

In [None]:
# the borrelio sample from sequence 7 on walking DNA
from samples import borrelia
print("length", len(borrelia))

We will see that the present technique will also suggest the presence of that loopback point:

In [None]:
sliding_window(borrelia, 400, 100)

And this even with larger-grained scales. Here is for example what we have been able to obtain:
![](media/fenetre-borrelia.png)

We suggest that you try several values for the `window` and `overlap` parameters in an attempt to reach a similar output:

In [None]:
sliding_window(borrelia, 2000, 500)

***

### From ENA

Optionnally, here is the skeleton that will let you run our algorithm on any sequence of your choice from ENA. You just need to select a search key, and to tweak the parameters to `sliding_window` according to actual length.

In [None]:
import fetch

In [None]:
from_ena = fetch.fetch('AE000789')
print("length", len(from_ena))

In [None]:
sliding_window(from_ena, 300, 100)