<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>François Rechenmann &amp; Thierry Parmentelat</span>
<span><img src="media/inria-25-alpha.png" /></span>
</div>

# Walking the DNA

We will see in this notebook an executable version of the algorithm that walks along DNA fragments.

Our goal is thus to draw a DNA sequence, with the dot moving along one of the 4 directions depicted below:

![Extrait du transparent](media/directions.png)

### The `matplotlib` library

We will use a library named `matplotlib` for actually drawing paths, mostly because it is a very widespread library for visualizing data.

In [None]:
# so that the graphics appear inside the notebook
%matplotlib inline
# importing the library
import matplotlib.pyplot as pyplot

# finally: the sizes to use when drawing figures
import pylab
pylab.rcParams['figure.figsize'] = 8., 8.

`matplotlib` can quite easily draw the path that we are interested in, 
as soon as we provide it with two lists of values, that are generally simply `X` and `Y`,
of same lengths, and that contain the coordinates of the points in that path.

Let us see this right away on a manually constructed example; 
let us imagine that we want to draw a path that goes through the following points:

* first point (0, 0)
* second point (2, 1)
* third point (1, 0)
* fourth point (3, 4)

In [None]:
# let us build the list of the first coordinates 
X = [0, 2, 1, 3]
# and the list in the second dimension
Y = [0, 1, 0, 4]

From that point we can simply draw the path using the `plot` function like this:

In [None]:
pyplot.plot(X, Y)
pyplot.show()

### Function that return 2 values

So in order to draw a DNA fragment, we are just to compute the coordinates of the dots in the path, expressed as a list of X's and a list of Y's.

We are thus faced with the problem of writing a function, but that needs to compute and return 2 lists, ideally in a single pass so as to be as efficient as possible.

It is very easy in python to return several values from a function. Remember the notebook about computing the frequencies of the 4 bases in a DNA fragment, where we were computing several items in a single pass already.

Let us see that again on a very simple example: a function that computes the square and the cube of a number.

In [None]:
# a function that returns 2 values
def square_and_cube(x):
    square = x * x
    cube = x ** 3
    # technically : we return a tuple with these 2 values
    return square, cube

In order to use the two results, one simply uses the following syntax:

In [None]:
a, b = square_and_cube(5)
print("a=", a)
print("b=", b)

### Using a dictionary

Before we can see the DNA walking algorithm per se, we still need to decide how to map our 4 letters `C`, `A`, `G` and `T`, into the corresponding move in the plane.

For this, it is natural in python to use a *dictionary*. As we have seen in the notebook on python basics, a dictionary allows to associate keys and values, like this:

In [None]:
moves = {
    'C' : [1, 0],
    'A' : [0, 1],
    'G' : [-1, 0],
    'T' : [0, -1],
    }

So that we can easily figure out how to move the current dot when we walk into a `C`:

In [None]:
moves['C']

Which means for us that whenever we see a `C`, we have to:

 * increment `x` by 1, 
 * and leave `y` intact (add to it `0`).

That we can write, using the same syntax as above:

In [None]:
delta_x, delta_y = moves['C']
print("to be added to X", delta_x)
print("to be added to Y", delta_y)

### Scanning 

We now have all the elements to write a function that

* expects in input a DNA fragment encoded as a string of characters among the 4 abbreviations,
* and returns two lists, that correspond to the X's and Y's of the path.

In [None]:
# this function computes the X and Y parts of the path
def path_x_y(dna):
    # initializing the results
    path_x, path_y = [], []
    # starting in the middle of the plane
    x, y = 0, 0
    # starting point is in the path
    path_x.append(x)
    path_y.append(y)

    # walking along the DNA
    for nucleotide in dna:
        # what move must we do next ?
        delta_x, delta_y = moves[nucleotide]
        # implement it
        x += delta_x
        y += delta_y
        # store the dot in the result(s)
        path_x.append(x)
        path_y.append(y)

    return path_x, path_y

Let use first see what this gives us on a very small DNA fragment:

In [None]:
small_dna = "CAGACCACT"
X, Y = path_x_y(small_dna)
print("the X part", X)

In [None]:
pyplot.plot(X, Y)
pyplot.show()

### A shortcut

It is probably convenient to glue all this together in a single function:

In [None]:
def walk(dna):
    print("input sequence length", len(dna))
    X, Y = path_x_y(dna)
    pyplot.plot(X, Y)
    pyplot.show()

In [None]:
walk(small_dna)

### Larger inputs

If we now run this code on the DNA sequence that is illustrated in the slide for sequence 7:

In [None]:
from samples import sample_week1_sequence7
print(sample_week1_sequence7)

It can be drawn like this:

In [None]:
walk(sample_week1_sequence7)

### Walking actual DNA sequences

Si you go and browse http://www.ebi.ac.uk/ena, you will find yuo can do all kinds of search and obtain real data that you can work with. 

##### A very visible loopback point: Borrelia 

We will start with running this code with *Borrelia*, that you [can see here](http://www.ebi.ac.uk/ena/data/view/CP000013), or find by yourself if you enter [http://ebi.ac.uk/ena]() nd search for key `CP000013`. We have loaded it for you in the course (see below to learn how you can load other specimens yourself):

In [None]:
from samples import borrelia
print("size for borrelia", len(borrelia))

With this sample, you can see very clearly the point where the path essentially comes back on its own track:

In [None]:
walk(borrelia)

##### A counter-example : Synechosystis

On the other hand, here is what is obtained with *Synechosystis* (key `BA000022`). Please be a little patient for this sequence contains no less than 3.5 millions nucleotides.

In [None]:
from samples import synechosystis
walk(synechosystis)

### Real data

In order to illustrate what can very easily be achieved nowadays, I went [on the Eureopan Nucleotide Archive website](http://www.ebi.ac.uk/ena), I searched for "Borrelia burgdorferi B31" and I came up with thispage:

[http://www.ebi.ac.uk/ena/data/view/AE000789]()

We provide you with a - very rustic - utility that will let you download such sequences and manipulate them **right in this notebook**:

In [None]:
import fetch

You can for example fetch the sequence for key `AE0000789` like this:

In [None]:
burgdorferi = fetch.fetch('AE000789')

In [None]:
# so you can draw it as well using our algorithm
walk(burgdorferi)

### Interactive path exploration

`matplotlib` was first deisgned to draw pictures on paper. We will end this notebook by talking about other possibilities that can turn out interesting in some cases. In the context of a user interacting with a screen through a mouse and keyboard, it is possible to provide finer grain tools to explore the details of those paths.

To this end, we are going to use an additional library on top of `matplotlib`, named `mpld3`, and here is what it looks like:

In [None]:
# need to import the library to use it
import mpld3

With this new tool, we can now display the same graphs:

In [None]:
def zoomable_walk(dna):
    print("input sequence length", len(dna))
    X, Y = path_x_y(dna)
    pyplot.plot(X, Y)
    # instead of displaying with pyplot.show()
    # we return a HTML object, that is 
    # rendered by the notebook
    return mpld3.display()

In [None]:
zoomable_walk(sample_week1_sequence7)

But with the additional possibility to zoom and move within the picture with the 3 little tools, that are displayed in the bottom left area of the drawing when your mouse is hovering over it:

* Home: come back to initial scale
* Move: change your viewpoint
* Zoom: click on a rectangle to zoom inside the figure

However this kind of capabilities are more fun than actually useful, because in practice of course it is not very practical to try and perform such a fine-grained inspection on real data, and it is often preferrable to tune the algorithm instead, like we will see in a next course.

******

##### Warning

The `fetch` function is, as we mentioned it already, very limited. For those of you who are more familiar with python, here is its source code, in case you are curious about how to implement this sort of function, or you would like to improve it to better fit your needs:

In [None]:
fetch.list_module(fetch)