<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>François Rechenmann &amp; Thierry Parmentelat</span>
<span><img src="media/inria-25-alpha.png" /></span>
</div>

# Distances array

In this notebook we will implement the algorithm that computes the distances array between all species in a given set.

### Text file

Like in the video, let us assume we have obtained a text file that contains the DNA sequences for the species that we are interested in.

Here is the contents of the sample file that we will use when running our algorithm:

In [None]:
with open('data/species.txt') as input:
    for line in input:
        print(line)

You will notice that all lines are separated with a blank line; this is because the `line` variable already contains a *newline* character already present in the file, and this adds to the *newline* that is always added by `print`. To avoid this duplicate *newline*, we have 2 options.

##### `print` without *newline*

The first option is to tell `print` to refrain from adding this extra *newline*:

In [None]:
with open('data/species.txt') as input:
    for line in input:
        # we leave a newline in line
        # but tell print to *not* add an extra newline
        # using end=""
        print(line, end="")

##### Removing *newline*s

This other option is to remove *newline* from the `line` variable, and this is the option we will use in the notebook so as to be compliant with other algorithms written so far:

In [None]:
with open('data/species.txt') as input:
    for line in input:
        # directly remove newline from line
        line = line.strip()
        # and now we can print as usual
        print(line)

##### Line numbers with  `enumerate`

We can also use `enumerate`, as we have already done here and there; this will let us access a line counter - except that, like always in python, indices will start at `0`, but as we will see, this is rather a good thing. This leads us to:

In [None]:
with open('data/species.txt') as input:
    for index, line in enumerate(input):
        # directly remove newline from line
        line = line.strip()
        # and now we can print as usual
        print(index, line)

### Needleman and Wunsch's distance

We import the `distance` function, like we had written it last week in sequence 9, in the iterative form of Needleman and Wunsch's algorithm:

In [None]:
from w4_s09_c1_needleman_wunsh_iter import needleman_wunsch, distance

And as a reminder, for illustrative purposes:

In [None]:
sample1 = "ACCTCTGTATCTATTCGGCATCGATCAT"
sample2 = "ACCTCGTGTATCTCTTCGGCATCATCAT"

needleman_wunsch(sample1, sample2)

In [None]:
# and indeed
distance(sample1, sample2)

### Dictionary indexed on tuples (simplified version)

For those of you who chose to skip the optional section on this topic, in sequence 9 last week, here is a condensed version of you need to know to understand the algorithm in the present notebook.

Short version: one can create a dictionary:

In [None]:
# starting from a dictionary 
d = {}

# we had seen we can insert keys that are integers
d[1] = "un"
# or strings
d["deux"] = 2
print(d)

Well, one can also add keys that are tuples - in our case couples - and it looks like this:

In [None]:
d [ (1, 2) ] = "the 1,2 couple"
print(d)

There is no kind of restriction, this dictionary can be used exactly as usual, and so we can use that same tuple to retrieve that value: 

In [None]:
d [(1, 2)]

or even more simply:

In [None]:
d[1, 2]

This technique is useful to us here, in that it helps us reduce memory footprint; actually we have see in the video that the distances array is of course symmetric, and so it is not required to create a whole matrix. We will see in the next section an even more interesting advantage of this feature, but let us not get ahead of ourselves.

### Computing the distances array

With all these tools at our disposal, it is now very simple to write a function that computes the array of all distances, with this code:

In [None]:
def all_distances(filename):
    """
    Reads input file, that is expected to contain one DNA sequence per line
    
    Returns:
    * a list of the entry sequences
    * a dictionary hashed on couples of indices, whose value is the corresponding distance
    """

    # we first read the file and store all sequences in 'dnas'
    dnas = []
    distances = {}
    
    with open(filename) as input:
        for line in input:
            dnas.append(line.strip())
            
    for i, dnai in enumerate(dnas):
        for j in range(i):
            dnaj = dnas[j]
            distances[i, j] = distance(dnai, dnaj)

    return dnas, distances

In [None]:
all_distances("data/species.txt")

The only minor cons with this techniqueare that:

  * (a) in the dictionary, we lose track of the order in which values are inserted,
  * (b) and also of course, the tuple needs to be made up *in the right order*, that is with $i>j$. 

Here is for example how we could work around these cons, and improve the overall layout:

In [None]:
def get_distance(d, i, j):
    return 0 if i == j \
        else d[(i, j)] if i > j \
        else d[(j, i)]

# displaying on 4 characters
space = 4*" "
formatr = "{:4}"
formatl = "{:<4}"

def pretty_distances(filename):
    dnas, distances = all_distances(filename)
    l = len(dnas)
    # first line : headers
    print(space + "".join([ formatr.format(i) for i in range(l)]))
    # pour chaque ligne
    for i in range(l):
        print(formatl.format(i) 
              + "".join([formatr.format(get_distance(distances, i, j)) 
                                   for j in range(l)]))

In [None]:
pretty_distances("data/species.txt")