<span style="float:left;">Licence CC BY-NC-ND</span><span style="float:right;">François Rechenmann &amp; Thierry Parmentelat&nbsp;<img src="media/inria-25.png" style="display:inline"></span><br/>

# RBS adjustment

In this notebook, we implement the RBS adjustment algorithm. But as usual:

In [None]:
# this is so that we can use print() in python2 like in python3
from __future__ import print_function
# with this, division will behave in python2 like in python3
from __future__ import division

### Variation on coding regions, with RBS

So as to implement the adjustment idea, as described in the video, we will modify  `coding_regions_one_phase` as follows. You will notice that the difference as compared with the simple version is very small:

In [None]:
# importing the utility functions that we had used to implement
# the simple version of this algorithm
from w3_s03_c2_next_codon import next_start_codon, next_stop_codon

# here again the default minimal length is 300
def coding_regions_one_phase_rbs(dna, phase, rbs, minimal_length=300):
    # initializing index
    # remember that next_start_codon and next_stop_codon 
    # always remain on the same phase
    index = phase
    # we return results as a list of couples 
    # [start_gene, stop_gene]
    genes = []
    # stop1 if the stop "on the left hand side"
    # at this point, it is the first stop on the phase
    stop1 = next_stop_codon(dna, index)
    # if we have no stop at all in the sequence, we're done
    if not stop1:
        return genes
    # main loop
    while True:
        # look for next stop after stop1
        # which is the "right hand side" stop
        stop2 = next_stop_codon(dna, stop1 + 3)
        # if there is none, we are done
        if not stop2:
            return genes
        # also it needs to be far enough
        if stop2 - stop1 < minimal_length:
            # too short : we skip this fragment
            stop1 = stop2
            continue
        # at this point, we found an ORF, we just need to find the correct Start
        start = next_start_codon(dna, stop1)
        # if there is none, it means we will not find anything more
        # and so we are done
        if not start:
            return genes
        # can we find a RBS ?
        next_rbs = dna.find(rbs, start)
        # is it in the region ?
        if start <= next_rbs <= stop2:
            # yes, the RBS is in the region
            # we adjust the beginning of the coding region
            # as the next START on the right of that RBS
            start = next_start_codon(dna, next_rbs)
        if start and stop2 - start < minimal_length:
            # if the region is too small, it is ignored
            pass
        else:
            # this time, we found a gene, we add it to the results
            genes.append( [start, stop2] )
        # we can move on to the next ORF
        stop1 = stop2

### On real data

Let us see the results of our two versions for `coding_regions` on a real bacteria, the famous *Escherichia Coli*, that you can find under ENA with key `U00096`, but that we have downloaded for you:

In [None]:
with open("data/escherichia-coli-U00096") as input:
    e_coli = input.read()
print("Escheria Coli has {} nucleotides".format(len(e_coli)))

Using the simple algorithm:

In [None]:
# la recherche de régions codantes sur une phase
# telle que nous l'avons vue dans la séquence 2
from w3_s02_c1_regions_codantes_v1 import coding_regions_one_phase

we obtain on this sample:

In [None]:
for phase in 0, 1, 2:
    simple_algo = coding_regions_one_phase(e_coli, phase)
    # the average size 
    how_many = len(simple_algo)
    total_length = sum ( stop - start for start, stop in simple_algo )
    average_size = total_length / how_many
    print("Simple algorithm finds {} regions, avg size = {:.02f} on phase {}"
          .format(how_many, average_size, phase))

If we now use the `AGGAGG` sequence as the RBS, we now obtain:

In [None]:
rbs_coli = 'AGGAGG'

for phase in 0, 1, 2:
    rbs_algo = coding_regions_one_phase_rbs(e_coli, phase, rbs_coli)
    # the average size
    how_many = len(rbs_algo)
    total_length = sum ( stop - start for start, stop in rbs_algo )
    average_size = total_length / how_many
    print("RBS algorithm finds {} regions, avg size = {:.02f} on phase {}"
          .format(how_many, average_size, phase))

You can see that with this method we obtain a rather small adjustment, that preserves orders of magnitude, since we observe:
* approximately the same number of coding regions; the difference is mainly due to the regions where adjustment shortens the sequence below minimal length;
* the sizes of coding regions are here again in the same order of magnitude.