<span style="float:left;">Licence CC BY-NC-ND</span><span style="float:right;">François Rechenmann &amp; Thierry Parmentelat&nbsp;<img src="media/inria-25.png" style="display:inline"></span><br/>

# Translating a RNA into amino acids

Let us now see a python implementation of the algorithm that translates a DNA (or RNA) into a sequence of amino acids.

And as usual we want to be compatible with python2 and python3:

In [None]:
# this is so that we can use print() in python2 like in python3
from __future__ import print_function
# with this, division will behave in python2 like in python3
from __future__ import division

### Dictionaries

We start with a small digression. When we were drawing the DNA walk, we have already used the notion of dictionary in python. Let us see this again because that data type will give us a solution that is both elegant and efficient for implementing this algorithm.

In puthon, a dictionary is a collection of *key* $\rightarrow$ *value* couples. This is best seen on an example:

In [None]:
# a yellow page using a dictionary
address_book = {
    'peter' : '14 Woodstock Street',
    'sherlock' : '221b Baker Street',
    'john' : '3 Hamstead Drive', 
}

The great pro of dictionaries is that is is optimised for retrieving the value attached to akey:

In [None]:
# Looking up a key in a dictionary is done using []
# it is an efficient operation regardless of the dict's size
address_book['john']

In practical terms, python dictionaries are implemented using hash tables. Without entering into too much detail, let us remember that **lookup is done in constant time**; which means that looking up in a 10.000-entries dict is far less that 1.000 times longer as if it contained only 10 entries.

### Correspondance codon $\rightarrow$ acide aminé

You have probably guessed already where we are headed at this point: a dictionary is the natural choice for implementing the association between a codon and its corresponding amino acid. Which leads us to quite simply this - assuming that we decide to encode the `Stop` criteria with the `#` sign:

In [None]:
# correspondance codon -> amino acid
lookup_table = {
    'UUU' : 'F', 'UCU' : 'S', 'UAU' : 'Y', 'UGU' : 'C', 
    'UUC' : 'F', 'UCC' : 'S', 'UAC' : 'Y', 'UGC' : 'C', 
    'UUA' : 'L', 'UCA' : 'S', 'UAA' : '#', 'UGA' : '#', 
    'UUG' : 'L', 'UCG' : 'S', 'UAG' : '#', 'UGG' : 'W', 
    'CUU' : 'L', 'CCU' : 'P', 'CAU' : 'H', 'CGU' : 'R', 
    'CUC' : 'L', 'CCC' : 'P', 'CAC' : 'H', 'CGC' : 'R', 
    'CUA' : 'L', 'CCA' : 'P', 'CAA' : 'Q', 'CGA' : 'R', 
    'CUG' : 'L', 'CCG' : 'P', 'CAG' : 'Q', 'CGG' : 'R', 
    'AUU' : 'I', 'ACU' : 'T', 'AAU' : 'N', 'AGU' : 'S', 
    'AUC' : 'I', 'ACC' : 'T', 'AAC' : 'N', 'AGC' : 'S', 
    'AUA' : 'I', 'ACA' : 'T', 'AAA' : 'K', 'AGA' : 'R', 
    'AUG' : 'M', 'ACG' : 'T', 'AAG' : 'K', 'AGG' : 'R', 
    'GUU' : 'V', 'GCU' : 'A', 'GAU' : 'D', 'GGU' : 'G', 
    'GUC' : 'V', 'GCC' : 'A', 'GAC' : 'D', 'GGC' : 'G', 
    'GUA' : 'V', 'GCA' : 'A', 'GAA' : 'E', 'GGA' : 'G', 
    'GUG' : 'V', 'GCG' : 'A', 'GAG' : 'E', 'GGG' : 'G', 
}

With this in place, we can achieve what one of the previous videos (Week 2, Sequence 4) called `lookupGeneticCode`, by simply doing for example:

In [None]:
lookup_table ['ACG']

### Translation - 1st version

Thanks to this lookup dictionary, we can write the first version of our translation algorithm. In this first, simplisitic, version, we slice RNA into 3-letters chunks, and lookup the corresponding amino acid. If at the end of that treatment we are left with 1 or 2 letters, we just ignore them:

In [None]:
def translate_rna_to_amino_acids_1(rna):
    """
    Translation of a RNA string into a
    string of amino acids.
    Input string is cut into 3-letter chunks
    starting at index 0
    Extraneous letters if any are ignored
    """
    # the start of a 3-letter chunk 
    offset = 0
    # length of the incoming RNA - once and for good
    length = len(rna)
    # the result
    result = ""
    # main loop
    while offset <= length - 3:
        # use slicing to cut a chunk
        codon = rna[offset:offset + 3]
        # use + like always to add to the resulting chain
        result += lookup_table[codon]
        # move to the next codon for next iteration
        offset += 3
    # we're done
    return result

Let us see what this gives us on an example:

In [None]:
from samples import small
print(small)

But beware, this is a piece if DNA, we must first translate it into RNA, using the function that we had written in notebook *Week 2, Sequence 1*. Let us import that code here:

In [None]:
# translating DNA into RNA - see previous notebook
from w2_s02_c1_translate_dna_rna import translate_dna_to_rna

We may now compute its RNA:

In [None]:
# l'ARN correspondant
small_rna = translate_dna_to_rna(small)
print(small_rna)

In [None]:
# the first version for translating into amino acids
translate_rna_to_amino_acids_1(small_rna)

### A second version

As we will see later on, when dealing with a DNA fragment, most of the time it is not known precisely where exactly is the starting point. Which means that the 3-letter chunks are not guaranteed to be aligned with indices that are multiple of 3, like is assumed in our first version.

This is why our second version accepts as an additional argument the first index that should be taken into account when splitting into codons. Which leads us to this slightly different version:

In [None]:
def translate_rna_to_amino_acids_2(rna, phase):
    """
    Translation of a RNA string into a
    string of amino acids.
    Input string is cut into 3-letter chunks
    starting at index phase
    Extraneous letters if any are ignored
    """
    # the start of a 3-letter chunk 
    offset = phase
    # length of the incoming RNA - once and for good
    length = len(rna)
    # the result
    result = ""
    # main loop
    while offset <= length - 3:
        # use slicing to cut a chunk
        codon = rna[offset:offset + 3]
        # use + like always to add to the resulting chain
        result += lookup_table[codon]
        # move to the next codon for next iteration
        offset += 3
    # we're done
    return result

As you can see, the change is minimal, there only is an additional parameter.

With this in place, we can easily display the translations obtained with the 3 possible phases, through a shortcut function that calls the above function with the 3 possible values for `phase`, and that on top of that starts by converting the incoming DNA into RNA:

In [None]:
def dna_to_amino_acids(dna):
    print("DNA = {}".format(dna))
    rna = translate_dna_to_rna(dna)
    print("RNA = {}".format(rna))
    print("Translating into amino acids by phase:")
    for phase in [0, 1 ,2]:
        print("phase {} -> {}".
              format(phase, translate_rna_to_amino_acids_2(rna, phase)))

In [None]:
# that we can call like this
dna_to_amino_acids(small)

### Checking

In order to convince ourselves that our algorithm is correct, it is possible to check it manually on our small input sample:

    RNA:     GGACGGACGUUGACU
	
	phase=0 GGA-CGG-ACG-UUG-ACU
	         G   R   T   L   T
	phase=1 G-GAC-GGA-CGU-UGA-CU
	           D   G   R   #
	phase=2 GG-ACG-GAC-GUU-GAC-U
	            T   D   V   D

### A little more legible - optional

Just to push a little further, and to show a few more features of python, for those of you who would like to dig around a bit more, let use notice that we can also use a dictionary to display the resulting amino acids in a somewhat more legible way, we just need to use a second dictionary for that purpose:

In [None]:
amino_acid_names = {
    'A' : ('Ala', 'Alanine'),
    'R' : ('Arg', 'Arginine'),
    'N' : ('Asn', 'Asparagine'),
    'D' : ('Asp', 'Aspartic acid'),
    'C' : ('Cys', 'Cysteine'),
    'E' : ('Glu', 'Glutamic acid'),
    'Q' : ('Gln', 'Glutamine'),
    'G' : ('Gly', 'Glycine'),
    'H' : ('His', 'Histidine'),
    'I' : ('Ile', 'Isoleucine'),
    'L' : ('Leu', 'Leucine'),
    'K' : ('Lys', 'Lysine'),
    'M' : ('Met', 'Methionine'),
    'F' : ('Phe', 'Phenylalanine'),
    'P' : ('Pro', 'Proline'),
    'S' : ('Ser', 'Serine'),
    'T' : ('Thr', 'Threonine'),
    'W' : ('Trp', 'Tryptophan'),
    'Y' : ('Tyr', 'Tyrosine'),
    'V' : ('Val', 'Valine'),
    # an extra entry for the '#' that mark 'Stop'
    '#' : ('Stp', 'STOP'),
}

This way, we are going to find out, for each character in a string of amino acids, a full name for a more pleasant display. The only new thing here is that the value attached to, say, the `A` key, is `('Ala', 'Alanine')` which is a **tuple** and not a list. But as we will see, it does not change much to the way we are going to use this data:

In [None]:
# a utility to display the amino acids 
# with a more complete name
def display_amino_acids(acids_string):
    for index, letter in enumerate(acids_string):
        short, long = amino_acid_names[letter]
        print("{:03d}:{} [{}] -> {}".format(index, letter, short, long))

In [None]:
# starting back from small_rna
print(small_rna)

In [None]:
# if we split it upon phase 0
acids = translate_rna_to_amino_acids_1(small_rna)
print(acids)

In [None]:
# result can now be displayed this way
display_amino_acids(acids)

### A final note - optional

For those who are interested in python beyond our simplistic usage, I would like to outline also, for your curiosity, that in a more realistic application, it would make sense to define the second version of algorithm like this:

    def translate_rna_to_amino_acids_2(rna, phase=0):
        <code unchanged>

which would make it possible to call `translate_rna_to_amino_acids_2(rna)` without providing any value for the `phase` parameter, which would then passed along to the code as being `0`.