# Using dictionaries to look up DNA codons

The power of dictionaries lies in the ability to quickly look up a value associated with a key.  Here's an example to show how you can use a dictionary to translate a DNA sequence into a protein sequence. 

### The  Central Dogma

![Central Dogma](./images/central_dogma.png "Central Dogma")


The genetic code consists of 64 triplets of nucleotides called codons. With three exceptions ('stop' codons that terminate translation), each codon encodes for one of the 20 amino acids used in the synthesis of proteins, so  the code is degenerate (more than one codon per amino acid, in most cases).  

For programming purposes, we do not need to first transcribe DNA sequences into RNA (which contains Uracil in place of Thymine), we can simply use a table of DNA codons.


<table>
<caption>Standard genetic code</caption>
<tr>
<th rowspan="2">1st<br />
base</th>
<th colspan="8">2nd base</th>
<th rowspan="2">3rd<br />
base</th>
</tr>
<tr>
<th colspan="2">T</th>
<th colspan="2">C</th>
<th colspan="2">A</th>
<th colspan="2">G</th>
</tr>
<tr>
<th rowspan="4">T</th>
<td>TTT</td>
<td rowspan="2" style="background-color:#ffe75f">(Phe/F) </td>
<td>TCT</td>
<td rowspan="4" style="background-color:#b3dec0">(Ser/S) </td>
<td>TAT</td>
<td rowspan="2" style="background-color:#b3dec0">(Tyr/Y) </td>
<td>TGT</td>
<td rowspan="2" style="background-color:#b3dec0">(Cys/C) </td>
<th>T</th>
</tr>
<tr>
<td>TTC</td>
<td>TCC</td>
<td>TAC</td>
<td>TGC</td>
<th>C</th>
</tr>
<tr>
<td>TTA</td>
<td rowspan="6" style="background-color:#ffe75f">(Leu/L) </td>
<td>TCA</td>
<td>TAA</td>
<td style="background-color:#B0B0B0;"> Stop</td>
<td>TGA</td>
<td style="background-color:#B0B0B0;"> Stop</td>
<th>A</th>
</tr>
<tr>
<td>TTG</td>
<td>TCG</td>
<td>TAG</td>
<td style="background-color:#B0B0B0;"> Stop</td>
<td>TGG</td>
<td style="background-color:#ffe75f;">(Trp/W) &#160;&#160;&#160;&#160;</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">C</th>
<td>CTT</td>
<td>CCT</td>
<td rowspan="4" style="background-color:#ffe75f">(Pro/P) </td>
<td>CAT</td>
<td rowspan="2" style="background-color:#bbbfe0">(His/H) </td>
<td>CGT</td>
<td rowspan="4" style="background-color:#bbbfe0">(Arg/R) </td>
<th>T</th>
</tr>
<tr>
<td>CTC</td>
<td>CCC</td>
<td>CAC</td>
<td>CGC</td>
<th>C</th>
</tr>
<tr>
<td>CTA</td>
<td>CCA</td>
<td>CAA</td>
<td rowspan="2" style="background-color:#b3dec0">(Gln/Q) </td>
<td>CGA</td>
<th>A</th>
</tr>
<tr>
<td>CTG</td>
<td>CCG</td>
<td>CAG</td>
<td>CGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">A</th>
<td>ATT</td>
<td rowspan="3" style="background-color:#ffe75f">(Ile/I) </td>
<td>ACT</td>
<td rowspan="4" style="background-color:#b3dec0">(Thr/T) &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</td>
<td>AAT</td>
<td rowspan="2" style="background-color:#b3dec0">(Asn/N) </td>
<td>AGT</td>
<td rowspan="2" style="background-color:#b3dec0">(Ser/S) </td>
<th>T</th>
</tr>
<tr>
<td>ATC</td>
<td>ACC</td>
<td>AAC</td>
<td>AGC</td>
<th>C</th>
</tr>
<tr>
<td>ATA</td>
<td>ACA</td>
<td>AAA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Lys/K) </td>
<td>AGA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Arg/R) </td>
<th>A</th>
</tr>
<tr>
<td>ATG<sup class="reference" id="ref_methionineA"></sup></td>
<td style="background-color:#ffe75f;">(Met/M) </td>
<td>ACG</td>
<td>AAG</td>
<td>AGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">G</th>
<td>GTT</td>
<td rowspan="4" style="background-color:#ffe75f">(Val/V) </td>
<td>GCT</td>
<td rowspan="4" style="background-color:#ffe75f">(Ala/A) </td>
<td>GAT</td>
<td rowspan="2" style="background-color:#f8b7d3">(Asp/D) </td>
<td>GGT</td>
<td rowspan="4" style="background-color:#ffe75f">(Gly/G) </td>
<th>T</th>
</tr>
<tr>
<td>GTC</td>
<td>GCC</td>
<td>GAC</td>
<td>GGC</td>
<th>C</th>
</tr>
<tr>
<td>GTA</td>
<td>GCA</td>
<td>GAA</td>
<td rowspan="2" style="background-color:#f8b7d3">(Glu/E) </td>
<td>GGA</td>
<th>A</th>
</tr>
<tr>
<td>GTG</td>
<td>GCG</td>
<td>GAG</td>
<td>GGG</td>
<th>G</th>
</tr>
</table>

In this program, we use a dictionary to store codons and the amino acid they code for as key:value pairs. That way, we can just access a the dictionary using the codon to return the amino acid, for example codon_dict['ATA'] returns I.

The next stage is to loop through each codon in the sequence. We create a loop that generate all multiples of three from zero to the length of the sequence (0, 3, 6, 9, 12 ...) as n. We can then extract the codon from the dna using slice notation as dna[n:n+3] and use that as the key to look up the codon dictionary. 


In [1]:
dna='ATGTATCCTTATACTCACAACTCGAAGATTCTTCTTTCTGCACGAGAAGCGTGGGAATCCATGGAATAA'

#Codon dictionary - key is DNA codon and value is amino acid letter
codon_dict = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'.', 'TAG':'.',
    'TGC':'C', 'TGT':'C', 'TGA':'.', 'TGG':'W',}

#make an empty string to add each amino acid to
proteinsequence='' 

#loop through the sequence in blocks of three DNA nucleotides
for n in range(0,len(dna),3): # (start, stop, interval)
    codon = dna[n:n+3] #extract three letter codon at position n of the DNA 
    if codon in codon_dict: #Handle possibility of incorrect bases (such as N) - though none in dataset
        proteinsequence += codon_dict[codon] #add single_letter amino acid for codon to protein sequence

print(proteinsequence)
        



MYPYTHNSKILLSAREAWESME.
