<h1 id="toctitle">Dictionaries exercise solutions</h1>
<ul id="toc"/>

##DNA translation 

First, we need to think about splitting up the DNA into codons. We can do the first few manually:

In [1]:
dna = "ATGTTCGGT"
codon1 = dna[0:3]
codon2 = dna[3:6]
codon3 = dna[6:9]
print(codon1, codon2, codon3)

('ATG', 'TTC', 'GGT')


until we see the pattern. The start position goes up by three each time, and the stop position is always three greater than the start. So with a range:

In [3]:
dna = "ATGTTCGGT" 
for start in range(0,7,3): 
    codon = dna[start:start+3] 
    print("one codon is " + codon) 

one codon is ATG
one codon is TTC
one codon is GGT


This works for the particular sequence, but we need a more general solution. We always start at zero, and always go up by three, but the middle argument to range is tricky. We need the last start position to be two bases back from the end of the sequence:

In [6]:
dna = "ATGTTCGTGACGAGGGT" 

# calculate the start position for the final codon
last_codon_start = len(dna) - 2 

# process the dna sequence in three base chunks
for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    print("one codon is " + codon) 

one codon is ATG
one codon is TTC
one codon is GTG
one codon is ACG
one codon is AGG


This version will work for any length DNA sequence. 

Getting the amino acid for a given codon is quite easy, just look it up:

In [7]:
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}


dna = "ATGTTCGTGACGAGGGT" 

last_codon_start = len(dna) - 2 

for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    aa = gencode[codon]
    print(codon, aa) 

('ATG', 'M')
('TTC', 'F')
('GTG', 'V')
('ACG', 'T')
('AGG', 'R')


Now all we need to do is build up the protein sequence one amino acid at a time:

In [8]:
dna = "ATGTTCGTGACGAGGGT" 

last_codon_start = len(dna) - 2 

protein = ""
for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    aa = gencode[codon]
    protein = protein + aa
    print(codon, aa, protein) 

('ATG', 'M', 'M')
('TTC', 'F', 'MF')
('GTG', 'V', 'MFV')
('ACG', 'T', 'MFVT')
('AGG', 'R', 'MFVTR')


We can see how one amino acid gets added to the `protein` string each time round the loop. At the end of the loop we have the complete protein:

In [9]:
dna = "ATGTTCGTGACGAGGGT" 

last_codon_start = len(dna) - 2 

protein = ""
for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    aa = gencode[codon]
    protein = protein + aa
    
print(protein) 

MFVTR


This is a good candidate for a function, so let's turn it into one:

In [10]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        aa = gencode.get(codon) 
        protein = protein + aa 
    return protein 

In [11]:
translate_dna("ATGTTCGTGACGAGGGT")

'MFVTR'

And we'll try a few more DNA sequences to see what happens:

In [12]:
print(translate_dna("ATGTTCGGT")) 
print(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG")) 
print(translate_dna("actgatcgtagctagctgacgtatcgtat")) 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

MFG
IDRSLLIDQ


TypeError: cannot concatenate 'str' and 'NoneType' objects

We get an error on the third DNA sequence because it's in lower case. Change everything to upper case:

In [13]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        aa = gencode.get(codon.upper()) 
        protein = protein + aa 
    return protein 

In [14]:
print(translate_dna("ATGTTCGGT")) 
print(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG")) 
print(translate_dna("actgatcgtagctagctgacgtatcgtat")) 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

MFG
IDRSLLIDQ
TDRS_LTYR


TypeError: cannot concatenate 'str' and 'NoneType' objects

Now the third DNA sequence works but we get an error on the fourth one. What is wrong? Let's print out the codon at each step to figure it out:

In [17]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        print(codon)
        aa = gencode.get(codon.upper()) 
        protein = protein + aa 
    return protein 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

ACG
ATC
GAT
CGT
NAC


TypeError: cannot concatenate 'str' and 'NoneType' objects

The error comes from the codon `NAC` i.e. there is an N in the sequence. The best way to fix it is to say that the amino acid for any codon that isn't in the dict is `X`, meaning unknown. To do this we just swich from 

```python
aa = gencode.get(codon.upper())
```

to 

```python
aa = gencode.get(codon.upper(), 'X')
```

like this:

In [18]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        print(codon)
        aa = gencode.get(codon.upper(), 'X') 
        protein = protein + aa 
    return protein 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

ACG
ATC
GAT
CGT
NAC
GTA
CGA
TCG
TAC
TCG
TIDRXVRSYS


Now we get a complete protein tranlsation - Notice that it has an X in the middle. 



In [3]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [4]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")