# The central dogma

You are tasked with implementing in Python code a program that will translate a set of gene sequences (provided in CSV format) into protein sequences, and then search those protein sequences for a particular sequence of amino acids that represents a cleavage site for enterokinase. Enterokinase is a specific protease that cleaves after lysine (K) at its cleavage site DDDDK, however it will not will not cleave at a site followed by a proline (P).

<table>
<caption>Standard genetic code</caption>
<tr>
<th rowspan="2">1st<br />
base</th>
<th colspan="8">2nd base</th>
<th rowspan="2">3rd<br />
base</th>
</tr>
<tr>
<th colspan="2">T</th>
<th colspan="2">C</th>
<th colspan="2">A</th>
<th colspan="2">G</th>
</tr>
<tr>
<th rowspan="4">T</th>
<td>TTT</td>
<td rowspan="2" style="background-color:#ffe75f">(Phe/F) </td>
<td>TCT</td>
<td rowspan="4" style="background-color:#b3dec0">(Ser/S) </td>
<td>TAT</td>
<td rowspan="2" style="background-color:#b3dec0">(Tyr/Y) </td>
<td>TGT</td>
<td rowspan="2" style="background-color:#b3dec0">(Cys/C) </td>
<th>T</th>
</tr>
<tr>
<td>TTC</td>
<td>TCC</td>
<td>TAC</td>
<td>TGC</td>
<th>C</th>
</tr>
<tr>
<td>TTA</td>
<td rowspan="6" style="background-color:#ffe75f">(Leu/L) </td>
<td>TCA</td>
<td>TAA</td>
<td style="background-color:#B0B0B0;"> Stop</td>
<td>TGA</td>
<td style="background-color:#B0B0B0;"> Stop</td>
<th>A</th>
</tr>
<tr>
<td>TTG</td>
<td>TCG</td>
<td>TAG</td>
<td style="background-color:#B0B0B0;"> Stop</td>
<td>TGG</td>
<td style="background-color:#ffe75f;">(Trp/W) &#160;&#160;&#160;&#160;</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">C</th>
<td>CTT</td>
<td>CCT</td>
<td rowspan="4" style="background-color:#ffe75f">(Pro/P) </td>
<td>CAT</td>
<td rowspan="2" style="background-color:#bbbfe0">(His/H) </td>
<td>CGT</td>
<td rowspan="4" style="background-color:#bbbfe0">(Arg/R) </td>
<th>T</th>
</tr>
<tr>
<td>CTC</td>
<td>CCC</td>
<td>CAC</td>
<td>CGC</td>
<th>C</th>
</tr>
<tr>
<td>CTA</td>
<td>CCA</td>
<td>CAA</td>
<td rowspan="2" style="background-color:#b3dec0">(Gln/Q) </td>
<td>CGA</td>
<th>A</th>
</tr>
<tr>
<td>CTG</td>
<td>CCG</td>
<td>CAG</td>
<td>CGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">A</th>
<td>ATT</td>
<td rowspan="3" style="background-color:#ffe75f">(Ile/I) </td>
<td>ACT</td>
<td rowspan="4" style="background-color:#b3dec0">(Thr/T) &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</td>
<td>AAT</td>
<td rowspan="2" style="background-color:#b3dec0">(Asn/N) </td>
<td>AGT</td>
<td rowspan="2" style="background-color:#b3dec0">(Ser/S) </td>
<th>T</th>
</tr>
<tr>
<td>ATC</td>
<td>ACC</td>
<td>AAC</td>
<td>AGC</td>
<th>C</th>
</tr>
<tr>
<td>ATA</td>
<td>ACA</td>
<td>AAA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Lys/K) </td>
<td>AGA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Arg/R) </td>
<th>A</th>
</tr>
<tr>
<td>ATG<sup class="reference" id="ref_methionineA"></sup></td>
<td style="background-color:#ffe75f;">(Met/M) </td>
<td>ACG</td>
<td>AAG</td>
<td>AGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">G</th>
<td>GTT</td>
<td rowspan="4" style="background-color:#ffe75f">(Val/V) </td>
<td>GCT</td>
<td rowspan="4" style="background-color:#ffe75f">(Ala/A) </td>
<td>GAT</td>
<td rowspan="2" style="background-color:#f8b7d3">(Asp/D) </td>
<td>GGT</td>
<td rowspan="4" style="background-color:#ffe75f">(Gly/G) </td>
<th>T</th>
</tr>
<tr>
<td>GTC</td>
<td>GCC</td>
<td>GAC</td>
<td>GGC</td>
<th>C</th>
</tr>
<tr>
<td>GTA</td>
<td>GCA</td>
<td>GAA</td>
<td rowspan="2" style="background-color:#f8b7d3">(Glu/E) </td>
<td>GGA</td>
<th>A</th>
</tr>
<tr>
<td>GTG</td>
<td>GCG</td>
<td>GAG</td>
<td>GGG</td>
<th>G</th>
</tr>
</table>


## The task

1) Write a Python 3 program that will translate each of the cDNA sequences into an amino acid sequence in single letter code. It should use the CSV format file "cDNA\_sequences.csv" as input. The file is comma-delimited, and has the following layout (sequences are truncated in length for clarity):

<table>
<tr><td>ENST0000039045</td><td>ATGGTCCTGAAATTCTCC</td></tr>
<tr><td>ENST0000055716</td><td>ATGGAGACTCTCCTGAAAGTGCTTT</td></tr>
<tr><td>ENST0000039046</td><td>ATGGAGACTGTTCTGCAAGTACTCC</td></tr>
</table>

Amino acid (protein) sequences should be output as a **FASTA format file** named "protein_sequences.fasta". 

2) Your program should search the translated (amino acid) sequences for the enterokinase motif: DDDDK, but exclude those where the next amino acid after the motif is a proline - i.e. DDDDKP. How many of the sequences contain one or more enterokinase motifs?


# Code Golf!

Code Golf is a computer programming competition in which participants strive to achieve the shortest possible source code that implements an algorithm. 

The code that results from Code Golf naturally has some terrible stylistic problems such as lack of commenting, very short variable names and sometime obscure ways to solve problems. ** Please do not try to develop your real code in this style!**

Code Golf is a very good exercise to make you think about other ways in which a task can be completed and understand the many alternative ways in which an algorithm can be written to solve a problem.


In [None]:
#Use Code golf to restrict the number of characters used. 


# timeit magic challenge 

You can use a variety of 'cell magics' to make use of additional features and functionality in Jupyter notebooks - see https://ipython.readthedocs.io/en/stable/interactive/magics.html for details. 

If you include %%timeit at the beginning of your cell, Jupyter will run the cell multiple times and tell you the mean taken to complete the program.  For example:

In [34]:
%%timeit
# Find prime numbers
def generate_primes(max_num):  
    primes = []
    for num in range(2, max_num + 1):
        isPrime=True
        for denominator in range(2, int(num ** 0.5) + 1):
            if num % denominator == 0:
                isPrime = False
                break
        if isPrime==True:
            primes.append(num)
    return(primes)
generate_primes(30)

51.5 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In this exercise, instead of minimising the number of characters in your translation and sequence searching code, optimise it to run as fast as possible. Avoid looping through data structures repeatedly, use dictionaries instead of lists for look up and this about whether all the steps in your program are necessary and optimised. 

You can also repeat this type of optimisation for the answers for in Workshop 3 and 4 - here there is much more scope for optimising the code compared with the translation algorithm. 