# Homework 5: translation, ORFs

Here, you will write simple python code, building on a template I have provided. 

Fill out this jupyter notebook, adding code to the cells that say to do so. Be sure to save your final version.

**Assigned:** 20 September, Lecture 10

**Due:** 9 October, Lecture 15

Please put all your work in a directory named HW5 on your private repo, so we can keep the different homework solutions separated.

Here we will use *dict*s to translate DNA sequences and to find all *open reading frames*.

Remember the central dogma of genetics: DNA sequences are translated into proteins (sequences of amino acids), through an RNA intermediate mechanism, by replacing codons (see Lecture 10).
The (most common) mapping of codons to amino acids, known as the *translation table*, is given in the notes for Lecture 20, on 20 September. 
Translation begins with the codon *ATG*, skipping sequence before that if necessary.
Translation continues until it reaches a *stop codon*: *TAA*, *TAG*, or *TGA*.

The region of the DNA between *ATG* and a stop codon is called an *open reading frame* (ORF). 
By convention, we are excluding the start and stop codons from the ORF.
An ORF is potentially a gene, since it has the potential to be translated into protein. 

Remember also that DNA sequences have six *reading frames*, three in the 5' to 3' direction, each offset by one nucleotide from the other two; and three on the reverse complement sequence (the 5' to 3' sequence reversed, with A, C, G, T replaced by T, G, C, A respectively). 
Thus, an **open** reading frame is part of a given reading frame which may be translated into a protein sequence.

In this assignment, you will be given a DNA sequence, which by convention is in the 5' to 3' direction. 
We are ignoring all other reading frames.
You will build a list of codons for this sequence. 
You will then translate this list into a string representing the encoded protein. 

Here are two examples.

given 

'ccATGgaaggctgcttcacagctgtgtacttacgaggtctctcaagcgaagttagtacc'
return 

>list_of_orfs = [] # (the only 'atg' is out of frame
>encoded_protein = '' # because there's nothing to translate

But given

>'cccATGgaaTAGggctgcttcATGacagctgtgtacttaTGAcgaggtctctcaagcgaagttagtacc'

return

>list_of_orfs = ['gaa', 'acagctgtgtactta']
>encoded_proteins = ['E', 'TAVYL']

You will working with 

>list_of_codons = ['cgg', 'cgg', 'ATG', 'aac', 'TAG', 'aac', 'ATG''cgc', 'ggc', 'gta', 'TAA', 'tcg', 'tta', 'att', 'ATG', 'tcc', 'cct', 'cca', 'gtt', 'gat', 'aag', 'att', 'TAA', 'tgt']

>encoded_proteins = ['N', 'RGV', 'SPPVDKI']

(note: I won't give you the tricky case where there are two start or end codons in a row! But think about that possibility.)

(**important note:** the *case* of nucleotides or amino acids in this notebook is only there to make it easier to see start and stop codons. In practice, you should never rely on the case of a nucleotide designation--different sequence-providing program do different things.)

## Your assignment

translate all the orfs in the following sequence (and reading frame) into amino acid sequences.

>'cggcggATGaacaacTAGcgcggcgtaATGtcgttaattTGAtccATGcctccagttgatTGAaagatttgt'

Remember: your code should not be case-dependent.

Also, a *very* useful trick is to convert a list of strings into a string by *join*ing the members of the list with the following code:

>''.join(my_list)

### Create a dict to look up codon:amino acid code
You know you'll need a way to translate codons into amino acids.
A dict is (clearly) the best data structure for the job. 
So, given a list of codons and a corresponding list of amino acids, zip them together into a dict called *codon_translation*, where 
>codon_translation['codon'] == the amino acid translation of 'codon'

You could just copy and paste from lecture 10. But you need practice with *zip*, so...fill in "zip(,)" using the two lists below.

In [None]:
### Create a dict to look up codon:amino acid code
codons = ['TAG', 'CCT', 'TAT', 'CTT', 'CAG', 'GTA', 
          'GGT', 'ATT', 'TGT', 'ACC', 'GTC', 'CGT', 
          'AGG', 'GCA', 'TTG', 'AAG', 'AGT', 'CCC', 
          'ACG', 'GGC', 'TCG', 'AAC', 'GAC', 'GAT', 
          'ATA', 'TCC', 'TAC', 'GTT', 'ACA', 'ATC', 
          'CCA', 'CTG', 'GAA', 'TCA', 'CGG', 'AGC', 
          'CAA', 'CAC', 'GCC', 'TGC', 'CGC', 'TTA', 
          'GTG', 'ATG', 'CTC', 'ACT', 'TTT', 'GCT', 
          'CAT', 'TCT', 'AAA', 'TAA', 'GCG', 'CCG', 
          'GAG', 'GGA', 'TGA', 'GGG', 'TTC', 'TGG', 
          'AAT', 'AGA', 'CTA', 'CGA']
amino_acids = ['_', 'P', 'Y', 'L', 'Q', 'V', 'G', 'I', 'C', 
               'T', 'V', 'R', 'R', 'A', 'L', 'K', 'S', 'P', 
               'T', 'G', 'S', 'N', 'D', 'D', 'I', 'S', 'Y', 
               'V', 'T', 'I', 'P', 'L', 'E', 'S', 'R', 'S', 
               'Q', 'H', 'A', 'C', 'R', 'L', 'V', 'M', 'L', 
               'T', 'F', 'A', 'H', 'S', 'K', '_', 'A', 'P', 
               'E', 'G', '_', 'G', 'F', 'W', 'N', 'R', 'L', 'R']

codon_translation = dict(zip(,))  # complete this function

### set up your DNA string variable

We'll use this later.
To make the case not be a problem, just make everything upper case now!

In [None]:
DNA = ''.upper()   # copy and paste is fine here. Note how we set things up to ignore case!

### Get a list of codons from the input string

See code comments for what you need to do. 

Remember, you can use a shorter DNA string of your own choosing while developing the code. It's easier to work with a simple example to which you know the answer, until you're convinced your code works!

In [None]:
# make list of codons
list_of_codons =    # initialize this (hint: what is your "list of codons" so far?)

for i in range(int(len(DNA)/3)):
    list_of_codons.append(DNA[:???]) # fill in ??? (hint: how long is a codon?)
    DNA = DNA[???:]

print(list_of_codons)   # do this to be sure you're got it right

### Get a list of the ORFs in the input

We know that *list_of_codons* is a list of the codons in *DNA*. 
So, now we need to find start and stop codon pairs, and keep just the codons between the two. 

The following code is a little tricky, so you only need to initialize the two lists *orf_list* and *next_orf*.

But, make an effort to understand what this code does.

In [None]:
# make list of ORFs
orf_list =     # FILL IN
next_orf =     # FILL IN

found_start, found_stop = False, False
for next_codon in list_of_codons:
    if next_codon == 'ATG':    # found start
        found_start = True   # start looking for stop
        found_stop = False
        next_orf = []
    if (next_codon == 'TAG') | (next_codon == 'TGA') | (next_codon == 'TAA'): # found stop
        found_stop = True   # start looking for next start
    if found_start:
        next_orf.append(next_codon)
        if found_stop:
            orf_list.append(orf[1:-1])
            next_orf = []
            found_start, found_stop = False, False 

print('List of codons: ',list_of_codons,'\nlist of orfs: ',orf_list)

### Translate all orfs into proteins

This code walks through the list of orfs you made above and translates each one into a protein using the dict you created for translation.

The one "trick" in this code is that it first translates the orf (which is a list of codons) into a list of amino acids, then converts that list of amino acids into a string representing the protein (using *''.join(protein_translation)*, which we haven't explained yet). 
This is because lists are **much** more efficient to work with than strings, because lists are mutable but strings aren't. 

Fix the errors in the code. (there are two!)

In [None]:
all_translated_proteins = []
for next_orf in orf_list:
    protein_translation = []
    for next_codon in next_orf:
        protein_translation.append(codon_translations[next_codon])
    all_translated_protein.append(''.join(protein_translation))

print(all_translated_proteins)

### (Extra Credit) ORFs in all reading frames

The assignment above assumes that you have one DNA sequence in *DNA* and that this is the only reading frame of interest.

But in reality there are six different *reading frames*, depending on whether you start at positions 0, 1, or 2 in either the DNA sequence (actually, we should be using RNA) or its reverse complement. 

1. Create an list named *reading_frames* which contains each of the six reading frames for *DNA*, 

In [None]:
#your code here
reading_frames = [ ... ] # fill this in

2. (Discuss) What would you need to do to create a list of all open reading frames for each reading frame?

(put your discussion here)

## Grading
There is little scope for cleverness here, so we will use the following grades:

grade | if
----- | :---------
0     | you don't turn in anything (in the *master* branch of *HW6* in your private repo)
1     | you turn in something but it isn't correct
2     | you turn in something, some is correct but some isn't
3     | All of your code is correct, but it violates proper style, or has poor/missing comments
4     | what you turned in is correct, stylish, and complete