# DNA Translation 


## Introduction 
DNA translation is the final stage of gene expression process, where it takes the information passed from DNA as messenger RNA (mRNA) and turns it into a series of amino acids bound together with peptide bonds.
## DNA Data 
In this case study we need to manually download the data. The data is available on <abbr title = "National Center for Biotechnology Information">[NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2)</abbr>, which is the United State's main public repository of DNA related information. 
From this repository we will download, or rather copy paste, two files. The first is a strand of DNA and the second is the corresponding protein sequence of amino acids translated from this DNA. 

## Processing & Loading DNA Data 

While loading the DNA sequence, we need to remove some additional characters, which resulted from the copy pasting done while getting the data.

In [1]:
def read_seq(input_file):
    """Read and return the input sequence with special characters are removed"""
    with open(input_file,"r") as f:
        seq = f.read()
    seq = seq.replace("\n","") # "\n" line break character 
    seq = seq.replace("\r","") # "\r" invisible control characters used by the applications for formatting text 
    return seq

In [2]:
# reading data 
prt = read_seq("protein.txt")
print(prt[:4])
dna = read_seq("dna.txt")
print(dna[:12]) # 4 proteins corresponds to 4 triplets of nucleotides

MSTH
GGTCAGAAAAAG


## Translation 

In [3]:
def translate(seq):
    """translate a string containing a nucleotide sequence into a string containing the corresponding sequence of amino acids.
        Nucleotides are translated in triplets using the dictionary table; each amino acid is encoded with a string of length 1.

        args:
            seq: the sequence to be translated 
        
        returns:
            protein: the sequence of amino acids which forms the protein molecule.

    """    
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein = ""
    if len(seq) % 3 == 0:
        for i in range(0,len(seq),3):
            codon = seq[i:i+3]
            protein += table[codon] 

    return protein

> **Note**: The function returns an empty string if the sequence is not divisible by 3, meaning that the sequence is not complete.

In [4]:
# let's consider the safety procedures 
print(len(dna)%3) 

2


So, the `translate()` function is not applicable to the current DNA sequence, because it's not complete. If we came back and looked at the CDS (coding sequence) section in the NCBI website, we will find that the translating has been performed starting from location **21** up to **938** from the DNA sequence. If we investigate the website, we will find that the indexing starts at position **1** up to **1157**.


In [5]:
dna_new = dna[20:938] # note that the end index has not been decremented 
print(len(dna_new)%3) # we are ready to go :)

0


In [6]:
trans_prt = translate(dna_new)
print(trans_prt == prt)
print(trans_prt[-11:])
print(prt[-10:])

False
KGPCSVFFNC_
KGPCSVFFNC


If we look at the tails of the protein sequences, we will find that both are identical, except that the sequence generated by our `translate()` function contains an underscore at the end of it; this is called a stop codon, which is similar to the dot at the end of a paragraph. This what happen in nature when synthesizing the protein. this character is not included in the coding sequence we have got from the website, so we need to not include the stop codon in our translation.


In [7]:
trans_prt = translate(dna[20:935]) # excluding the stop codon 
print(trans_prt == prt) # done :)

True


So, the main idea behind DNA translation is **decoding** sequences. We can extend this idea and apply it in many different contexts like constructing a **cipher decoder/encoder**. Let's take a look.  

# Cipher Decoder/Encoder 

A cipher is a secret code for a language. We will explore a cipher that is reported by contemporary Greek historians to have been used by Julius Caesar to send secret messages to generals during times of war. 

The Caesar cipher shifts each letter of a message to another letter in the alphabet located a fixed distance from the original letter. If our encryption key were `1`, we would shift `h` to the next letter `i`, `i` to the next letter `j`, and so on. If we reach the end of the alphabet, which for us is the space character, we simply loop back to `a`. To decode the message, we make a similar shift, except we move the same number of steps backwards in the alphabet.

In the following few cells, we will create our own Caesar cipher, as well as a message decoder for this cipher. 

## Alphabets Letters  

In [8]:
import string 
uppercase_letters = string.ascii_uppercase
lowercase_letters = string.ascii_lowercase
upper_alphabet = " " + uppercase_letters 
lower_alphabet = " " + lowercase_letters
alphabet = [lower_alphabet,upper_alphabet]
len(alphabet[0]) 

27

Next, we will construct a dictionary of 2 nested dictionaries, where each nested dictionary has keys consisting of the characters in corresponding alphabet and values consisting of the numbers from 0 to 26

In [9]:
lookup = dict()
for i,alph in enumerate(["lower","upper"]):
        # creating dictionary from list of tuples 
        lookup[alph] = dict(zip(alphabet[i],range(len(alphabet[i]))))
        

print(lookup["lower"])
print(lookup["upper"])

{' ': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}
{' ': 0, 'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'H': 8, 'I': 9, 'J': 10, 'K': 11, 'L': 12, 'M': 13, 'N': 14, 'O': 15, 'P': 16, 'Q': 17, 'R': 18, 'S': 19, 'T': 20, 'U': 21, 'V': 22, 'W': 23, 'X': 24, 'Y': 25, 'Z': 26}


## Caesar Cipher
let's encode a message with a Caesar cipher. 


In [15]:
message = "Hi, my name is Youssef"
def encode(message, alphabet, lookup,key=1):
    """Encode given message with caesar cipher. 

        args:
            message: the message to be encoded 
            alphabet: a list of the following definition: ["lower_alphabet","upper_alphabet"]
            lookup: a dictionary of two nested dictionaries, each nested dictionary
            alphabet characters (either lowercase or uppercase) as keys and the values 
            of the corresponding indices.
            key: encryption key of the shift-based encoding system

        returns: 
            encoded_message: the caesar cipher version of the message.  
    """
    encoded_message = "" 
    for char in message:
        if not char.isalpha():
            encoded_message += char
        elif char.isupper():
            result = (lookup["upper"][char] + key) % 27 # to guarantee that all in the same range 
            encoded_message += alphabet[1][result] # at index 1 is upper alphabet
        else:
            result = (lookup["lower"][char] + key) % 27
            encoded_message += alphabet[0][result] # at index 0 is lower alphabet
    return encoded_message

first_encoded_message = encode(message,alphabet,lookup)
print(first_encoded_message)
second_encoded_message = encode(message,alphabet,lookup,3)
print(second_encoded_message)

Ij, nz obnf jt Zpvttfg


## Generic Shift-based Encryption System  

With the help of above function, we can perform a decoder to any shift-based encryption system given its encryption key by reversing the direction of shifting.

In [16]:
first_decoded_message = encode(first_encoded_message,alphabet,lookup,-1)
second_decoded_message = encode(second_encoded_message,alphabet,lookup,-3)
print(first_decoded_message)
print(second_decoded_message)

Hi, my name is Youssef
Hi, my name is Youssef
