## Part II: Translating DNA to protein

In this lesson, we will learn how to translate DNA to protein. We will use Biopython, a Python library made specifically for biology to do the translation.

DNA is made up of 4 nucleotides, but has to encode proteins, which are made up of 20 amino acids. 
The genetic code is the mapping between the nucleotides and the amino acids.
The genetic code is degenerate, meaning that a single amino acid can be encoded by multiple codons.
The genetic code is also universal, meaning that the same codon encodes the same amino acid in all organisms.

Imagine that DNA is like a recipe book that contains instructions for making different things in biology. These instructions are made up of smaller parts called "codons," which are like sentences made of three letters each. 

Just like in a recipe, there are certain words that tell you when to stop doing something, like "stop stirring" or "turn off the oven." In DNA, there are special codons called "stop codons" that tell the cell when to stop making a protein.

So, when a cell is reading the DNA to make a protein, it keeps going until it reaches a stop codon. When it sees this stop codon, it knows it's reached the end of the instructions, and it stops making the protein. 

Just like how "stop" signals the end of something in a recipe, stop codons signal the end of making a protein in a cell. These stop codons help make sure the proteins are the right length and shape for cells to work properly.

In [20]:
# load libraries

from utils import *
from sequence_alignment_viewer import *
from build_tree import *

from Bio.Seq import Seq


In [21]:
# repeat the same steps as last lesson

seq_length = 40

seq1 = random_DNA(length=seq_length)
print("DNA sequence of our ancestor: " + seq1)

sequences = {'Ancestor': seq1}  
for i in range(10):
    sequences['mutant_' + str(i)] = mutate_n(seq1, 10)

p = view_alignment(sequences, language='DNA', plot_width=1000)
pn.pane.Bokeh(p)

DNA sequence of our ancestor: TCTGCGTGAGTCAAGTATCTGAATGAGTAACGACGGACAG


BokehModel(combine_events=True, render_bundle={'docs_json': {'8ec00b31-aec2-4e7a-975c-b86c1cc7d48b': {'version…

Now, we are going to translate these DNA sequences into protein sequences.

In [22]:
# translate the dictionary of DNA sequences to a dictionary of protein sequences

# create a dictionary for the protein sequences
sequences_protein = {}
# iterate over the dictionary of DNA sequences
for name, seq in sequences.items():
    # translate the sequence to protein
    bio_seq = Seq(seq)
    protein_seq = bio_seq.translate()
    # add the protein sequence to the dictionary
    sequences_protein[name] = str(protein_seq)

# view the protein sequences
p = view_alignment(sequences_protein, language='protein', plot_width=800)
pn.pane.Bokeh(p)

BokehModel(combine_events=True, render_bundle={'docs_json': {'9b945089-1924-46d2-99d6-a6cb3d316a38': {'version…

The amino acids are coloured accoring to their chemical characteristics:
- Red amino acids are 'hydrophobic', meaning they avoid water
- Blue amino acids are 'basic', meaning they can be positively charged
- Green amino acids are 'acidic', meaning they can be negatively charged
- Yellow amino acids are 'polar', meaning they don't have a charge, but can interacti with water and other polar molecules
- Pink amino acids have special qualities and may perform particularly important roles in protein structure or function
- Black '*'s are stop codons, which tell the cell's machinery to stop making the protein - any amino acids after a stop codon wouldn't actually be incorporated into the real protein

Notice that some of the amino acids have changed to another one of the same colour. This is because biology uses codons that, if mutated, are more likely to change to one for a similar amino acid than a dissimilar one. 