<a name="top"></a>
# Introduction to Python Programming for Bioinformatics. Lesson 10

<details>
<summary>
About this notebook
</summary>

This notebook was originally written by [Marc Cohen](https://github.com/mco-gh), an engineer at Google. The original source can be found on [Marc's short link service](https://mco.fyi/), and starts with [Python lesson 0](https://mco.fyi/py0), and I encourage you to work through that notebook if you find some details missing here.

Rob Edwards edited the notebook, adapted it for bioinformatics, using some simple geneticy examples, condensed it into a single notebook, and rearranged some of the lessons, so if some of it does not make sense, it is Rob's fault!

It is intended as a hands-on companion to an in-person course, and if you would like Rob to teach this course (or one of the other courses) don't hesitate to get in touch with him.

</details>
<details>
<summary>
Using this notebook
</summary>

You can download the original version of this notebook from [GitHub](https://linsalrob.github.io/ComputationalGenomicsManual/Python/Python_Lesson_9.ipynb) and from [Rob's Google Drive]()

**You should make your own copy of this notebook by selecting File->Save a copy in Drive from the menu bar above, and then you can edit the code and run it as your own**

There are several lessons, and you can do them in any order. I've tried to organise them in the order I think most appropriate, but you may disagree!

</details>

<a name="lessons"></a>
# Lesson Links

* [Lesson 10 - Translating a DNA sequence](#Lesson-10---Translating-a-DNA-sequence)

Previous Lesson: [GitHub](Python_Lesson_9.ipynb) | [Google Colab](https://colab.research.google.com/drive/1JGRJpUPKkkVukyNvtfEJYVVCcdpkyRLZ)

Next Lesson: [GitHub](Python_Lesson_11.ipynb) | [Google Colab](https://colab.research.google.com/drive/1N2WL7WDjUQkb7BLWqKYALWCwsUEAVVdf)

<!-- #region id="qXu_bY7yPpsS" -->

# Lesson 10 - Translating a DNA sequence

Now that we have covered several concepts, we are going to put it together to translate a DNA sequence into a Protein sequence. Although the Central Dogma of Biology includes an RNA step, we usually skip that, acknowledging the presence of uracil, and convert DNA to protein.

Here is a function that translates a codon into an amino acid. You have seen most of this previously:


In [5]:
def translate_dna(codon):
  """
  Translate a codon sequence into an amino acid sequence.
  :param codon: a codon sequence
  :type codon: str
  :return: an amino acid sequence
  :rtype: str
  """

  codon_to_amino_acid = {
      'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',
      'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',
      'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',
      'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',
      'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',
      'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
      'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
      'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
      'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',
      'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
      'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
      'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
      'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
      'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
      'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
      'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'
  }

  if codon.upper() not in codon_to_amino_acid:
    print(f"Invalid codon: {codon}")
    return None
  return codon_to_amino_acid[codon.upper()]

Next, we need to take a sequence, and reading it three letters at a time, translate it

In [7]:
sequence = "ATGATCGACAAGCTACGCTACGATCAGACTGCATCAGATTAA"
print(len(sequence))
for codon in [sequence[i:i+3] for i in range(0, len(sequence), 3)]:
  print(translate_dna(codon), end='')
print()

42
MIDKLRYDQTASD*


## Challenge 1

Here are two DNA sequences:

```
TCGCGCACGCTGATCGTGGGGTGA
AGTAAAACTTTAATTGTTGGTTAA
```

1. What is the percent identity of these two sequences at the DNA level?
2. After you have translated them using the genetic code above, what is the percent identity at the amino acid level?

Does this result surprise you?

## Challenge 2

Can you write the code to translate all the ORFs in the Bc01.fasta file that we worked on earlier?