<a href="https://colab.research.google.com/github/mohamedyosef101/Python-for-AI/blob/main/04_DNA_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translation Theory : DNA ? RNA ? Protein


Life depends on the ability of cells to store, retrieve, and translate genetic instructions.These instructions are needed to make and maintain living organisms. For a long time, it was not clear what molecules were able to copy and transmit genetic information. We now know that this information is carried by the deoxyribonucleic acid or DNA in all living things.

**DNA:** DNA is a discrete code physically present in almost every cell of an organism. We can think of DNA as a one dimensional string of characters with four characters to choose from. These characters are A, C, G, and T. They stand for the first letters with the four nucleotides used to construct DNA. The full names of these nucleotides are Adenine, Cytosine, Guanine, and Thymine. Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things.

Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins. We can think of DNA, when read as sequences of three letters, as a dictionary of life.

**Aim:** Convert a given sequence of DNA into its Protein equivalent.

**File:** Download a DNA strand as a text file from a public web-based repository of DNA sequences from [**NCBI**](https://www.ncbi.nlm.nih.gov/).The Nucleotide sample is ( NM_207618.2 ), which can be found [**here**](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2).

**Sources:** I took the project idea from [HarvardX Python for Research](https://courses.edx.org/courses/course-v1:HarvardX+PH526x+3T2016/info) course on edX. Also, I used some content from an [article by GeeksforGeeks](https://www.geeksforgeeks.org/dna-protein-python-3/).

## Project overview
### **The goal:**
Translating the DNA sequence into amino acids.

```
ATA -> I
ATG -> M
CAA -> Q
TCT -> S
TGG -> W
```

### What we need to do..

1. Download a DNA sequence from NCBI
2. Translate the DNA sequence into amino acids.
3. Download amino acid sequence to check our solution.

### Tasks

1. Manually download DNA and  protein sequence data.
2. Import the DNA data into Python.
3. Create an algorithm to translate the DNA.
4. Check if translation matches your download.

## Download and read the data

- Download two files; first one is Strand of DNA while the other is the corresponding protein sequence.
    - Choose a Nucleotide database type instead of “all databases”.
    - In the search field, write “NM_207618.2” and hit search.
    - At the top of the page found and click on the word “FASTA”
    - Using your mouse and cursor to copy the Strand of DNA and save it in a .txt file.
    - Go back the data page and click on “CDS” to download the corresponding protein sequence.

In [1]:
%%writefile dna.txt
GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTC
ACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTG
ATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT

Writing dna.txt


In [None]:
%%writefile protein.txt
MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPIST
GSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARST
NLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTG
PQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRM
QYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCND
ILVSGFPTISPLLLTFRDPKGPCSVFFNC

- Read the files

In [4]:
inputfile = "dna.txt"
f = open(inputfile, "r") # r for reading
seq = f.read()
print(seq)

GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAAT

You might face an issue with copied file (some “\n” in the text file). To get rid of that, you’ll need to remove “\n” using the replace method.

In [None]:
# replace the newlines \n with nothing
seq = seq.replace("\n", "")

Sometimes you’ll find some extra characters in your copied called carriage return. Anyway, we’ll get rid of it (just in case).

In [None]:
# Replace carriage return with nothing
seq = seq.replace("\r", "")

### Put it all together

In [None]:
inputfiles = ["dna.txt", "protein.txt"]
seq = []

for i, file in enumerate(inputfiles):
  f = open(file, "r")
  seq.append(f.read())
  seq[i] = seq[i].replace("\n", "")
  seq[i] = seq[i].replace("\r", "")

dna = seq[0]
protein = seq[1]

### Translation table

- You don’t have to understand how to translate the DNA by yourself, the table below will do just fine.

In [None]:
table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
}

## Build the algorithm

1. Check that the length of sequence is divisible by 3
2. Look up each 3-letter string in table and store result
3. End when you reach the end of the sequence

In [None]:
# === The small demo ====
s = "abcdef"
t = {'abc': '1', "def": '2'}
res = []
i = 0

while i < len(s):
  print(f"i={i}")
  test = s[i: i+3]
  print(f"testing {test}...")
  result = t[test]
  print(f"this is {result}")
  res.append(result)
  print(f"the full list: {res}")
  i = i + 3

In [None]:
translated_seq = ''
i = 0
dna = dna[20:935]

while i < len(dna):
  clip = dna[i: i+3]
  if len(clip) < 3:
    break
  else:
    result = table[clip]
    translated_seq += result
    i += 3

In [None]:
# check the results
print(translated_seq == protein)