# Open Reading Frame

## Transcription May Begin Anywhere

In “Transcribing DNA into RNA”, we discussed the transcription of DNA into RNA, and in “Translating RNA into Protein”, we examined the translation of RNA into a chain of amino acids for the construction of proteins. We can view these two processes as a single step in which we directly translate a DNA string into a protein string, thus calling for a DNA codon table.

However, three immediate wrinkles of complexity arise when we try to pass directly from DNA to proteins. First, not all DNA will be transcribed into RNA: so-called junk DNA appears to have no practical purpose for cellular function. Second, we can begin translation at any position along a strand of RNA, meaning that any substring of a DNA string can serve as a template for translation, as long as it begins with a start codon, ends with a stop codon, and has no other stop codons in the middle. See Figure 1. As a result, the same RNA string can actually be translated in three different ways, depending on how we group triplets of symbols into codons. For example, ...AUGCUGAC... can be translated as ...AUGCUG..., ...UGCUGA..., and ...GCUGAC..., which will typically produce wildly different protein strings.

## Problem

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.
An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

### Given: 
A DNA string s of length at most 1 kbp in FASTA format.

### Return: 

Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

In [1]:
# Sample Dataset
sd = """>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG"""
print(sd.split('\n')[1])

AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG


### Sample Output
```
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
```

In [16]:
# Libraries
import re
import Bio.Seq as bio

In [3]:
# load fasta

In [4]:
# Start Codon
start = "ATG"

In [5]:
# Stop Codons
stop = ["TAG", "TGA", "TAA"]

In [6]:
# Codon dictionary
codons = {'TTT': 'F', 'CTT': 'L', 'ATT': 'I', 'GTT': 'V', 'TTC': 'F', 'CTC': 'L', 'ATC': 'I', 'GTC': 'V', 'TTA': 'L', 'CTA': 'L', 
           'ATA': 'I', 'GTA': 'V', 'TTG': 'L', 'CTG': 'L', 'ATG': 'M', 'GTG': 'V', 'TCT': 'S', 'CCT': 'P', 'ACT': 'T', 'GCT': 'A', 
           'TCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A', 'TCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A', 'TCG': 'S', 'CCG': 'P', 
           'ACG': 'T', 'GCG': 'A', 'TAT': 'Y', 'CAT': 'H', 'AAT': 'N', 'GAT': 'D', 'TAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D', 
           'TAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E', 'TAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E', 'TGT': 'C', 
           'CGT': 'R', 'AGT': 'S', 'GGT': 'G', 'TGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G', 'TGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 
           'GGA': 'G', 'TGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G' 
} 

In [7]:
# Find index of start codons
def checkMotif(fasta):
    print(fasta)
    inxLst = []
    motif = re.compile(start)
    inxLst = [(i.start() + 1) for i in re.finditer(motif, fasta)]
   # inxLst = [i.end() for i in re.finditer(motif, fasta)]
    return inxLst

In [8]:
# Convert sequence to amino acids
def translate(seq):
    #print(seq)
    decoded = ""
    for i in range(0, len(seq)-3, 3):
        if len(seq[i:]) < 3 or codons[seq[i:i+3]] == "Stop":
            break
        else:
            decoded += codons[seq[i:i+3]]
    print(decoded)

In [9]:
print(checkMotif(sd.split('\n')[1]))

AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
[5, 25, 31, 76]


In [46]:
def trans2(seq):
    oh = len(seq)%3
    if oh != 0:
        #print(len(seq)%3)
        seq = seq[:-oh]
    try:
        print(seq.translate(to_stop=True))
    except:
        print("")

In [47]:
def main():
    # get sequence
    #seq = sd.split('\n')[1]
    seq = bio.Seq(sd.split('\n')[1])
    # forward strand
    fIndx = checkMotif(sd.split('\n')[1])
    for i in fIndx:
        translate(seq[(i-1):])
        trans2(seq[(i-1):])
    

In [48]:
main()

AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
M
M
MGMTPRLGLESLLE
MGMTPRLGLESLLE
MTPRLGLESLLE
MTPRLGLESLLE
MIRVAS
MIRVASQ


In [12]:
print(sd.split('\n')[1][::-1])

GACTCTACGATGAGCCTAGTAAGTCCGAATAAGGTTTTCTCTGAGATTAGGTTCAGCGCCCCAGTAGGGGTACATTGGACTCAATCGATGTACCGA


In [50]:
import Bio.Seq as bio
sq = bio.Seq(sd.split('\n')[1])
print(sq)

print(sq.translate())

rcseq = sq.reverse_complement()
trans2(rseq)

AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ
LRCYSDHSGLFQKRL
