# Open Reading Frames

## Problem
    Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

    An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.
    
### Given
    A DNA string s of length at most 1 kbp in FASTA format.
### Return
    Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.
    
### Sample Dataset
>Rosalind_99<br>
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGG<br>ATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

### Sample Output
>MLLGSFRLIPKETLIQVAGSSPCNLS<br>
>M<br>
>MGMTPRLGLESLLE<br>
>MTPRLGLESLLE


In [1]:
DNA_CODON_TABLE = {
    'TTT': 'F',     'CTT': 'L',     'ATT': 'I',     'GTT': 'V',
    'TTC': 'F',     'CTC': 'L',     'ATC': 'I',     'GTC': 'V',
    'TTA': 'L',     'CTA': 'L',     'ATA': 'I',     'GTA': 'V',
    'TTG': 'L',     'CTG': 'L',     'ATG': 'M',     'GTG': 'V',
    'TCT': 'S',     'CCT': 'P',     'ACT': 'T',     'GCT': 'A',
    'TCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
    'TCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
    'TCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
    'TAT': 'Y',     'CAT': 'H',     'AAT': 'N',     'GAT': 'D',
    'TAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
    'TAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
    'TAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
    'TGT': 'C',     'CGT': 'R',     'AGT': 'S',     'GGT': 'G',
    'TGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
    'TGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
    'TGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G'
}

In [1]:
import re                                                 
from Bio import SeqIO                                     
from Bio.Seq import Seq                                   
from Bio.Alphabet import generic_dna                      

def ORF(file):
    record = SeqIO.read(file, 'fasta')          
    pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
    FRWD = record.seq                                      
    REV = FRWD.reverse_complement()                    
    sequences = []                                            

    for m in re.findall(pattern, str(FRWD)):               
        dna_seq = Seq(m, generic_dna)                         
        prot_seq = dna_seq.translate()                        
        if prot_seq not in sequences:                         
            sequences.append(prot_seq)
            
    for n in re.findall(pattern, str(REV)):               
        rev_dna_seq = Seq(n, generic_dna)                     
        rev_prot_seq = rev_dna_seq.translate()                
        if rev_prot_seq not in sequences:                     
            sequences.append(rev_prot_seq)                    

    for i, s in enumerate(sequences):                         
        print(s)                         

In [5]:
ORF('/home/kip/Downloads/rosalind_orf.txt')

MDDRFTTCYLGRRVSLHITKSIVLPSDPRVKDKNLPYLSHCP
MMAERCLPAIPGFLI
MAERCLPAIPGFLI
MFAGYTRFSYLTHSALLSKPQHSNGTLTRMINI
MEL
MINI
MGHACCRLTERVSRWH
MRAVD
MALRHD
MYIWHRP
M
MVCST
MPHSLERVVRRVCEVSSLIIVGVPLHRLGNPTSDAAAVMHKLTRRTEVLAHAATFSIE
MQRRSCINLPGGQRYSPMQPRSQ
MHKLTRRTEVLAHAATFSIE
MQPRSQ
MRITRG
MSC
MGEYLCPPGKFMHDRRCITSRVT
MHDRRCITSRVT
MTAAASLVGLPNRCRGTPTMIKDETSHTRRTTRSRLWGMPLGRTHHCTGPGEGESNSTNDTPLHCGSERSDLNVYLAPSLNTLASVFRDGAKYT
MIKDETSHTRRTTRSRLWGMPLGRTHHCTGPGEGESNSTNDTPLHCGSERSDLNVYLAPSLNTLASVFRDGAKYT
MGHAPR
MPLGRTHHCTGPGEGESNSTNDTPLHCGSERSDLNVYLAPSLNTLASVFRDGAKYT
MTRLYIAGVSGLT
MTSWITTSRASMPS
MPS
MSHLRRKHLSARCCKKGYLNIYHPS
ML
MIRAQLVSFRGSVTDKASFYP
MDFVMCKETLRPK
MCKETLRPK
