**Open Reading Frames**

Problem
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

In [1]:
import re

In [35]:
filepath = "/mnt/c/Data/ROSALIND_download/rosalind_orf.txt"
with open(filepath) as f:
    sequence = ""
    for line in f.readlines():
        line = line.rstrip()
        if line.startswith('>'):
            continue
        sequence += line

sequence = ''.join([('U' if i == 'T' else i) for i in sequence]) 
sequence


'UUCCCUUUGUUCCAACAGUCGUCGAUUGAGCAUCUGGGCUAGUACGGCUAUACUAUUUUAUGUUUGGACGUAUAAUCUUCUAAUUACUGCAUUGUUUGCGACUAUCCCCCUCAUAAUGCCGUUGUUACUGUCGCGUCGAAGGGGGGCCUAUAUGGACAAUCGUGCCUAAUGUAUUCGGUACGGUCGAGUGCCCCUUGCGUGUACUGCAGGUAACGAUCACGUGUGUGUACUAAAAUCUUUGAGGAAUGUCGCACCAACGGGGCCGAAUGACCAGAGCUCGGCCCUCCUAUGACCAAAAUGGGUUUGAAGAGAACAAAUAAAAUUUUCAAAGAAAAGGUACUAUUACUACGUGAGGGGCCAACUAAAAAGAGAUACGCCAUCGUUUCCUCGAAUUGGUAUGGUGGGGAAAACGCUUUAUUCCAAAUGCAUGCGAGCUUCGUAGUGUUCAGCGAAAUAGCUAUUUCGCUGAACACUACGAAGCUCGCAUAAUCGACCUUAACCUCGUCCGUCCCCAGUUGGGGGCCAUACUUCUAAAUGUUUCUGUGUCCUCGUAGUAAUCUAAAGAGGCGCCGGAAUCCAUGAGCCCCGACCCCCACGCUCGUGAACAAGCAGAAACGAGUUAUUUGUCCCUGGUCCGAUGAGCUGCCCGGAUAGAUAUGUUGGCAUCGCAGAGCUCGGCCGAGCUCCACAAUCUUUCUGGGGUCGCUUUCAUAAGGAUCAUUUCUGGUUGAGUGGUAACGCAUGCUCUCGCCAGAUUGUUACUACCCCGAUCGAAAGGACCCCAUACAUAGAUAAUAGACGGAUACGUAGCUUCACGAAUGGCCACAUCGAAUAGUUAUUCGUUCUGAUGAGAGACCAUCUUUCGGUGACACAUAGGACCUGUCUCCCAAUUGGUCCCCUCGGGGAGUUUAAUA'

In [36]:
code = "UUU F UUC F UUA L UUG L CUU L CUC L CUA L CUG L AUU I AUC I AUA I AUG M GUU V GUC V GUA V GUG V UCU S UCC S UCA S UCG S CCU P CCC P CCA P CCG P ACU T ACC T ACA T ACG T GCU A GCC A GCA A GCG A UAU Y UAC Y CAU H CAC H CAA Q CAG Q AAU N AAC N AAA K AAG K GAU D GAC D GAA E GAG E UGU C UGC C UGG W CGU R CGC R CGA R CGG R AGU S AGC S AGA R AGG R GGU G GGC G GGA G GGG G"
code_list = code.split()
code_dict = {code_list[i]:code_list[i+1] for i in range(0,len(code_list),2)}
code_dict['UAA'] = '\n'
code_dict['UAG'] = '\n'
code_dict['UGA'] = '\n'
code_dict

{'UUU': 'F',
 'UUC': 'F',
 'UUA': 'L',
 'UUG': 'L',
 'CUU': 'L',
 'CUC': 'L',
 'CUA': 'L',
 'CUG': 'L',
 'AUU': 'I',
 'AUC': 'I',
 'AUA': 'I',
 'AUG': 'M',
 'GUU': 'V',
 'GUC': 'V',
 'GUA': 'V',
 'GUG': 'V',
 'UCU': 'S',
 'UCC': 'S',
 'UCA': 'S',
 'UCG': 'S',
 'CCU': 'P',
 'CCC': 'P',
 'CCA': 'P',
 'CCG': 'P',
 'ACU': 'T',
 'ACC': 'T',
 'ACA': 'T',
 'ACG': 'T',
 'GCU': 'A',
 'GCC': 'A',
 'GCA': 'A',
 'GCG': 'A',
 'UAU': 'Y',
 'UAC': 'Y',
 'CAU': 'H',
 'CAC': 'H',
 'CAA': 'Q',
 'CAG': 'Q',
 'AAU': 'N',
 'AAC': 'N',
 'AAA': 'K',
 'AAG': 'K',
 'GAU': 'D',
 'GAC': 'D',
 'GAA': 'E',
 'GAG': 'E',
 'UGU': 'C',
 'UGC': 'C',
 'UGG': 'W',
 'CGU': 'R',
 'CGC': 'R',
 'CGA': 'R',
 'CGG': 'R',
 'AGU': 'S',
 'AGC': 'S',
 'AGA': 'R',
 'AGG': 'R',
 'GGU': 'G',
 'GGC': 'G',
 'GGA': 'G',
 'GGG': 'G',
 'UAA': '\n',
 'UAG': '\n',
 'UGA': '\n'}

In [37]:
def DNA_to_prot(seq):
    new_seq = ""
    for i in range(0,len(seq),3):
        if i + 3 < len(seq):
            new_seq = new_seq + code_dict[seq[i:i+3]]
        else:
            continue
    return new_seq + '*'

        
        

In [38]:
RF1, RF2, RF3 = DNA_to_prot(sequence), DNA_to_prot(sequence[1:]), DNA_to_prot(sequence[2:])

print(RF1)
print(RF2)
print(RF3)


FPLFQQSSIEHLG
YGYTILCLDV
SSNYCIVCDYPPHNAVVTVASKGGLYGQSCLMYSVRSSAPCVYCR
RSRVCTKIFEECRTNGAE
PELGPPMTKMGLKRTNKIFKEKVLLLREGPTKKRYAIVSSNWYGGENALFQMHASFVVFSEIAISLNTTKLA
STLTSSVPSWGPYF
MFLCPRSNLKRRRNP
APTPTLVNKQKRVICPWSDELPG
ICWHRRARPSSTIFLGSLS
GSFLVEW
RMLSPDCYYPDRKDPIHR

TDT
LHEWPHRIVIRSDERPSFGDT
DLSPNWSPRGV
*
SLCSNSRRLSIWASTAILFYVWTYNLLITALFATIPLIMPLLLSRRRGAYMDNRA
CIRYGRVPLACTAGNDHVCVLKSLRNVAPTGPNDQSSALL
PKWV
REQIKFSKKRYYYYVRGQLKRDTPSFPRIGMVGKTLYSKCMRAS
CSAK
LFR
TLRSSHNRP
PRPSPVGGHTSKCFCVLVVI
RGAGIHEPRPPRS
TSRNELFVPGPMSCPDRYVGIAELGRAPQSFWGRFHKDHFWLSGNACSRQIVTTPIERTPYIDNRRIRSFTNGHIE
LFVLMRDHLSVTHRTCLPIGPLGEFN*
PFVPTVVD
ASGLVRLYYFMFGRIIF
LLHCLRLSPS
CRCYCRVEGGPIWTIVPNVFGTVECPLRVLQVTITCVY
NL
GMSHQRGRMTRARPSYDQNGFEENK
NFQRKGTITT
GAN
KEIRHRFLELVWWGKRFIPNACELRSVQRNSYFAEHYEARIIDLNLVRPQLGAILLNVSVSS

SKEAPESMSPDPHAREQAETSYLSLVR
AARIDMLASQSSAELHNLSGVAFIRIISG
VVTHALARLLLPRSKGPHT
IIDGYVASRMATSNSYSF

ETIFR
HIGPVSQLVPSGSL*


In [39]:
def find_ORF(seq):
    seq = seq.split()
    ORF_list = [i[i.find('M'):] for i in seq if ('M' in i) and ('*' not in i)]
    result_ORF = []
    for seq in ORF_list:
        for m in re.finditer('M',seq):
            result_ORF.append(seq[m.start():])
    return result_ORF
            

In [40]:
# ORF1 = [i[i.find('M'):] for i in RF1.split() if 'M' in i]
ORFs = find_ORF(RF1) + find_ORF(RF2) + find_ORF(RF3)
print('\n'.join(ORFs))

MYSVRSSAPCVYCR
MTKMGLKRTNKIFKEKVLLLREGPTKKRYAIVSSNWYGGENALFQMHASFVVFSEIAISLNTTKLA
MGLKRTNKIFKEKVLLLREGPTKKRYAIVSSNWYGGENALFQMHASFVVFSEIAISLNTTKLA
MHASFVVFSEIAISLNTTKLA
MFLCPRSNLKRRRNP
MLSPDCYYPDRKDPIHR
MPLLLSRRRGAYMDNRA
MDNRA
MVGKTLYSKCMRAS
MRAS
MSCPDRYVGIAELGRAPQSFWGRFHKDHFWLSGNACSRQIVTTPIERTPYIDNRRIRSFTNGHIE
MFGRIIF
MSHQRGRMTRARPSYDQNGFEENK
MTRARPSYDQNGFEENK
MSPDPHAREQAETSYLSLVR
MLASQSSAELHNLSGVAFIRIISG
MATSNSYSF


In [41]:
rev_dict = {'A':'U', 'U':'A','G':'C','C':'G'}
rev_seq = [rev_dict[i] for i in sequence]
rev_seq = ''.join(rev_seq[-1::-1])
rev_seq


'UAUUAAACUCCCCGAGGGGACCAAUUGGGAGACAGGUCCUAUGUGUCACCGAAAGAUGGUCUCUCAUCAGAACGAAUAACUAUUCGAUGUGGCCAUUCGUGAAGCUACGUAUCCGUCUAUUAUCUAUGUAUGGGGUCCUUUCGAUCGGGGUAGUAACAAUCUGGCGAGAGCAUGCGUUACCACUCAACCAGAAAUGAUCCUUAUGAAAGCGACCCCAGAAAGAUUGUGGAGCUCGGCCGAGCUCUGCGAUGCCAACAUAUCUAUCCGGGCAGCUCAUCGGACCAGGGACAAAUAACUCGUUUCUGCUUGUUCACGAGCGUGGGGGUCGGGGCUCAUGGAUUCCGGCGCCUCUUUAGAUUACUACGAGGACACAGAAACAUUUAGAAGUAUGGCCCCCAACUGGGGACGGACGAGGUUAAGGUCGAUUAUGCGAGCUUCGUAGUGUUCAGCGAAAUAGCUAUUUCGCUGAACACUACGAAGCUCGCAUGCAUUUGGAAUAAAGCGUUUUCCCCACCAUACCAAUUCGAGGAAACGAUGGCGUAUCUCUUUUUAGUUGGCCCCUCACGUAGUAAUAGUACCUUUUCUUUGAAAAUUUUAUUUGUUCUCUUCAAACCCAUUUUGGUCAUAGGAGGGCCGAGCUCUGGUCAUUCGGCCCCGUUGGUGCGACAUUCCUCAAAGAUUUUAGUACACACACGUGAUCGUUACCUGCAGUACACGCAAGGGGCACUCGACCGUACCGAAUACAUUAGGCACGAUUGUCCAUAUAGGCCCCCCUUCGACGCGACAGUAACAACGGCAUUAUGAGGGGGAUAGUCGCAAACAAUGCAGUAAUUAGAAGAUUAUACGUCCAAACAUAAAAUAGUAUAGCCGUACUAGCCCAGAUGCUCAAUCGACGACUGUUGGAACAAAGGGAA'

In [42]:
rRF1, rRF2, rRF3 = DNA_to_prot(rev_seq), DNA_to_prot(rev_seq[1:]), DNA_to_prot(rev_seq[2:])
# print(rRF1)
# print(rRF2)
# print(rRF3)
rORFs = find_ORF(rRF1) + find_ORF(rRF2) + find_ORF(rRF3)
total_orfs = list(set(rORFs + ORFs))
print('\n'.join(total_orfs))

MSHQRGRMTRARPSYDQNGFEENK
MPLLLSRRRGAYMDNRA
MILMKATPERLWSSAELCDANISIRAAHRTRDK
MAYLFLVGPSRSNSTFSLKILFVLFKPILVIGGPSSGHSAPLVRHSSKILVHTRDRYLQYTQGALDRTEYIRHDCPYRPPFDATVTTAL
MWPFVKLRIRLLSMYGVLSIGVVTIWREHALPLNQK
MLSPDCYYPDRKDPIHR
MRGIVANNAVIRRLYVQT
MTKMGLKRTNKIFKEKVLLLREGPTKKRYAIVSSNWYGGENALFQMHASFVVFSEIAISLNTTKLA
MGLKRTNKIFKEKVLLLREGPTKKRYAIVSSNWYGGENALFQMHASFVVFSEIAISLNTTKLA
MKATPERLWSSAELCDANISIRAAHRTRDK
MRAS
MCHRKMVSHQNE
MVSHQNE
MVGKTLYSKCMRAS
MATSNSYSF
MFLCPRSNLKRRRNP
MDSGASLDYYEDTETFRSMAPNWGRTRLRSIMRAS
MQ
MPTYLSGQLIGPGTNNSFLLVHERGGRGSWIPAPL
MHASFVVFSEIAISLNTTKLA
MTRARPSYDQNGFEENK
MYGVLSIGVVTIWREHALPLNQK
MGSFRSG
MLASQSSAELHNLSGVAFIRIISG
MDNRA
MAPNWGRTRLRSIMRAS
MSPDPHAREQAETSYLSLVR
MHLE
MRYHSTRNDPYESDPRKIVELGRALRCQHIYPGSSSDQGQITRFCLFTSVGVGAHGFRRLFRLLRGHRNI
MYSVRSSAPCVYCR
MFGRIIF
MSCPDRYVGIAELGRAPQSFWGRFHKDHFWLSGNACSRQIVTTPIERTPYIDNRRIRSFTNGHIE
