## ORF - Open Reading Frames (Loop/Con)
### Theory
In “Transcribing DNA into RNA”, we discussed the transcription of DNA into RNA, and in “Translating RNA into Protein”, we examined the translation of RNA into a chain of amino acids for the construction of proteins. We can view these two processes as a single step in which we directly translate a DNA string into a protein string, thus calling for a DNA codon table.

However, three immediate wrinkles of complexity arise when we try to pass directly from DNA to proteins. First, not all DNA will be transcribed into RNA: so-called junk DNA appears to have no practical purpose for cellular function. Second, we can begin translation at any position along a strand of RNA, meaning that any substring of a DNA string can serve as a template for translation, as long as it begins with a start codon, ends with a stop codon, and has no other stop codons in the middle. As a result, the same RNA string can actually be translated in three different ways, depending on how we group triplets of symbols into codons. For example, ...AUGCUGAC... can be translated as ...AUGCUG..., ...UGCUGA..., and ...GCUGAC..., which will typically produce wildly different protein strings.

### Exercise
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given: A DNA string s of length at most 1 kbp in FASTA format.

E.g.
```
\>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
```
Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

E.g.
```
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
```

## Notes
DNA double helix
- sequence
- reverse complement (opposite base pairs and reversed)
    - each has 3 reading frames (we do not know where it starts; in total 6), starting from the 1st, 2nd and 3rd nucleotide base from the left
        - keep unique translated sequences (we want all possibilities that start with start codon and end with an end codon in a given sequence, accounting for (relatively) unknown starting position)

## Input Data

In [22]:
# new version - used

# insert a file path for DNA strings in FASTA format
#path = str(insert('insert a file path:'))
# fixed path
path = '/mnt/c/Users/guspa/Desktop/python_homework/data/rosalind/rosalind_orf.txt'
# reading mode
mode = 'r'

storage = []

s = ''

with open(path, mode) as file:
    for line in file.readlines():
        line = line.strip()

        if '>' in line:
            if s:
                storage.append(s)
                s = ''

        else:
            s +=  line

if s:
    storage.append(s)
         
print(storage) # lines with ">" are not used, but are rather utilised to append whole (merged) sequences (not separated by new line characters)

['AGACTAACTGAATAATTCCCTCACCATGCAGCGGCCTAGACGACCTTGTTCGGATTCGGACTACGCTCACTCATACGAGTCACCGTTTGTCCGGGGGGTTATGTAACTTCGCAGGCGCGTAAGACGTCTTCACGCTTGCAGTTCGAGGGTCGGATCCGTGGAAGGCCAAGATGACTACCATGCATTATGCGTCGTGCCATGGCGTCATAATACCCGGAGCTAACAGTGACTTCAGCTTAGCGGTGCAATAGATCCATGTCAATATCCAGATGATCCGCCTCCTAGCGGCTATGAACAGATACCGGCCGAAATCAGGATTTACCAAAATGTTATCATCTAGTGTCAGGCAAGCCGTTCGGACAAAGAGCAGCGATGCAACCCAAAACCAGGAGAGCATCCGACGGCGCGGTAATGGTGGGCCAATTCACGGGTTTATGAGTTCTCGGGGTTAGTAGCTACTAACCCCGAGAACTCATACCCGATCCTAAAGATCCAACTGATTCCCGTGCCTGATGTGTAAAGGGTTCAAATGTTGCCTCCTCGGCCCAGTCGAAACAGGCGTCCCAACGTCGAGACATAATGTACCGTGCTCGGACTAGCCCCACCAAACCTAACCAATCTATTACCCGGGGTACGCTCCAAGGCTAACCCCGCATGTGGACCCAGATACCCGGAGCGGCGCCCAACCTGCCAAATCTTGTCATACCGTATCAAGTGCGTCCAACCTACTGGTCCTTAAGATCCGGGATCCGGCCACAGGTGAAAGGACTCGTTGTCAACATCTCGCGACGACCGCGATGTTCTCAACGGGATTGCTATAGAGTGCTTGCTTCGAACAACCGGCTCGGGGTTCGAAATTTCGGTTTTTACTTCACGCGTGGTGAGCTCATGATGGCGTAGTTTCGGCTTG']


## Testing Ground

In [10]:
storage_test = ['AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG', 'CTGAGATGCTACTCGGATCATTCAGGCTTATTCCAAAAGAGACTCTAATCCAAGTCGCGGGGTCATCCCCATGTAACCTGAGTTAGCTACATGGCT']

print(storage_test)

# not used
combinations_rna = {
    'UUU' : 'F',    'CUU' : 'L', 'AUU' : 'I', 'GUU' : 'V',
    'UUC' : 'F',    'CUC' : 'L', 'AUC' : 'I', 'GUC' : 'V',
    'UUA' : 'L',    'CUA' : 'L', 'AUA' : 'I', 'GUA' : 'V',
    'UUG' : 'L',    'CUG' : 'L', 'AUG' : 'M', 'GUG' : 'V',
    'UCU' : 'S',    'CCU' : 'P', 'ACU' : 'T', 'GCU' : 'A',
    'UCC' : 'S',    'CCC' : 'P', 'ACC' : 'T', 'GCC' : 'A',
    'UCA' : 'S',    'CCA' : 'P', 'ACA' : 'T', 'GCA' : 'A',
    'UCG' : 'S',    'CCG' : 'P', 'ACG' : 'T', 'GCG' : 'A',
    'UAU' : 'Y',    'CAU' : 'H', 'AAU' : 'N', 'GAU' : 'D',
    'UAC' : 'Y',    'CAC' : 'H', 'AAC' : 'N', 'GAC' : 'D',
    'UAA' : 'Stop', 'CAA' : 'Q', 'AAA' : 'K', 'GAA' : 'E',
    'UAG' : 'Stop', 'CAG' : 'Q', 'AAG' : 'K', 'GAG' : 'E',
    'UGU' : 'C',    'CGU' : 'R', 'AGU' : 'S', 'GGU' : 'G',
    'UGC' : 'C',    'CGC' : 'R', 'AGC' : 'S', 'GGC' : 'G',
    'UGA' : 'Stop', 'CGA' : 'R', 'AGA' : 'R', 'GGA' : 'G',
    'UGG' : 'W',    'CGG' : 'R', 'AGG' : 'R', 'GGG' : 'G'
}

# used
combinations_dna = {
    'TTT' : 'F',      'CTT' : 'L',      'ATT' : 'I',      'GTT' : 'V',
    'TTC' : 'F',      'CTC' : 'L',      'ATC' : 'I',      'GTC' : 'V',
    'TTA' : 'L',      'CTA' : 'L',      'ATA' : 'I',      'GTA' : 'V',
    'TTG' : 'L',      'CTG' : 'L',      'ATG' : 'M',      'GTG' : 'V',
    'TCT' : 'S',      'CCT' : 'P',      'ACT' : 'T',      'GCT' : 'A',
    'TCC' : 'S',      'CCC' : 'P',      'ACC' : 'T',      'GCC' : 'A',
    'TCA' : 'S',      'CCA' : 'P',      'ACA' : 'T',      'GCA' : 'A',
    'TCG' : 'S',      'CCG' : 'P',      'ACG' : 'T',      'GCG' : 'A',
    'TAT' : 'Y',      'CAT' : 'H',      'AAT' : 'N',      'GAT' : 'D',
    'TAC' : 'Y',      'CAC' : 'H',      'AAC' : 'N',      'GAC' : 'D',
    'TAA' : 'Stop',   'CAA' : 'Q',      'AAA' : 'K',      'GAA' : 'E',
    'TAG' : 'Stop',   'CAG' : 'Q',      'AAG' : 'K',      'GAG' : 'E',
    'TGT' : 'C',      'CGT' : 'R',      'AGT' : 'S',      'GGT' : 'G',
    'TGC' : 'C',      'CGC' : 'R',      'AGC' : 'S',      'GGC' : 'G',
    'TGA' : 'Stop',   'CGA' : 'R',      'AGA' : 'R',      'GGA' : 'G',
    'TGG' : 'W',      'CGG' : 'R',      'AGG' : 'R',      'GGG' : 'G' 
}

print(combinations_rna); print(combinations_dna)

['AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG', 'CTGAGATGCTACTCGGATCATTCAGGCTTATTCCAAAAGAGACTCTAATCCAAGTCGCGGGGTCATCCCCATGTAACCTGAGTTAGCTACATGGCT']
{'UUU': 'F', 'CUU': 'L', 'AUU': 'I', 'GUU': 'V', 'UUC': 'F', 'CUC': 'L', 'AUC': 'I', 'GUC': 'V', 'UUA': 'L', 'CUA': 'L', 'AUA': 'I', 'GUA': 'V', 'UUG': 'L', 'CUG': 'L', 'AUG': 'M', 'GUG': 'V', 'UCU': 'S', 'CCU': 'P', 'ACU': 'T', 'GCU': 'A', 'UCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A', 'UCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A', 'UCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A', 'UAU': 'Y', 'CAU': 'H', 'AAU': 'N', 'GAU': 'D', 'UAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D', 'UAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E', 'UAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E', 'UGU': 'C', 'CGU': 'R', 'AGU': 'S', 'GGU': 'G', 'UGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G', 'UGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G', 'UGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'}
{'TTT': 'F', 'CTT': '

## Functions

In [11]:
# not used 
def transcription(string):

    transcribed_string = ''

    for nucleotide_base in string:

        if nucleotide_base == 'T':
            transcribed_string += 'U'
        
        else:
            transcribed_string += nucleotide_base

    return transcribed_string

rna_sequence = transcription(storage_test[0]); print(rna_sequence) # changes thymine (dna) to uracil (rna)

AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG


In [19]:
def reverse(string):

    # create a sequence with complementing base pairs
    reversed_string = string.replace('A', 't')
    reversed_string = reversed_string.replace('T', 'a')
    reversed_string = reversed_string.replace('C', 'g')
    reversed_string = reversed_string.replace('G', 'c')

    reversed_string = reversed_string.upper() # capitalise letters

    reversed_string = reversed_string[::-1] # reverse order

    return reversed_string

reverse_complement = reverse(storage_test[0]); print(reverse_complement)

CTGAGATGCTACTCGGATCATTCAGGCTTATTCCAAAAGAGACTCTAATCCAAGTCGCGGGGTCATCCCCATGTAACCTGAGTTAGCTACATGGCT


In [25]:
def get_codons(string):

    codons = [[], [], []]

    if len(string) % 3 == 1:
        # for each open reading frame (3 in total)
        for orf in range(3): # 0, 1, 2 (offset)

            # correction: -1, 0, -2 (removal of nucleotide bases from the end - junk DNA)
            if orf == 0:
                correction = -1  
            elif orf == 1:
                correction = 0
            else:
                correction = -2
            
        # acquire triplets 
            for codon in range((len(string) + correction) // 3):

                next_codon = codon * 3

                start = next_codon + orf
                end = next_codon + orf + 3

                codons[orf].append(string[start:end])

    elif len(string) % 3 == 2:
        # for each open reading frame (3 in total)
        for orf in range(3): # 0, 1, 2 (offset)

            # correction: -2, -1, 0 (removal of nucleotide bases from the end - junk DNA)
            if orf == 0:
                correction = -2  
            elif orf == 1:
                correction = -1
            else:
                correction = 0
            
        # acquire triplets 
            for codon in range((len(string) + correction) // 3):

                next_codon = codon * 3

                start = next_codon + orf
                end = next_codon + orf + 3

                codons[orf].append(string[start:end])

    else: 
        # for each open reading frame (3 in total)
        for orf in range(3): # 0, 1, 2 (offset)

            # correction: 0, -2, -1 (removal of nucleotide bases from the end - junk DNA)
            if orf == 0:
                correction = 0  
            elif orf == 1:
                correction = -2
            else:
                correction = -1

            # acquire triplets 
            for codon in range((len(string) + correction) // 3):

                next_codon = codon * 3

                start = next_codon + orf
                end = next_codon + orf + 3

                codons[orf].append(string[start:end])

    return codons

print(get_codons(storage_test[0]))
print()
print(get_codons(storage[0]))

[['AGC', 'CAT', 'GTA', 'GCT', 'AAC', 'TCA', 'GGT', 'TAC', 'ATG', 'GGG', 'ATG', 'ACC', 'CCG', 'CGA', 'CTT', 'GGA', 'TTA', 'GAG', 'TCT', 'CTT', 'TTG', 'GAA', 'TAA', 'GCC', 'TGA', 'ATG', 'ATC', 'CGA', 'GTA', 'GCA', 'TCT', 'CAG'], ['GCC', 'ATG', 'TAG', 'CTA', 'ACT', 'CAG', 'GTT', 'ACA', 'TGG', 'GGA', 'TGA', 'CCC', 'CGC', 'GAC', 'TTG', 'GAT', 'TAG', 'AGT', 'CTC', 'TTT', 'TGG', 'AAT', 'AAG', 'CCT', 'GAA', 'TGA', 'TCC', 'GAG', 'TAG', 'CAT', 'CTC'], ['CCA', 'TGT', 'AGC', 'TAA', 'CTC', 'AGG', 'TTA', 'CAT', 'GGG', 'GAT', 'GAC', 'CCC', 'GCG', 'ACT', 'TGG', 'ATT', 'AGA', 'GTC', 'TCT', 'TTT', 'GGA', 'ATA', 'AGC', 'CTG', 'AAT', 'GAT', 'CCG', 'AGT', 'AGC', 'ATC', 'TCA']]

[['AGA', 'CTA', 'ACT', 'GAA', 'TAA', 'TTC', 'CCT', 'CAC', 'CAT', 'GCA', 'GCG', 'GCC', 'TAG', 'ACG', 'ACC', 'TTG', 'TTC', 'GGA', 'TTC', 'GGA', 'CTA', 'CGC', 'TCA', 'CTC', 'ATA', 'CGA', 'GTC', 'ACC', 'GTT', 'TGT', 'CCG', 'GGG', 'GGT', 'TAT', 'GTA', 'ACT', 'TCG', 'CAG', 'GCG', 'CGT', 'AAG', 'ACG', 'TCT', 'TCA', 'CGC', 'TTG', 'CAG', 'TT

In [14]:
def translation(codon_list, codon_combinations):

    proteins = []
    protein_sequence = ''

    # loop through codons
    for orf in range(len(codon_list)):
        for position_dna, codon in enumerate(codon_list[orf]):

            # locate start codon
            if codon == 'ATG':

                # loop through next codons, beginning with the start codon and translate them until the stop codon is reached
                for codon_in_sequence in codon_list[orf][position_dna : ]:

                    translation = codon_combinations[codon_in_sequence]

                    if translation != 'Stop':
                        protein_sequence += translation
                
                    else:
                        proteins.append(protein_sequence)
                        protein_sequence = ''

                        break

            protein_sequence = ''

    return proteins

print(translation(get_codons(storage_test[0]), combinations_dna))
print()
print(translation(get_codons(storage_test[1]), combinations_dna))

['MGMTPRLGLESLLE', 'MTPRLGLESLLE', 'M']

['M', 'MLLGSFRLIPKETLIQVAGSSPCNLS']


## Programme

In [26]:
# get all possible protein sequences from the open reading frames
proteins = []

for dna_sequence in storage: # storage_test or storage
    
    reverse_complement = reverse(dna_sequence)

    dna_codons = get_codons(dna_sequence)
    dna_rc_codons = get_codons(reverse_complement)

    proteins.append(translation(dna_codons, combinations_dna))
    proteins.append(translation(dna_rc_codons, combinations_dna))
    

# keep unique protein sequences only
unique_proteins = []

for iteration in range(len(proteins)):
    for protein in proteins[iteration]:

        if protein not in unique_proteins:
            unique_proteins.append(protein)


# print the results
for protein in unique_proteins:
    print(protein)

MRRAMAS
MAS
MSISR
MQPKTRRASDGAVMVGQFTGL
MVGQFTGL
MYRARTSPTKPNQSITRGTLQG
MWTQIPGAAPNLPNLVIPYQVRPTYWSLRSGIRPQVKGLVVNISRRPRCSQRDCYRVLASNNRLGVRNFGFYFTRGELMMA
MMA
MA
MQRPRRPCSDSDYAHSYESPFVRGVM
M
MLPPRPSRNRRPNVET
MTTMHYASCHGVIIPGANSDFSLAVQ
MHYASCHGVIIPGANSDFSLAVQ
MIRLLAAMNRYRPKSGFTKMLSSSVRQAVRTKSSDATQNQESIRRRGNGGPIHGFMSSRG
MNRYRPKSGFTKMLSSSVRQAVRTKSSDATQNQESIRRRGNGGPIHGFMSSRG
MLSSSVRQAVRTKSSDATQNQESIRRRGNGGPIHGFMSSRG
MSSRG
MCKGFKCCLLGPVETGVPTSRHNVPCSD
MFSTGLL
MLSWFWVASLLFVRTACLTLDDNILVNPDFGRYLFIAARRRIIWILTWIYCTAKLKSLLAPGIMTPWHDA
MTPWHDA
MVVILAFHGSDPRTASVKTSYAPAKLHNPPDKR
MITFW
MVRELFS
MSSPRVK
MLTTSPFTCGRIPDLKDQ
MTRFGRLGAAPGIWVHMRG
MRG
MSRRWDACFDWAEEATFEPFTHQARESVGSLGSGMSSRG
MDLLHR
MARRIMHGSHLGLPRIRPSNCKREDVLRACEVT
MHGSHLGLPRIRPSNCKREDVLRACEVT
MSERSPNPNKVV


## Answer
- can be in any order

```
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
```