# ros-orf level 1

### Solve Rosalind problem

https://rosalind.info/problems/orf/

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

**Given:** A DNA string s of length at most 1 kbp in FASTA format.

**Return:** Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

#### Strategy:
Not going for the most straight-forward path because I want to retain frame information.
1. Create dictionary with codons of all 6 possible frames (sense_1, sense_2, sense_3, asense_1, asense_2, asense_3).
2. Find all start and stop codon positions in frames dictionary and store positions in new dictionary (still organized by frame).
3. Select all the ORF's (i.e. codons from start to stop w/o intermediate stop codons), translate into peptide sequence and add peptide to new dictionary organized by frame.
4. Print all the unique possible peptides.

NOTE:
Because I want to also practice bipython, I will create an alternative version using it.

In [10]:
codon_table = {
    'AUG': 'M',  # Start codon (Methionine)
    'UUU': 'F', 'UUC': 'F',  # Phenylalanine
    'UUA': 'L', 'UUG': 'L',  # Leucine
    'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',  # Leucine
    'AUU': 'I', 'AUC': 'I', 'AUA': 'I',  # Isoleucine
    'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',  # Valine
    'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',  # Serine
    'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',  # Proline
    'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',  # Threonine
    'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',  # Alanine
    'UAU': 'Y', 'UAC': 'Y',  # Tyrosine
    'CAU': 'H', 'CAC': 'H',  # Histidine
    'CAA': 'Q', 'CAG': 'Q',  # Glutamine
    'AAU': 'N', 'AAC': 'N',  # Asparagine
    'AAA': 'K', 'AAG': 'K',  # Lysine
    'GAU': 'D', 'GAC': 'D',  # Aspartic acid
    'GAA': 'E', 'GAG': 'E',  # Glutamic acid
    'UGU': 'C', 'UGC': 'C',  # Cysteine
    'UGG': 'W',  # Tryptophan
    'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',  # Arginine
    'AGU': 'S', 'AGC': 'S',  # Serine
    'AGA': 'R', 'AGG': 'R',  # Arginine
    'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',  # Glycine
    'UAA': 'Stop', 'UAG': 'Stop', 'UGA': 'Stop'  # Stop codons
}


with open("rosalind_orf_sample.txt", "r") as file:
    infile = file.read().split("\n")
    DNA = "".join(infile[1:])
    RNA_sense = DNA.replace("T", "U")
    RNA_asense = (
        RNA_sense
        .replace("A", "u")
        .replace("U", "a")
        .replace("C", "g")
        .replace("G", "c")
        .upper()[::-1]
    )
    RNA = {"RNA_sense": RNA_sense, "RNA_asense": RNA_asense}




# 1. Create dictionary with codons of all 6 possible frames:

frames = {}

for strand, seq in RNA.items():
    for frame_start in range(3):
        key = f"{strand}_{frame_start+1}"
        frames[key] = [seq[i : i + 3] for i in range(frame_start, len(seq) - 2, 3)]



# 2. Find all start and stop codon positions in frames dictionary and store them in new dictionaries.

start_codon = "AUG"
stop_codon = {"UAA", "UAG", "UGA"}

start_i = {}
stop_i = {}


for frame, codons in frames.items():
    start_i[frame] = [i for i, codon in enumerate(codons) if codon == start_codon]
    stop_i[frame] = [i for i, codon in enumerate(codons) if codon in stop_codon]


# 3. Select all ORF's (start to stop codons w/o an intermediate stop codon), translate into peptide and organize by frame in new dictionary


ORFs = {}

for frame, codons in frames.items():
    starts = start_i[frame]
    stops = stop_i[frame]

    for start in starts:
        for stop in stops:
            if stop > start:
                mid_stops = [mid_codon for mid_codon in stops if start < mid_codon < stop]  # Find intermediate stop codons
                if not mid_stops:  # Proceed if there aren't intermediate stop codons
                    orf_codons = codons[start:stop]
                    peptide = "".join(codon_table[codon] for codon in orf_codons)
                    if frame not in ORFs:
                        ORFs[frame] = []
                    ORFs[frame].append(peptide)
                    break



# 4. Print all the distinct possible peptides

unique_peptides = set()  # Kind of like a list but for unique values and unordered

for peptide_list in ORFs.values():
    unique_peptides.update(peptide_list)

for unique_peptide in unique_peptides:
    print(unique_peptide)

AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG
CUGAGAUGCUACUCGGAUCAUUCAGGCUUAUUCCAAAAGAGACUCUAAUCCAAGUCGCGGGGUCAUCCCCAUGUAACCUGAGUUAGCUACAUGGCU
MLLGSFRLIPKETLIQVAGSSPCNLS
MGMTPRLGLESLLE
MTPRLGLESLLE
M


### Biopython version

In [17]:
from Bio.Seq import Seq
from Bio import SeqIO

with open("rosalind_orf_sample.txt", "r") as file:
    items = SeqIO.read(file, "fasta")
    DNA = items.seq
    RNA_sense = DNA.transcribe()
    RNA_asense = RNA_sense.reverse_complement().transcribe()
    RNA = {"RNA_sense": RNA_sense, "RNA_asense": RNA_asense}


# 1. Create dictionary with codons of all 6 possible frames:

frames = {}
for strand, seq in RNA.items():
    for frame_start in range(3):
        key = f"{strand}_{frame_start+1}"
        frames[key] = [str(seq[i : i + 3]) for i in range(frame_start, len(seq) - 2, 3)]


# 2. Find all start and stop codons in frames dictionary, extract putative ORFs and store in new dictionary

start_codon = "AUG"
stop_codon = {"UAA", "UAG", "UGA"}

ORFs = {}

# Iterate over codons for each frame
for frame, codons in frames.items():

    # Add frame as key to ORFs dictionary
    ORFs[frame] = []

    # Store position of start and stop codons in two lists
    # enumerate(codons) generates tuple (index, codon)
    starts = [i for i, codon in enumerate(codons) if codon == start_codon]
    stops = [i for i, codon in enumerate(codons) if codon in stop_codon]

    # INTERMEDIATE REPORT:
    # print(f"\nFrame: {frame}")
    # print(f"Start codon positions: {starts}")
    # print(f"Stop codon positions: {stops}")

    # Iterate over start and stop codons positions
    for start in starts:
        for stop in stops:
            # If there is a start codon followed by stop codon...
            if stop > start:
                # with no stop codons in between,
                mid_stops = [mid_stop for mid_stop in stops if start < mid_stop < stop]
                if not mid_stops:  # you found a putative ORF!
                    # Assemble the mRNA sequence, translate into peptide and store it in ORFs dictionary
                    mrna = "".join(codons[start:stop])
                    peptide = Seq(mrna).translate()
                    ORFs[frame].append(str(peptide))
                    break


# 3. Print all the distinct peptides potentially encoded in the DNA sequence

unique_peptides = set()  # Kind of like a list but for unique values and unordered
for peptide_list in ORFs.values():
    unique_peptides.update(peptide_list)
for unique_peptide in unique_peptides:
    print(unique_peptide)

MLLGSFRLIPKETLIQVAGSSPCNLS
MGMTPRLGLESLLE
MTPRLGLESLLE
M


### Final thoughts

Not a very efficient way to go about it. It works fine with small DNA sequences, but I imagine it would not deal well with large sequences. In the next step I will refrain from creating the frames dictionary.

In the future I can also avoid doing the ORF search on RNA, could just do it in the DNA sequence.

Does biopython make things easier?
Maybe, once I know its quirks. It simplifies parsing the FASTA file, transcribing and translating.
But I found it very annoying that `.reversecomplement()` turns the sequence back to DNA. It was a headache to realize that that was why the code was not working.