# ORF: Open Reanding Frames

### Problem  
https://rosalind.info/problems/orf/

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

**Given:** A DNA string s of length at most 1 kbp in FASTA format.

**Return:** Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

## Level 1: Solve Rosalind problem ORF

#### Strategy:
Not going for the most straight-forward path because I want to create useful intermediate steps and I want to retain frame information.
1. Create dictionary with codons of all 6 possible frames (sense_1, sense_2, sense_3, asense_1, asense_2, asense_3).
2. Find all start and stop codon positions in frames dictionary and store positions in new dictionary (still organized by frame).
3. Select all the ORF's (i.e. codons from start to stop w/o intermediate stop codons), translate into peptide sequence and add peptide to new dictionary organized by frame.
4. Print all the unique possible peptides.

NOTE:
Because I want to also practice bipython, I will create an alternative version using it whenever it is useful.

In [None]:
codon_table = {'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L', 'UCU':'S',
               'UCC':'S', 'UCA':'S', 'UCG':'S', 'UAU':'Y', 'UAC':'Y',
               'UAA':'stop', 'UAG':'stop', 'UGU':'C', 'UGC':'C', 'UGA':'stop',
               'UGG':'W', 'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L',
               'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'CAU':'H',
               'CAC':'H', 'CAA':'Q', 'CAG':'Q', 'CGU':'R', 'CGC':'R',
               'CGA':'R', 'CGG':'R', 'AUU':'I', 'AUC':'I', 'AUA':'I',
               'AUG':'M', 'ACU':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T',
               'AAU':'N', 'AAC':'N', 'AAA':'K', 'AAG':'K', 'AGU':'S',
               'AGC':'S', 'AGA':'R', 'AGG':'R', 'GUU':'V', 'GUC':'V',
               'GUA':'V', 'GUG':'V', 'GCU':'A', 'GCC':'A', 'GCA':'A',
               'GCG':'A', 'GAU':'D', 'GAC':'D', 'GAA':'E', 'GAG':'E',
               'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G'}

with open('rosalind_orf.txt','r') as file:
    infile = file.read().split('\n')
    DNA = ''.join(infile[1:])
    RNA_sense = DNA.replace('T','U')
    RNA_asense = RNA_sense.replace('A','u').replace('U','a').replace('C','g').replace('G','c').upper()[::-1]
    RNA= {'RNA_sense' : RNA_sense,
          'RNA_asense' : RNA_asense}


# 1. Create dictionary with codons of all 6 possible frames:

frames = {}
for strand, seq in RNA.items():
    for frame_start in range(3):
        key = f'{strand}_{frame_start+1}'
        frames[key] = [seq[i:i+3] for i in range(frame_start,len(seq)-2, 3)]


# 2. Find all start and stop codon positions in frames dictionary and store them in new dictionaries.
# Note: could have also stored in list, but I wanted to retain the frame information for myself

start_codon = 'AUG'
stop_codon = {'UAA','UAG','UGA'}


start_i = {}
stop_i = {}

for frame, codons in frames.items():
    start_i[frame]= [i for i, codon in enumerate(codons) if codon == start_codon]
    stop_i[frame]= [i for i, codon in enumerate(codons) if codon in stop_codon]


# 3. Select all ORF's (start to stop codons w/o an intermediate stop codon), translate into peptide and organize by frame in new dictionary

ORFs = {}
for frame, codons in frames.items():
    starts = start_i[frame]
    stops = stop_i[frame]

    for start in starts:
        for stop in stops:
            if stop > start:
                mid_stops = [mid_codon for mid_codon in stops if start < mid_codon < stop] # Find intermediate stop codons
                if not mid_stops: # Proceed if there aren't intermediate stop codons
                    orf_codons = codons[start:stop]
                    peptide = ''.join(codon_table[codon] for codon in orf_codons)
                    if frame not in ORFs:
                        ORFs[frame] = []
                    ORFs[frame].append(peptide)
                    break


# 4. Print all the unique possible peptides

unique_peptides = set() # Kind of like a list but for unique values and unordered
for peptide_list in ORFs.values():
    unique_peptides.update(peptide_list)
for unique_peptide in unique_peptides:
    print(unique_peptide)

##### Biopython variant

# Leveling up

I want to expand on this Rosalind problem by tweaking the goals into incremental levels of complexity - each level introducing new skills to learn and build on.

- Level 1: Solve the problem
- Level 2: Analyse putative ORF data and generate fasta files
- Level 3: Scale to analysis of multiple sequences and run through bash
- Level 4: Interface


## Level 2: Analyse putative ORF data and generate fasta files

**Create program to assemble and analyse data on the putative ORFs and peptides for a DNA sequence.**

This means generating a data frame including:
- Unique putative ORF and peptide ID; Frame and Strand; Start and stop positions; ORF length and peptide size; Overlap information
And plotting some data:
- Size x position plot; Overlaps

Additionally, program should be interactive and ask whether you want to generate fasta files containing the putative ORFs (in DNA and/or RNA) and/or the respective peptides, along with what sequence identifying information you would like to include.

Rather than just returning a list of all the putative peptides translated from a DNA string, I want to create a program to assemble a data frame storing more useful information about the putative ORF's of a DNA sequence. The resulting data frame would include:
- ID (generate an ID for each putative ORF: dnaID_orf_1, dnaID_orf_2...)
- Frame (Sense +1, Sense +2, Sense +3, Antisense +1, Antisense +2, Antisense +3)
- Start (position of first nucleotide in ORF)
- Stop (position of last nucleotide in ORF)
- Stop codon sequence (which of the three stop codons?)
- ORF length (nucleotides in putative ORF)
- Peptide size (number of aminoacids in putative peptide)
- Overlap (does ORF overlap with another ORF?)

Learning goals:
- Intro do Pandas data analysis
    - Creating data-frames
    - Plotting data
- Writing fasta files
- Creating interactive functions

Steps:
- [x] Use Pandas to create data-frame
- [x] Plot data
- [ ] Write fasta files (DNA, RNA and protein)
- [ ] Integrate code into interactive function

#### 2.1 Use Pandas to create data-frame

In [None]:

# Step 1: Prepare sequence for analysis
with open('rosalind_orf.txt', 'r') as file:
    infile = file.read().splitlines()
    print(infile)
    dna_id = infile[0].replace('>','')
    dna_seq = ''.join(infile[1:])
    print(dna_seq)
    rna_seq_sense = dna_seq.replace('T','U')
    rna_seq_asense = rna_seq_sense.replace('A','u').replace('U','a').replace('C','g').replace('G','c').upper()[::-1]
    rna_seq = {'Sense' : rna_seq_sense, 'Antisense' : rna_seq_asense}
    dna_dict = {'ID' : [dna_id], 'DNA sequence' : [dna_seq] }
    dna_seq_asense = dna_seq.replace('A','t').replace('T','a').replace('C','g').replace('G','c').upper()[::-1]

# Step 2: Set start and stop codons
start_codon = ['AUG']
stop_codon = ['UAA','UAG','UGA']


# Step 3: Create each data variable as a list
index = []
start_codon_i = []
stop_codon_i = []
strand = []
frame = []
length = []
peptide_size = []
fromhere_dna = []
tohere_dna = []
start_dna = []
stop_dna = []
overlap = []


# Step 4: Fill-up the data lists
# Gather all the data on position, direction and size of putative ORFs
for dir, seq in rna_seq.items():
    for i in range(len(seq)-5): # -5 because I only want to find start codons that have enough subsequent sequence for a stop codon
        if seq[i:i+3] in start_codon: # if you find a start codon...
            for j in range(i+3 ,len(seq)-2, 3):
                if seq[j:j+3] in stop_codon: # ...then find a stop codon, fill-out the variables: 
                    if dir == 'Sense':
                        start_codon_i.append(i)
                        stop_codon_i.append(j)
                        fromhere_dna.append(i+1) # +1 to change to a 1-base count
                        tohere_dna.append(j+3) # +3 to include stop codon bp
                        start_dna.append(i+1)
                        stop_dna.append(j+3)
                        strand.append(dir)
                        if (i+3) % 3 == 0:
                            frame.append(f'{dir} +1')
                        elif (i+2) % 3 == 0:
                            frame.append(f'{dir} +2')
                        elif (i+1) % 3 == 0:
                            frame.append(f'{dir} +3')
                        length.append(j-i)
                        peptide_size.append((j-i)//3) # // needed to get integer instead of float?
                        break
                    elif dir == 'Antisense':
                        start_codon_i.append(i)
                        stop_codon_i.append(j)
                        fromhere_dna.append(len(dna_seq)-j-2) # -2 to include whole stop codon
                        tohere_dna.append(len(dna_seq)-i)
                        start_dna.append(len(dna_seq)-i)
                        stop_dna.append(len(dna_seq)-j-2)
                        strand.append(dir)
                        if (i+3) % 3 == 0:
                            frame.append(f'{dir} +1')
                        elif (i+2) % 3 == 0:
                            frame.append(f'{dir} +2')
                        elif (i+1) % 3 == 0:
                            frame.append(f'{dir} +3')
                        length.append(j-i)
                        peptide_size.append((j-i)//3)
                        break

# Assign an indivdual ID to the putative ORFs
orf_id = []
for i in range(len(fromhere_dna)):
    orf_id.append(f'{id}_orf_{i+1}')

# Look for overlap between putative ORFs in the DNA duplex
for k in range(len(orf_id)):
    overlaps = []
    for l in range(len(fromhere_dna)):
        if k != l:
            if fromhere_dna[k] < tohere_dna[l] and tohere_dna[k] > fromhere_dna[l]:
                overlaps.append(orf_id[l])
    if overlaps:
        overlap.append(', '.join(overlaps))
    else:
        overlap.append('None')

# Step 5: Assemble data variable lists into a dictionary, then into a data-frame using pandas
orf_data = {
    'ID' : orf_id,
    'First bp' : fromhere_dna, # First bp of putative ORF in DNA duplex 
    'Last bp' : tohere_dna, # Last bp of putative ORF in DNA duplex
    'Frame' : frame, # Strand and coding frame of putative ORF
    'Strand' : strand, # Strand where putative ORF is
    'Length' : length, # Length of putative strand
    'Peptide size (aa)' : peptide_size,
    'Start codon index' : start_codon_i, # Index of start codon of putative ORF in respective strand. SEE NOTE BELLOW.
    'Stop codon index' : stop_codon_i, # Index of stop codon of putative ORF in respective strand. SEE NOTE BELLOW.
    'Start in DNA' : start_dna, # Position of start codon first nt in DNA duplex
    'Stop in DNA' : stop_dna, # Position of stop codon last nt in DNA duplex
    'Overlap in DNA' : overlap # Whether putative ORF overlaps with any others in the DNA duplex
}

# NOTE:
# The reason why I gathered seemingly redundant positional data:
# 'First bp' and 'Last bp' just indicate that an ORF goes from A to B in the DNA duplex, irrespective of the strand it is encoded in
# 'Start in DNA' and 'Stop in DNA' is similar but it retains information of the ORFs orientation - i.e.: if the ORF is encoded in the antisense strand 'Stop in DNA' value is lower than the 'Start in DNA' value
# The first set of variables is useful to simply know putative ORFs position in the DNA duplex
# The second set of variables is useful for plotting putative ORFs position along the DNA in their correct orientation

import pandas as pd
orf_df = pd.DataFrame(orf_data).sort_values(by='First bp') # turn dictionary into data-frame
orf_df.to_csv('orf_df.csv') # save in csv file


dna_df = pd.DataFrame(dna_dict)
dna_df.to_csv('inputdna_df.csv')


#### 2.2 Plot data

Maybe at some point I will try to do it in python, but for now I wanted to do it in R.
It would be a good way to warm up my R skills and get some analysis done before I dive into plotting with python.
In the next block, I will try to do that R plot. Remember to change to the R kernel for it to work!

#### 2.3 Create FASTA file

1. Use dataframe to generate putative ORF, mRNA and peptide sequences
2. Write interactive function to generate fasta files with these sequences

In [69]:
# Step 1: Set codon table
rna_codon_table = {
    'AUG': 'M',  # Start codon (Methionine)
    'UUU': 'F', 'UUC': 'F',  # Phenylalanine
    'UUA': 'L', 'UUG': 'L',  # Leucine
    'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',  # Leucine
    'AUU': 'I', 'AUC': 'I', 'AUA': 'I',  # Isoleucine
    'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',  # Valine
    'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',  # Serine
    'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',  # Proline
    'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',  # Threonine
    'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',  # Alanine
    'UAU': 'Y', 'UAC': 'Y',  # Tyrosine
    'CAU': 'H', 'CAC': 'H',  # Histidine
    'CAA': 'Q', 'CAG': 'Q',  # Glutamine
    'AAU': 'N', 'AAC': 'N',  # Asparagine
    'AAA': 'K', 'AAG': 'K',  # Lysine
    'GAU': 'D', 'GAC': 'D',  # Aspartic acid
    'GAA': 'E', 'GAG': 'E',  # Glutamic acid
    'UGU': 'C', 'UGC': 'C',  # Cysteine
    'UGG': 'W',  # Tryptophan
    'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',  # Arginine
    'AGU': 'S', 'AGC': 'S',  # Serine
    'AGA': 'R', 'AGG': 'R',  # Arginine
    'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',  # Glycine
    'UAA': 'Stop', 'UAG': 'Stop', 'UGA': 'Stop'  # Stop codons
}

# Step 2: Create lists of putative ORFs, mRNAs and peptides

putative_orfs = []
putative_mrnas = []
putative_peptides = []

for index, row in orf_df.iterrows():
    if row['Strand'] == 'Sense':
        putative_orfs.append(dna_seq[row['Start codon index']:(row['Stop codon index']+3)])
    elif row['Strand'] == 'Antisense':
        putative_orfs.append(dna_seq_asense[row['Start codon index']:(row['Stop codon index']+3)]) # To include the stop codon

for index, row in orf_df.iterrows():
    if row['Strand'] == 'Sense':
        putative_mrnas.append(rna_seq_sense[row['Start codon index']:row['Stop codon index']])
    elif row['Strand'] == 'Antisense':
        putative_mrnas.append(rna_seq_asense[row['Start codon index']:row['Stop codon index']])

for mrna in putative_mrnas:
    peptide = ''
    for codon_start in range(0, len(mrna), 3):
        codon = mrna[codon_start:codon_start+3]
        if codon in rna_codon_table:
            aa = rna_codon_table.get(codon)
            peptide += aa
        else:
            print('something is wrong')
    putative_peptides.append(peptide)

# Print the unique putative peptides (Rosalind wants it that way)
# unique_putative_peptides = set(putative_peptides)
#for peptide in unique_putative_peptides:
    #print(peptide)

# Step 3: Add sequence lists to the dataframe

orf_df_seq = orf_df
orf_df_seq['DNA'] = putative_orfs
orf_df_seq['RNA'] = putative_mrnas
orf_df_seq['Peptide'] = putative_peptides

orf_df_seq.to_csv('orf_df_seq.csv') # save in csv file


In [None]:
# Step 4: Create interactive function to generate fasta files with putative ORF sequences (DNA, RNA or peptide)

def fasta_generator(df):
    valid_types = ['DNA', 'RNA', 'PEPTIDE']
    sequence_type = input(f"Which sequence type you would like for the fasta file?\n{valid_types}").strip().upper()
    if sequence_type not in valid_types:
        print(f"Invalid choice. Please choose from {valid_types}.")
        return
    df.columns = df.columns.str.upper()
    fasta_file = f"{dna_id}_putativeorfs_{sequence_type}.fasta"
    with open(fasta_file, 'w') as f:
        for index, row in df.iterrows():
            f.write(f">{row['ID']}\n")
            f.write(f"{row[sequence_type]}\n")
    print(f"FASTA file '{fasta_file}' with putative {sequence_type} sequences has been created.")

fasta_generator(orf_df_seq)

Incorporating biopython

In [3]:
# Import packages
from Bio import SeqIO
from Bio.Seq import Seq
import pandas as pd


# PART 1: PREPARE DNA INPUT

def fasta_dna_parser(fasta_file):
    """Parse FASTA file with DNA sequences into dataframe with DNA ID and sequence.
    
    Input: path to FASTA file
    Output: pandas dataframe"""

    # Create empty dictionary
    dna_dict = {
        'DNA ID': [],
        'DNA Sequence': []
    }

    # Open and parse FASTA file, then store sequneces in dictionary
    with open(fasta_file, 'r') as file:
        for item in SeqIO.parse(file, 'fasta'):
            dna_id = item.id
            dna_seq = str(item.seq)
            dna_dict['DNA ID'].append(dna_id)
            dna_dict['DNA Sequence'].append(dna_seq)
    
    # Return dna sequences in dataframe         
    return pd.DataFrame(dna_dict)


# PART 2: Find putative ORFs on DNA sequences

def orf_finder(dna_df):
    """Find putative ORFs in DNA sequences provided in dataframe,
    and return dataframe with location data for putative ORFs and unique putative ORF ID.
    Note that start and stop codon location is stored as 0-based python index of the first codon nt.
    
    Input: dataframe with DNA ID and sequences    
    Output: dataframe with putative ORF location data"""

    # Set start and stop codons
    start_codon = ['ATG']
    stop_codon = ['TAA', 'TAG', 'UGA']

    # Create empty dictionary to store locations of putative ORFs
    orf_dict = {
        'Source DNA ID': [],
        'Putative ORF ID': [],
        'Strand': [],
        'Start codon index': [],
        'Stop codon index': []
    }

    # Prepare DNA from dataframe 
    for _, row in dna_df.iterrows():
        dna_id = row['DNA ID']
        dna_seq_sense = row['DNA Sequence']
        dna_seq_asense = str(Seq(dna_seq_sense).reverse_complement())

        # Search for putative ORFs and fill-out dictionary with data of found ORFs
        for dir, seq in {'Sense': dna_seq_sense, 'Antisense': dna_seq_asense}.items():
            for i in range(len(seq) - 5): 
                if seq[i:i+3] in start_codon:
                    for j in range(i+3, len(seq)-2, 3):
                        if seq[j:j+3] in stop_codon:
                            orf_id = f'{dna_id.lower()}_orf_{len(orf_dict['Putative ORF ID'])+1}'
                            orf_dict['Source DNA ID'].append(dna_id)
                            orf_dict['Putative ORF ID'].append(orf_id)
                            orf_dict['Strand'].append(dir)
                            orf_dict['Start codon index'].append(i)
                            orf_dict['Stop codon index'].append(j)
                            break
    
    # Return dataframe with location data of putative ORFs
    return pd.DataFrame(orf_dict)

# PART 3: ASSEMBLE DATAFRAME WITH MORE PUTATIVE ORF DATA

def orf_data_collector(orf_locations_df):
    """Gather more data regarding putative ORFs and store it in more comprehensive dataframe.
    This data can be useful for subsequent analysis and plotting.
    Examples of data to collect:
    Location of first and last bp of putative ORF in the DNA duplex (First bp and Last bp),
    an alternative location metric similar to the aforementioned but retaining orientation of the ORF (Start and Stop),
    length of the mRNA (Length of mRNA), size of the peptide (Peptide size),
    whether putative ORF overlaps with others in the DNA duplex (Overlap in DNA)...

    Input: dataframe with locations (0-based index) of putative ORFs
    Output: comprehensive dataframe with more data on putative ORFs"""
    return
    

# PART 4: GENERATE FASTA FILES WITH PUTATIVE ORFS

def orf_dna_fasta_generator(orf_df):
    """Write DNA FASTA files for putative ORFs found for each input DNA sequence. One file is created per source DNA.
    Files contains the putative ORF DNA sequences, along with ORF ID, frame, length and overlap information."""
    return

def orf_mrna_fasta_generator(orf_df):
    """Write RNA FASTA files for putative ORFs found for each input DNA sequence. One file is created per source DNA.
    File contains the putative ORF mRNA sequences, along with ORF ID and length"""
    return

def orf_mrna_fasta_generator(orf_df):
    """Write peptide FASTA files for putative ORFs found for each input DNA sequence. One file is created per source DNA.
    File contains the putative ORF peptide sequences, along with ORF ID and size"""
    return



#ros_dna_df = fasta_dna_parser('rosalind_orf.txt')
#ros_orf_locations_df = orf_finder(ros_dna_df)