## restOfORF
Write a function `restOfORF(DNA)` that takes as its input a DNA sequence. It **assumes** that this DNA sequence begins with a start codon "ATG". It then finds the next in frame stop codon, and returns the ORF from the start to that stop codon. The sequence that is returned should include the start codon but not the stop codon. If there is no in frame stop codon, restOfORF should assume that the reading frame extends through the end of the sequence and simply return the entire sequence.

In [5]:
def restOfORF(DNA):
    for i in range(0, len(DNA), 3):
        if(DNA[i:i+3] in ['TAG', 'TAA', 'TGA']):
            return DNA[:i]
    return DNA

## oneFrame
Write a function called `oneFrame(DNA)` that takes a DNA string as input. It searches that string from left to right in multiples of 3 nucleotides--that is, in a single reading frame. When it hits a start codon "ATG" it calls restOfORF on the slice of the string beginning at that codon to get back an ORF. That ORF is added to a list and then the function skips ahead in the DNA string to the point right after the ORF that we just found. This is repeated until we've traversed the entire DNA string.

In [35]:
#use the length of the returned sequence to find where to jump to
def oneFrame(DNA):
    #list that will be returned at the end
    orf = []
    i = 0
    while(i < len(DNA)-2):
        if(DNA[i:i+3] == 'ATG'):
            a = restOfORF(DNA[i:])
            orf.append(a)
            i += len(a)-3
        i += 3
    return orf

## longestORF
Next, you will write a function longestORF(DNA) that takes a DNA string and returns the sequence of the longest open reading frame on it, in any of the three possible frames. This function will not consider the reverse complement of DNA.

It shouldn't take much work to write longestORF given that you've already written oneFrame.

In [33]:
def longestORF(DNA):
    a1 = oneFrame(DNA)
    a2 = oneFrame(DNA[1:])
    a3 = oneFrame(DNA[2:])
    
    a = a1 + a2 + a3
    length = 0
    pos = -1
    
    for i in range(len(a)):
        if(len(a[i]) > length):
            length = len(a[i])
            pos = i
    return a[pos]

## longestORFBothStrands
We are given a DNA sequence in 5' to 3' order. A gene might appear on this strand or its reverse complement. Thus, our next function is a very short one called longestORFBothStrands(DNA). This function takes a DNA string as input and finds the longest ORF on that DNA string or its reverse complement. You can use the longestORF function you have already written. First ask it for the longest ORF in the given DNA and then ask it for the longest ORF on its reverse complement (use your reverseComplement function to find the reverse complement). The longer of those two is the longest ORF possible (break ties arbitrarily).

In [40]:
from dna import *

def longestORFBothStrands(DNA):
    a = longestORF(DNA)
    b = longestORF(reverseComplement(DNA))
    if(len(a) > len(b)):
        return a
    else:
        return b

In [41]:
longestORFBothStrands('CTATTTCATG')

'ATGAAA'

## longestORFNoncoding
To assess whether long ORFs are genes, a researcher would ask the question, "is this sequence length indicative of a coding region or would I expect to see sequences this long in garbage?" By "garbage", of course, we mean just random sequences of nucleotides.

We'll test this by generating a bunch of "garbage" sequences of the same length as our test DNA sequence, and measuring the maximum ORF length in each. Then, we'll ask the following question. Is the very longest ORF among these still shorter than some ORFs we observe in our real DNA? If the real DNA ORFs are significantly longer than what we see in the garbage sequence, that is a very strong indicator that we did in fact find genes in our original DNA!

Write a function longestORFNoncoding(DNA, numReps) that makes a bunch of garbage sequences, finds the very longest ORF in all of these, and returns its length. Note: this function returns a number rather than a DNA string.

OK, so now it's time to generate garbage sequences. We could generate totally random strings of the same length as our DNA string, but that might not be a very accurate test since our DNA string might have more nucleotides of one type and fewer of another. To be fair, our garbage strings should have the same nucleotides but just reordered or "shuffled" randomly.

To do that, you'll first need to take your DNA string and turn it into a list of its constituent symbols. There is a built-in function called list that takes as input a string and returns the list of symbols in that string.

In [None]:
import random

def collapse(L):
    output = ""
    for s in L:
        output = output + s
    return output


def longestORFNoncoding(DNA, numReps):
    dnaList = list(DNA)
    maxLen = 0
    for i in range(numReps):
        random.shuffle(dnaList)
        DNA = collapse(dnaList)
        l = longestORFBothStrands(DNA)