In [1]:
sequence = "ATCCATGACTTGACTATGACGTTCAGCATCTGAAGACTTATGAAAATGATCTCATGGATCGACT"

`               ↑↑↑---↑↑↑  ↑↑↑------------↑↑↑      ↑↑↑----↑↑↑    ↑↑↑--------`

If you double click on this in Jupyter, you can see the arrows line up with where we inserted ORFs for testing purposes.

Please note, that the third "ORF" is not a **valid** ORF (hence the quotation marks), since there's 4 nucleotides between the start and the end codon, so it should NOT be picked up by our algorithm (stop codon isn't in the same reading frame!

Also, the last start codon does **not** have a matching stop codon, so it shouldn't be found, either.

Of course there are more things to test and consider -- for example an ORF inside another ORF. We encourage you to think about what might happen (step through the algorithm by hand to really try to understand what's happening) and finally verify your assumptions with an experiment.

Finally, it's very well possible that there's still bugs in here we haven't caught. Please feel free to point these out to us :)!

In [2]:
def find_atg(sequence, position):
    """Return the position of the first encountered start
    codon in ``sequence``, starting from ``position``"""
    for start_pos in range(position, len(sequence)):
        if sequence[start_pos:(start_pos + 3)] == "ATG":
            return start_pos

Please note that there's definitely things we could do to still improve this approach! Here's some potentially interesting questions:

- Do we need to traverse the *entire* string, or might it be sensible to stop a couple of letters before the end?
- If we're taking the range up to `len(sequence)` but are then accessing `sequence[start_pos:(start_pos + 3)]`, why is it okay to even go beyond what should be the maximum index of our sequence?
- What if we didn't *just* want to look for "ATG" as a start codon? Could you think of ways to make this more flexible?

Also, if you've noticed the usage of the message in triple quotation marks (`"""Hello!"""`) right below the function definition, those are called "docstrings" and I recommend reading up on them :)! They're there for documentation purposes.

In [3]:
def find_stop(sequence, position):
    """Return an ORF's sequence by finding the first stop codon
    to appear after a start codon found at ``position``"""
    for stop_pos in range(position, len(sequence), 3):
        if sequence[stop_pos:(stop_pos + 3)] in ("TAG", "TGA", "TAA"):
            return sequence[position:(stop_pos + 3)]

Just like with the previous function definition, think about how you might implement this differently, or even how to improve on this solution.

Points to note:

- The usage of the `in` keyword, which is a *very* handy tool to have in your belt. If you've not encountered it before, I'd strongly recommend reading up on it!
- Also note that we're passing an unassuming `3` to the `range` function as well! What does this accomplish? (If you need a hint: think about something mentioned in the "test case" string at the top!

In [4]:
# Begin with initialising your variables
orfs = []
position = 0

# Loop through the sequence, i.e. step through every character
while position < len(sequence):
    
    # Find first (if any) start codon
    start = find_atg(sequence, position)
    
    # If there's no start codon in the entire sequence, just exit the loop
    if not start:
        break
    
    # Try to find a sequence by looking for stop codons after
    # the start codon we just determined
    orf = find_stop(sequence, start)
    
    # If we found one, add it to our list of codons, including where it started
    if orf:
        orfs.append((orf, start))
        
    # Finally, step to the next character in the sequence and repeat the process
    position += start + 1

Step through this by hand, if you're uncertain about what's going on. Whenever there's a function called, go to that definition and apply it, step by step, too.

Things to note:

- Variable initialisation
- The use of a `while` loop – could/should we have used something different? What else needs to be taken care of? (Hint: why are we performing `position += start + 1` at the end?
- `break` – another keyword to check up on (check out https://docs.python.org/3/library/keyword.html!)

And as always: how could we do better/differently? Do you see any issues?

In [5]:
# Let's check our results thus far:
print(orfs)

[('ATGACTTGA', 4), ('ATGACGTTCAGCATCTGA', 15)]


In [6]:
def find_longest_orf(orfs):
    """Return the longest ORF in our list of ORFs"""
    
    # Initialise variable with the first ORF we found
    longest = orfs[0]
    
    # Step through each remaining element in the list and see
    # if any of these ORFs is longer than the first one
    for orf in orfs[1:]:
        if len(orf[0]) > len(longest[0]):
            longest = orf
    return longest

There's an issue in this! Can you spot it? (Hint: what do you think happens, if the sequence we're checking doesn't have any ORFs in it? If you can't see it, try it out and figure out what's going wrong!)

In [7]:
# Print our result:
print(find_longest_orf(orfs))

('ATGACGTTCAGCATCTGA', 15)


And done :)!