In [1]:
import os

# Exercise 4.1: Pathogenicity islands

For this and the next problem, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file `~git/bootcamp/data/salmonella_spi1_region.fna`. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Use principles of TDD to write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like

    gc_blocks(seq, block_size)
    
To be clear, if `seq = 'ATGACTACGT'` and `block_size = 4`, the blocks to be considered are
    
    ATGA
    CTAC
    
and the function should return `(0.25, 0.5)`. Note that the blocks are non-overlapping and that we don’t bother with the fact that end of the sequence that does not fit completely in a block.

In [2]:
import gc_blocks

gc_blocks.gc_blocks('ATGACTACGT', 4)

(0.25, 0.5)

b) Write a function that takes as input a sequence, block size, and a threshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:

    mapped_seq = gc_map(seq, block_size, gc_thresh)
    
For example,

    gc_map('ATGACTACGT', 4, 0.4)
    
returns `'atgaCTAC'`.

In [3]:
import gc_map

gc_map.gc_map('ATGACTACGT', 4, 0.4)

'atgaCTAC'

c) Use the `gc_map()` function to generate a GC content map for the Salmonella sequence with `block_size = 1000` and `gc_thresh = 0.45`. Where do you think the pathogenicity island is?

In [4]:
def read_FASTA(filename):
    """Reads in a sequence from a FASTA file that contains a single sequence.
    Returns the descriptor string and a string containing the sequence."""

    with open(filename, 'r') as f:

        # Initialize the sequence
        seq = ''
        
        for line in f:
            # Get the descriptor string, which is the first line of a FASTA file and starts with '>'
            if line[0] == '>':
                description = line.rstrip()
            # Get the sequence
            if '>' not in line:
                seq += line.rstrip()
    
    return description, seq

filename = 'data/salmonella_spi1_region.fna'
descriptor, seq = read_FASTA(filename)

In [5]:
seq_analyzed = gc_map.gc_map(seq, 1000, 0.45)

seq_analyzed

'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCACCGGCTTGCGCCATGTCCGGACGACCGCCACCCTTACCGCCCACCTGCTGAGCGACCATCCCAATCAATTCCCCTGCTTTAACCCGGTCGGTCACATCCTTCGACACGCCCGCAATCAGAGAAACCTTACCTTCAACAACCGTTGCCAGTACGATAACGGTAGACCCCAGTTGATTTTTCAGATCATCAACCATGGTTCGCAGCATTTTCGGCTCAATACCAGCAAGCTCGCTAACCAGGAGCTTCACGCCGTTGAGATCGACCGCTTTACTGGAAAGATTTGCACTCTCCTGCGCTGCGGCCTGGTCCTTCAACTGCTGCAACTCTTTTTCCAGCTGACGTGTACGTTCCAGCACGGCACGCACTTTGTCGCCCAGATTCTGGCTGTCGCCCTTCAGCAGATGCGCAATATCGTTTAAGCGATCGCTTTGCGCATGAACGGTGGCCATAGCGCCTTCACCGGTTACCGCCTCAATACGACGAATGCCCGCTGCGGTGCC

d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a `>` in the original FASTA file), and have line breaks every 60 characters in the sequence.

In [6]:
newfilename = 'data/salmonella_spi1_region_analyzed.fna'

if os.path.isfile('newfilename'):
    raise RuntimeError('File ' + newfilename + 'already exists.')
    
with open(newfilename, 'w') as f:
    f.write(descriptor+'\n')
    
    # Initialize counter
    i = 0
    
    while i < len(seq_analyzed):
        f.write(seq[i:i+60]+'\n')
        i += 60

In [7]:
!head 'data/salmonella_spi1_region_analyzed.fna'

>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000
AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTC
ACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCAC
CCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGT
CAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGG
CATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGT
CATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTT
TATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGC
TTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAG
AGCTGCCGCATCCGTACCACCGGCTTGCGCCATGTCCGGACGACCGCCACCCTTACCGCC


In [8]:
!tail 'data/salmonella_spi1_region_analyzed.fna'

ACGCATTTCTCCCGTGCAGGTCACATTTGCCCGACACGGCGGGGCAAGAGGCTTGAACAG
ACGTTCATTTTCCGTAAAACTGGCGTAATGTAAGCGTTTACCCACTATAGGTATTATCAT
GGCGACCATAAAAGATGTAGCCCGACTGGCCGGTGTTTCAGTCGCCACCGTTTCTCGCGT
TATTAACGATTCGCCAAAAGCCAGCGAAGCGTCCCGGCTGGCGGTAACCAGCGCAATGGA
GTCCCTGAGCTATCACCCTAACGCCAACGCGCGCGCGCTGGCACAGCAGGCAACGGAAAC
CCTCGGTCTGGTGGTCGGCGACGTTTCCGATCCTTTTTTCGGCGCGATGGTGAAAGCCGT
TGAACAGGTGGCGTATCACACCGGCAATTTTTTACTGATTGGCAACGGGTATCATAACGA
ACAAAAAGAGCGTCAGGCTATTGAACAGTTGATTCGTCATCGTTGCGCAGCGTTAGTGGT
GCACGCCAAAATGATTCCGGATGCGGACCTGGCCTCATTAATGAAGCAAATCCCCGGCAT
GGTGCTGATTAACCGCATTT


# Exercise 4.2: ORF detection

a) Write a function, `longest_orf()`, that takes a DNA sequence as input and finds the longest open reading frame (ORF) in the sequence (we will not consider reverse complements). A sequence fragment constitutes an ORF if the following are all true.

1. It begins with `ATG`.

1. It ends with any of `TGA`, `TAG`, or `TAA`.

1. The total number of bases is a multiple of 3.

Note that the sequence `ATG` may appear in the middle of an ORF.

*Hint: The statement for this problem is a bit ambiguous as it is written. What other specification might you need for this function?

---
I think an ORF should not have a stop codon in the middle of it.


---
- Check if all the characters are allowed

- Find the start codons
- Find the first stop codon that is in frame with each start codon
- Find the greatest difference between a start codon and stop codon

---

In [9]:
def longest_orf(seq):
    """
    Finds the longest open reading frame (ORF) in a sequence.
    The reverse complement is not considered.
    An ORF may have a start codon in the middle, but not a stop codon.
    """
    
    start = 'ATG'
    stops = ('TAA', 'TGA', 'TAG')
    
    # Convert to uppercase
    seq = seq.upper()
    
    # Initialize list of indices of start codon
    start_inds = []
    
    # Initialize list of ORF boundary index pairs
    pairs = []
    
    for i, base in enumerate(seq):
        
        # Check that all bases are valid
        if base not in ('A', 'T', 'C', 'G'):
            return base + 'is not a valid nucleotide.'
        
        # Find all the start codons
        if seq[i:i + 3] == start:
            start_inds.append(i)
    
    # Find the first stop codon that is in frame with each start codon
    for i in start_inds:
        # Initialize counter
        n = 1
        while i + 3 * n < len(seq):
            j = i + 3 * n
            if seq[j:j + 3] in stops:
                pairs.append((i, j+3))
                break
            n += 1
    
    # Find the longest ORF (greatest difference between ORF boundary index pairs)
    # Initialize list of differences
    diffs = []
    for pair in pairs:
        diffs.append(pair[1] - pair[0])
        
    longest_pair_ind = diffs.index(max(diffs))
    i, j = pairs[longest_pair_ind]
    
    return seq[i:j]

In [10]:
seq = 'GGATGATGATGTAAAAC'
longest_orf(seq)

'ATGATGATGTAA'

b) Use your function to find the longest ORF from the section of the Salmonella genome we are investigating.

In [11]:
filename = 'data/salmonella_spi1_region.fna'
_, seq = read_FASTA(filename)

longest_orf(seq)

'ATGACCAACTACAGCCTGCGCGCACGCATGATGATTCTGATCCTGGCCCCGACCGTCCTGATAGGTTTGCTGCTCAGTATCTTTTTTGTAGTGCACCGCTATAACGACCTGCAGCGTCAACTGGAAGATGCCGGCGCCAGTATTATTGAACCGCTCGCCGTCTCCAGCGAATATGGTATGAACTTACAAAACCGGGAGTCTATCGGCCAACTTATCAGCGTCCTGCACCGCAGACACTCTGATATTGTGCGGGCGATTTCCGTTTATGACGATCATAACCGTCTGTTTGTAACGTCTAATTTCCATCTGGACCCCTCACAGATGCAGCTTCCCGCCGGAGCGCCGTTTCCACGTCGTCTGAGCGTTGATCGCCACGGCGATATTATGATTCTGCGCACCCCAATTATCTCGGAGAGCTATTCGCCGGACGAGTCAGCGATTGCTGACGCGAAAAATACCAAAAATATGCTGGGGTATGTGGCGCTTGAACTGGATCTCAAGTCGGTCAGGCTACAGCAATACAAAGAGATTTTTATCTCCAGCGTGATGATGCTTTTTTGTATTGGCATTGCGCTGATCTTTGGCTGGCGGCTTATGCGCGATGTCACCGGGCCTATCCGTAATATGGTGAATACCGTTGACCGCATTCGCCGCGGACAACTGGATAGCCGGGTGGAAGGATTTATGCTGGGCGAACTGGATATGCTGAAAAACGGCATTAATTCCATGGCGATGTCGCTTGCCGCCTATCACGAAGAGATGCAGCATAATATCGATCAGGCCACTTCGGACCTGCGTGAAACCCTTGAGCAGATGGAAATCCAAAACGTTGAGCTGGATCTGGCGAAAAAGCGTGCCCAGGAAGCGGCGCGTATTAAGTCGGAGTTCCTGGCGAACATGTCGCACGAACTGCGAACGCCGCTGAACGGCGTCATTGGCTTTACCCGCCTGACATTAAAAACGGAGCTGAATCCCACCCAGCGCGACCATCTGAACACC

In [12]:
%timeit longest_orf(seq)

121 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [13]:
def longest_orf(seq):
    """
    Finds the longest open reading frame (ORF) in a sequence.
    The reverse complement is not considered.
    An ORF may have a start codon in the middle, but not a stop codon.
    """
    
    start = 'ATG'
    stops = ('TAA', 'TGA', 'TAG')
    
    # Convert to uppercase
    seq = seq.upper()
    
    # Initialize list of indices of start codon
    start_inds = []
    
    # Initialize list of ORF boundary index pairs
    pairs = []
    
    for i, base in enumerate(seq):
        
        # Check that all bases are valid
        if base not in ('A', 'T', 'C', 'G'):
            return base + 'is not a valid nucleotide.'
        
        # Find all the start codons
        if seq[i:i + 3] == start:
            start_inds.append(i)
    
    # Find the first stop codon that is in frame with each start codon
    for i in start_inds:
        # Initialize counter
        j = 0
        while j < len(seq):
            if j > i and (j - i) % 3 == 0 and seq[j:j + 3] in stops:
                pairs.append((i, j+3))
                break
            j += 1
    
    # Find the longest ORF (greatest difference between ORF boundary index pairs)
    # Initialize list of differences
    diffs = []
    
    for pair in pairs:
        diffs.append(pair[1] - pair[0])
        
    longest_pair_ind = diffs.index(max(diffs))
    i, j = pairs[longest_pair_ind]
    
    return seq[i:j]

In [14]:
filename = 'data/salmonella_spi1_region.fna'
_, seq = read_FASTA(filename)

%time longest_orf(seq)

CPU times: user 33.6 s, sys: 370 ms, total: 33.9 s
Wall time: 34.9 s


'ATGACCAACTACAGCCTGCGCGCACGCATGATGATTCTGATCCTGGCCCCGACCGTCCTGATAGGTTTGCTGCTCAGTATCTTTTTTGTAGTGCACCGCTATAACGACCTGCAGCGTCAACTGGAAGATGCCGGCGCCAGTATTATTGAACCGCTCGCCGTCTCCAGCGAATATGGTATGAACTTACAAAACCGGGAGTCTATCGGCCAACTTATCAGCGTCCTGCACCGCAGACACTCTGATATTGTGCGGGCGATTTCCGTTTATGACGATCATAACCGTCTGTTTGTAACGTCTAATTTCCATCTGGACCCCTCACAGATGCAGCTTCCCGCCGGAGCGCCGTTTCCACGTCGTCTGAGCGTTGATCGCCACGGCGATATTATGATTCTGCGCACCCCAATTATCTCGGAGAGCTATTCGCCGGACGAGTCAGCGATTGCTGACGCGAAAAATACCAAAAATATGCTGGGGTATGTGGCGCTTGAACTGGATCTCAAGTCGGTCAGGCTACAGCAATACAAAGAGATTTTTATCTCCAGCGTGATGATGCTTTTTTGTATTGGCATTGCGCTGATCTTTGGCTGGCGGCTTATGCGCGATGTCACCGGGCCTATCCGTAATATGGTGAATACCGTTGACCGCATTCGCCGCGGACAACTGGATAGCCGGGTGGAAGGATTTATGCTGGGCGAACTGGATATGCTGAAAAACGGCATTAATTCCATGGCGATGTCGCTTGCCGCCTATCACGAAGAGATGCAGCATAATATCGATCAGGCCACTTCGGACCTGCGTGAAACCCTTGAGCAGATGGAAATCCAAAACGTTGAGCTGGATCTGGCGAAAAAGCGTGCCCAGGAAGCGGCGCGTATTAAGTCGGAGTTCCTGGCGAACATGTCGCACGAACTGCGAACGCCGCTGAACGGCGTCATTGGCTTTACCCGCCTGACATTAAAAACGGAGCTGAATCCCACCCAGCGCGACCATCTGAACACC

In [15]:
def longest_orf(seq):
    """
    Finds the longest open reading frame (ORF) in a sequence.
    The reverse complement is not considered.
    An ORF may have a start codon in the middle, but not a stop codon.
    """
    
    start = 'ATG'
    stops = ('TAA', 'TGA', 'TAG')
    
    # Convert to uppercase
    seq = seq.upper()
    
    # Initialize list of indices of start codon
    start_inds = []
    
    # Initialize list of ORF boundary index pairs
    pairs = []
    
    # Initialize list of boundary index pair differences
    diffs = []
    
    for i, base in enumerate(seq):
        
        # Check that all bases are valid
        if base not in ('A', 'T', 'C', 'G'):
            return base + 'is not a valid nucleotide.'
        
        # Find all the start codons
        if seq[i:i + 3] == start:
            start_inds.append(i)
    
        
            # Find the first stop codon after each start codon
            
            # Initialize counter
            n = 1
            while i + 3 * n < len(seq):
                j = i + 3 * n
                if seq[j:j + 3] in stops:
                    pairs.append((i, j+3))
                    diffs.append(j+3 - i)
                    break
                n += 1
            
    longest_pair_ind = diffs.index(max(diffs))
    i, j = pairs[longest_pair_ind]
    
    return seq[i:j]

In [16]:
filename = 'data/salmonella_spi1_region.fna'
_, seq = read_FASTA(filename)

orf = longest_orf(seq)
print(orf)
print('Length: ', len(orf))

ATGACCAACTACAGCCTGCGCGCACGCATGATGATTCTGATCCTGGCCCCGACCGTCCTGATAGGTTTGCTGCTCAGTATCTTTTTTGTAGTGCACCGCTATAACGACCTGCAGCGTCAACTGGAAGATGCCGGCGCCAGTATTATTGAACCGCTCGCCGTCTCCAGCGAATATGGTATGAACTTACAAAACCGGGAGTCTATCGGCCAACTTATCAGCGTCCTGCACCGCAGACACTCTGATATTGTGCGGGCGATTTCCGTTTATGACGATCATAACCGTCTGTTTGTAACGTCTAATTTCCATCTGGACCCCTCACAGATGCAGCTTCCCGCCGGAGCGCCGTTTCCACGTCGTCTGAGCGTTGATCGCCACGGCGATATTATGATTCTGCGCACCCCAATTATCTCGGAGAGCTATTCGCCGGACGAGTCAGCGATTGCTGACGCGAAAAATACCAAAAATATGCTGGGGTATGTGGCGCTTGAACTGGATCTCAAGTCGGTCAGGCTACAGCAATACAAAGAGATTTTTATCTCCAGCGTGATGATGCTTTTTTGTATTGGCATTGCGCTGATCTTTGGCTGGCGGCTTATGCGCGATGTCACCGGGCCTATCCGTAATATGGTGAATACCGTTGACCGCATTCGCCGCGGACAACTGGATAGCCGGGTGGAAGGATTTATGCTGGGCGAACTGGATATGCTGAAAAACGGCATTAATTCCATGGCGATGTCGCTTGCCGCCTATCACGAAGAGATGCAGCATAATATCGATCAGGCCACTTCGGACCTGCGTGAAACCCTTGAGCAGATGGAAATCCAAAACGTTGAGCTGGATCTGGCGAAAAAGCGTGCCCAGGAAGCGGCGCGTATTAAGTCGGAGTTCCTGGCGAACATGTCGCACGAACTGCGAACGCCGCTGAACGGCGTCATTGGCTTTACCCGCCTGACATTAAAAACGGAGCTGAATCCCACCCAGCGCGACCATCTGAACACCA

In [17]:
%timeit longest_orf(seq)

121 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


c) Write a function that converts a DNA sequence into a protein sequence. You can of course use the `bootcamp_utils` module.

In [18]:
import bootcamp_utils

In [19]:
def DNA_to_protein(orf):
    """Converts a DNA sequence into a protein sequence."""
    
    i = 0
    aas = []
    
    while i < len(orf):
        aas.append(bootcamp_utils.codons[orf[i:i+3]])
        i += 3
        
    aa_seq = ''.join(aas)    
    
    return aa_seq

d) Translate the longest ORF you generated in part (b) into a protein sequence and perform a BLAST search. Search for the protein sequence (a blastp query). What gene is it?

In [20]:
filename = 'data/salmonella_spi1_region.fna'
_, seq = read_FASTA(filename)

orf = longest_orf(seq)

DNA_to_protein(orf)

'MTNYSLRARMMILILAPTVLIGLLLSIFFVVHRYNDLQRQLEDAGASIIEPLAVSSEYGMNLQNRESIGQLISVLHRRHSDIVRAISVYDDHNRLFVTSNFHLDPSQMQLPAGAPFPRRLSVDRHGDIMILRTPIISESYSPDESAIADAKNTKNMLGYVALELDLKSVRLQQYKEIFISSVMMLFCIGIALIFGWRLMRDVTGPIRNMVNTVDRIRRGQLDSRVEGFMLGELDMLKNGINSMAMSLAAYHEEMQHNIDQATSDLRETLEQMEIQNVELDLAKKRAQEAARIKSEFLANMSHELRTPLNGVIGFTRLTLKTELNPTQRDHLNTIERSANNLLAIINDVLDFSKLEAGKLILESIPFPLRNTLDEVVTLLAHSSHDKGLELTLNIKNDVPDNVIGDPLRLQQVITNLVGNAIKFTESGNIDILVEKRALSNTKVQIEVQIRDTGIGIPERDQSRLFQAFRQADASISRRHGGTGLGLVITQKLVNEMGGDISFHSQPNRGSTFWFHINLDLNPNVIIDGPSTACLAGKRLAYVEPNATAAQCTLDLLSDTPVEVVYSPTFSALPLAHYDIMILSVPVTFREPLTMQHERLAKAASMTDFLLLALPCHAQINAEKLKQGGAAACLLKPLTSTRLLPALTEYCQLNHHPEPLLMDTSKITMTVMAVDDNPANLKLIGALLEDKVQHVELCDSGHQAVDRAKQMQFDLILMDIQMPDMDGIRACELIHQLPHQQQTPVIAVTAHAMAGQKEKLLSAGMNDYLAKPIEEEKLHNLLLRYKPGANVAARLMAPEPAEFIFNPNATLDWQLALRQAAGKPDLARDMLQMLIDFLPEVRNKIEEQLVGENPNGLVDLVHKLHGSCGYSGVPRMKNLCQLIEQQLRSGVHEEELEPEFLELLDEMDNVAREAKKILG*'

Using blastp, this protein seems to be: two-component sensor histidine kinase BarA.

e) [Bonus challenge] Modify your function to return the `n` longest ORFs. Compute the five longest ORFs for the Salmonella genome section we are working with. Perform BLAST searches on them. What are they?

In [64]:
def n_longest_orfs(seq, n):
    """
    Finds the n longest open reading frames (ORFs) in a sequence, 
    where each ORF is not contained within any other ORF.
    The reverse complement is not considered.
    An ORF may have a start codon in the middle, but not a stop codon.
    """
    
    start = 'ATG'
    stops = ('TAA', 'TGA', 'TAG')
    
    # Convert to uppercase
    seq = seq.upper()
    
    # Initialize list of indices of start codon
    start_inds = []
    
    # Initialize list of ORF boundary index pairs
    pairs = []
    
    # Initialize list of boundary index pair differences
    diffs = []
    
    for i, base in enumerate(seq):
        
        # Check that all bases are valid
        if base not in ('A', 'T', 'C', 'G'):
            return base + 'is not a valid nucleotide.'
        
        # Find all the start codons
        if seq[i:i + 3] == start:
            start_inds.append(i)
    
        
            # Find the first stop codon after each start codon
            
            # Initialize counter
            k = 1
            while i + 3 * k < len(seq):
                j = i + 3 * k
                if seq[j:j + 3] in stops:
                    pairs.append((i, j+3))
                    diffs.append(j+3 - i)
                    break
                k += 1
    
    # Find longest ORF
    longest_pair_ind = diffs.index(max(diffs))
    i, j = pairs[longest_pair_ind]
    orf = seq[i:j]
    
    # Find n longest ORFs
    
    list_of_orfs = [orf]
    
    while len(list_of_orfs) < n:
        longest_pair_ind = diffs.index(max(diffs))
        
        j, k = pairs[longest_pair_ind]
        
        ### Need to check if the sequence is contained in any of the other ones already in the list!!!!
        
        contained = False
        
        for orf in list_of_orfs:       
            if seq[j:k] in orf:
                contained = True
                
        if contained == False:
            list_of_orfs.append(seq[j:k])

        pairs.remove(pairs[longest_pair_ind])
        diffs.remove(max(diffs))
    
    return tuple(list_of_orfs)

In [65]:
filename = 'data/salmonella_spi1_region.fna'
_, seq = read_FASTA(filename)

five_longest_orfs = n_longest_orfs(seq, 5)
five_longest_orfs

('ATGACCAACTACAGCCTGCGCGCACGCATGATGATTCTGATCCTGGCCCCGACCGTCCTGATAGGTTTGCTGCTCAGTATCTTTTTTGTAGTGCACCGCTATAACGACCTGCAGCGTCAACTGGAAGATGCCGGCGCCAGTATTATTGAACCGCTCGCCGTCTCCAGCGAATATGGTATGAACTTACAAAACCGGGAGTCTATCGGCCAACTTATCAGCGTCCTGCACCGCAGACACTCTGATATTGTGCGGGCGATTTCCGTTTATGACGATCATAACCGTCTGTTTGTAACGTCTAATTTCCATCTGGACCCCTCACAGATGCAGCTTCCCGCCGGAGCGCCGTTTCCACGTCGTCTGAGCGTTGATCGCCACGGCGATATTATGATTCTGCGCACCCCAATTATCTCGGAGAGCTATTCGCCGGACGAGTCAGCGATTGCTGACGCGAAAAATACCAAAAATATGCTGGGGTATGTGGCGCTTGAACTGGATCTCAAGTCGGTCAGGCTACAGCAATACAAAGAGATTTTTATCTCCAGCGTGATGATGCTTTTTTGTATTGGCATTGCGCTGATCTTTGGCTGGCGGCTTATGCGCGATGTCACCGGGCCTATCCGTAATATGGTGAATACCGTTGACCGCATTCGCCGCGGACAACTGGATAGCCGGGTGGAAGGATTTATGCTGGGCGAACTGGATATGCTGAAAAACGGCATTAATTCCATGGCGATGTCGCTTGCCGCCTATCACGAAGAGATGCAGCATAATATCGATCAGGCCACTTCGGACCTGCGTGAAACCCTTGAGCAGATGGAAATCCAAAACGTTGAGCTGGATCTGGCGAAAAAGCGTGCCCAGGAAGCGGCGCGTATTAAGTCGGAGTTCCTGGCGAACATGTCGCACGAACTGCGAACGCCGCTGAACGGCGTCATTGGCTTTACCCGCCTGACATTAAAAACGGAGCTGAATCCCACCCAGCGCGACCATCTGAACAC

In [66]:
longest_orfs_aas = []

for i, seq in enumerate(five_longest_orfs):
    longest_orfs_aas.append(DNA_to_protein(five_longest_orfs[i]))

longest_orfs_aas
                        

['MTNYSLRARMMILILAPTVLIGLLLSIFFVVHRYNDLQRQLEDAGASIIEPLAVSSEYGMNLQNRESIGQLISVLHRRHSDIVRAISVYDDHNRLFVTSNFHLDPSQMQLPAGAPFPRRLSVDRHGDIMILRTPIISESYSPDESAIADAKNTKNMLGYVALELDLKSVRLQQYKEIFISSVMMLFCIGIALIFGWRLMRDVTGPIRNMVNTVDRIRRGQLDSRVEGFMLGELDMLKNGINSMAMSLAAYHEEMQHNIDQATSDLRETLEQMEIQNVELDLAKKRAQEAARIKSEFLANMSHELRTPLNGVIGFTRLTLKTELNPTQRDHLNTIERSANNLLAIINDVLDFSKLEAGKLILESIPFPLRNTLDEVVTLLAHSSHDKGLELTLNIKNDVPDNVIGDPLRLQQVITNLVGNAIKFTESGNIDILVEKRALSNTKVQIEVQIRDTGIGIPERDQSRLFQAFRQADASISRRHGGTGLGLVITQKLVNEMGGDISFHSQPNRGSTFWFHINLDLNPNVIIDGPSTACLAGKRLAYVEPNATAAQCTLDLLSDTPVEVVYSPTFSALPLAHYDIMILSVPVTFREPLTMQHERLAKAASMTDFLLLALPCHAQINAEKLKQGGAAACLLKPLTSTRLLPALTEYCQLNHHPEPLLMDTSKITMTVMAVDDNPANLKLIGALLEDKVQHVELCDSGHQAVDRAKQMQFDLILMDIQMPDMDGIRACELIHQLPHQQQTPVIAVTAHAMAGQKEKLLSAGMNDYLAKPIEEEKLHNLLLRYKPGANVAARLMAPEPAEFIFNPNATLDWQLALRQAAGKPDLARDMLQMLIDFLPEVRNKIEEQLVGENPNGLVDLVHKLHGSCGYSGVPRMKNLCQLIEQQLRSGVHEEELEPEFLELLDEMDNVAREAKKILG*',
 'MNESFDKDFSNHTPMMQQYLKLKAQHPEILLFYRMGDFYELFYDDAKRASQLLDISLTKRGASAGEPIPMAGIP

In order, these are (using blastp):

1) two-component sensor histidine kinase BarA

2) DNA mismatch repair protein MutS

3) formate hydrogenlyase transcriptional activator FlhA

4) L-fucose isomerase

5) transcriptional regulator HilA
    

In [23]:
%reload_ext watermark
%watermark -v -p bootcamp_utils,pytest,os,jupyterlab

CPython 3.7.7
IPython 7.13.0

bootcamp_utils 0.0.5
pytest 5.4.2
os unknown
jupyterlab 1.2.6
