# Exercise 3.1: Parsing a FASTA file

a) Use command line tools to investigate the FASTA file located at `~git/bootcamp/data/salmonella_spi1_region.fna`. This file contains a portion of the *Salmonella* genome (described in Exercise 4.1).

You will notice that the first line begins with a `>`, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

In [38]:
!head ~/git/bootcamp/data/salmonella_spi1_region.fna

>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000
AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTC
ACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCAC
CCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGT
CAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGG
CATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGT
CATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTT
TATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGC
TTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAG
AGCTGCCGCATCCGTACCACCGGCTTGCGCCATGTCCGGACGACCGCCACCCTTACCGCC


In [39]:
!tail ~/git/bootcamp/data/salmonella_spi1_region.fna

ACGCATTTCTCCCGTGCAGGTCACATTTGCCCGACACGGCGGGGCAAGAGGCTTGAACAG
ACGTTCATTTTCCGTAAAACTGGCGTAATGTAAGCGTTTACCCACTATAGGTATTATCAT
GGCGACCATAAAAGATGTAGCCCGACTGGCCGGTGTTTCAGTCGCCACCGTTTCTCGCGT
TATTAACGATTCGCCAAAAGCCAGCGAAGCGTCCCGGCTGGCGGTAACCAGCGCAATGGA
GTCCCTGAGCTATCACCCTAACGCCAACGCGCGCGCGCTGGCACAGCAGGCAACGGAAAC
CCTCGGTCTGGTGGTCGGCGACGTTTCCGATCCTTTTTTCGGCGCGATGGTGAAAGCCGT
TGAACAGGTGGCGTATCACACCGGCAATTTTTTACTGATTGGCAACGGGTATCATAACGA
ACAAAAAGAGCGTCAGGCTATTGAACAGTTGATTCGTCATCGTTGCGCAGCGTTAGTGGT
GCACGCCAAAATGATTCCGGATGCGGACCTGGCCTCATTAATGAAGCAAATCCCCGGCAT
GGTGCTGATTAACCGCATTT


b) The format of the Salmonella SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with `>`). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with `>`). Second, it should return a string with no gaps containing the sequence.

In [3]:
def read_FASTA(filename):
    """Reads in a sequence from a FASTA file that contains a single sequence.
    Returns the descriptor string and a string containing the sequence."""

    with open(filename, 'r') as f:

        # Initialize the sequence
        seq = ''
        
        for line in f:
            # Get the descriptor string, which is the first line of a FASTA file and starts with '>'
            if line[0] == '>':
                description = line.rstrip()
            # Get the sequence
            if '>' not in line:
                seq += line.rstrip()
    
    return description, seq

In [41]:
filename = 'data/salmonella_spi1_region.fna'

read_FASTA(filename)

('>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000',
 'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCACCGGCTTGCGCCATGTCCGGACGACCGCCACCCTTACCGCCCACCTGCTGAGCGACCATCCCAATCAATTCCCCTGCTTTAACCCGGTCGGTCACATCCTTCGACACGCCCGCAATCAGAGAAACCTTACCTTCAACAACCGTTGCCAGTACGATAACGGTAGACCCCAGTTGATTTTTCAGATCATCAACCATGGTTCGCAGCATTTTCGGCTCAATACCAGCAAGCTCGCTAACCAGGAGCTTCACGCCGTTGAGATCGACCGCTTTACTGGAAAGATTTGCACTCTCCTGCGCTGCGGCCTGGTCCTTCAACTGCTGCAACTCTTTTTCCAGCTGACGTGTACGTTCCAGCACGGCACGCACTTTG

# Exercise 3.2: Restriction enzyme cut sites

a) New England Biosystems sells purified DNA of the genome of λ-phage, a bacteriophage that infect *E. coli*. You can download the FASTA file containing the sequence. Use the function you wrote in Exercise 3.1 to extract the sequence.

In [42]:
lambda_genome = read_FASTA('lambdafsa.fasta')

In [14]:
# Stylistic note from class: you can get seq from the function like this (unpacking the tuple).
_, seq = read_FASTA('lambdafsa.fasta')

b) Write a function with call signature

```restriction_sites(seq, recoq_seq)```

that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence. Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results to those reported on the New England Biosystems datasheet.

In [43]:
def restriction_sites(seq, recog_seq):
    """ Finds the recognition sites for a restriction enzyme in a sequence.
    Returns the indices of the first base of each restriction site, with 0-indexing."""
    
    seq_copy = seq
    
    # Initialize list of restriction sites
    sites = []
    
    while recog_seq in seq_copy:
        ind = seq_copy.rfind(recog_seq)
        sites.append(ind)
        seq_copy = seq_copy[:ind]
        
    # List indices in ascending order
    sites.sort()    
        
    return tuple(sites)

In [12]:
# From class:

def restriction_sites(seq, recog_seq):
    """Finds the recognition sites for a restriction enzyme in a sequence.
    Returns the indices of the first base of each restriction site, with 0-indexing."""
    
    # Initialize list of restriction sites
    sites = []
    
    length = len(recog_seq)
    for i, base in enumerate(seq):
        if seq[i:i+length] == recog_seq:
            sites.append(i)
            
    return sites

In [15]:
recog_seqs = dict(HindIII = 'AAGCTT', EcoRI = 'GAATTC', KpnI = 'GGTACC')

In [16]:
restriction_sites(seq, recog_seqs['HindIII'])

[23129, 25156, 27478, 36894, 37458, 37583, 44140]

In [17]:
restriction_sites(seq, recog_seqs['EcoRI'])

[21225, 26103, 31746, 39167, 44971]

In [18]:
restriction_sites(seq, recog_seqs['KpnI'])

[17052, 18555]

Note that our indexing is 1 less than NEB's indexing. Also, NEB seems to have missed one of the HindIII restriction sites, 37583 (would be 37584 in their indexing.

In [49]:
%load_ext watermark
%watermark -v -p jupyterlab

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.7
IPython 7.13.0

jupyterlab 1.2.6
