# Day 6 notebook

The objectives of this notebook are to practice with the concepts of

* de Bruijn assembly
* sequencing errors
* repeats

This notebook is intended to be solved by hand.  You are welcome to use any code if that helps you.  You are strongly encouraged to work with your group members to understand and solve each problem.

## PROBLEM 1: Minimum overlap lengths and read spectra (1 POINT)
Recall that the de Bruijn approach to genome assembly with shotgun sequencing reads is to approximate the $k$-mer spectrum of the genome by the union of the $k$-mer spectra of the reads and then to use an Eulerian path approach to computing the assembly.  In the absence of sequencing errors, this approach may be successful if the read spectrum is equal to the genome spectrum.

On the other hand, the fragment assembly approach for the same shotgun sequencing data is to find overlaps between pairs of reads and then to compute a superstring (ideally the shortest such superstring).  In practice, one typically only considers overlaps between pairs of reads that are at least of some minimum length, `min_overlap`.  Thus, in the absence of sequencing errors, this approach may be successful if the reads cover every position of the genome *and*  each pair of adjacent reads (in terms of their position along the genome) overlaps by at least `min_overlap`.

In this problem, we will explore the relationship between the `min_overlap` parameter for fragment assembly and the value of $k$ for the de Bruijn approach.

Suppose that a set of error-free shotgun sequencing reads satisfies the requirements for fragment assembly to be successful with a minimum overlap length of `min_overlap` and that all reads are longer than twice the value of `min_overlap`.  Write a function, `largest_k`, that given `min_overlap` as input, outputs the largest value of $k$ such that we are guaranteed that the read spectrum is equal to the genome spectrum (such that the de Bruijn approach may be successful).

*Hint: consider k-mers that come from a region of the genome where two reads overlap by exactly `min_overlap` bases*

In [4]:
def largest_k(min_overlap):
    """Returns the largest value of k for which the read spectrum = genome spectrum
       when all reads overlap by at least min_overlap
    Args:
        min_overlap: an integer specifying the minimum overlap length between adjacent reads
    Returns:
        k, an integer
    """    
    ###
    ### YOUR CODE HERE
    k = min_overlap + 1
    return k


###
### Written reasoning for your function
###


In [5]:
assert isinstance(largest_k(20), int), "return value should be an integer"
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## PROBLEM 2: Sequencing errors and $k$-mers (1 POINT)
Sequencing errors in reads result in the spectra for those reads having potentially false $k$-mers (i.e., $k$-mers that are not in the genome spectrum).  Write a function `minimum_errors` that takes as input $k$ and the length, $L$, of a read and returns the *minimum* number of base substitution errors in the read such that the read may contain *only* false $k$-mers.

In [16]:
def minimum_errors(k, l):
    """Returns the minimum number of base substitution errors in a read
    such that the read may contain only false k-mers.
    Args:
        k: the length of substrings in the spectrum, an integer
        l: the length of a read, an integer
    Returns:
        number of base substitution errors, an integer
    """  
    
    ### YOUR CODE HERE
    return l//k


In [17]:
assert isinstance(minimum_errors(13, 100), int), "minimum_errors should return an integer value"
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## PROBLEM 3: Repeats (1 POINT)
Suppose you are given the following set of reads:

    AAACT
    AAAGG
    AAAGT
    AAATT
    CCAAA
    CTAAA
    GGAAA
    TTAAA

How many shortest superstrings are there for this set of reads?

For autograding purposes, answer this question below by assigning the appropriate value to the variable `num_shortest_superstrings`.

In [12]:
###
num_shortest_superstrings = 6
###


In [13]:
# autograding tests for num_shortest_superstrings
assert isinstance(num_shortest_superstrings, int), "num_shortest_superstrings should be assigned an integer value"
###
### AUTOGRADER TEST - DO NOT REMOVE
###
