# BMI/CS 576 HW1
The objectives of this homework are to practice

* with the basic algorithms for sequence assembly
* reasoning about graphs and paths for the sequence assembly task

## HW policies
Before starting this homework, please read over the [homework policies](https://canvas.wisc.edu/courses/374201/pages/hw-policies) for this course.  In particular, note that homeworks are to be completed *individually* and plagiarism from any source (with the one exception noted below) will be considered **academic misconduct**.

You are welcome to use any code from the weekly notebooks (including the official solutions) in your solutions to the HW.

## PROBLEM 1: A "sometimes greedy" algorithm for fragment assembly (50 points)
Write a function, `sometimes_greedy_assemble`, that takes as input a list of read strings and uses a modified version of the greedy fragment assembly algorithm to assemble them into a single superstring.  We will consider a modification to the greedy algorithm described for fragment assembly (Page 9 of the [Sequence Assembly - Graphs and fragment assembly](https://canvas.wisc.edu/courses/374201/pages/day-4-online-lecture-sequence-assembly-graphs-and-fragment-assembly)) in which the algorithm does not always add the next (largest overlap) compatible edge from the queue.  Instead, after popping the next edge off of the queue and checking that it is compatible with the current graph, we will choose to add it to the graph with probability $p$, and if not, we will instead add it to a list, $H$, of "held aside" edges that will be considered later, and then continue with the algorithm.  The modified pseudocode for the algorithm is:

* Let $G$ be a graph with fragments as vertices and no edges
* Create a queue, $Q$, of overlap edges (not currently in $G$), with edges in order of increasing weight (decreasing overlap length)
* Initialize $H$ to be an empty list
* While $G$ is disconnected
    * Pop the next possible edge $e = (u, v)$ off of $Q$
    * If $outdegree(u) = 0$ and $indegree(v) = 0$ and $e$ does not create a cycle
        * Let $x$ be a random number drawn uniformly from $[0, 1)$
        * If $x < p$
            * add *e* to *G*
            * move the edges in $H$ back to $Q$ in sorted order
        * Else
            * add *e* to *H*
    * If $Q$ is empty
        * move the edges in $H$ back to $Q$ in sorted order

A potential advantage of this randomized algorithm is that, in cases where the deterministic greedy algorithm fails to find the shortest superstring, there is a non-zero probability that the randomized algorithm will find a shorter superstring than the deterministic algorithm.  And we can run the randomized algorithm many times to increase these chances.

To keep things simple for this homework we will allow overlaps of any length (including zero).  In practice, we would typically require some minimum overlap length.  For simplicity, we will also assume that:
1. we are assembling a single-stranded sequence and
2. that no read is a substring of any other read.

## Important implementation details

### Random number generation 
Random number generation should occur only at the line specified in the pseudocode, and you should use the [random.random](https://docs.python.org/3/library/random.html#random.random) function for this purpose.

### Tie-breaking criteria

For the purpose of making this algorithm deterministic, we must establish tiebreaking criteria for edges in the overlap graph that have the same weight. For two edges with the same weight, we will first choose the edge whose source vertex read is first in lexicographical order. If the source vertices are identical, then we choose the edge whose target vertex read is first in lexicographical order. For example, if e1 = ATCGGA → GGAT and e2 = ATCGGA → GGAA, we will attempt to use edge e2 first because GGAA < GGAT according to lexicographical order.  You may find useful the fact that comparison operators for sequences in Python (e.g., tuples) use lexicographical ordering.  For example,

In [1]:
(-3, "ATCGGA", "GGAA") < (-3, "ATCGGA", "GGAT")

True

In [2]:
# Code for PROBLEM 1
# You are welcome to develop your code as a separate Python module
# and import it here if that is more convenient for you.
def sometimes_greedy_assemble(reads, p=1.0):
    """Assembles a set of reads using the graph-based 'sometimes' greedy algorithm.
    
    Args:
        reads: a list of strings
        p: probability of a compatible edge being added at each iteration
           (default: 1.0, which is equilavent to the deterministic greedy algorithm)
    Returns:
        A string that is a superstring of the input reads
    """
    ### BEGIN SOLUTION
    import assemble
    return assemble.sometimes_greedy_assemble(reads, p)
    ### END SOLUTION

Tests for `sometimes_greedy_assemble` are provided at the bottom of this notebook.

## PROBLEM 2: Assembling the SARS-CoV-2 genome (10 points)

Included with this notebook is the file `sarscov2_reads.fasta` which is a set of reads from a SARS-CoV-2 variant genome.  In this problem, we will use your `sometimes_greedy_assemble` function to assemble this genome and then *determine the identity of the variant*.  A few notes about these reads:

1. The reads are free of sequencing errors
2. The reads are all in the same orientation as the SARS-CoV-2 genome

**(a)** Write code to read in the SARS-CoV-2 reads and asssemble them with your `sometimes_greedy_assemble` function, with `p = 1` (deterministic).  Write the code assuming your `sometimes_greedy_assemble` function is correct.  This problem will be graded manually.  

In [3]:
### BEGIN SOLUTION
import fasta
sarscov2_read_records = fasta.read_sequences_from_fasta_file("sarscov2_reads.fasta")
sarscov2_reads = [read for (name, read) in sarscov2_read_records]
assembly = sometimes_greedy_assemble(sarscov2_reads)
### END SOLUTION

Computing overlaps...
Running greedy algorithm...


**(b)** The file `sarscov2_variant_genomes.fasta` contains the genome sequences for eight SARS-CoV-2 variants of concern or variants of interest: alpha, beta, delta, gamma, epsilon, omicron BA.1, omicron BA.4, omicron BA.5, omicron EG.5.1, and omicron XBB.1.5  If your code is correct, your assembly should be identical to one of these genomes (*note: typically, a newly sequenced viral genome will not match exactly to a reference genome, but we are keeping it simple in this assignment*).  **Which variant do these reads come from?**

A few notes:
1. If your `sometimes_greedy_assemble` function is not correct, you may use an alternative strategy for determining the identity of the variant (e.g., by examining the reads and the candidate variant genome sequences provided)
2. This problem will be manually graded.  The majority of the credit will be providing the code that you used to determine the identity of the variant.

In [4]:
### BEGIN SOLUTION
variant_genomes = fasta.read_sequences_from_fasta_file("sarscov2_variant_genomes.fasta")
for name, genome in variant_genomes:
    if assembly == genome:
        print("The variant is:", name)

# alternative strategy
for name, genome in variant_genomes:
    if all(read in genome for read in sarscov2_reads):
        print("The variant is:", name)
### END SOLUTION

The variant is: omicron_eg_5.1
The variant is: omicron_eg_5.1


## PROBLEM 3: SBH graphs and Eulerian paths (20 points) 
For the following strings, (i) give the k = 3 spectrum for the string, (ii) draw the SBH graph for the spectrum, (iii) give one Eulerian path and its corresponding string for the SBH graph, and (iv) show whether or not there exists an Eulerian path in the graph that corresponds to the original string.

(a) `AGTTAAATTGCAG`

(b) `TATCGGATCGTTA`

### BEGIN SOLUTION  TEMPLATE=*YOUR ANSWER TO PROBLEM 3 HERE*

(b)  AGTTAAATTGCAG

i. {AAA, AAT, AGT, ATT, CAG, GCA, GTT, TAA, TGC, TTA, TTG}

ii. See figure below.
![p3a](p3a.png)

iii. There is one Eulerian cycle in this graph
TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA
which contains the following Eulerian paths and their corresponding strings:

    TAAATTGCAGTTA: TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA
    AAATTGCAGTTAA: AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA -> AA
    AATTGCAGTTAAA: AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA -> AA -> AA
    ATTGCAGTTAAAT: AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA -> AA -> AA -> AT
    TTGCAGTTAAATT: TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA -> AA -> AA -> AT -> TT
    TGCAGTTAAATTG: TG -> GC -> CA -> AG -> GT -> TT -> TA -> AA -> AA -> AT -> TT -> TG
    GCAGTTAAATTGC: GC -> CA -> AG -> GT -> TT -> TA -> AA -> AA -> AT -> TT -> TG -> GC
    CAGTTAAATTGCA: CA -> AG -> GT -> TT -> TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA
    AGTTAAATTGCAG: AG -> GT -> TT -> TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG
    GTTAAATTGCAGT: GT -> TT -> TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT
    TTAAATTGCAGTT: TT -> TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT
    TAAATTGCAGTTA: TA -> AA -> AA -> AT -> TT -> TG -> GC -> CA -> AG -> GT -> TT -> TA

iv. Yes, there exists an Eulerian path in the graph that corresponds to the
original string (see the ninth string and path in the list above).

(b) TATCGGATCGTTA

i. {ATC, CGG, CGT, GAT, GGA, GTT, TAT, TCG, TTA}

ii. See figure below.
![p3b](p3b.png)

iii. The two possible Eulerian paths in this graph and their corresponding
strings are:

    CGGATCGTTAT: CG -> GG -> GA -> AT -> TC -> CG -> GT -> TT -> TA -> AT
    CGTTATCGGAT: CG -> GT -> TT -> TA -> AT -> TC -> CG -> GG -> GA -> AT

iv. No.  No Eulerian path in this SBH graph can correspond to the original string because the original string has k-mers that are present more than once (ATC and TCG) and in the SBH graph each unique k-mer is represented by a single edge, which can only be traversed once by an Eulerian path.  One can also see this by noting that there are only 9 edges in the graph, and thus a string corresponding to an Eulerian path will be $11 = 9 + 3 - 1$ characters long, whereas the original string is $13$ characters long.

           
### END SOLUTION

## PROBLEM 4: Which data will assemble correctly? (20 points) 
Suppose that the true genome sequence of an organism is `AACGCCGCTAG`.

**(a)** Suppose you use the fragment assembly paradigm and the greedy algorithm for assembling these three reads into a superstring.

**(a.i)** Identify three reads, each with length = 5, that cover the genome and for which the algorithm will *successfully* assemble the genome.  Draw the overlap graph and specify the order in which the edges are added to form the path.

**(a.ii)** Identify three reads, each with length = 5, that cover the genome and for which the algorithm will *fail* to assemble the genome.  Draw the overlap graph and specify the order in which the edges are added to form the path.

**(a.iii)** Identify three reads, each with length = 5, that cover the genome and for which the algorithm *may fail or may succeed* depending on how edge weight ties are arbitrarily broken.  Draw the overlap graph and specify the order in which the edges are added to form a correct assembly and also an order in which the edges are added to form an incorrect assembly.

**(b)** Suppose instead that you use the spectral assembly paradigm.  What is the smallest value of $k$ for which this assembly approach will succeed?  For this value of $k$, give the spectral assembly graph and an Eulerian path through the graph.

### BEGIN SOLUTION  TEMPLATE=*YOUR ANSWER TO PROBLEM 4 HERE*

**(a)** In order for the reads to cover the genome, reads corresponding to the first five bases (CTAGC) and the last five bases (GCGTT) of the genome must be included in the read set.  To cover the genome, the third read must be one of the following substrings: ACGCC, CGCCG, GCCGC, CCGCT, or CGCTA.

**(i)** Either CGCCG or CCGCT can be the third read to guarantee successful assembly.  With CGCCG the overlap graph is

![p4ai_overlap graph](p4ai_overlap_graph.png)

and the order of edge additions are as follows:
* Iteration 1: Edge AACGC → CGCCG added
* Iteration 2: Edge AACGC → GCTAG rejected
* Iteration 3: Edge CGCCG → GCTAG added

**(ii)** CGCTA as the third read will guarantee failed assembly.  The overlap graph is

![p4aii_overlap graph](p4aii_overlap_graph.png)

and the order of edge additions are as follows:
* Iteration 1: Edge CGCTA → GCTAG added
* Iteration 2: Edge AACGC → CGCTA added

**(iii)** ACGCC or GCCGC as the third read may result in failure or success depending on edge weight ties. For ACGCC The overlap graph is

![p4aiii_overlap graph](p4aiii_overlap_graph.png)

An order of edge additions that leads to a correct assembly is:
* Iteration 1: Edge AACGC → ACGCC added (weight = 4)
* Iteration 2: Edge AACGC → GCTAG rejected (weight = 2)
* Iteration 3: Edge ACGCC → GCTAG added (weight = 0)

An order of edge additions that leads to an incorrect assembly is:
* Iteration 1: Edge AACGC → ACGCC added (weight = 4)
* Iteration 2: Edge AACGC → GCTAG rejected (weight = 2)
* Iteration 3: Edge GCTAG → AACGC added (weight = 0)

**(b)** $k = 4$.  For $k = 4$, all k-mers in the genome are unique, which allows for the spectral assembly approach to succeed.  Note that for $k =3$, not all k-mers in the genome are unique (CGC is found twice), thus 4 is the smallest value for $k$.  The spectral assembly graph for $k = 4$ is:

![p4b](p4b.png)

And the Eulerian path through this graph is AAC → ACG → CGC → GCC → CCG → CGC → GCT → CTA → TAG

### END SOLUTION

### Tests for problem 1

In [5]:
# read sets for testing
tiny_reads = ["ATAG", "CATA", "TAAT"]
single_base_reads = ["C", "A", "T", "G"]
medium_reads = ["ATGCT", "CTAT", "CCTATA", "CCC", "CTCC", "AAG"]

# utility functions for testing
import random

def read_strings_from_file(filename):
    return [line.rstrip() for line in open(filename)]

def test_sometimes_greedy_assemble_with_files(reads_filename, superstring_filename, p = 1.0):
    reads = read_strings_from_file(reads_filename)
    superstring = read_strings_from_file(superstring_filename)[0]
    assert sometimes_greedy_assemble(reads, p) == superstring

In [6]:
# TEST: returns a string
assembly = sometimes_greedy_assemble(tiny_reads)
assert isinstance(assembly, str), "Return value of sometimes_greedy_assemble is not a str"
print("SUCCESS: returns a string passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: returns a string passed!


In [7]:
# TEST: returns a superstring
def check_is_superstring(assembly, reads):
    for read in reads:
        assert read in assembly, f"read '{read}' is not contained in assembly"

assembly = sometimes_greedy_assemble(tiny_reads)
check_is_superstring(assembly, tiny_reads)
print("SUCCESS: returns a superstring passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: returns a superstring passed!


In [8]:
# TEST: tiny_deterministic
assembly = sometimes_greedy_assemble(tiny_reads, 1.0)
assert assembly == 'CATAGTAAT'
print("SUCCESS: tiny_deterministic passed")

Computing overlaps...
Running greedy algorithm...
SUCCESS: tiny_deterministic passed


In [9]:
# TEST: tiny_single_skip
random.seed(0)
assembly = sometimes_greedy_assemble(tiny_reads, 0.8)
assert assembly == 'CATAATAG'
print("SUCCESS: tiny_single_skip passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: tiny_single_skip passed!


In [10]:
# TEST: tiny_double_skip
random.seed(78)
assembly = sometimes_greedy_assemble(tiny_reads, 0.8)
assert assembly == 'ATAGCATAAT'
print("SUCCESS: tiny_double_skip passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: tiny_double_skip passed!


In [11]:
# TEST: tiny_triple_skip
random.seed(23)
assembly = sometimes_greedy_assemble(tiny_reads, 0.8)
assert assembly == 'ATAGCATAAT'
print("SUCCESS: tiny_triple_skip passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: tiny_triple_skip passed!


In [12]:
# TEST: tiny_empty_queue
import random
random.seed(17)
assembly = sometimes_greedy_assemble(tiny_reads, 0.8)
assert assembly == 'CATAGTAAT'
print("SUCCESS: tiny_empty_queue passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: tiny_empty_queue passed!


In [13]:
# TEST: tiny_multiple_empty_queue
import random
random.seed(36)
assembly = sometimes_greedy_assemble(tiny_reads, 0.5)
assert assembly == 'TAATCATAG'
print("SUCCESS: tiny_multiple_empty_queue passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: tiny_multiple_empty_queue passed!


In [14]:
# TEST: single_bases
assembly = sometimes_greedy_assemble(single_base_reads)
assert assembly == "ACGT"
print("SUCCESS: single_bases passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: single_bases passed!


In [15]:
# TEST: medium
assembly = sometimes_greedy_assemble(medium_reads)
assert assembly == "CTCCCTATAAGATGCTAT"
print("SUCCESS: medium passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: medium passed!


In [16]:
# TEST: medium_randomized
random.seed(15)
assembly = sometimes_greedy_assemble(medium_reads, p=0.7)
assert assembly == "CTATGCTCCCTATAAG"
print("SUCCESS: medium_randomized passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: medium_randomized passed!


In [17]:
# TEST: large0
random.seed(0)
test_sometimes_greedy_assemble_with_files("tests/large0_reads.txt", "tests/large0_assembly.txt", p = 0.8)
print("SUCCESS: large0 passed!")

Computing overlaps...
Running greedy algorithm...
SUCCESS: large0 passed!


In [18]:
# TEST: large1
### BEGIN HIDDEN TESTS
random.seed(1)
test_sometimes_greedy_assemble_with_files("tests/large1_reads.txt", "tests/large1_assembly.txt", p = 0.8)
print("SUCCESS: large1 passed!")
### END HIDDEN TESTS

Computing overlaps...
Running greedy algorithm...
SUCCESS: large1 passed!


In [19]:
# TEST: large2
### BEGIN HIDDEN TESTS
random.seed(2)
test_sometimes_greedy_assemble_with_files("tests/large2_reads.txt", "tests/large2_assembly.txt", p = 0.8)
print("SUCCESS: large2 passed!")
### END HIDDEN TESTS

Computing overlaps...
Running greedy algorithm...
SUCCESS: large2 passed!


In [20]:
# TEST: large3
### BEGIN HIDDEN TESTS
random.seed(3)
test_sometimes_greedy_assemble_with_files("tests/large3_reads.txt", "tests/large3_assembly.txt", p = 0.8)
print("SUCCESS: large3 passed!")
### END HIDDEN TESTS

Computing overlaps...
Running greedy algorithm...
SUCCESS: large3 passed!


In [21]:
# TEST: large4
### BEGIN HIDDEN TESTS
random.seed(4)
test_sometimes_greedy_assemble_with_files("tests/large4_reads.txt", "tests/large4_assembly.txt", p = 0.8)
print("SUCCESS: large4 passed!")
### END HIDDEN TESTS

Computing overlaps...
Running greedy algorithm...
SUCCESS: large4 passed!


In [22]:
# TEST: large5
### BEGIN HIDDEN TESTS
random.seed(5)
test_sometimes_greedy_assemble_with_files("tests/large5_reads.txt", "tests/large5_assembly.txt", p = 0.8)
print("SUCCESS: large5 passed!")
### END HIDDEN TESTS

Computing overlaps...
Running greedy algorithm...
SUCCESS: large5 passed!
