# Module 4: Assembly

How do we solve the assembly problem with an algo? 

- shortest common superstring problem 
    - hard to solve quickly
    - one solution is fast, comes at a cost

- when a genome is repetetive, any algorithm will make mistakes 
- we'll learn DeBruin graph alrightms 

## Lecture: The shortest common superstring problem

The shortest common superstring (SCS) problem: given a set of strings (S), we'd like to get the shortest possible string containing the all the strings in S

Fig 1
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/SCS.png" alt="Local Image" width="500">

Why is the SCS problem useful? If you give it all reads and it finds the shortest common superstring, the resulting superstring is the assembled genome. This assembled genome is the "most parsimonious", "most likely" genome. 

Downsides of SCS: 
- Not tractable. NP-complete. No algorithms which solve it efficiently. 

Algorithm for SCS problem: 

Fig. 2 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/SCS_algo.png" alt="Local Image" width="500">

Assume Set S = input strings of same length. 

Repeat the following steps for each possible order:
1. place string in set S in some possible order
2. find the overlap of the current string with the superstring which is being built. Only add characters not overlapping with the end of the superstring. 
- pick the shortest superstring from all possible orders. 
- if S contains n strings, n! orderings possible, very bad time complexity

- the algorithm will always report a correct shortest common superstring value.

## Practical: Implementing shortest common superstring

In [29]:
from modules.all_functions import *
import itertools

""" My initial thoughts/algorithm: """

def SCS(ss): 
    """ Takes a set of strings (ss) and returns the shortest common superstring
    containing all of them.
    All reads in this set shiould be the same length. """
    
    all_supstrings = []
    
    # for each possible ordering, generate the shortest possible substring 
    for ordering in itertools.permutations(ss): 

        superstring = '' # initialize empty supersting 
        # in each ordering, look at one sequence at a time. 
        for read in ordering: 
            last_bit_superstring = superstring[-len(read):]
            # for each sequence, search for one character at the end of the string
            for i in range(len(read)):
                # if the read character is already there, skip 
                if read[i] == last_bit_superstring[i]:
                    continue 
                # if the read character is not already there, add it. 
                superstring += read[i]
        all_supstrings += [superstring] # add the current superstring to the list.

    return min(all_supstrings)

Critique: initial thought was wrong and cause index problems

Doesn't work for the case where 
act
xxxac 
you'd be comparing xac to act and that won't work. 

In [37]:
def scs(ss): 
    superstring = None
    # for each possible ordering, generate the shortest possible substring 
    for ordering in itertools.permutations(ss): 
        # initialize the superstring with the first string in the list
        current_superstring = ordering[0] 
        # for every overlapping pair (including first string in the list )
        for i in range(len(ss)-1):
            read, next_read = ordering[i], ordering[i+1]
            # find overlap length between the current string and the next string
            olen = overlap(read, next_read, min_length=1)
            # add hte non-overlapping bit. 
            current_superstring += next_read[olen:]
        if superstring == None or len(superstring) > len(current_superstring):
            superstring = current_superstring
    return superstring

In [40]:
scs(['ACGGTACGAGC', 'GAGCTTCGGA', 'GACACGG'])

'GACACGGTACGAGCTTCGGA'

## Lecture: Greedy shortest common superstring

- downside of SCS: no efficient solutions (solution we just saw was very slow). 

- look at faster alt: greedy shortest common superstring 
    - greedy - at each superstring, the algorithm makes a choice that results in the shortest superstring. By using a greedy algo, we cannot ensure we always get the correct solution.

- algorithm proceeds in rounds. in each round, pick the longest remaining overlap in the overlap map. 
 - the longer the overlap between strings, the shorter the superstring. 

Fig 1. 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/greedy_map.png" alt="Local Image" width="500">

1. locate the edge corresponding to the longest overlap. If there is a tie, pick an edge at random 

Fig 2. 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/merging_nodes.png" alt="Local Image" width="500">

2. Merge these two nodes 

Repeat steps 1 and 2 over and over again. 

If theres a case where we end up with two non-overlapping nodes where edge = 0, concatenate the strings. 

Pros: Fast
Cons: Not always accurate. 

Fig 3. Not always accurate 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/mistakes.png" alt="Local Image" width="500">

## Practical: Implementing greedy shortest common superstring

1. find two reads, max overlap


In [46]:
def pick_max_overlap(reads, k):
    """ Find the pair of reads with maximum overlap. """
    reada, readb = None, None
    best_olen = 0 

    # for each pair of reads possible, calculate overlap between them
    # store the reads if they have longest overlap 
    for a,b in itertools.permutations(reads, 2):
        olen = overlap(a, b, min_length=k)
        if olen > best_olen:
            reada, readb = a, b
            best_olen = olen 
    return reada, readb, best_olen

In [49]:
def greedy_scs(reads, k): 
    read_a, read_b, olen = pick_max_overlap(reads, k)
    while olen > 0:
        # remove read a and b
        reads.remove(read_a)
        reads.remove(read_b)
        # put back a read which contains both of them 
        reads.append(read_a + read_b[olen:])

        # next iteration = reset the new most overlapping reads
        read_a, read_b, olen = pick_max_overlap(reads, k)
    return ''.join(reads) # return as one string, also concatenates any nodes with overalp 0/

In [50]:
greedy_scs(['ABC', 'BCA', 'CAB'], 2)

'CABCA'

In [52]:
greedy_scs(['ABCD', 'CDBC', 'BCDA'], 1)

'CDBCABCDA'

In [54]:
scs(['ABCD', 'CDBC', 'BCDA'])

'ABCDBCDA'

## Lecture: Third law of assembly: repeats are bad