# Module 4: Assembly

How do we solve the assembly problem with an algo? 

- shortest common superstring problem 
    - hard to solve quickly
    - one solution is fast, comes at a cost

- when a genome is repetetive, any algorithm will make mistakes 
- we'll learn DeBruin graph alrightms 

## Lecture: The shortest common superstring problem

The shortest common superstring (SCS) problem: given a set of strings (S), we'd like to get the shortest possible string containing the all the strings in S

Fig 1
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/SCS.png" alt="Local Image" width="500">

Why is the SCS problem useful? If you give it all reads and it finds the shortest common superstring, the resulting superstring is the assembled genome. This assembled genome is the "most parsimonious", "most likely" genome. 

Downsides of SCS: 
- Not tractable. NP-complete. No algorithms which solve it efficiently. 

Algorithm for SCS problem: 

Fig. 2 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/SCS_algo.png" alt="Local Image" width="500">

Assume Set S = input strings of same length. 

Repeat the following steps for each possible order:
1. place string in set S in some possible order
2. find the overlap of the current string with the superstring which is being built. Only add characters not overlapping with the end of the superstring. 
- pick the shortest superstring from all possible orders. 
- if S contains n strings, n! orderings possible, very bad time complexity

- the algorithm will always report a correct shortest common superstring value.

## Practical: Implementing shortest common superstring

In [78]:
from modules.all_functions import *
import itertools

""" My initial thoughts/algorithm: """

def SCS(ss): 
    """ Takes a set of strings (ss) and returns the shortest common superstring
    containing all of them.
    All reads in this set shiould be the same length. """
    
    all_supstrings = []
    
    # for each possible ordering, generate the shortest possible substring 
    for ordering in itertools.permutations(ss): 

        superstring = '' # initialize empty supersting 
        # in each ordering, look at one sequence at a time. 
        for read in ordering: 
            last_bit_superstring = superstring[-len(read):]
            # for each sequence, search for one character at the end of the string
            for i in range(len(read)):
                # if the read character is already there, skip 
                if read[i] == last_bit_superstring[i]:
                    continue 
                # if the read character is not already there, add it. 
                superstring += read[i]
        all_supstrings += [superstring] # add the current superstring to the list.

    return min(all_supstrings)

Critique: initial thought was wrong and cause index problems

Doesn't work for the case where 
act
xxxac 
you'd be comparing xac to act and that won't work. 

In [79]:
def scs(ss): 
    superstring = None
    # for each possible ordering, generate the shortest possible substring 
    for ordering in itertools.permutations(ss): 
        # initialize the superstring with the first string in the list
        current_superstring = ordering[0] 
        # for every overlapping pair (including first string in the list )
        for i in range(len(ss)-1):
            read, next_read = ordering[i], ordering[i+1]
            # find overlap length between the current string and the next string
            olen = overlap(read, next_read, min_length=1)
            # add hte non-overlapping bit. 
            current_superstring += next_read[olen:]
        if superstring == None or len(superstring) > len(current_superstring):
            superstring = current_superstring
    return superstring

In [80]:
scs(['ACGGTACGAGC', 'GAGCTTCGGA', 'GACACGG'])

'GACACGGTACGAGCTTCGGA'

## Lecture: Greedy shortest common superstring

- downside of SCS: no efficient solutions (solution we just saw was very slow). 

- look at faster alt: greedy shortest common superstring 
    - greedy - at each superstring, the algorithm makes a choice that results in the shortest superstring. By using a greedy algo, we cannot ensure we always get the correct solution.

- algorithm proceeds in rounds. in each round, pick the longest remaining overlap in the overlap map. 
 - the longer the overlap between strings, the shorter the superstring. 

Fig 1. 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/greedy_map.png" alt="Local Image" width="500">

1. locate the edge corresponding to the longest overlap. If there is a tie, pick an edge at random 

Fig 2. 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/merging_nodes.png" alt="Local Image" width="500">

2. Merge these two nodes 

Repeat steps 1 and 2 over and over again. 

If theres a case where we end up with two non-overlapping nodes where edge = 0, concatenate the strings. 

Pros: Fast
Cons: Not always accurate. 

Fig 3. Not always accurate 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/mistakes.png" alt="Local Image" width="500">

## Practical: Implementing greedy shortest common superstring

In [81]:
def pick_max_overlap(reads, k):
    """ Find the pair of reads with maximum overlap. """
    reada, readb = None, None
    best_olen = 0 

    # for each pair of reads possible, calculate overlap between them
    # store the reads if they have longest overlap 
    for a,b in itertools.permutations(reads, 2):
        olen = overlap(a, b, min_length=k)
        if olen > best_olen:
            reada, readb = a, b
            best_olen = olen 
    return reada, readb, best_olen

In [82]:
def greedy_scs(reads, k): 
    read_a, read_b, olen = pick_max_overlap(reads, k)
    while olen > 0:
        # remove read a and b
        reads.remove(read_a)
        reads.remove(read_b)
        # put back a read which contains both of them 
        reads.append(read_a + read_b[olen:])

        # next iteration = reset the new most overlapping reads
        read_a, read_b, olen = pick_max_overlap(reads, k)
    return ''.join(reads) # return as one string, also concatenates any nodes with overalp 0/

In [83]:
greedy_scs(['ABC', 'BCA', 'CAB'], 2)

'CABCA'

In [84]:
greedy_scs(['ABCD', 'CDBC', 'BCDA'], 1)

'CDBCABCDA'

In [85]:
scs(['ABCD', 'CDBC', 'BCDA'])

'ABCDBCDA'

## Lecture: Third law of assembly: repeats are bad

- most frustrating! 

Third law of assembly: shortest common superstring is not the correct genome if the genome is repetitive. 
- repeated elements get collapsed down into just one copy. 
- so repetive elements cause ambiguity 
- the way repeats fail algorithms is different based on the algorithms. 
- about half the genome is covered by repeated sequences.
- summary: scs and greedy_scs need alternative algorithms 

## Lecture: De Bruijn graphs and Eulerian walks

alternative algorithm: de brujin graph 

Fig. 1 - De Brujin Graph 

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">
- make nodes for each distinct phrases, directions indicate the sequence of these words. 
- multigraph- multiple edges between a given pair of nodes.

Fig. 2 - De Brujin Graph for DNA 

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/dna_deb.png" alt="Local Image" width="500">
- extract all kmers of size k (size of reads) from the genome. 
- for each kmer, extract the (k-1)mers 
- add a node for each (k-1)mer (if it doesn't already exist)
- make an edge from the first (k-1)mer to the next sequentially 

- each kmer in the genome corresponds to an edge in the graph 
- one node per distinct (k-1)mer

Fig 3. Reconstruct genome from the debrujin graph using the Eulerian Walk 

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/e_walk.png" alt="Local Image" width="500">
- start at a node, follow the edges as you go. 
- Eulerian Walk: Crosses each edgge exactly once in the walkthrough 

## Practical: Building a De Bruijn graph

In [86]:
def de_bruijn_ize(seq, k):
    edges = [] 
    nodes = set() # no duplicates 

    # build all kmers of the given string. 
    for i in range(len(seq)-k): 
        # append a tupule of both of the (k-1)mers, [0] is right, [1] is left, indicating direction by sequence.
        # (a, b) means a points to b
        edges.append((seq[i:i+k-1], seq[i+1:i+k]))
        # add the (k-1)mers to the set of nodes 
        nodes.add(seq[i:i+k-1])
        nodes.add(seq[i+1:i+k])
    return nodes, edges 

In [87]:
nodes, edges = de_bruijn_ize('ACGCGTCG', 3)
print(nodes)

{'GT', 'TC', 'AC', 'CG', 'GC'}


In [88]:
print(edges) # all of these are directionality

[('AC', 'CG'), ('CG', 'GC'), ('GC', 'CG'), ('CG', 'GT'), ('GT', 'TC')]


In [89]:
# eulerian walk 

def visualize_de_bruijn(st, k):
    """ Visualize a directed multigraph using graphviz """
    nodes, edges = de_bruijn_ize(st, k)
    dot_str = 'digraph "DeBruijn graph" {\n'
    for node in nodes:
        dot_str += '  %s [label="%s"] ;\n' % (node, node)
    for src, dst in edges:
        dot_str += '  %s -> %s ;\n' % (src, dst)
    return dot_str + '}\n'

In [90]:
!wget https://raw.github.com/cjdrake/ipython-magic/master/gvmagic.py

--2024-06-28 15:44:00--  https://raw.github.com/cjdrake/ipython-magic/master/gvmagic.py
Resolving raw.github.com (raw.github.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.github.com (raw.github.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/cjdrake/ipython-magic/master/gvmagic.py [following]
--2024-06-28 15:44:01--  https://raw.githubusercontent.com/cjdrake/ipython-magic/master/gvmagic.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4988 (4.9K) [text/plain]
Saving to: ‘gvmagic.py’


2024-06-28 15:44:01 (4.89 MB/s) - ‘gvmagic.py’ saved [4988/4988]



In [96]:
%load_ext gvmagic

The gvmagic extension is already loaded. To reload it, use:
  %reload_ext gvmagic


didn't work so just did this

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/map.png" alt="Local Image" width="200">

## Lecture: When Eulerian walks go wrong

Fig. 1: Does it overcollapse repeats? 
- Sometimes not: 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

- We might not get overcollapsed repeats, but that doesn't mean the third law of assembly doesn't show up. It causes some other issues. 
    
    - Repeats cause there to be mutliple Eulerian paths that one can take through the graph. This introduces ambiguity: we don't know which walk through is the correct genome and there can only be one walk that converges to the correct genome sequence. Every other (wrong) walk through is a wrong suffling.
    <img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

    - By decreasing kmer length, we increase the chances of repeats and more e-walks. The longer the kmer length, the less ambiguity. 
    <img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

    - Violations of assumptions about sequencing data,: 
        - sequencing reads (100s bp) much longer than k-mer length (typical 30-50). 
        - sequenicing errors and not every section of genome has good coverage 
        - any errors => end up with a non-eulirean graph (in practice)

## Lecture: Assemblers in practice

Objective: How do real assemblers deal with repeats and other challenges? 

Fig. 1 Two types of systems, both are graph-based

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

Fig. 2 De Bruijin-based assembly 

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

- sequencing error causes these diversions in the graph, dead ends. (fig 3)
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

- they can contain edged which don't tell us any new information 
- the green edge is impled by the blue edges (fig 4)
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

- polyploidy may cause additional "bubbles" in the graph: 
    - one straight line diverges where the read starts differing because of one different base between strands
    - you can try to fix this by getting rid of the "bubbles" and putting a note there
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">

- repeats cause ambiguity:
    - scs: overcollapses repeats
    - de bruijn: shuffles around bits of genome between repetive bits. 
    - how do you deal with it? 
        - chopping the assembly into peices
            - there will be peices of the graph/assembly where there is no ambiguity. Reconstruct those parts. 
            - for ambiguous parts, we know what sequence was there, just not # of times it was repeated. 
            - assemble each of these unambiguous peices into contigs. 

<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/deb.png" alt="Local Image" width="500">
