Practical complications that make genome assembly more difficult:

1. DNA is double-stranded, and we have no way of knowing a priori which strand a given read derives from, meaning that we will not know whether to use a read or its reverse complement when assembling a particular strand of a genome. 

2. Modern sequencing machines are not perfect, and the reads that they generate often contain errors. Sequencing errors complicate genome assembly because they prevent us from identifying all overlapping reads. 

3. Some regions of the genome may not be covered by any reads, making it impossible to reconstruct the entire genome.

In [1]:
# 3a String Composition

def StringComposition(k, text):
    return sorted([text[i:i+k] for i in range(len(text) - k + 1)])

In [2]:
StringComposition(5, 'CAATCCAAC')

['AATCC', 'ATCCA', 'CAATC', 'CCAAC', 'TCCAA']

In [None]:
test_file = 'rosalind_ba3a.txt'
k_text = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_text.append(line.strip('\n'))
for string in StringComposition(int(k_text[0]), k_text[1]):
    print(string)

Repeats complicate genome assembly

In [2]:
# 3b Reconstruct a String from its Genome Path
def ReconstructString(kmers):
    return kmers[0] + ''.join([kmer[-1] for kmer in kmers[1:]])

In [4]:
ReconstructString(['ACCGA', 'CCGAA', 'CGAAG', 'GAAGC', 'AAGCT'])

'ACCGAAGCT'

In [None]:
test_file = 'rosalind_ba3b.txt'
kmers = []
with open(test_file, 'r') as reader:
    for line in reader:
        kmers.append(line.strip('\n'))
ReconstructString(kmers)

In [3]:
# 3c Overlap Graph

def OverlapGraph(kmers):
    graphs = {}
    for kmer in kmers:
        if kmer not in graphs:
            rest_kmers = list(filter(lambda x: x != kmer, kmers))
            edges = [rest_kmer for rest_kmer in rest_kmers if kmer[1:] == rest_kmer[:-1]]
            if len(edges):
                graphs[kmer] = edges
    return graphs

In [6]:
OverlapGraph(['ATGCG', 'GCATG', 'CATGC', 'AGGCA', 'GGCAT'])

{'AGGCA': ['GGCAT'],
 'CATGC': ['ATGCG'],
 'GCATG': ['CATGC'],
 'GGCAT': ['GCATG']}

In [None]:
test_file = 'rosalind_ba3c.txt'
kmers = []
with open(test_file, 'r') as reader:
    for line in reader:
        kmers.append(line.strip('\n'))
for kmer, edge in OverlapGraph(kmers).items():
    print(kmer + ' -> ' + edge[0])

__Hamiltonian Path Problem__: Construct a Hamiltonian path in a graph.

__Input__: A directed graph.

__Output__: A path visiting every node in the graph exactly once (if such a path exists).

A binary string is a string composed only of 0’s and 1’s; a binary string is k-universal if it contains every binary k-mer exactly once. For example, 0001110100 is a 3-universal string, as it contains each of the eight binary 3-mers (000, 001, 011, 111, 110, 101, 010, and 100) exactly once.

Finding a k-universal string is equivalent to solving the String Reconstruction Problem when the k-mer composition is the collection of all binary k-mers. Thus, finding a k-universal string can be reduced to finding a Hamiltonian path in the overlap graph formed on all binary k-mers (see the figure below)

In [4]:
# 3d De Bruijn Graph

def generateEdges(kmer, kmers):
    rest_kmers = list(filter(lambda x: x != kmer, kmers))
    edges = [rest_kmer[:-1] for rest_kmer in rest_kmers if kmer[1:] == rest_kmer[:-1]]
    
    return edges

def DeBruijnFromString(k, text):
    kmers = [text[i:i+k] for i in range(len(text) - k + 1)]
    graphs = {}
    for kmer in kmers:
        if kmer[:-1] in graphs:
            graphs[kmer[:-1]].add(kmer[1:])
        else:
            graphs[kmer[:-1]] = {kmer[1:]}

    return graphs

In [5]:
DeBruijnFromString(4, 'AAGATTCTCTAC')

{'AAG': {'AGA'},
 'AGA': {'GAT'},
 'ATT': {'TTC'},
 'CTA': {'TAC'},
 'CTC': {'TCT'},
 'GAT': {'ATT'},
 'TCT': {'CTA', 'CTC'},
 'TTC': {'TCT'}}

In [6]:
import networkx as nx
nx.test()

platform darwin -- Python 3.6.7, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /Users/indrikwijaya/Documents/bioinformaticsalgorithm/chp3_GenomeAssembly, inifile:
collected 4042 items / 3 skipped

. ................................................................................................................................................................................................F.............................................................................................................F...F.....F..........................................................................................................................................................................................................................................................................................................................................................................................sss......s........s..............s..................................s.......................s................

           problems. *Naval Research Logistics Quarterly*, 3: 253-258, 1956.
    
        4. Munkres, J. Algorithms for the Assignment and Transportation Problems.
           *J. SIAM*, 5(1):32-38, March, 1957.
    
        5. https://en.wikipedia.org/wiki/Hungarian_algorithm
        """
        cost_matrix = np.asarray(cost_matrix)
        if len(cost_matrix.shape) != 2:
            raise ValueError("expected a matrix (2-d array), got a %r array"
                             % (cost_matrix.shape,))
    
        if not (np.issubdtype(cost_matrix.dtype, np.number) or
                cost_matrix.dtype == np.dtype(np.bool)):
            raise ValueError("expected a matrix containing numerical entries, got %s"
                             % (cost_matrix.dtype,))
    
        if np.any(np.isinf(cost_matrix) | np.isnan(cost_matrix)):
>           raise ValueError("matrix contains invalid numeric entries")
E           ValueError: matrix contains invalid numeric entries

cost_matrix = array([[ 

        s = np.zeros(rhs.shape, dtype=self.dtype)
>       s[1:] = linalg.cg(self.L1, rhs[1:], M=self.M, atol=0)[0]
E       TypeError: cg() got an unexpected keyword argument 'atol'

rhs        = array([ 1., -1.,  0.,  0.])
s          = array([0., 0., 0., 0.])
self       = <networkx.algorithms.centrality.flow_matrix.CGInverseLaplacian object at 0x11ac86be0>

../../../anaconda3/lib/python3.6/site-packages/networkx/algorithms/centrality/flow_matrix.py:124: TypeError
___________________________ TestYaml.testUndirected ____________________________

self = <networkx.readwrite.tests.test_yaml.TestYaml object at 0x115af3eb8>

    def testUndirected(self):
>       self.assert_equal(self.G, data=False)

self       = <networkx.readwrite.tests.test_yaml.TestYaml object at 0x115af3eb8>

../../../anaconda3/lib/python3.6/site-packages/networkx/readwrite/tests/test_yaml.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../anaconda3/lib/python3.6/site-packag

False

In [100]:
test_file = 'rosalind_ba3d.txt'
with open(test_file, 'r') as reader:
    k = int(reader.readline())
    text = reader.readline().strip()
debruijn_file = open('debruijnstring.txt', 'a')
debruijn = [' -> '.join([item[0], ','.join(item[1])]) for item in DeBruijnFromString(k, text).items()]
debruijn_file.write('\n'.join(debruijn))
debruijn_file.close()


In [None]:
DeBruijnFromString(12, 'CTGAAGACCTCTCCACATTACTACGATATAAATCATTTCAGCCTCTAGATACGCCTTGGTGGGTGGGGTTGGCAATTTACGATATGTCCGAATGATTTGACACCAAATACCTTAGCTAGCCCCAAGGAAAATTCTGGGCTTTACGTTGGCCGAGCCACATTACTACAGTAAGGTTAAGCAACCAGCCAGTCGCTCATAAGGACTCCACGCCTCCCGTTACTGACTTCCAACAACAATGTGACAGTAGACTGGAACCTGGGAGGACATTATTGATTCGCCGCGAATCTTCTAAGGTATTTTACCCCCACTGGTCACCTTAACCATTAAGACCTCGAAGTGACACCTAGCCTCTTAACACCCAACTCCACCGACAATACCTATTCGCTGACAAGCGGGACATCCGATCGCCCCTGACTCGAGGTGTCTACCGTCCATCGATTGCTAAACTTTGTTAGGAGTCTAAGCGAACCATGGGAAGGGGGCGGCAGTCAACGTGCTCCTTTAGTGAGGTACCATATTCTTACAGCATGTGGAGCGCAGCAAACTAGCGACCGGGAGTACTCCCACAACCCTGGGTACGTACTGCACTTTTTTCAAGAGCCAGGGTCATTTAAATAGCATCTTTGCTCTTTCTGATAAGGGGGCGACCATCTCCGAATTGAGCCAAACGCTGGTATAAGACTCGTCTCATGACTCCCTAGCCATTTGTATGTTGTCATTTCTGATTTTAGCAGGTAAAACGTAAGGCCTGCTAAAGAATCACGCGGGGAGGCCTTAAATTTCGTCATGGAGCAATCGTCCTAGATTGCTGTGAAGGTTCGTACCAGTAGAGTCTAATGTGCGTAAATGTTAACTGGCCGTATATTCTCTGGTGAGCTGAAACAGAAAGCTGGCAGAAAGCCACTCTTGCTGTTTCGTGTGTACGGACATCGGGATAGTACCAAAAAGCATGTTCTTCATCTGGCGATGCTTGATGTCTACCGTAGACACCTTCATACGT')

Consider the following questions.

1. If we gave you the de Bruijn graph DeBruijnk(Text) without giving you Text, could you reconstruct Text?
2. Construct the de Bruijn graphs DeBruijn2(TAATGCCATGGGATGTT), DeBruijn3(TAATGCCATGGGATGTT), and DeBruijn4(TAATGCCATGGGATGTT). What do you notice?
3. How does the graph DeBruijn3(TAATGCCATGGGATGTT) compare to DeBruijn3(TAATGGGATGCCATGTT)?

Solving the String Reconstruction Problem reduces to finding a path in the De Bruijn graph that visits every _edge_ exactly one

__Eulerian Path Problem__: Construct an Eulerian path in a graph.

__Input__: A directed graph.

__Output__: A path visiting every edge in the graph exactly once (if it exists).

For an arbitrary string Text, we define CompositionGraphk(Text) as the graph consisting of |Text|−k+1 isolated edges, where edges are labeled by k-mers in Text; every edge labeled by a k-mer edge connects nodes labeled by the prefix and suffix of this k-mer. The graph CompositionGraphk(Text) is just a collection of isolated edges representing the k-mers in the k-mer composition of Text, meaning that we can construct CompositionGraph(Text) from the k-mer composition of Text. Gluing nodes with the same label in CompositionGraphk(Text) produces DeBruijnk(Text). Thus, we can construct the de Bruijn graph of a genome without knowing the genome!

__DeBruijn Graph from k-mers Problem__: Construct the de Bruijn graph from a set of k-mers.

__Input__: A collection of k-mers Patterns.

__Output__: The adjacency list of the de Bruijn graph DeBruijn(Patterns).

In [9]:
# 3e De Bruijn Graph from Kmers
def DeBruijnFromKmers(kmers):
    graphs = {kmer[:-1]: [] for kmer in kmers}
    for kmer in kmers:
        graphs[kmer[:-1]].append(kmer[1:])
    return graphs

In [10]:
DeBruijnFromKmers(['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG'])

{'AGG': ['GGG'],
 'CAG': ['AGG', 'AGG'],
 'GAG': ['AGG'],
 'GGA': ['GAG'],
 'GGG': ['GGG', 'GGA']}

In [None]:
test_file = 'rosalind_ba3e.txt'
kmers = []
with open(test_file, 'r') as reader:
    for line in reader:
        kmers.append(line.strip('\n'))
#debruijn_file = open('debruijnskmers.txt', 'a')
debruijn = [' -> '.join([item[0], ','.join(item[1])]) for item in DeBruijnFromKmers(kmers).items()]
#debruijn_file.write('\n'.join(debruijn))
#debruijn_file.close()
debruijn

We have already defined an Eulerian path as a path in a graph traversing each edge of a graph exactly once. A cycle that traverses each edge of a graph exactly once is called an Eulerian cycle, and we say that a graph containing such a cycle is Eulerian.

__Eulerian Cycle Problem__: Find an Eulerian cycle in a graph.

__Input__: A graph.

__Output__: An Eulerian cycle in this graph, if one exists.

Computer scientists classify an algorithm as polynomial if its running time can be bounded by a polynomial in the length of the input data. On the other hand, an algorithm is exponential if its runtime on some datasets is exponential in the length of the input data. Although Euler's algorithm is polynomial, the Hamiltonian Path Problem belongs to a special class of problems for which all attempts to develop a polynomial algorithm have failed

Thus, in order for a graph to be Eulerian, the number of incoming edges at any node must be equal to the number of outgoing edges at that node. We define the indegree and outdegree of a node v (denoted in(v) and out(v), respectively) as the number of edges leading into and out of v. 

A node v is __balanced__ if __in(v) = out(v)__, 

and a graph is balanced if all its nodes are balanced. 

The graph in the figure below is balanced but not Eulerian because it is __disconnected__, meaning that some nodes cannot be reached from other nodes. In any disconnected graph, it is impossible to find an Eulerian cycle. In contrast, we say that a directed graph is strongly connected if it is possible to reach any node from every other node. We now know that an Eulerian graph must be both balanced and __strongly connected__. Euler’s Theorem states that these two conditions are sufficient to guarantee that an arbitrary graph is Eulerian. As a result, it implies that we can determine whether a graph is Eulerian without ever having to draw any cycles.

__Euler’s Theorem__: Every balanced, strongly connected directed graph is Eulerian.

In [50]:
from random import randint

test_edges = {0:[3], 1:[0], 2:[1, 6], 3:[2], 4:[2], 5:[4], 6: [5, 8], 7:[9], 8:[7], 9:[6]}

def EulerianCycle(adj_list):
    '''Generates an Eulerian Cycle given adjacency list'''
    current_node = list(adj_list.keys())[randint(0, len(adj_list.keys())-1)]
    path = [current_node]
    
    # get the initial cycle
    while True:
        path.append(adj_list[current_node][0])
        
        if len(adj_list[current_node]) == 1:
            del adj_list[current_node]
        else:
            adj_list[current_node] = adj_list[current_node][1:]
        
        if path[-1] in adj_list:
            current_node = path[-1]
        else:
            break
    
    # continually expand the initial cycle until we're out of adjacency list
    while len(adj_list) > 0:

        for i in range(len(path)):

            if path[i] in adj_list:
                current_node = path[i]
                cycle = [current_node]
                
                while True:
                    cycle.append(adj_list[current_node][0])
                    
                    if len(adj_list[current_node]) == 1:
                        del adj_list[current_node]
                    else:
                        adj_list[current_node] = adj_list[current_node][1:]
                    
                    if cycle[-1] in adj_list:
                        current_node = cycle[-1]
                    else:
                        break
                path = path[:i] + cycle + path[i+1:]
    return path
EulerianCycle(test_edges)

[6, 5, 4, 2, 1, 0, 3, 2, 6, 8, 7, 9, 6]

In [49]:
edges = [(0, 3), (1, 0), (2, 1), (2, 6), (3, 2), (4, 2), (5, 4), (6, 5), (6, 8), (7, 9), (8, 7), (9, 6)]
G = nx.Graph()
G.add_edges_from(edges)
G.number_of_edges()
print(list(nx.eulerian_path(G)))

[(0, 1), (1, 2), (2, 4), (4, 5), (5, 6), (6, 9), (9, 7), (7, 8), (8, 6), (6, 2), (2, 3), (3, 0)]


In [198]:
with open('rosalind_ba3f.txt') as input_data:
    edges = {}
    for edge in [line.strip().split(' -> ') for line in input_data.readlines()]:
        if ',' in edge[1]:
            edges[int(edge[0])] = list(map(int,edge[1].split(',')))
        else:
            edges[int(edge[0])] = [int(edge[1])]

# Get the Eulerian cycle.
path = EulerianCycle(edges)

# Print and save the answer.
#print('->'.join(map(str,path)))
with open('euleriancycle.txt', 'w') as output_data:
    output_data.write('->'.join(list(map(str,path))))

In [59]:
from functools import reduce
from copy import deepcopy
# 3g Eulerian Path

def EulerianPath(adj_list):
    '''Returns an Eulerian path given adjacency list'''
    
    # Determine the unbalanced edges.
    out_values = reduce(lambda a, b: a+b, adj_list.values())

    for node in set(out_values + list(adj_list.keys())):
        out_value = out_values.count(node)
        if node in adj_list:
            in_value = len(adj_list[node])
        else:
            in_value = 0
        
        if in_value < out_value:
            unbalanced_from = node
        elif out_value < in_value:
            unbalanced_to = node

    # add an edge connecting the unbalanced edges
    if unbalanced_from in adj_list:
        adj_list[unbalanced_from].append(unbalanced_to)
    else:
        adj_list[unbalanced_from] = [unbalanced_to]
    
    # Get the Eulerian cycle form the edges, including the unbalanced edge
    cycle = EulerianCycle(adj_list)
    
    # Find the location of the unbalanced edge in the Eulerian cycle
    divide_point = list(filter(lambda i: cycle[i:i+2] == [unbalanced_from, unbalanced_to], 
                              range(len(cycle) - 1)))[0]
    
    # Remove the unbalanced edge, and shift appropriately, overlapping the head and tail
    return cycle[divide_point+1:] + cycle[1:divide_point+1]

In [68]:
test_edges = {0:[2], 1:[3], 2:[1], 3:[0, 4], 6: [3, 7], 7:[8], 8:[9], 9:[6]}
edgesList = [(k, i) for k, l in test_edges.items() for i in l]
G = nx.DiGraph()
G.add_edges_from(edgesList)
G.number_of_edges()
print(list(nx.eulerian_path(G)))

[(6, 7), (7, 8), (8, 9), (9, 6), (6, 3), (3, 0), (0, 2), (2, 1), (1, 3), (3, 4)]


In [67]:
test_edges = {0:[2], 1:[3], 2:[1], 3:[0, 4], 6: [3, 7], 7:[8], 8:[9], 9:[6]}
'->'.join([str(node) for node in EulerianPath(deepcopy(test_edges))])



1 4 6


'6->7->8->9->6->3->0->2->1->3->4'

In [54]:
with open('rosalind_ba3g.txt') as input_data:
    edges = {}
    for edge in [line.strip().split(' -> ') for line in input_data.readlines()]:
        if ',' in edge[1]:
            edges[int(edge[0])] = list(map(int,edge[1].split(',')))
        else:
            edges[int(edge[0])] = [int(edge[1])]
print(edges)
# Get the Eulerian cycle.
path = EulerianPath(edges)

# Print and save the answer.
#print('->'.join(map(str,path)))
with open('eulerianpath.txt', 'w') as output_data:
    output_data.write('->'.join(list(map(str,path))))


{0: [1642, 3, 343], 1: [0, 199, 262, 32, 55, 7], 10: [11, 13, 140], 100: [1000, 101], 1000: [1001], 1001: [1002], 1002: [100, 1516], 1003: [118, 1883], 1004: [1003], 1005: [1004, 2067], 1006: [2281, 229], 1007: [1008], 1008: [1006, 2056], 1009: [1011], 101: [7], 1010: [233], 1011: [1010], 1012: [82], 1013: [1014], 1014: [1012], 1015: [1016], 1016: [1017, 1379], 1017: [147], 1018: [1020], 1019: [896], 102: [100], 1020: [1019, 1051, 2294], 1021: [1022], 1022: [151], 1023: [1021], 1024: [1026], 1025: [94], 1026: [1025], 1027: [165], 1028: [1027], 1029: [1028], 103: [105], 1030: [1031, 1089], 1031: [1032], 1032: [155], 1033: [1034], 1034: [1035], 1035: [914], 1036: [2567, 816], 1037: [1038], 1038: [1036], 1039: [2435, 347], 104: [15], 1040: [1041], 1041: [1039, 1217, 1437], 1042: [1044], 1043: [1042, 1276], 1044: [1090, 737], 1045: [1046, 2305, 2354], 1046: [435], 1047: [1045], 1048: [291], 1049: [1050], 105: [104], 1050: [1048], 1051: [1052], 1052: [1053], 1053: [1020], 1054: [1055], 1055

We can therefore summarize this solution using the following pseudocode, which relies on three problems that we have already solved:

The de Bruijn Graph Construction Problem;
The Eulerian Path Problem;
The String Spelled by a Genome Path Problem.
    StringReconstruction(Patterns)
    
        dB ← DeBruijn(Patterns)
        
        path ← EulerianPath(dB)
        
        Text ← PathToGenome(path)
        
        return Text

In [14]:
# 3h String Reconstruction
def StringReconstruction(k, Patterns):
    dB = DeBruijnFromKmers(Patterns)
    path = EulerianPath(dB)
    
    return ReconstructString(path)

In [15]:
StringReconstruction(4, ['CTTA', 'ACCA', 'TACC', 'GGCT', 'GCTT', 'TTAC'])

'GGCTTACCA'

In [None]:
k_patterns = []
with open('rosalind_ba3h.txt', 'r') as data:
    for line in data:
        k_patterns.append(line.strip('\n'))
k = int(k_patterns[0])
patterns = k_patterns[1:]
StringReconstruction(k, patterns)

In [16]:
from itertools import product

def kUniversalCircularString(k):
    # create the edges
    adj_list = {}
    for kmer in [''.join(item) for item in list(product('01', repeat = k))]:
        if kmer[:-1] in adj_list:
            adj_list[kmer[:-1]].append(kmer[1:])
        else:
            adj_list[kmer[:-1]] = [kmer[1:]]
    # get the eulerian cycle, remove the repeated last entry for the associated path
    
    path = EulerianCycle(adj_list)
    
    return ''.join([item[0] for item in path[:-1]])

In [17]:
kUniversalCircularString(4)

'0110000100111101'

In [18]:
kUniversalCircularString(9)

'10010100001111011111001011111010011111011011011111100011111101011111110011111111101110001101110010101010010101110011101010011101110100101110101101010101101110110100110110101110111100010111100110111101010111101100010101100011001100011101100100100101100101101100110100100110101100111000100111001100111100100111101000101101000110001000110101000111001000111100000101100000111110000101110000110110000111000000111010000011010000101010000000001000000101000100001000100100000100101001000101001100000001100001001100100001'

__String Reconstruction from Read-Pairs Problem__: Reconstruct a string from its paired composition.

__Input__: A collection of paired k-mers PairedReads and an integer d.

__Output__: A string Text with (k,d)-mer composition equal to PairedReads (if such a string exists).

Not every Eulerian path in the paired DeBruijn graph constructed from (k, d)-mer composition spells out a solution of the String Reconstruction from Read-Pairs problem

In [64]:
paired_list = ['GAGA|TTGA', 'TCGT|GATG', 'CGTG|ATGT', 'TGGT|TGAG', 'GTGA|TGTT', 'GTGG|GTGA', 'TGAG|GTTG', 'GGTC|GAGA', 'GTCG|AGAT']

def StringReconstructionFromReadPairs(k, d, pairedKmers):
    pairedKmers = [pair.split('|') for pair in pairedKmers]
    # Construct adjacency list from paired kmers
    paired_adj_list = {}
    for pair in pairedKmers:
        if (pair[0][:-1], pair[1][:-1]) in paired_adj_list:
            paired_adj_list[(pair[0][:-1], pair[1][:-1])].append((pair[0][1:], pair[1][1:]))
        else:
            paired_adj_list[(pair[0][:-1], pair[1][:-1])] = [(pair[0][1:], pair[1][1:])]
    
    # Get Eulerian path from the paired adjacency list
    paired_path = EulerianPath(paired_adj_list)

    # Recombine the paths, accounting for their overlaps
    strings = [paired_path[0][i] + ''.join(map(lambda x: x[i][-1], paired_path[1:])) for i in range(2)]
    
    text = strings[0][:k+d] + strings[1]
    
    return text

In [65]:
StringReconstructionFromReadPairs(4, 2, paired_list)

'GTGGTCGTGAGATGTTGA'

In [None]:
test_file = 'rosalind_ba3j.txt'
k_d_pairedkmers = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_d_pairedkmers.append(line.strip('\n'))
k = int(k_d_pairedkmers[0].split(' ')[0])
d = int(k_d_pairedkmers[0].split(' ')[1])
pairedKmers = k_d_pairedkmers[1:]
StringReconstructionFromReadPairs(k, d, pairedKmers)

Even after read breaking, most assemblies still have gaps in k-mer coverage, causing the de Bruijn graph to have missing edges, and so the search for an Eulerian path fails. In this case, biologists often settle on assembling contigs (long, contiguous segments of the genome) rather than entire chromosomes. For example, a typical bacterial sequencing project may result in about a hundred contigs, ranging in length from a few thousand to a few hundred thousand nucleotides. For most genomes, the order of these contigs along the genome remains unknown. Needless to say, biologists would prefer to have the entire genomic sequence, but the cost of ordering the contigs into a final assembly and closing the gaps using more expensive experimental methods is often prohibitive.

Fortunately, we can derive contigs from the de Bruijn graph. A path in a graph is called non-branching if in(v) = out(v) = 1 for each intermediate node v of this path, i.e., for each node except possibly the starting and ending node of a path. A maximal non-branching path is a non-branching path that cannot be extended into a longer non-branching path. We are interested in these paths because the strings of nucleotides that they spell out must be present in any assembly with a given k-mer composition. For this reason, contigs correspond to strings spelled by maximal non-branching paths in the de Bruijn graph. 