Skip to content
Isaac Turner edited this page Feb 25, 2015 · 4 revisions

The core data structure of McCortex is a de Bruijn Graph, which is a graph where nodes are kmers (words of length k) and edges exist between all kmers which overlap by k-1 characters. Kmer size k is fixed for any given graph (more on kmer size). The kmers in genome assembly use bases (i.e. the DNA letters A,C,G,T).

An example with k=5:

ACCAT -> CCATC

which has two nodes (ACCAT,CCATC) and one edge between them. Traversing this graph from left to right we assemble the sequence ACCATC. From right to left we assemble the reverse complement GATGGT.

Each node in the McCortex de Bruijn graph represents both the kmer and is reverse complement. We refer to this representation as the kmer-key (lexically lowest of the two). Nodes can then be traversed with an orientation FORWARD (the kmer-key) or REVERSE (the reverse complement of the kmer-key).

By only allowing odd values of parameter k, a kmer can never be its own reverse complement. This is why --kmer-size parameter must always be odd.

Data structures

There are various ways to represent a de Bruijn graph in memory, the most obvious being a hash table where the key is the kmer-key. More compact representations have been developed and present different pay offs in speed and required memory.

More information on de Bruijn graphs: