Skip to content
/ odgi Public
forked from pangenome/odgi

optimized dynamic genome graph implementation

License

Notifications You must be signed in to change notification settings

mandosoft/odgi

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

odgi

install with bioconda

optimized dynamic genome/graph implemenation

Representing large genomic variation graphs with minimal memory overhead requires a careful encoding of the graph entities. It is possible to build succinct, static data structures to store queryable graphs, as in xg, but dynamic data structures are more tricky to implement.

odgi follows the dynamic GBWT in developing a byte-packed version of the graph and paths through it. Each node is represented by a byte array into which we write variable length integers to represent, 1) the node sequence, 2) its edges, and 3) the paths crossing the node.

The edges and path steps are recorded relativistically, as deltas between the current node id and the target node id, where the node id corresponds to the rank in the global array of nodes. Graphs built from biological data sets tend to have local partial order, and when sorted the stored deltas will tend to be small. This allows them to be compressed with a variable length integer representation, resulting in a small in-memory footprint at the cost of packing and unpacking.

The savings are substantial. In partially ordered regions of the graph, most deltas will require only a single byte. The resulting implementation is able to load the whole genome 1000 Genomes Project graph (described in our publication on vg in Nature Biotechnology) in around 20GB of RAM. Initially, odgi has been developed to allow in-memory manipulation of graphs produced by the seqwish variation graph inducer.

building

git clone --recursive https://github.com/vgteam/odgi.git
cd odgi
cmake -H. -Bbuild && cmake --build build -- -j 3

To build a static executable, use:

cmake -DBUILD_STATIC=1 -H. -Bbuild && cmake --build build -- -j 3

You'll need to set this flag to 0 or remove and rebuild your build directory if you want to unset this build behavior and get a dynamic binary again. Static builds are unlikely to be supported on OSX, and require appropriate static libraries on linux.

supported functionality

Currently, odgi includes subtools that allow the import of graphs in GFA format (odgi build), the extraction of summary statistics about the graph (odgi stats), topologically sorting the graph (odgi sort), export of the graph in GFA and other formats (odgi view), and enumeration and indexing of kmers (odgi kmers).

tests

Unittests from vg have been ported here and are used to validate the behavior of the algorithm. They can be run via odgi test.

API

odgi::graph_t is a MutablePathDeletableHandleGraph in the generic variation graph handle graph hierarchical API model. As such, it is possible to add, delete, and modify nodes, edges, and paths through the graph. Wherever possible, destructive operations on the graph maintain path validity.

name

odgi is a play on the Italian word "oggi" (/ˈɔd.dʒi/), which means "today". As of 2019, a standard refrain in genomics is that genome graphs will be useful in x years. But, if we make them efficient and scalable, they will be useful today.

About

optimized dynamic genome graph implementation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 97.7%
  • CMake 2.3%