-
Notifications
You must be signed in to change notification settings - Fork 1
Graph Summarization Development Options #5
Comments
You might think about subclassing the XG object somehow, or adding additional indexes on top of / independently of it. |
I am considering how to implement web backend. There are three choices described above. |
Alternative idea is that we have our own database for summary layers. To summarize graphs in the database, we need to prefetch all graphs into our database. The upside is that we can easily update our data model for visualization, but the downside is that might be redundant and might not be compatible with new versions. |
@6br I think we should build from, not into the xg index. We can extend it with these kinds of indexes, maybe be precomputing summarized views and storing them in separate indexes, then translating between them as needed with a separate system that links them together. I think the idea of keeping your own database is fine too, and basically the same as this, it all depends on how much you optimize it to this particular application. |
I believe these summary graphs will be useful for more than just visualization. Fundamentally, what I'm designing it to do is to group together haplotype blocks and iteratively find informative boundaries. That'd be a very useful precompute step to any other analysis. Working with less nodes that are less noisy would be a good place to start for other researchers. The difference between 1 and 2 is most obvious when it comes to updates. When XG standard format is updated, does the summary graph code also get updated before a release? Do we lock the version numbering together? |
Further discussion in the following month has led pretty conclusively to an approach where we support import from both XG and GFA. Our database can handle the "new feature concepts" of links between summary layers and aggregation of paths into haplotypes and ribbons. Then we can export to XG file formats and still be intercompatible with other tools. This seems the best of both worlds and doesn't put too much development burden on being able to add features. Each summarization layer can be exported as its own Graph but since other tools don't contain the concept of linking graphs in a hierarchy, they won't be linked. This means even if you don't care about our visualizations, it'll still be a useful tool for Graph scrubbing or summarization. We can have a followup discussion about the feature differences between GFA and XG and whether or not either of those technical details conflict with something we're doing in the database concepts. I personally think that discussion will be more clear once we have a functioning tool. So I'll skip on speculation for now and just implement what is possible and see if there's any substantial snags along the way. |
To store the graphs in RAM efficiently you might be looking at something
very similar to the various HandleGraph implementations like xg. GFA is
only an interchange format. The HandleGraphs are self indexes that allow
random access by any feature of the graph.
XG presents a HandleGraph interface, but is unique in allowing random query
of path (sequence or reference) positions. We can find what paths are at a
given node in O(1) time, and what node is at a given path position in
something like O(log N) time where N is the size of the path. The graph
topology is packed into a single vector that supports efficient random
lookup and cache-efficient O(1) relativistic traversal.
At the BioHackathon I will be implementing a server API on top of xg. It
will expose a HandleGraph API. Let me know if y'all have any ideas for
queries needed by the visualization service. I can add them to the API if
they are not already implemented there.
…On Thu, Aug 15, 2019, 13:09 Josiah Seaman ***@***.***> wrote:
Closed #5 <#5>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABDQEM4U4TNFSORKCO27MLQEU2MZANCNFSM4HXKH6WA>
.
|
Thank you for your comment, Erik. HandleGraph is probably relevant to us. Does this require storing the whole graph in memory? I don't quite have the mental bandwidth to think through it all right now, so please allow me to think out loud. I'm pretty sure a database would have the same O(1) Node retrieval or O(log N) binary search on paths. The difference may be that a DB handles paging to memory automatically. Or there may be no difference at all. In which case, yeah, I don't want to reimplement the wheel. The factor driving the database decision is that we'll need a database to contain links and concepts not handled by XG. Fundamentally, we need a place to add new feature. In order to have links between nodes in different summarization layers, I need to have two nodes in a DB and a link between them. In practical terms, that means every single Node needs to be present as a copy in the DB. Sure, I could skip storing sequence in them, even skip upstream and downstream connections. But at the point where you already have a data structure with every node, it seems you're 90% of the way there to just handling the whole dataset internally with an XG mediated import option. Sebastian recently suggested we collaborate to build a standard file format for summary levels. If we had a summary node link map (which is just a tree) plus XG retrieval, we could technically have all the data without a database, though it would be in many different files. The reason I'd decide against a pure file solution comes down to migrations. If I have a database schema, Django automatically generates migrations scripts that update all data from any version. If we code file formats by hand, then all schema changes are breaking changes or a lot of development time goes into writing version migrations by hand and hoping you never make a mistake that corrupts your user data. It seems like with Import / Export from an internal database I get the best of both worlds, with clearly defined boundaries of responsibility. |
I agree with you that a database is sufficient and vastly more flexible.
The subtext here is that the size of the graphs can be very large if they
are stored in uncompressed form. If we use even a handful of pointers and
64-bit integers per node or edge in the graph, we're going to run into
storage costs in the terabyte range for just the 1000GP small variant
graph. The implementations we've made keep the memory usage close to the
0-order entropy of the data while providing random access. They have to be
customized for this particular application. I would suggest a kind of
hybrid approach, where the links and annotations are stored in an overlay.
However, if things are all being dropped into disk backed databases and
performance isn't critical, then maybe there's no reason to go this route.
…On Fri, Aug 16, 2019 at 11:57 AM Josiah Seaman ***@***.***> wrote:
Thank you for your comment, Erik. HandleGraph is probably relevant to us.
Does this require storing the whole graph in memory? I don't quite have the
mental bandwidth to think through it all right now, so please allow me to
think out loud. I'm pretty sure a database would have the same O(1) Node
retrieval or O(log N) binary search on paths. The difference may be that a
DB handles paging to memory automatically. Or there may be no difference at
all. In which case, yeah, I don't want to reimplement the wheel.
The factor driving the database decision is that we'll need a database to
contain links and concepts not handled by XG. Fundamentally, we need a
place to add new feature. In order to have links between nodes in different
summarization layers, I need to have two nodes in a DB and a link between
them. In practical terms, that means every single Node needs to be present
as a copy in the DB. Sure, I could skip storing sequence in them, even skip
upstream and downstream connections. But at the point where you already
have a data structure with every node, it seems you're 90% of the way there
to just handling the whole dataset internally with an XG mediated import
option.
Sebastian recently suggested we collaborate to build a standard file
format for summary levels. If we had a summary node link map (which is just
a tree) plus XG retrieval, we could technically have all the data without a
database, though it would be in many different files. The reason I'd decide
against a pure file solution comes down to migrations. If I have a database
schema, Django automatically generates migrations scripts that update all
data from any version. If we code file formats by hand, then all schema
changes are breaking changes or a lot of development time goes into writing
version migrations by hand and hoping you never make a mistake that
corrupts your user data. It seems like with Import / Export from an
internal database I get the best of both worlds, with clearly defined
boundaries of responsibility.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABDQEMMO7UJL7WLP3P5KY3QEZ2Z3ANCNFSM4HXKH6WA>
.
|
Hi Erik, |
Development Options
The text was updated successfully, but these errors were encountered: