Skip to content
This repository has been archived by the owner on Mar 20, 2020. It is now read-only.

Graph Summarization Development Options #5

Closed
6br opened this issue Jun 12, 2019 · 10 comments
Closed

Graph Summarization Development Options #5

6br opened this issue Jun 12, 2019 · 10 comments
Assignees

Comments

@6br
Copy link
Contributor

6br commented Jun 12, 2019

Development Options

  1. Extend VG / XG with our new Graph Summarization Concepts
  2. Summary is stored in our database, XG still used at nucleotide level
  3. Import entire XG graph into our database, then add summary layers
@ekg
Copy link

ekg commented Jun 12, 2019

You might think about subclassing the XG object somehow, or adding additional indexes on top of / independently of it.

@6br
Copy link
Contributor Author

6br commented Jun 13, 2019

I am considering how to implement web backend. There are three choices described above.
Graph summarization adds several layers on top of the original graph genome. The lowest layer is the original "full" graph, and the highest layer is a kind of "bird's eyes view". and each layer has a smaller graph than the lower layer. Each node of a smaller graph has a pointer to several nodes of the lower layer. So, as a whole, the pointers become like a tree. All layers include an index to retrieve subgraphs.
@ekg, what do you think about adding these layers into xg indices, or is there any other attempts to add such higher layers into vg? Currently, I am not sure whether these data structure is beneficial for other than visualization.

@6br
Copy link
Contributor Author

6br commented Jun 13, 2019

Alternative idea is that we have our own database for summary layers. To summarize graphs in the database, we need to prefetch all graphs into our database. The upside is that we can easily update our data model for visualization, but the downside is that might be redundant and might not be compatible with new versions.

@ekg
Copy link

ekg commented Jun 13, 2019

@6br I think we should build from, not into the xg index. We can extend it with these kinds of indexes, maybe be precomputing summarized views and storing them in separate indexes, then translating between them as needed with a separate system that links them together.

I think the idea of keeping your own database is fine too, and basically the same as this, it all depends on how much you optimize it to this particular application.

@josiahseaman
Copy link
Member

I believe these summary graphs will be useful for more than just visualization. Fundamentally, what I'm designing it to do is to group together haplotype blocks and iteratively find informative boundaries. That'd be a very useful precompute step to any other analysis. Working with less nodes that are less noisy would be a good place to start for other researchers.

The difference between 1 and 2 is most obvious when it comes to updates. When XG standard format is updated, does the summary graph code also get updated before a release? Do we lock the version numbering together?

@josiahseaman
Copy link
Member

Further discussion in the following month has led pretty conclusively to an approach where we support import from both XG and GFA. Our database can handle the "new feature concepts" of links between summary layers and aggregation of paths into haplotypes and ribbons. Then we can export to XG file formats and still be intercompatible with other tools. This seems the best of both worlds and doesn't put too much development burden on being able to add features. Each summarization layer can be exported as its own Graph but since other tools don't contain the concept of linking graphs in a hierarchy, they won't be linked. This means even if you don't care about our visualizations, it'll still be a useful tool for Graph scrubbing or summarization.

We can have a followup discussion about the feature differences between GFA and XG and whether or not either of those technical details conflict with something we're doing in the database concepts. I personally think that discussion will be more clear once we have a functioning tool. So I'll skip on speculation for now and just implement what is possible and see if there's any substantial snags along the way.

@ekg
Copy link

ekg commented Aug 15, 2019 via email

@josiahseaman
Copy link
Member

Thank you for your comment, Erik. HandleGraph is probably relevant to us. Does this require storing the whole graph in memory? I don't quite have the mental bandwidth to think through it all right now, so please allow me to think out loud. I'm pretty sure a database would have the same O(1) Node retrieval or O(log N) binary search on paths. The difference may be that a DB handles paging to memory automatically. Or there may be no difference at all. In which case, yeah, I don't want to reimplement the wheel.

The factor driving the database decision is that we'll need a database to contain links and concepts not handled by XG. Fundamentally, we need a place to add new feature. In order to have links between nodes in different summarization layers, I need to have two nodes in a DB and a link between them. In practical terms, that means every single Node needs to be present as a copy in the DB. Sure, I could skip storing sequence in them, even skip upstream and downstream connections. But at the point where you already have a data structure with every node, it seems you're 90% of the way there to just handling the whole dataset internally with an XG mediated import option.

Sebastian recently suggested we collaborate to build a standard file format for summary levels. If we had a summary node link map (which is just a tree) plus XG retrieval, we could technically have all the data without a database, though it would be in many different files. The reason I'd decide against a pure file solution comes down to migrations. If I have a database schema, Django automatically generates migrations scripts that update all data from any version. If we code file formats by hand, then all schema changes are breaking changes or a lot of development time goes into writing version migrations by hand and hoping you never make a mistake that corrupts your user data. It seems like with Import / Export from an internal database I get the best of both worlds, with clearly defined boundaries of responsibility.

@josiahseaman josiahseaman reopened this Aug 16, 2019
@ekg
Copy link

ekg commented Aug 16, 2019 via email

@josiahseaman
Copy link
Member

Hi Erik,
I will take your recommendation seriously. I hadn't done the storage size calculations yet, but I'd wager the DB is 2-4x your optimized size. In order to make an overlay for our summary connections we're going to need those DB objects anyways. I would plan that if storage or performance becomes an issue that we'll replace the DB object properties that get sequence level nodes with a method that invisibly fetches this data from HandleGraph instead. This still leaves a slow DB implementation as a first development step, then later becomes an interface to HandleGraph. Does that sound like a reasonable approach for feature development versus performance?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants