-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone libraries to work with GFA #3
Comments
@lh3, we have developed a library to provide a standard interface to sequence graphs with embedded paths, https://github.com/vgteam/libhandlegraph. The idea with this interface hierarchy is to expose something based on a few primitive types without needing to implement the data structure using those types. For instance, we often represent graphs using fully succint data structures, but this means that entities in the graph can't be represented as pointers to nodes or or atomic IDs. The handle concept refers to the bidirectional identifier used by a particular implementation to refer to a node (S line) in the graph. The class hierarchy includes immutable sequence graphs, graphs with paths (VG model), and mutable versions of them. It also exposes a positional index based on the embedded paths. Two implementations are based on reading GFA files into a self index and exposing aspects of this API on top of them (xg and odgi). We have a study in progress to compare implementations. It should be easy enough to add a simpler fixed C and C++ interface on top of these. I don't think the semantics become radically different. There is a mismatch with the number of coordinate spaces. There are some semantic mismatches with rGFA, but they can be resolved. |
An important question is about the scope of the library. vg is too large. I think in its current form, libhandlegraph is too small. My preference is to include at least a GFA parser and an in-memory data structure like handle graph. I don't have a strong opinion on serialization, indexing and other stuffs. Another question is about the terminology. The use of "(sequence) segment" and "link" can be traced back to the discussion on the FASTG format. Richard and I wanted to avoid "vertex", "node", "edge" and "arc" because in the assembly world, people always have different opinions. In a de Bruijn graph, "vertex" and "edge" are interchangeable to some extent, and as a result, a graph simplified from a de Bruijn graph is more often represented in the "edge way", with sequences put on edges instead of nodes. Adopting the GFA terminology will help to avoid such confusions. |
For clarity, we are rewriting all of vg to be based around libhandlegraph.
Version 2 will arrive when this transition is done.
I think we should consider extending the HandleGraph interfaces to match
what you are thinking of. Then we can peg a C interface to it. The benefit
is that the backend piece that stores and allows manipulation of the graph
can be changed. We have the impression that there is not one best solution
on this side, but it has helped a lot to specify a small API to these
graphs.
Libhandlegraph is missing anything to do with alignment. It might make
sense to mix this in somehow. In vg we had other primitives but we
shouldn't be stuck on them.
…On Thu, Jul 18, 2019, 19:11 Heng Li ***@***.***> wrote:
An important question is about the scope of the library. vg is too large.
I think in its current form, libhandlegraph is too small. My preference is
to include at least a GFA parser and an in-memory data structure like
handle graph. I don't have a strong opinion on serialization, indexing and
other stuffs.
Another question is about the terminology. The use of "(sequence) segment"
and "link" can be traced back to the discussion on the FASTG format.
Richard and I wanted to avoid "vertex", "node", "edge" and "arc" because in
the assembly world, people always have different opinions. In a de Bruijn
graph, "vertex" and "edge" are interchangeable to some extent, and as a
result, a graph simplified from a de Bruijn graph is more often represented
in the "edge way", with sequences put on edges instead of nodes. Adopting
the GFA terminology will help to avoid such confusions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3?email_source=notifications&email_token=AABDQEJUIMWIODN3NFYMZUDQACP3DA5CNFSM4IE5OQU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2JFDPQ#issuecomment-512905662>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABDQEJQFP77CRFQV6MVFLTQACP3DANCNFSM4IE5OQUQ>
.
|
How about https://github.com/edawson/gfakluge ? Though i don't think it supports rGFA. |
Another discussion thread. It is probably too early to implement libraries now, but it would be good to start thinking about the topic.
Currently, gfatools comes with very preliminary APIs to read rGFA into memory. The memory layout is described in gfa.h. It largely follows the model of string graphs. I quite like model and will stick with it. However, I guess general devs will feel uncomfortable with this representation. I won't have the bandwidth to implement the more general path model any time soon, either. In addition, it is also preferable to have two independent implementations (e.g. samtools vs picard vs bamtools). I wonder if you (@ekg and @benedictpaten) are interested in implementing a standalone library to work with GFA. You already have in vg a GFA parser, an in-memory model and a serialization format. You can isolate the relevant code and expose stable C and C++ APIs to other devs. I know vg has APIs, but I guess other devs will prefer a more focused lightweight library that is easier to build.
The text was updated successfully, but these errors were encountered: