-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Succinct graph representations #1041
Conversation
By adjacency I guess that you mean |
I will try to do this after #1029 gets merged. |
Yes, or even just containsEdge(source, target). Maybe it's a linear search now? That would explain the timings. I got a few interesting ideas and I'm going to partially rewrite this PR. My goal is to have size 4-5 times smaller than the sparse implementation and faster accessors. Let's see whether I can get there :). |
Yes, it is currently linear in the size of the neighborhood. |
OK, this is probably the most sensible approach. There is an implementation mimicking the sparse one which is quite slow, and one using pairs as edges that is an order of magnitude faster. I'm still doing some speed tests and I have to review the docs but it looks pretty usable. Footprint is 3 to 10 times less than the sparse implementation, depending on density. |
OK, I think this is ready to merge if you like it. I have written extensive Javadocs as the tradeoffs between the two different kind of implementations might not be trivial to understand. I don't know how that might be difficult, but having a bridge to Python for implementations using IntIntPair for edges might be very interesting—the succinct implementations using that instead of Integer are an order of magnitude faster. |
BTW, do you guys think there's some value in a constructor for directed graphs that encodes only outgoing arcs? The space would be halved, but of course you woudn't get incoming arcs, similarly to the forward-only constructor of the WebGraph adapters. |
Yes, I did this recently for the |
Ok, I'll do it ASAP. |
…ructors acceping a supplier of streams of edges
I exploited also your new constructors using suppliers of edges of streams. BTW, is there any reason why sparse representations are not serializable? I've been reliably storing and loading such instances without problems (just adding Since it takes some time to build them I think it would be a useful feature. And people could easily publish graphs using that format. |
Is there anything more I should do? Once this is released, I was thinking about making part of the LAW graph database (say, graphs with <2B nodes) in this format. It would make it possible to test easily JGraphT on large graphs even with relatively little memory. |
Looks good. I will wait a bit to see if John has any comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few doc nits. Now we can proudly say, "JGraphT Sux!"
* <var>k</var>-th element of the sequence and some bit shifting (the encoding | ||
* <var>x</var><var>n</var> + <var>y</var> would be slightly more compact, but much slower to | ||
* decode). Since we know the list of cumulative outdegrees, we know which range of indices | ||
* corresponds to the edges outgoing from each vertex. If we need to now whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "know"
* {@link SparseIntUndirectedGraph}). | ||
* | ||
* <p> | ||
* {@linkplain org.jgrapht.GraphIterables#outgoingEdgesOf(Object) Enumeration of edges} is is very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "is is"
* are very fast and happen in almost constant time. | ||
* | ||
* <p> | ||
* {@link SuccinctDirectedGraph} is a much slower implementation with a similar footprint using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is supposed to reference SuccintIntDirectedGraph instead?
Thank you for the check! All fixed. BTW, once I talked with a guy who wouldn't use the library because of the word pun 😂. |
Just merged, thanks! |
Great! Any planned release date for 1.5.1? |
yes, yesterday :) |
I don't know if this is the intended behavior, and I don't know if this might be an Ivy problem, but adding the artifact |
Yes, you need to explicitly request the module that you want to use. The core library is the jgrapht-core. |
@vigna @d-michail
Prior to this PR, I personally hadn't heard about 'succinct' graphs. This might be used in a very specific field? We should try to make this accessible to 'standard' users. Also, from the wiki I understand that succinct graphs are incredibly space efficient. The only (?) reason to make graph storage this efficient is when you intend to store a massive graph (massive in terms of nr of edges/vertices). The same question applies here: do the JGraphT algorithms perform well enough on those massive graphs? Here we can obviously limit ourselves to algorithms you would reasonably expect to execute on those graphs, e.g. you would not compute an exact TSP on a graph with 20MM vertices. |
I tried to explain all this in the Javadoc, but we can rephrase this somewhere else. The point is that we already have succinct representation in WebGraph, but these are more targeted at JGraphT and in particular to the Python bridge. We plan to distribute all our datasets with less than 2^31 arcs in serialized succinct JGraphT form directly, so users can just load and use them (less friction than with adapters). Succinct data structures are fairly recent, with progress in implementations starting in the mid-2000. For example, Facebook's graph and text index are all stored using partitioned Elias-Fano, a succinct data structure (you can see some public code here, but what they actually use is more complex). Lucene has a succinct Elias-Fano encoder. The reason to use succinct graphs is simply that you can analyze in core memory much larger graphs. Access is asymptotically the same of a redundant format, with constant factors that can be large or small depending on the implementation. |
We're having a look at the wiki with Dimitrios. Where would you think would be sensible to put information about this (and WebGraph adapters)? |
The correct place would be in the user guide (edit docs/guide-templates/UserOverview.md): https://jgrapht.org/guide/UserOverview#graph-adapters A new page (with code examples) linked from here would be best. |
This PR adds to jgrapht-unimi-dsi two graph representations based on succinct data structures.
The WebGraph adapters already provide ways to use succinct representation (e.g., EFGraph), but the implementations in this PR are modeled after the sparse representations of JGraphT—nodes and arcs are represented by integers and numbered starting from zero (they should be usable from Python).
Unfortunately, JGraphT's architecture clashes a bit with the succinct representation. For example, getEdgeSource() and getEdgeTarget() have to make twice the same expensive call. The result is that while the graphs are about 5 times smaller, access is 5 times slower, too.
UPDATE: My claims about speed are quite wrong.
The problem is that I (stupidly) quickly tested enumeration time for the whole arc set, which is not a good idea because in these graphs the arc set is trivial.
More accurate testing shows that that the directed version is 50% slower than SparseIntDirectedGraph when enumerating successors, and it is 2-3 times faster when checking adjacency (but see #1042 — the speed test in this case might be meaningless). This figures will vary with the density of the graph.
The undirected implementation is unfortunately significantly slower (5 times slower when enumerating successors). Adjacency, however, is about 150 times faster. Is there some reason why adjacency in SparseIntUndirectedGraph is so slow?