-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update calculation of triangles #6258
Conversation
This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity). Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.
Co-authored-by: Erik Welch <erik.n.welch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code seems to be fine
Do we have any evidence yet that this speeds up the calculation? It looks like a fair amount of extra code and that usually ends up slowing things down. My comment in the original issue about looking up one neighbor not speeding things up made me think of that again here. How is this version doing? |
Yes, there's definitely a speed-up (my original SO question was prompted by very slow triangles calculation). I didn't do comprehensive benchmarking, but here's a small python script to examine the times:https://gist.github.com/SultanOrazbayev/c4d5be83b5a8dd35ad13f0eff847135d. For an Erdos-Reny graph with 1500 nodes and 0.1 edge probability I get a speed up of about 10x-20x (note I gave up on trying larger graphs, since nx current version is taking rather long). |
And in the above script, I'm not controlling for seed etc, but the results are large and consistent enough. |
For the single-node case, though, yes, the current version appears to be much, much faster. This is likely due to the sub-graph creation part. Let me investigate a bit more. |
After further inspection, it seems that the speedup is observed only when computing triangles for the whole graph (possibly, there is also speed up when As a way to combine the strengths of both approaches, the old approach is preserved if |
To speedup the subgraph part, you might try making a copy of the subgraph. Subgraph Views are not very performant. When traversing the graph, you still have to look at all edges and reject the ones that are not in the graph. If memory is not a bottleneck, and if you expect to look at an edge more than once on average, it is likely to be faster to add |
Thank you for these details, I didn't think about the overhead of using views vs copy. Right now, the iterative approach of the original code is preserved for the case when nodes are provided. Do you think this will work? |
Yes -- that is a good way to handle the case when nodes are provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look good!
The later_neighbors approach creates an extra data structure (more RAM) but avoids the repeated lookup of neighbors in the previous set(G[w]) - {w}
. I think the extra RAM is at most a dict of sets with N pointers to sets with at most N elements (N being the number of nodes). That's less than a copy of the graph so I view this as a good enhancement.
For performance, I checked a complete_graph on 1000 nodes and the old/current code is 28s while the PR code is 34s. So, the new version is slower for denser graphs. This makes sense because the bulk of the time savings with the new approach is pre-computation of set(G[w]) - {w}
and the cost of the new method is looping over nodes in neighbors & later_neighbors[node2]
instead of using it's length. When those "third node" sets are large the extra inner loop is costly. And when they are small, using lookups makes up for the extra inner loop cost. Ordering the nodes to avoid counting twice helps too, but the main savings is replacing set creation time with lookup time.
I then checked random graphs with 1500 nodes and p=0.3 and 0.5. When p=0.5, the two methods both took about 20.7s. So, I think that is the density where the methods take essentially the same time.
I was able to speed up the PR method by a factor of 2-4 by moving the update of triangle_counts
outside the inner loop and increasing it by m
instead of 1
where m
is the number of nodes to be looped over in the inner loop. Using a Counter
to handle the inner loop also helps speed things up.
I tried adding pre-computation to the current networkx version and it helped a lot -- about 3-4 times faster. But that is still slower than the current PR for sparse graphs and slower than the inner-loop-speedup version for complete graphs. So, overall we should go with the PR with the inner-loop speedup with the inner loop pushed into the compiled code that is within Counter.update()
.
We're looking at a speedup of 10+ times for sparse graphs and about 7 for density p=0.5 and about 4 for the complete_graph... The complete graph on 1500 nodes takes 1min 56sec for the existing code, and 28sec for the PR with inner loop speedup.
See the comments below...
- return to the original doc for default
- make corrections in the doc string
Co-authored-by: Dan Schult <dschult@colgate.edu>
- this is to make sure that the Counter contains a count for every node in the graph
- update doctest to pass a list rather that tuple
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
TLDR; I simplified the code and made the behaviour consistent with the original version. This is ready for a review now. The ambiguity of def test_node_not_in_graph(self):
G = nx.Graph()
G.add_node((0, 0))
G.add_node((2, 3))
assert nx.triangles(G, (0, 1)) == 0 The tricky part here is that if one passes an iterable containing nodes that are not in the graph, one would expect a dictionary with keys corresponding to the nodes in the iterable and values being zero. (somewhat similar behaviour for a list containing some nodes in the graph and some nodes not in the graph). Since a node could potentially also be an iterable itself, this leads to ambiguity (if we should return zero because the iterable is not a node in the graph or try to iterate over the contents of the iterable and return a dict with key/val containing zeros for nodes that are not in the graph and n-triangles for nodes that are in the graph). Implementing this would not lead to a readable code, so for now the most efficient path is to revert to the original behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this -- and sorry for the delay. I tried to include the new changes in the helper function used by other functions for triangles, degree, generalized degree. But the loops are different enough that it didn't end up helping there. Let's get this merged. :)
Thanks!
Thank you very much, @dschult ! |
* Update calculation of triangles This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity). Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also. * fix a typo * fix formatting * ignore self-loops * remove print statements * Update networkx/algorithms/cluster.py Co-authored-by: Erik Welch <erik.n.welch@gmail.com> * Update cluster.py * Update cluster.py * Update cluster.py * Update cluster.py - return to the original doc for default * Update cluster.py - make corrections in the doc string * Update cluster.py * Update networkx/algorithms/cluster.py Co-authored-by: Dan Schult <dschult@colgate.edu> * Update cluster.py * Update cluster.py - this is to make sure that the Counter contains a count for every node in the graph * Update cluster.py - update doctest to pass a list rather that tuple * Update cluster.py * Update test_cluster.py * Update networkx/algorithms/cluster.py Co-authored-by: Ross Barnowski <rossbar@berkeley.edu> * simplify the code * trailing white space * add blank line (restart tests) * remove test * minor tweak to docs and rerun CI --------- Co-authored-by: Erik Welch <erik.n.welch@gmail.com> Co-authored-by: Dan Schult <dschult@colgate.edu> Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
* Update calculation of triangles This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity). Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also. * fix a typo * fix formatting * ignore self-loops * remove print statements * Update networkx/algorithms/cluster.py Co-authored-by: Erik Welch <erik.n.welch@gmail.com> * Update cluster.py * Update cluster.py * Update cluster.py * Update cluster.py - return to the original doc for default * Update cluster.py - make corrections in the doc string * Update cluster.py * Update networkx/algorithms/cluster.py Co-authored-by: Dan Schult <dschult@colgate.edu> * Update cluster.py * Update cluster.py - this is to make sure that the Counter contains a count for every node in the graph * Update cluster.py - update doctest to pass a list rather that tuple * Update cluster.py * Update test_cluster.py * Update networkx/algorithms/cluster.py Co-authored-by: Ross Barnowski <rossbar@berkeley.edu> * simplify the code * trailing white space * add blank line (restart tests) * remove test * minor tweak to docs and rerun CI --------- Co-authored-by: Erik Welch <erik.n.welch@gmail.com> Co-authored-by: Dan Schult <dschult@colgate.edu> Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
* Update calculation of triangles This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity). Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also. * fix a typo * fix formatting * ignore self-loops * remove print statements * Update networkx/algorithms/cluster.py Co-authored-by: Erik Welch <erik.n.welch@gmail.com> * Update cluster.py * Update cluster.py * Update cluster.py * Update cluster.py - return to the original doc for default * Update cluster.py - make corrections in the doc string * Update cluster.py * Update networkx/algorithms/cluster.py Co-authored-by: Dan Schult <dschult@colgate.edu> * Update cluster.py * Update cluster.py - this is to make sure that the Counter contains a count for every node in the graph * Update cluster.py - update doctest to pass a list rather that tuple * Update cluster.py * Update test_cluster.py * Update networkx/algorithms/cluster.py Co-authored-by: Ross Barnowski <rossbar@berkeley.edu> * simplify the code * trailing white space * add blank line (restart tests) * remove test * minor tweak to docs and rerun CI --------- Co-authored-by: Erik Welch <erik.n.welch@gmail.com> Co-authored-by: Dan Schult <dschult@colgate.edu> Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
This PR follows up on issue #5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity).
Note that the existing docs are missing the
int
return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.