Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update calculation of triangles #6258

Merged
merged 25 commits into from
Jul 26, 2023
Merged

Conversation

SultanOrazbayev
Copy link
Contributor

This PR follows up on issue #5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity).

Note that the existing docs are missing the int return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.

This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity).

Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.
Co-authored-by: Erik Welch <erik.n.welch@gmail.com>
Copy link
Contributor Author

@SultanOrazbayev SultanOrazbayev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code seems to be fine

@dschult
Copy link
Member

dschult commented Dec 8, 2022

Do we have any evidence yet that this speeds up the calculation? It looks like a fair amount of extra code and that usually ends up slowing things down. My comment in the original issue about looking up one neighbor not speeding things up made me think of that again here. How is this version doing?

@SultanOrazbayev
Copy link
Contributor Author

SultanOrazbayev commented Dec 8, 2022

Yes, there's definitely a speed-up (my original SO question was prompted by very slow triangles calculation).

I didn't do comprehensive benchmarking, but here's a small python script to examine the times:https://gist.github.com/SultanOrazbayev/c4d5be83b5a8dd35ad13f0eff847135d. For an Erdos-Reny graph with 1500 nodes and 0.1 edge probability I get a speed up of about 10x-20x (note I gave up on trying larger graphs, since nx current version is taking rather long).

@SultanOrazbayev
Copy link
Contributor Author

And in the above script, I'm not controlling for seed etc, but the results are large and consistent enough.

@SultanOrazbayev
Copy link
Contributor Author

For the single-node case, though, yes, the current version appears to be much, much faster. This is likely due to the sub-graph creation part. Let me investigate a bit more.

@SultanOrazbayev
Copy link
Contributor Author

After further inspection, it seems that the speedup is observed only when computing triangles for the whole graph (possibly, there is also speed up when nodes contains a lot of nodes). For single node or for small subset of nodes the old approach is much, much faster.

As a way to combine the strengths of both approaches, the old approach is preserved if nodes is specified, if nodes kwarg is None, then the new approach is used.

@dschult
Copy link
Member

dschult commented Dec 8, 2022

To speedup the subgraph part, you might try making a copy of the subgraph. Subgraph Views are not very performant. When traversing the graph, you still have to look at all edges and reject the ones that are not in the graph. If memory is not a bottleneck, and if you expect to look at an edge more than once on average, it is likely to be faster to add .copy() to the end of your subgraph call. This does take time of course -- because you have to create that new graph to hold the subgraph. So, no guarantees...
That also would mean you should have some logic to handle the case where nodes is None so that you don't make a copy/subgraph of the whole graph. :)

@SultanOrazbayev
Copy link
Contributor Author

Thank you for these details, I didn't think about the overhead of using views vs copy. Right now, the iterative approach of the original code is preserved for the case when nodes are provided. Do you think this will work?

@dschult
Copy link
Member

dschult commented Dec 8, 2022

Yes -- that is a good way to handle the case when nodes are provided.

Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look good!
The later_neighbors approach creates an extra data structure (more RAM) but avoids the repeated lookup of neighbors in the previous set(G[w]) - {w}. I think the extra RAM is at most a dict of sets with N pointers to sets with at most N elements (N being the number of nodes). That's less than a copy of the graph so I view this as a good enhancement.

For performance, I checked a complete_graph on 1000 nodes and the old/current code is 28s while the PR code is 34s. So, the new version is slower for denser graphs. This makes sense because the bulk of the time savings with the new approach is pre-computation of set(G[w]) - {w} and the cost of the new method is looping over nodes in neighbors & later_neighbors[node2] instead of using it's length. When those "third node" sets are large the extra inner loop is costly. And when they are small, using lookups makes up for the extra inner loop cost. Ordering the nodes to avoid counting twice helps too, but the main savings is replacing set creation time with lookup time.

I then checked random graphs with 1500 nodes and p=0.3 and 0.5. When p=0.5, the two methods both took about 20.7s. So, I think that is the density where the methods take essentially the same time.

I was able to speed up the PR method by a factor of 2-4 by moving the update of triangle_counts outside the inner loop and increasing it by m instead of 1 where m is the number of nodes to be looped over in the inner loop. Using a Counter to handle the inner loop also helps speed things up.

I tried adding pre-computation to the current networkx version and it helped a lot -- about 3-4 times faster. But that is still slower than the current PR for sparse graphs and slower than the inner-loop-speedup version for complete graphs. So, overall we should go with the PR with the inner-loop speedup with the inner loop pushed into the compiled code that is within Counter.update().

We're looking at a speedup of 10+ times for sparse graphs and about 7 for density p=0.5 and about 4 for the complete_graph... The complete graph on 1500 nodes takes 1min 56sec for the existing code, and 28sec for the PR with inner loop speedup.
See the comments below...

networkx/algorithms/cluster.py Outdated Show resolved Hide resolved
networkx/algorithms/cluster.py Outdated Show resolved Hide resolved
networkx/algorithms/cluster.py Outdated Show resolved Hide resolved
networkx/algorithms/cluster.py Outdated Show resolved Hide resolved
SultanOrazbayev and others added 7 commits December 11, 2022 09:03
- return to the original doc for default
- make corrections in the doc string
Co-authored-by: Dan Schult <dschult@colgate.edu>
- this is to make sure that the Counter contains a count for every node in the graph
- update doctest to pass a list rather that tuple
@SultanOrazbayev
Copy link
Contributor Author

TLDR; I simplified the code and made the behaviour consistent with the original version. This is ready for a review now.

The ambiguity of nodes kwarg makes it difficult to have a nice implementation that would return 0 if a node is not in the graph. As a result, this test that I wanted to add will not be added:

    def test_node_not_in_graph(self):
        G = nx.Graph()
        G.add_node((0, 0))
        G.add_node((2, 3))
        assert nx.triangles(G, (0, 1)) == 0

The tricky part here is that if one passes an iterable containing nodes that are not in the graph, one would expect a dictionary with keys corresponding to the nodes in the iterable and values being zero. (somewhat similar behaviour for a list containing some nodes in the graph and some nodes not in the graph). Since a node could potentially also be an iterable itself, this leads to ambiguity (if we should return zero because the iterable is not a node in the graph or try to iterate over the contents of the iterable and return a dict with key/val containing zeros for nodes that are not in the graph and n-triangles for nodes that are in the graph). Implementing this would not lead to a readable code, so for now the most efficient path is to revert to the original behaviour.

networkx/algorithms/cluster.py Outdated Show resolved Hide resolved
networkx/algorithms/cluster.py Show resolved Hide resolved
Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this -- and sorry for the delay. I tried to include the new changes in the helper function used by other functions for triangles, degree, generalized degree. But the loops are different enough that it didn't end up helping there. Let's get this merged. :)
Thanks!

@SultanOrazbayev
Copy link
Contributor Author

Thank you very much, @dschult !

@jarrodmillman jarrodmillman merged commit adf6cff into networkx:main Jul 26, 2023
38 checks passed
@jarrodmillman jarrodmillman added this to the 3.2 milestone Jul 26, 2023
Alex-Markham pushed a commit to Alex-Markham/networkx that referenced this pull request Oct 13, 2023
* Update calculation of triangles

This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity).

Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.

* fix a typo

* fix formatting

* ignore self-loops

* remove print statements

* Update networkx/algorithms/cluster.py

Co-authored-by: Erik Welch <erik.n.welch@gmail.com>

* Update cluster.py

* Update cluster.py

* Update cluster.py

* Update cluster.py

- return to the original doc for default

* Update cluster.py

- make corrections in the doc string

* Update cluster.py

* Update networkx/algorithms/cluster.py

Co-authored-by: Dan Schult <dschult@colgate.edu>

* Update cluster.py

* Update cluster.py

- this is to make sure that the Counter contains a count for every node in the graph

* Update cluster.py

- update doctest to pass a list rather that tuple

* Update cluster.py

* Update test_cluster.py

* Update networkx/algorithms/cluster.py

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>

* simplify the code

* trailing white space

* add blank line (restart tests)

* remove test

* minor tweak to docs and rerun CI

---------

Co-authored-by: Erik Welch <erik.n.welch@gmail.com>
Co-authored-by: Dan Schult <dschult@colgate.edu>
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
dschult added a commit to BrunoBaldissera/networkx that referenced this pull request Oct 23, 2023
* Update calculation of triangles

This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity).

Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.

* fix a typo

* fix formatting

* ignore self-loops

* remove print statements

* Update networkx/algorithms/cluster.py

Co-authored-by: Erik Welch <erik.n.welch@gmail.com>

* Update cluster.py

* Update cluster.py

* Update cluster.py

* Update cluster.py

- return to the original doc for default

* Update cluster.py

- make corrections in the doc string

* Update cluster.py

* Update networkx/algorithms/cluster.py

Co-authored-by: Dan Schult <dschult@colgate.edu>

* Update cluster.py

* Update cluster.py

- this is to make sure that the Counter contains a count for every node in the graph

* Update cluster.py

- update doctest to pass a list rather that tuple

* Update cluster.py

* Update test_cluster.py

* Update networkx/algorithms/cluster.py

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>

* simplify the code

* trailing white space

* add blank line (restart tests)

* remove test

* minor tweak to docs and rerun CI

---------

Co-authored-by: Erik Welch <erik.n.welch@gmail.com>
Co-authored-by: Dan Schult <dschult@colgate.edu>
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
cvanelteren pushed a commit to cvanelteren/networkx that referenced this pull request Apr 22, 2024
* Update calculation of triangles

This PR follows up on issue networkx#5246, the key function is mostly as suggested by @dschult and @zephyr111 (minor variable renaming, hopefully for better clarity).

Note that the existing docs are missing the `int` return when a single node is provided (even though it is one of the examples), so the docs are modified to account for that also.

* fix a typo

* fix formatting

* ignore self-loops

* remove print statements

* Update networkx/algorithms/cluster.py

Co-authored-by: Erik Welch <erik.n.welch@gmail.com>

* Update cluster.py

* Update cluster.py

* Update cluster.py

* Update cluster.py

- return to the original doc for default

* Update cluster.py

- make corrections in the doc string

* Update cluster.py

* Update networkx/algorithms/cluster.py

Co-authored-by: Dan Schult <dschult@colgate.edu>

* Update cluster.py

* Update cluster.py

- this is to make sure that the Counter contains a count for every node in the graph

* Update cluster.py

- update doctest to pass a list rather that tuple

* Update cluster.py

* Update test_cluster.py

* Update networkx/algorithms/cluster.py

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>

* simplify the code

* trailing white space

* add blank line (restart tests)

* remove test

* minor tweak to docs and rerun CI

---------

Co-authored-by: Erik Welch <erik.n.welch@gmail.com>
Co-authored-by: Dan Schult <dschult@colgate.edu>
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants