Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Panther algorithm per #3849 #3886

Merged
merged 34 commits into from
Nov 28, 2020

Conversation

mrecachinas
Copy link
Contributor

@mrecachinas mrecachinas commented Mar 31, 2020

This addresses #3849.

A couple remaining questions:

  • Is there a better home for generate_random_paths? It’s a generic Markovian random walk that could be used elsewhere. Alternatively, if this already exists or there's a more efficient approach, I'm all ears.
  • I used numpy for a bit of this. Is there a desire to have a pure Python version as well (like simrank_similarity vs simrank_similarity_numpy)?
  • Should n_choose_k be elsewhere or perhaps should we use scipy.special.binom instead?
  • Do we want Panther++ added (requires a KD-tree, which we can use from scipy.spatial)?

@pep8speaks
Copy link

pep8speaks commented Mar 31, 2020

Hello @mrecachinas! Thanks for updating this PR.

Line 1460:1: W293 blank line contains whitespace
Line 1516:89: E501 line too long (92 > 88 characters)
Line 1517:54: W605 invalid escape sequence '\p'
Line 1518:47: W605 invalid escape sequence '\s'
Line 1642:89: E501 line too long (98 > 88 characters)

Please install and run psf/black.

Comment last updated at 2020-09-05 14:56:05 UTC

@mrecachinas mrecachinas changed the title [WIP] Add Panther algorithm per #3849 Add Panther algorithm per #3849 Apr 1, 2020
@dschult
Copy link
Member

dschult commented May 20, 2020

n_choose_k is nicely expressed in Python 3.8+ as math.comb(n, k).
But we probably need to keep something around until we stop supporting Python 3.7.

There is only a reason to keep a numpy version and plain python if there is an advantage to each approach. For example, the algorithm used might be different with one using matrix manipulations and one using graph search techniques, or similar.

Thanks for this! I think there could be demand for Panther++. Let me think about where generate_random_paths might fit better.

small tweaks -- typo "ictionary" in docs for generate_random_paths. And could the example do soemthing with random_path to show that it has two outputs and what someone might want to do with them? Perhaps you could return just the paths and have an optional argument index_map that could be filled if supplied. ??

@mrecachinas
Copy link
Contributor Author

mrecachinas commented May 20, 2020

@dschult Fixed the typo and moved index_map to be an optional kwarg that, if provided, will be populated. I also added an example in the docstring illustrating that.

Let me know if there's anything else. In the meantime, I'll finish my implementation of Panther++.

@storopoli
Copy link

Any news guys? Will it be released in the next version?

@mrecachinas
Copy link
Contributor Author

@dschult Think I addressed everything. Let me know if there's anything else you would like changed!

Also, I'll submit another PR for Panther++ once I finish it.

@storopoli
Copy link

@mrecachinas great job! Waiting for the panther++ algorithm also...

@dschult
Copy link
Member

dschult commented Sep 3, 2020

Quick question: In the docs, delta is:
The probability that 𝑆 is not an epsilon-approximation to (R, phi)

What are R and phi?

You can answer here-- but if you prefer, you can add more to that doc string.

@storopoli
Copy link

@dschult according to the original Panther article, $R$ is the number of random paths and $\phi$ is the probability that an element sampled from a set A $\subseteq$ D, where D is the domain.

ref: Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., & Li, J. (2015). Panther: Fast Top-k Similarity Search on Large Networks. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1445–1454. https://doi.org/10.1145/2783258.2783267

@dschult
Copy link
Member

dschult commented Sep 3, 2020

Good Thanks!

But what is S... ? I don't think these symbols are defined in the doc_string.... only in the article.
The code refers to "S", but it is an array of values, which is hard to see how to be epsilon close to a 2-tuple (R, phi)...

@dschult dschult added this to the networkx-2.6 milestone Sep 3, 2020
@mrecachinas
Copy link
Contributor Author

mrecachinas commented Sep 4, 2020

Sorry - S is just the similarity adjacency matrix. Updating the docstring.

networkx/algorithms/similarity.py Show resolved Hide resolved
networkx/algorithms/similarity.py Outdated Show resolved Hide resolved
networkx/algorithms/similarity.py Outdated Show resolved Hide resolved
@mrecachinas
Copy link
Contributor Author

@jarrodmillman Let me know if I've addressed your change requests.

@rossbar rossbar self-requested a review November 24, 2020 07:16
@jarrodmillman
Copy link
Member

@mrecachinas Sorry for the delay. Ross or I will take a look very soon! We are planning to release 2.6 soonish and I want to get this in first. Thanks for your patience.

Despite our slow response time, we would also like to get Panther++ in soon. So I hope you are still interested in contributing it!

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead an made some small touch-ups in rossbar/networkx@0973646c. I tried to push these to your branch (to spare you the spam via the github suggestion feature) but was unable to do so, likely because it seems like your fork originated from somewhere other than networkx/networkx. I wasn't able to PR against your fork either, so if you could apply those changes via cherry-pick that'd be great. If you're not comfortable doing so, you can either:

  1. Add rossbar as a collaborator to your fork and I'll open a PR against your fork or
  2. We just put this in and I can make a follow-on PR

Generally this LGTM - I'm not very familiar with the algorithm so my opinion shouldn't be weighted too heavily but the implementation seems good! I too was struck by the question of whether there is a better place for generate_random_paths, but I don't have any ideas and I don't think that should be a sticking point. Thanks @mrecachinas !

@mrecachinas
Copy link
Contributor Author

mrecachinas commented Nov 25, 2020

@jarrodmillman I'll dust off and finish the Panther++ I started a few months ago (but I'll put it in a separate PR).

@rossbar Ahh, I think I forked someone else's fork for a previous PR (adding SimRank, IIRC), so yeah that's probably the issue. I'll cherry-pick if you're good with it.

@jarrodmillman jarrodmillman merged commit a1cad29 into networkx:master Nov 28, 2020
MridulS pushed a commit to MridulS/networkx that referenced this pull request Feb 4, 2023
* Add initial pass at panther algorithm

* Fix example imports and references

* Fix pep8 issues

* Remove unnecessary compare in panther conditional

* Fix doctest failure in generate_random_paths

* Fix n_choose_k when n == k

* Add tests for panther, n_choose_k, generate_random_paths

* Clean up panther_similarity docstring

* Fix panther and generate_random_paths to return node names

* Fix pep8 issue in similarity.py

* Fix typo in generate_random_paths docstring

* Add small microoptimizations in calculating transition probs

* Handle k > n in n_choose_k

* Change generate_random_paths to accept optional index_map vs return it

This also adds an example in the docstring for generate_random_paths
and fixes the relevant tests.

* Rename v to source in panther_similarity

* Change doc for c in panther_similarity

* Change generate_random_paths to return generator instead of list

Change generate_random_paths tests
with new API for generate_random_paths

* Increase random path sample size in docstring example

* Fix docstring example for generate_random_paths

* Update similarity.py per suggestions

- Remove `n_choose_k` from `__all__`
- Compute `1 / sample_size` once, and use that value in the loop
- Replace `setdefault` with if-block

* Add panther_similarity funcs to similarity.rst

* Remove nx import from docstring in similarity.py

* Fix typo in generate_random_paths

indexmap -> index_map

* Remove nx prefix from n_choose_k docstring examples

* Fix unit test for n_choose_k by importing it directly

* Add more context to panther docstring

* Rename n_choose_k to _n_choose_k and add deprecation note

* * DOC: minor documentation touch-ups.
 * MAINT: use pre-defined local variable.

* Run black on test_similarity.py

* Remove second setdefault in generate_random_paths

* Optimize panther using the inverted vertex-path index map

* Fix variable shadowing in panther similarity

* Remove debug printing in panther_similarity

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants