New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PageRank: ARPACK method sometimes fails and produces negative values #1641
Comments
@ntamas I discovered this by finally implementing a warning in the Mathematica interface for when the eigenvalue is not 1, and fuzzing a bit to see if the tolerance I put in was good enough. If you don't already check this in the Python interface, it would be a good idea to do it ... I chose to check this allowing for up to 5 binary digits of difference, using |
The above test was with the vendored ARPACK. It also happens with an external ARPACK 3.8.0, but less frequently. However, the tolerance of 5 binary digits is not sufficient with the external ARPACK, and it also tends to run into convergence problems (ARPACK itself reports no convergence). I decided to switch to 7 digits (which is also Mathematica's default tolerance when comparing |
The result ARPACK returns here is not actually an eigenvector / eigenvalue pair at all. I am not sure why ARPACK won't detect the problem. I would have assumed that it attempts to verify it results. |
Most of the examples that you have shown above have sinks or sources; the Arnoldi method might have a problem with these, although in theory it shouldn't of course. PRPACK works around this by decomposing the graph to strongly connected components, calculating the PageRank scores in each of them, and then using some clever math to merge the PageRank scores of all SCCs so the result is consistent. In general, I would say that PRPACK is probably much more reliable than ARPACK exactly because of this. Example #2 is intriguing, but it is also not strongly connected; the leftmost vertex and the one that is connected to with a mutual edge form a strongly connected component so the PageRank "juice" mostly flows to these two nodes from the rest of the graph. When testing the eigenvector, I was wondering whether it would make sense to use a more sophisticated comparison between floats that tests both absolute and relative errors instead of simply comparing the differences with a tolerance. |
Yes, that is absolutely necessary, to handle both zeros (relative tol doesn't work) and non-zeros (absolute tol not good). Luckily, for PageRank we only need to test the eigenVALUE, as we know it should be one. So this function is fine. In the Mathematica interface, I do not return the eigenvalue, but I added a check which show s warning if it differs from 1 in more than 5 binary digits (see the PageRank test for something similar). With this amount of tolerance, I have not yet run into any case when it shows a warning even though nothing is wrong (e.g. when the eigenvalue is barely off, such as 0.999999999999)
ARPACK does not see the sinks. When we are in a sink, we can jump to any other vertex. Effectively, each sink is connected with an out-edge to all other vertices. Of course, it is possible that it is precisely this (overwhelming amount of effective connections) that is throwing ARPACK off. |
To clarify, I was suggesting verifying ARPACK's result in other cases than PageRank, such as with eigenvector centrality. |
For some context, this problem mostly affects the old version of ARPACK that is bundled with igraph. Newer versions (such as ARPACK-NG 3.8.0) are much less susceptible to this problem, but still not completely immune to it. Reproducing the problem is not easy, so no, it was not reported upstream. Now for IGraph/M specifically: It does come with the old, bundled version of ARPACK, so this is a real problem. However, for PageRank specifically, we also have the PRPACK method which is better than the ARPACK-based one in almost every situation (i.e. all reasonable damping factors, except maybe those very close to 1.0). Thus I suggest just using the default, PRPACK-based method. ARPACK is used in other functions though, such as eigenvector centrality or leading eigenvector based community detection. During practical work, I have not encountered this problem with those functions, but I know that it can theoretically appear. Did you run into it? A potential solution would be to link IGraph/M with a newer ARPACK, but doing that on all platforms is quite difficult ... it's not something I can promise. Investigating the use of FEAST for the core igraph C library is something interesting, but also a lot of work, so I don't expect it would happen for the next release (the TODO list for that is already too long). But it's definitely something to look at. We can continue discussing that in the other issue, #551. Using Mathematica's own ARPACK (or FEAST) is not something I'm keen to try, as they don't support this through any documented API. It would be a fragile hack. Still, I guess it wouldn't hurt to ask Wolfram whether they'd be willing to make this possible/easier, but I don't expect a positive response. |
Lack of time. If you have the time to debug this, of course contributions are welcome.
Again, a lot of work that will likely require an understanding of ARPACK's workings. ARPACK is written in Fortran. We use a version translated to C with f2c. f2c is no longer maintained and only works on F77 code. Newer ARPACK uses some F90. This is why upgrading it is problematic.
PRPACK is the default for PageRank calculations in all high-level interfaces of igraph, including IGraph/M. You will only encounter this issue if you explicitly choose the ARPACK methods. As I recall, that only makes sense for damping factors which are very close to 1.0, as PRPACK does very well for everything else. Similar issues may be present in other functions which use ARPACK. Since PRPACK is a PageRank calculation library, it can't be used for those other functions (such as eigenvector centrality). You can see the common theme here now: lack of time and resources. This would require a very large resource investment for a relatively small gain. At this time, there are other issues which affect more uses cases in more severe ways, so they are of higher priority. @ValZapod Did you encounter this bug in practice? If you have run into any problems during practical work, let us know, as that helps prioritize issues. |
Ultimately, the best solution for IGraph/M is to link to a newer ARPACK, not the old one bundled with igraph. Why is this difficult?
|
It seems that the fix for #2090 made this failure rarer, but it still occurs sometimes. Interestingly, when it does occur, in most cases ARPACK finds an eigenvalue of -0.85 which is certainly related to the default damping factor of 0.85. Once again, this is tested with the vendored ARPACK. |
No, I don't think so; I've attempted a conversion a few months ago and it seemed to me like the new codebase uses certain language constructs that |
Describe the bug
igraph_pagerank()
with the ARPACK method sometimes fails, produces an eigenvalue that is not 1, and produces negative results or even non-finite values (not sure if NaN or Infinity).To reproduce
This is easy to reproduce with fuzzing: create a directed random graph with at least 8 vertices, and run the function many times. Be sure to check if the eigenvalue differs from 1. Eventually you will find a case where it does. I see no correlation with the graph structure that causes the failure. Here are some examples:
One:
Two:
Three:
Four:
The failure is non-deterministic and is somehow related to the random (!!) way that
igraph_pagerank
initialized ARPACK's starting vector.Example in Mathematica:
Output:
PageRank values:
Graph:
For completeness, this specific graph's edge list in C format:
Python format:
The text was updated successfully, but these errors were encountered: