-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked Dijkstra's algorithm for weighted betweenness metrics. #880
Conversation
This revamped piece of code does not need a 'seen' list. Also, it needs less values to be added to the Q, that is used as a heap. Moreover, it does significantly less 'if' checks and other operations.
There is another version of this algorithm in |
Yes, for the reasons I mentioned above. Maybe you have two, because this one is specialized for centrality measures and not just shortest paths. |
Well? Isn't this all right? |
Probably - I haven't had time yet to review it and figure out why we have multiple versions. |
It looks like we have a stand-alone version of Dijkstra here because it returns more than just distance or paths. It keeps track of multiple paths as well as the order in which the nodes are traversed. This makes it easier to accumulate the paths into betweenness measures. Two comments on the current pull request.
I also have a question about the replacement of dicts "D" and "seen" with the dict "dist". The difference between "D" and "seen" is that "seen" gets updated when a node is first put on the stack and "D" is updated when the node is taken off the stack. The PR doesn't distinguish and this reduces some "if" statements. My question is can you explain why this gives the same results? Also the if conditions may account for the speedup. Summary: |
Thanks for the reply guys,
Additional notes: |
Guys? |
It looks like this version is slower. The original takes about 70% the time. But perhaps more important is that it isn't clear to me that this version is correct. The logic has changed in order to reduce the number of dicts. With two dicts, the "seen" dict gets updated when the node is first connected to, and the "D" dict is updated when its neighbors are first accessed. With the new code, the "dist" dict is updated when the node is first connected to. It is likely that the logic of the if statements (the "continue" based on D and the creating/updating seen/dist if's) works out to be the same, but I haven't worked through all possibilities. But even if the logic is the same, the routine is slower. It looks like saving one dict loses some speed. I'm inclined to leave the current code unless the new code is faster or demonstrably thinner (less RAM). |
Seems like line 251 is to blame for. There must be a more efficient way to perform this action. |
Try Q.remove( ... ) ? maybe? |
I need a way to do this operation in less than O(n) time. |
You can leave it in the queue and ignore it when it comes to the top of the queue. But that's what the old code did. |
Allright! Now it must be faster! It was for me, and succeeded all the tests! 👍 |
Thanks for that! I'm getting closer to understanding. Two more questions though: You treat sigma differently than the original code does. You set sigma[w] = sigma[v] when w is first seen. The old code sets sigma[w] = 0.0 when first seen and then adds sigma[pred](which is just sigma[v]) when the neighbors of w are first examined. At first the two methods look the same. But how do you know that sigma[v] hasn't changed between these times? By adding sigma[v] too early you may not get all the paths to v and thus the number of paths to w will be incorrect. The timing is still slower for the new routine. I think the test "if v in S" takes longer than checking "if v in D", especially for large graphs, because S is a list and D is a dict. I used gnp_random_graph(100,0.5,seed=0) and it was 2 times slower, but with 1000 nodes it was 10 times slower. It seems like a tradeoff between space (an extra dict D keyed by node to distance) and time (looking up nodes in a dict rather than a list). |
Allright, this is the final piece of code and it's as good as it gets. What do you think? |
Changes Unknown when pulling 1bc32d0 on Geo-Piskas:master into * on networkx:master*. |
Changes Unknown when pulling ee5517c on Geo-Piskas:master into * on networkx:master*. |
This looks good--I don't see any issues with it. It's very helpful to have someone dig through the code for these common routines. Actually looking at the diff from the current "master", we're almost back to the original codebase! In fact the only changes I can see are the name seen becoming dist and the default weights are int whereas in the codebase they are float. dist is probably a better name than seen for this method. But we have a number of other routines that use the name seen so I'd like to keep the name the same as those routines. For int vs float, it shouldn't matter. But Python 2 has the crazy int-division-truncates problem so I'd like to keep that as a float. So I'm going to close this PR. Thanks for the submission and please submit other routines when they're ready. |
Well a change that actually offers, as far as I remember, 5%-10% speedup is at line 245. Instead of doing "sigma=dict.fromkeys(G,0.0)", I initiate sigma to 0.0 inside the previous loop. |
This revamped piece of code does not need a 'seen' list.
Also, it needs less values to be added to the Q, that is used as a heap.
Moreover, it does significantly less 'if' checks and other operations.
I have tested the implementation and it outputs the same results with the previous one.