Reworked Dijkstra's algorithm for weighted betweenness metrics. #880

gpiskas · 2013-06-01T13:37:24Z

This revamped piece of code does not need a 'seen' list.
Also, it needs less values to be added to the Q, that is used as a heap.
Moreover, it does significantly less 'if' checks and other operations.

I have tested the implementation and it outputs the same results with the previous one.

This revamped piece of code does not need a 'seen' list. Also, it needs less values to be added to the Q, that is used as a heap. Moreover, it does significantly less 'if' checks and other operations.

hagberg · 2013-06-06T12:15:29Z

There is another version of this algorithm in
networkx/algorithms/shortest_paths/weighted.py.
Does that one have better performance?
In principle we should only have one implementation. I can't remember right now why there are two.

gpiskas · 2013-06-06T19:43:08Z

Yes, for the reasons I mentioned above. Maybe you have two, because this one is specialized for centrality measures and not just shortest paths.

gpiskas · 2013-06-20T16:53:58Z

Well? Isn't this all right?

hagberg · 2013-06-20T17:32:34Z

Probably - I haven't had time yet to review it and figure out why we have multiple versions.

dschult · 2013-06-20T18:39:27Z

It looks like we have a stand-alone version of Dijkstra here because it returns more than just distance or paths. It keeps track of multiple paths as well as the order in which the nodes are traversed. This makes it easier to accumulate the paths into betweenness measures.

Two comments on the current pull request.

I get different results for sigma. It looks like the PR version gives sigma values that are half the original, but that may only be the networks I looked at. Is there some reason this doesn't matter in the end?
A quick speed test shows the PR version is slower. The original takes about 70% the time of the new for random graphs with density 0.2.

I also have a question about the replacement of dicts "D" and "seen" with the dict "dist". The difference between "D" and "seen" is that "seen" gets updated when a node is first put on the stack and "D" is updated when the node is taken off the stack. The PR doesn't distinguish and this reduces some "if" statements. My question is can you explain why this gives the same results? Also the if conditions may account for the speedup.

Summary:
Geo-Piskas, can you verify that my observations on sigma and timing are done correctly? And is there a way to see that the removal of the if conditions (and thus one dict) gives the same results?
Thanks

gpiskas · 2013-06-20T20:47:32Z

Thanks for the reply guys,

There results of sigma are halved due to the old line 248 removal. There is no point making that summation for the reason I will now explain.
We need sigma for the accumulation part, in the method "_accumulate_edges(betweenness,S,P,sigma,s)". As you see, we first calculate "coeff=(1.0+delta[w])/sigma[w]" and then "c=sigma[v]_coeff".
In other words, c=(1.0+delta[w])_sigma[v]/sigma[w]. It is easy to notice, that there is no problem multiplying/dividing the sigma values by 1/2, or by any number, because (2_s[v])/(2_s[w])=s[v]/*s[w] eventually.
As far as speed is concerned, I am sorry, but due to university exams I lack the time for a comprehensive test.

Additional notes:
dist dictionary holds the distance from the source to all other nodes. Infinite means the node has not been visited yet. Thus, we won't be needing D and seen elements anymore in order to check if a node is already seen. I guess the way to test the correctness new implementation it is testing.

gpiskas · 2013-06-30T20:42:09Z

Guys?

dschult · 2013-07-05T16:21:02Z

It looks like this version is slower. The original takes about 70% the time. But perhaps more important is that it isn't clear to me that this version is correct. The logic has changed in order to reduce the number of dicts. With two dicts, the "seen" dict gets updated when the node is first connected to, and the "D" dict is updated when its neighbors are first accessed. With the new code, the "dist" dict is updated when the node is first connected to. It is likely that the logic of the if statements (the "continue" based on D and the creating/updating seen/dist if's) works out to be the same, but I haven't worked through all possibilities. But even if the logic is the same, the routine is slower. It looks like saving one dict loses some speed.

I'm inclined to leave the current code unless the new code is faster or demonstrably thinner (less RAM).

gpiskas · 2013-07-06T11:55:25Z

Seems like line 251 is to blame for. There must be a more efficient way to perform this action.

dschult · 2013-07-06T15:00:48Z

Try Q.remove( ... ) ? maybe?

gpiskas · 2013-07-06T20:11:20Z

I need a way to do this operation in less than O(n) time.

dschult · 2013-07-06T22:13:48Z

You can leave it in the queue and ignore it when it comes to the top of the queue. But that's what the old code did.

gpiskas · 2013-07-09T20:01:52Z

Allright! Now it must be faster! It was for me, and succeeded all the tests! 👍

dschult · 2013-07-15T18:03:18Z

Thanks for that! I'm getting closer to understanding. Two more questions though:

You treat sigma differently than the original code does. You set sigma[w] = sigma[v] when w is first seen. The old code sets sigma[w] = 0.0 when first seen and then adds sigma[pred](which is just sigma[v]) when the neighbors of w are first examined. At first the two methods look the same. But how do you know that sigma[v] hasn't changed between these times? By adding sigma[v] too early you may not get all the paths to v and thus the number of paths to w will be incorrect.

The timing is still slower for the new routine. I think the test "if v in S" takes longer than checking "if v in D", especially for large graphs, because S is a list and D is a dict. I used gnp_random_graph(100,0.5,seed=0) and it was 2 times slower, but with 1000 nodes it was 10 times slower. It seems like a tradeoff between space (an extra dict D keyed by node to distance) and time (looking up nodes in a dict rather than a list).

gpiskas · 2013-07-27T12:07:46Z

Allright, this is the final piece of code and it's as good as it gets. What do you think?

coveralls · 2013-07-27T12:08:44Z

Changes Unknown when pulling 1bc32d0 on Geo-Piskas:master into * on networkx:master*.

coveralls · 2013-07-27T12:15:46Z

Changes Unknown when pulling ee5517c on Geo-Piskas:master into * on networkx:master*.

dschult · 2013-07-29T16:15:37Z

This looks good--I don't see any issues with it. It's very helpful to have someone dig through the code for these common routines. Actually looking at the diff from the current "master", we're almost back to the original codebase! In fact the only changes I can see are the name seen becoming dist and the default weights are int whereas in the codebase they are float. dist is probably a better name than seen for this method. But we have a number of other routines that use the name seen so I'd like to keep the name the same as those routines. For int vs float, it shouldn't matter. But Python 2 has the crazy int-division-truncates problem so I'd like to keep that as a float. So I'm going to close this PR.

Thanks for the submission and please submit other routines when they're ready.
Dan

gpiskas · 2013-07-29T19:33:29Z

Well a change that actually offers, as far as I remember, 5%-10% speedup is at line 245. Instead of doing "sigma=dict.fromkeys(G,0.0)", I initiate sigma to 0.0 inside the previous loop.

Updated _single_source_dijkstra_path_basic in betweenness.py

22b270e

This revamped piece of code does not need a 'seen' list. Also, it needs less values to be added to the Q, that is used as a heap. Moreover, it does significantly less 'if' checks and other operations.

ghost assigned hagberg Jun 6, 2013

Reworked Dijkstra

88bb4c3

Final rework of Dijkstra

1bc32d0

Spacing problems fixed

ee5517c

dschult closed this Jul 29, 2013

jarrodmillman modified the milestones: networkx-future, networkx-2.0 Jul 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked Dijkstra's algorithm for weighted betweenness metrics. #880

Reworked Dijkstra's algorithm for weighted betweenness metrics. #880

gpiskas commented Jun 1, 2013

hagberg commented Jun 6, 2013

gpiskas commented Jun 6, 2013

gpiskas commented Jun 20, 2013

hagberg commented Jun 20, 2013

dschult commented Jun 20, 2013

gpiskas commented Jun 20, 2013

gpiskas commented Jun 30, 2013

dschult commented Jul 5, 2013

gpiskas commented Jul 6, 2013

dschult commented Jul 6, 2013

gpiskas commented Jul 6, 2013

dschult commented Jul 6, 2013

gpiskas commented Jul 9, 2013

dschult commented Jul 15, 2013

gpiskas commented Jul 27, 2013

coveralls commented Jul 27, 2013

coveralls commented Jul 27, 2013

dschult commented Jul 29, 2013

gpiskas commented Jul 29, 2013

Reworked Dijkstra's algorithm for weighted betweenness metrics. #880

Reworked Dijkstra's algorithm for weighted betweenness metrics. #880

Conversation

gpiskas commented Jun 1, 2013

hagberg commented Jun 6, 2013

gpiskas commented Jun 6, 2013

gpiskas commented Jun 20, 2013

hagberg commented Jun 20, 2013

dschult commented Jun 20, 2013

gpiskas commented Jun 20, 2013

gpiskas commented Jun 30, 2013

dschult commented Jul 5, 2013

gpiskas commented Jul 6, 2013

dschult commented Jul 6, 2013

gpiskas commented Jul 6, 2013

dschult commented Jul 6, 2013

gpiskas commented Jul 9, 2013

dschult commented Jul 15, 2013

gpiskas commented Jul 27, 2013

coveralls commented Jul 27, 2013

coveralls commented Jul 27, 2013

dschult commented Jul 29, 2013

gpiskas commented Jul 29, 2013