Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked Dijkstra's algorithm for weighted betweenness metrics. #880

Closed
wants to merge 4 commits into from
Closed

Conversation

gpiskas
Copy link

@gpiskas gpiskas commented Jun 1, 2013

This revamped piece of code does not need a 'seen' list.
Also, it needs less values to be added to the Q, that is used as a heap.
Moreover, it does significantly less 'if' checks and other operations.

I have tested the implementation and it outputs the same results with the previous one.

This revamped piece of code does not need a 'seen' list.
Also, it needs less values to be added to the Q, that is used as a heap.
Moreover, it does significantly less 'if' checks and other operations.
@ghost ghost assigned hagberg Jun 6, 2013
@hagberg
Copy link
Member

hagberg commented Jun 6, 2013

There is another version of this algorithm in
networkx/algorithms/shortest_paths/weighted.py.
Does that one have better performance?
In principle we should only have one implementation. I can't remember right now why there are two.

@gpiskas
Copy link
Author

gpiskas commented Jun 6, 2013

Yes, for the reasons I mentioned above. Maybe you have two, because this one is specialized for centrality measures and not just shortest paths.

@gpiskas
Copy link
Author

gpiskas commented Jun 20, 2013

Well? Isn't this all right?

@hagberg
Copy link
Member

hagberg commented Jun 20, 2013

Probably - I haven't had time yet to review it and figure out why we have multiple versions.

@dschult
Copy link
Member

dschult commented Jun 20, 2013

It looks like we have a stand-alone version of Dijkstra here because it returns more than just distance or paths. It keeps track of multiple paths as well as the order in which the nodes are traversed. This makes it easier to accumulate the paths into betweenness measures.

Two comments on the current pull request.

  1. I get different results for sigma. It looks like the PR version gives sigma values that are half the original, but that may only be the networks I looked at. Is there some reason this doesn't matter in the end?
  2. A quick speed test shows the PR version is slower. The original takes about 70% the time of the new for random graphs with density 0.2.

I also have a question about the replacement of dicts "D" and "seen" with the dict "dist". The difference between "D" and "seen" is that "seen" gets updated when a node is first put on the stack and "D" is updated when the node is taken off the stack. The PR doesn't distinguish and this reduces some "if" statements. My question is can you explain why this gives the same results? Also the if conditions may account for the speedup.

Summary:
Geo-Piskas, can you verify that my observations on sigma and timing are done correctly? And is there a way to see that the removal of the if conditions (and thus one dict) gives the same results?
Thanks

@gpiskas
Copy link
Author

gpiskas commented Jun 20, 2013

Thanks for the reply guys,

  1. There results of sigma are halved due to the old line 248 removal. There is no point making that summation for the reason I will now explain.
    We need sigma for the accumulation part, in the method "_accumulate_edges(betweenness,S,P,sigma,s)". As you see, we first calculate "coeff=(1.0+delta[w])/sigma[w]" and then "c=sigma[v]_coeff".
    In other words, c=(1.0+delta[w])_sigma[v]/sigma[w]. It is easy to notice, that there is no problem multiplying/dividing the sigma values by 1/2, or by any number, because (2_s[v])/(2_s[w])=s[v]/*s[w] eventually.

  2. As far as speed is concerned, I am sorry, but due to university exams I lack the time for a comprehensive test.

Additional notes:
dist dictionary holds the distance from the source to all other nodes. Infinite means the node has not been visited yet. Thus, we won't be needing D and seen elements anymore in order to check if a node is already seen. I guess the way to test the correctness new implementation it is testing.

@gpiskas
Copy link
Author

gpiskas commented Jun 30, 2013

Guys?

@dschult
Copy link
Member

dschult commented Jul 5, 2013

It looks like this version is slower. The original takes about 70% the time. But perhaps more important is that it isn't clear to me that this version is correct. The logic has changed in order to reduce the number of dicts. With two dicts, the "seen" dict gets updated when the node is first connected to, and the "D" dict is updated when its neighbors are first accessed. With the new code, the "dist" dict is updated when the node is first connected to. It is likely that the logic of the if statements (the "continue" based on D and the creating/updating seen/dist if's) works out to be the same, but I haven't worked through all possibilities. But even if the logic is the same, the routine is slower. It looks like saving one dict loses some speed.

I'm inclined to leave the current code unless the new code is faster or demonstrably thinner (less RAM).

@gpiskas
Copy link
Author

gpiskas commented Jul 6, 2013

Seems like line 251 is to blame for. There must be a more efficient way to perform this action.

@dschult
Copy link
Member

dschult commented Jul 6, 2013

Try Q.remove( ... ) ? maybe?

@gpiskas
Copy link
Author

gpiskas commented Jul 6, 2013

I need a way to do this operation in less than O(n) time.

@dschult
Copy link
Member

dschult commented Jul 6, 2013

You can leave it in the queue and ignore it when it comes to the top of the queue. But that's what the old code did.

@gpiskas
Copy link
Author

gpiskas commented Jul 9, 2013

Allright! Now it must be faster! It was for me, and succeeded all the tests! 👍

@dschult
Copy link
Member

dschult commented Jul 15, 2013

Thanks for that! I'm getting closer to understanding. Two more questions though:

You treat sigma differently than the original code does. You set sigma[w] = sigma[v] when w is first seen. The old code sets sigma[w] = 0.0 when first seen and then adds sigma[pred](which is just sigma[v]) when the neighbors of w are first examined. At first the two methods look the same. But how do you know that sigma[v] hasn't changed between these times? By adding sigma[v] too early you may not get all the paths to v and thus the number of paths to w will be incorrect.

The timing is still slower for the new routine. I think the test "if v in S" takes longer than checking "if v in D", especially for large graphs, because S is a list and D is a dict. I used gnp_random_graph(100,0.5,seed=0) and it was 2 times slower, but with 1000 nodes it was 10 times slower. It seems like a tradeoff between space (an extra dict D keyed by node to distance) and time (looking up nodes in a dict rather than a list).

@gpiskas
Copy link
Author

gpiskas commented Jul 27, 2013

Allright, this is the final piece of code and it's as good as it gets. What do you think?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 1bc32d0 on Geo-Piskas:master into * on networkx:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ee5517c on Geo-Piskas:master into * on networkx:master*.

@dschult
Copy link
Member

dschult commented Jul 29, 2013

This looks good--I don't see any issues with it. It's very helpful to have someone dig through the code for these common routines. Actually looking at the diff from the current "master", we're almost back to the original codebase! In fact the only changes I can see are the name seen becoming dist and the default weights are int whereas in the codebase they are float. dist is probably a better name than seen for this method. But we have a number of other routines that use the name seen so I'd like to keep the name the same as those routines. For int vs float, it shouldn't matter. But Python 2 has the crazy int-division-truncates problem so I'd like to keep that as a float. So I'm going to close this PR.

Thanks for the submission and please submit other routines when they're ready.
Dan

@dschult dschult closed this Jul 29, 2013
@gpiskas
Copy link
Author

gpiskas commented Jul 29, 2013

Well a change that actually offers, as far as I remember, 5%-10% speedup is at line 245. Instead of doing "sigma=dict.fromkeys(G,0.0)", I initiate sigma to 0.0 inside the previous loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants