New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor connectivity package #1126
Conversation
Use the new algorithms and the new interfaces provided by the flow package. The code is quite clearer, and the new algorithms and interface provide a significant speed up in running time. For some problems, this code is one order of magnitude faster (see networkx#1102 for bechmarks). So far the code is still backwards compatible, but we should discuss some possible non backward compatible improvements.
Some benchmarks to show the strong and weak points of
As you can see, each algorithm is strong in different contexts. Thus we should allow the user to select one via the |
I am yet to look at the code. Did you enable |
Yes, in the results above both algorithms use It must be said that, the run times posted above were computed in a quite slower machine (Xeon X5650) than the run times that I've posted at #1102 (Haswell i7-4700QM). I see that it can be confusing because the run times at #1102 do not use |
The code that I'm using for the benchmarks is: import time
import networkx as nx
from networkx.utils import powerlaw_sequence, create_degree_sequence
flow_funcs = dict(
edmonds_karp=nx.edmonds_karp,
shortest_ap=nx.shortest_augmenting_path,
)
def build_power_law(n, exponent=2.0):
deg_seq = create_degree_sequence(n, powerlaw_sequence, 100)
G = nx.Graph(nx.configuration_model(deg_seq))
G.remove_edges_from(G.selfloop_edges())
G = sorted(nx.connected_component_subgraphs(G), key=len, reverse=True)[0]
G.name = 'Power law configuration model: {0}'.format(n)
return G
def benchmark_connectivity():
graphs = []
for p in [0.2, 0.5, 0.7]:
G = nx.fast_gnp_random_graph(200, p)
graphs.append(G)
for n in [1000, 2000, 3000]:
G = build_power_law(n)
graphs.append(G)
for G in graphs:
print(nx.info(G))
print("Computing node connectivity")
for fname, flow_func in sorted(flow_funcs.items()):
start = time.time()
k = nx.node_connectivity(G, flow_func=flow_func)
end = time.time() - start
print(" " * 4 + "{0}:\t{1:.3f} seconds".format(fname, end))
print("Computing edge connectivity")
for fname, flow_func in sorted(flow_funcs.items()):
start = time.time()
k = nx.edge_connectivity(G, flow_func=flow_func)
end = time.time() - start
print(" " * 4 + "{0}:\t{1:.3f} seconds".format(fname, end))
if __name__ == '__main__':
benchmark_connectivity() |
Hmm, it seems that Travis is failing while trying to install coverage. I'll open an issue. |
1. The node mapping needed for node connectivity and minimum node cuts is now a graph attribute of the auxiliary digraph. Thus there is no need for a mapping parameter in the local version of these functions. 2. Change the parameter name for the auxuliary digraph from `aux_digraph` to `auxiliary`. Also be consistent on the auxiliary digraph variable name in the code. Now it is always `H`. 3. Added small sanity check for the auxiliary digraph for node connectivity. If a digraph is passed as a paramater for reuse, check that it has a graph attribute with the node mapping. If not we raise instead of rebuild the auxiliary digraph.
With the addition of the example of how to compute local node connectivity among all pairs of nodes reusing the data structures (added in the previous commit in the docstrings of local_node_connectivity). We can remove this function (which is the only one that has numpy dependency in the connectivity package).
This change imporves significantly the speed of node_connectivity in denser graphs. For very sparse ones increases speed by ~5%.
This change is backwards incompatible. Updated docstrings for all affected functions with example usage. The global functions provide a good enough interface for most uses of connectivity algorithms. More sophisticated uses require explicit imports from the flow package anyway.
The failure in python 3.2 is unrelated to this PR, it seems that is caused in functions that use scipy: I've implemented some changes (a few backwards incompatible) in the commits above:
|
H.add_edge('%dA' % i, '%dB' % i, capacity=1) | ||
|
||
edges = [] | ||
for (source, target) in G.edges(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
G.edges_iter()
Save for some minor issues, I think that this is okay. |
Thanks for looking at this @ysitu! I've done the changes that you suggested. I'm using now the nice I think that this is ready for merging. The interfaces to connectivity algorithms are only a bit backwards incompatible (parameter
|
nx.set_edge_attributes(H, 'capacity', capacity) | ||
return H | ||
else: | ||
H = G.to_directed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_directed
/to_undirected
will also ends up deepcopying user data of unknown size/copyability. The proper fix is to make them and copy
accept a data=False
argument. But that belongs to another PR.
for (source, target) in G.edges_iter(): | ||
H.add_edges_from([(source, target), (target, source)]) | ||
capacity = dict((e, 1) for e in H.edges()) | ||
nx.set_edge_attributes(H, 'capacity', capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
Well, if an user gets a All the connectivity and cut functions are supposed to work on graphs with capacities == 1 for all edges. The only exception is I'll do the changes that you propose shortly, and will also check the docstrings to make sure that we have no back-ticks missing. |
If the user specify the capacities of some but not all edges, they will likely get a |
Oh I see. You are right, I'll fix that. We should always build the auxiliary network with the unity capacities even if the graph passed to |
All connectivity and cut algorithms are supposed to work on unit capacity networks. minimum_st_edge_cut was the only exception, because it uses the new interface to flow algorithms, is potentially able to compute weighted cutsets. However, this complicated the implementation and could result in a NetworkXUnbounded exception if an user used as argument a graph with some but not all edges with an attribute called capacity. Since minimum_cut computes weighted cuts, there is no need to duplicate functionality here.
Always build the auxiliary network and simplify the process of building of the auxiliary digraph.
After looking at the problem that @ysitu pointed out, I think that the cleanest option is to do not allow weighted computations in Also improved the generation of the auxiliary digraph for edge connectivity and cleaned up the docstrings. |
@jtorrents How common is it to want to calculate node connectivity between all pairs? If it is quite common, I'd lean towards including a simple function that does this for the user. One general complaint of Python libraries (i.e., in comparison to R), is that sometimes they tend to be too low level. The example you provided is >10 lines right? It involves the use of itertools, auxillary digraphs, and residuals. Users who only want node connectivity between pairs and don't care to learn the NetworkX implementation will thank you for being able to calculate it in one line. |
Good point @chebee7i. We could add again this function. However I think that computing node connectivity between all pairs is not very common, because it is a quite slow computation. For not so big problems, it will not be practical. In fact flow based connectivity algorithms are based in clever ways to avoid computing a minimum cut among all pairs of nodes. However I agree that if an user needs to compute node connectivity among all pairs, they will have to dive in implementation details ... and that might no be so pleasant for them as it is for us ;). So I'll add it again. I'm not sure of which would be the best data structure for returning it. In the previous version it returned a 2d numpy array, but I'm thinking that a plain old dict might do. What do you think? |
I'm fine with either. The NumPy array is more efficient, but it is a dependency. So dict works. |
This reverts commit 7cb64b6.
1. Change function name from all_pairs_node_connectiviy_matrix to all_pairs_node_connectivity. 2. The function now returns a dict instead of a numpy 2d array. 3. New parameter nbunch for computing node connectivity only among pairs of nodes in the container nbunch. 4. Added old and new tests for all_pairs_node_connectiviy to test_connectivity.py.
I've added again the function
|
Almost forgot to comment that I've added the function |
Also need to put it in the Sphinx source. |
Added |
How about adding a test or two to check |
Good idea @ysitu! I've added a test that uses several platonic graphs to check edge connectivity against stoer_wagner. |
Any other comment on this? |
Use the new algorithms and the new interfaces provided by the flow package. I think that the code is quite clearer by using the new interface, and the new algorithms and interface provide a significant speed up in running time. For some problems, this code is one order of magnitude faster (see #1102 for bechmarks). So far the code is still backwards compatible, but we should discuss some possible non backward compatible improvements. I have some doubts about the implementation and documentation that I'd like to discuss.
flow_func
parameter to all connectivity functions. It is great for tests to be able to use several flow algorithms. Also, the two algorithms that make sense in this scenario (edmonds_karp
andshortest_augmenting_path
) perform better in different scenarios: the former is faster in very sparse power-law like degree distribution networks, and the latter for denser networks with the edges more evenly distributed among nodes. Thus it makes sense to allow users to pick an algorithm. The implementation takes care of using the optimal parameters (cutoff
for both andtwo_phase
for SAP) for this two algorithms. So far I've set the default flow function toedmonds_karp
because is faster in a wide set of contexts. I'll prepare more detailed benchmarks between these two algorithms.local_node_connectivity
andlocal_edge_connectivity
. Do you think that we should document how to reuse the data structures?residual
andauxiliary
) that are conceptually different, but could be merged in an unique (more complex) data structure. This does not simplify the code a lot (we get rid of a parameter in function calls and a line of code each time we initialize the data structures) and I think it makes more difficult to understand what the code is actually doing. So I'd prefer to keep them separated but I'm open to merging them.node_connectivity
,edge_connectivity
,minimum_node_cut
andminimum_edge_cut
. These functions support computing that for the whole graph and also for two nodes. So I think that these are the functions that we should import to the base NetworkX namespace. The other functions (local_*
andminumum_st_*_cuts
) are more specialized and should be used by users interested in building connectivity algorithms themselves. These functions accept all parameters of the the flow interface, and some more on their own, in order to reuse of data structures and achieve a significant speed up. So I think that we could keep these functions only to the connectivity package and require users that want to use them to import them explicitly from the connectivity package.all_pairs_node_connectivity_matrix
because it seems easy to write the few lines of code required for computing it. However I'm a bit hesitant because if you implement them without reusing the data structures, they will be a lot slower than the version that we provide here. This could be solved maybe with better documentation.