Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor connectivity package #1126

Merged
merged 24 commits into from May 20, 2014
Merged

Conversation

jtorrents
Copy link
Member

Use the new algorithms and the new interfaces provided by the flow package. I think that the code is quite clearer by using the new interface, and the new algorithms and interface provide a significant speed up in running time. For some problems, this code is one order of magnitude faster (see #1102 for bechmarks). So far the code is still backwards compatible, but we should discuss some possible non backward compatible improvements. I have some doubts about the implementation and documentation that I'd like to discuss.

  1. In this first version I've added the flow_func parameter to all connectivity functions. It is great for tests to be able to use several flow algorithms. Also, the two algorithms that make sense in this scenario (edmonds_karp and shortest_augmenting_path) perform better in different scenarios: the former is faster in very sparse power-law like degree distribution networks, and the latter for denser networks with the edges more evenly distributed among nodes. Thus it makes sense to allow users to pick an algorithm. The implementation takes care of using the optimal parameters (cutoff for both and two_phase for SAP) for this two algorithms. So far I've set the default flow function to edmonds_karp because is faster in a wide set of contexts. I'll prepare more detailed benchmarks between these two algorithms.
  2. Much of the increment of speed comes from reusing the data structures that we use for the underlying maximum flow computations (residual network) and local node|edge connectivity (auxiliary digraph). I think we should explain this to users and show them how they can also do that in their code. I've tried to do that in the docstings of local_node_connectivity and local_edge_connectivity. Do you think that we should document how to reuse the data structures?
  3. The connectivity algorithms rely on two data structures (residual and auxiliary) that are conceptually different, but could be merged in an unique (more complex) data structure. This does not simplify the code a lot (we get rid of a parameter in function calls and a line of code each time we initialize the data structures) and I think it makes more difficult to understand what the code is actually doing. So I'd prefer to keep them separated but I'm open to merging them.
  4. This changes are, so far are backwards compatible, but I think that we can improve the interface: an user that is only interested in computing node|edge cuts|connectivity will have enough using node_connectivity, edge_connectivity, minimum_node_cut and minimum_edge_cut. These functions support computing that for the whole graph and also for two nodes. So I think that these are the functions that we should import to the base NetworkX namespace. The other functions (local_* and minumum_st_*_cuts) are more specialized and should be used by users interested in building connectivity algorithms themselves. These functions accept all parameters of the the flow interface, and some more on their own, in order to reuse of data structures and achieve a significant speed up. So I think that we could keep these functions only to the connectivity package and require users that want to use them to import them explicitly from the connectivity package.
  5. I was tempted to remove some connectivity functions, such as all_pairs_node_connectivity_matrix because it seems easy to write the few lines of code required for computing it. However I'm a bit hesitant because if you implement them without reusing the data structures, they will be a lot slower than the version that we provide here. This could be solved maybe with better documentation.

Use the new algorithms and the new interfaces provided by the flow package.
The code is quite clearer, and the new algorithms and interface provide
a significant speed up in running time. For some problems, this code is one order
of magnitude faster (see networkx#1102 for bechmarks). So far the code is still
backwards compatible, but we should discuss some possible non backward
compatible improvements.
@jtorrents
Copy link
Member Author

Some benchmarks to show the strong and weak points of edmonds_karp and shortest_augmenting_path. The legacy ford_fulkerson implementation is not included because it is too slow but it implements the same algorithm than edmonds_karp.

Graph edmonds_karp SAP edmonds_karp SAP
-------- node-conn node-conn edge-conn edge-conn
Gnp(200, 0.2) 2.858 2.859 0.099 0.139
Gnp(200, 0.5) 97.461 50.318 0.261 0.287
Gnp(200, 0.7) 208.808 115.520 0.327 0.396
-------- ------------- -------------- -------------- -------------
Powerlaw(1000, 2) 2.125 13.560 0.653 3.254
Powerlaw(2000, 2) 12.314 80.617 2.907 17.390
Powerlaw(3000, 2) 31.563 233.510 7.429 49.709

As you can see, each algorithm is strong in different contexts. Thus we should allow the user to select one via the flow_func parameter. I think that edmonds_karp is better as a default because the networks that I care about are very sparse with skewed degree distributions, but I'm open to discuss that.

@ysitu
Copy link
Contributor

ysitu commented Apr 29, 2014

I am yet to look at the code. Did you enable cutoffin your tests?

@jtorrents
Copy link
Member Author

Yes, in the results above both algorithms use cutoff (in this version of the code both algorithms always use cutoff, at least that was my intent).

It must be said that, the run times posted above were computed in a quite slower machine (Xeon X5650) than the run times that I've posted at #1102 (Haswell i7-4700QM). I see that it can be confusing because the run times at #1102 do not use cutoff but are (only a bit!) faster than the figures presented here.

@jtorrents
Copy link
Member Author

The code that I'm using for the benchmarks is:

import time
import networkx as nx
from networkx.utils import powerlaw_sequence, create_degree_sequence

flow_funcs = dict(
    edmonds_karp=nx.edmonds_karp,
    shortest_ap=nx.shortest_augmenting_path,
)

def build_power_law(n, exponent=2.0):
    deg_seq = create_degree_sequence(n, powerlaw_sequence, 100)
    G = nx.Graph(nx.configuration_model(deg_seq))
    G.remove_edges_from(G.selfloop_edges())
    G = sorted(nx.connected_component_subgraphs(G), key=len, reverse=True)[0]
    G.name = 'Power law configuration model: {0}'.format(n)
    return G

def benchmark_connectivity():
    graphs = []
    for p in [0.2, 0.5, 0.7]:
        G = nx.fast_gnp_random_graph(200, p)
        graphs.append(G)
    for n in [1000, 2000, 3000]:
        G = build_power_law(n)
        graphs.append(G)
    for G in graphs:
        print(nx.info(G))
        print("Computing node connectivity")
        for fname, flow_func in sorted(flow_funcs.items()):
            start = time.time()
            k = nx.node_connectivity(G, flow_func=flow_func)
            end = time.time() - start
            print(" " * 4 + "{0}:\t{1:.3f} seconds".format(fname, end))
        print("Computing edge connectivity")
        for fname, flow_func in sorted(flow_funcs.items()):
            start = time.time()
            k = nx.edge_connectivity(G, flow_func=flow_func)
            end = time.time() - start
            print(" " * 4 + "{0}:\t{1:.3f} seconds".format(fname, end))

if __name__ == '__main__':
    benchmark_connectivity()

@jtorrents
Copy link
Member Author

Hmm, it seems that Travis is failing while trying to install coverage. I'll open an issue.

@hagberg hagberg added this to the networkx-1.9 milestone May 2, 2014
1. The node mapping needed for node connectivity and minimum node cuts
is now a graph attribute of the auxiliary digraph. Thus there is no
need for a mapping parameter in the local version of these functions.

2. Change the parameter name for the auxuliary digraph from `aux_digraph`
to `auxiliary`. Also be consistent on the auxiliary digraph variable name
in the code. Now it is always `H`.

3. Added small sanity check for the auxiliary digraph for node connectivity.
If a digraph is passed as a paramater for reuse, check that it has a graph
attribute with the node mapping. If not we raise instead of rebuild the
auxiliary digraph.
With the addition of the example of how to compute local node connectivity
among all pairs of nodes reusing the data structures (added in the previous
commit in the docstrings of local_node_connectivity). We can remove this
function (which is the only one that has numpy dependency in the
connectivity package).
This change imporves significantly the speed of node_connectivity in
denser graphs. For very sparse ones increases speed by ~5%.
This change is backwards incompatible. Updated docstrings for all affected
functions with example usage. The global functions provide a good enough
interface for most uses of connectivity algorithms. More sophisticated uses
require explicit imports from the flow package anyway.
@jtorrents
Copy link
Member Author

The failure in python 3.2 is unrelated to this PR, it seems that is caused in functions that use scipy:
https://travis-ci.org/networkx/networkx/jobs/24348012#L2166

I've implemented some changes (a few backwards incompatible) in the commits above:

  1. Improved auxiliary digraph for connectivity functions: The node mapping needed for node connectivity and minimum node cuts is now a graph attribute of the auxiliary digraph. Thus there is no need for a mapping parameter in the local version of these functions. Also changed the parameter name for the auxuliary digraph from aux_digraph to auxiliary.
  2. Added examples of reusing data structures in the local version of connectivity an cut functions, and improved docstrings for all functions.
  3. Improve cutoff handling in node_connectivity based in the fact that node connectivity is bounded by minimum degree. I only did this change for edge connectivity in the first commit. This change imporves significantly the speed of node_connectivity in denser graphs (in my tests I see ~20% less time in networks with density 0.7; I should probably update the benchmarks). For very sparse networks the time reductions is ~5% or less.
  4. Remove all_pairs_node_connectivity_matrix function: With the addition of the example of how to compute local node connectivity among all pairs of nodes reusing the data structures, we can remove this function (which is the only one that has numpy dependency in the connectivity package).
  5. Remove local connectivity/cut functions from the base namespace: This change is backwards incompatible but I think it is worth it. Also updated docstrings for all affected functions with example usage. The global functions provide a good enough interface for most uses of connectivity algorithms. More sophisticated uses require explicit imports from the flow package anyway.

H.add_edge('%dA' % i, '%dB' % i, capacity=1)

edges = []
for (source, target) in G.edges():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G.edges_iter()

@ysitu
Copy link
Contributor

ysitu commented May 11, 2014

Save for some minor issues, I think that this is okay.

@jtorrents
Copy link
Member Author

Thanks for looking at this @ysitu! I've done the changes that you suggested. I'm using now the nice v = min(G, key=G.degree) way of selecting a node with minimum degree. I post here updated benchmarks with the last changes. Using the fact that connectivity is bounded by degree increases significantly the performance in dense networks.

I think that this is ready for merging. The interfaces to connectivity algorithms are only a bit backwards incompatible (parameter mapping is gone and local functions are not imported in the base namespace), but exposing and using the new interfaces to flow algorithms is a big improvement. I did not do detailed benchmarks of this PR against the code in 1.8.1 that uses the legacy ford_fulkerson, but for some problems (sparse networks with skewed degree distributions) it is 10x faster (or more).

Graph edmonds_karp SAP edmonds_karp SAP
-------- node-con nectivity edge-conne ctivity
Gnp(200, 0.2) 1.946 2.520 0.064 0.097
Gnp(200, 0.5) 65.678 39.659 0.157 0.195
Gnp(200, 0.7) 150.120 94.216 0.215 0.265
-------- ------------- -------------- -------------- -------------
Powerlaw(1000, 2) 1.685 12.038 0.470 2.348
Powerlaw(2000, 2) 10.456 79.087 2.185 15.667
Powerlaw(3000, 2) 25.963 212.799 6.667 43.309

nx.set_edge_attributes(H, 'capacity', capacity)
return H
else:
H = G.to_directed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_directed/to_undirected will also ends up deepcopying user data of unknown size/copyability. The proper fix is to make them and copy accept a data=False argument. But that belongs to another PR.

for (source, target) in G.edges_iter():
H.add_edges_from([(source, target), (target, source)])
capacity = dict((e, 1) for e in H.edges())
nx.set_edge_attributes(H, 'capacity', capacity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@jtorrents
Copy link
Member Author

Nowhere in the documentation is it mentioned that the graph can be capacitated. The user will be very surprised to see a NetworkXUnbounded due to an uncapacitated edge.

Well, if an user gets a NetworkXUnbounded exception, then it is a bug on our side because what has to be capacited is the auxiliary digraph, not the input graph that the user will use as argument.

All the connectivity and cut functions are supposed to work on graphs with capacities == 1 for all edges. The only exception is minimum_st_edge_cut that, because it uses the new minimum_cut interface, is able to compute weighted cuts. This is the only function that has a capacity parameter. I'll try to clarify its docstrings.

I'll do the changes that you propose shortly, and will also check the docstrings to make sure that we have no back-ticks missing.

@ysitu
Copy link
Contributor

ysitu commented May 13, 2014

If the user specify the capacities of some but not all edges, they will likely get a NetworkXUnbounded when they use the edge connectivity/cut functions.

@jtorrents
Copy link
Member Author

Oh I see. You are right, I'll fix that. We should always build the auxiliary network with the unity capacities even if the graph passed to build_auxiliary_edge_connectivity has already a capacity edge attribute.

All connectivity and cut algorithms are supposed to work on unit
capacity networks. minimum_st_edge_cut was the only exception, because
it uses the new interface to flow algorithms, is potentially able to compute
weighted cutsets. However, this complicated the implementation and could
result in a NetworkXUnbounded exception if an user used as argument a
graph with some but not all edges with an attribute called capacity. Since
minimum_cut computes weighted cuts, there is no need to duplicate
functionality here.
Always build the auxiliary network and simplify the process of
building of the auxiliary digraph.
@jtorrents
Copy link
Member Author

After looking at the problem that @ysitu pointed out, I think that the cleanest option is to do not allow weighted computations in minimum_st_edge_cut. All connectivity and cut algorithms are supposed to work on unit capacity networks. minimum_st_edge_cut was the only exception, it is potentially able to compute weighted cutsets because it uses the new interface to flow algorithms. However, this complicated the implementation and could result in a NetworkXUnbounded exception if an user used as argument a graph with some but not all edges with an attribute called capacity. Since minimum_cut computes weighted cuts, there is no need to duplicate functionality here.

Also improved the generation of the auxiliary digraph for edge connectivity and cleaned up the docstrings.

@chebee7i
Copy link
Member

@jtorrents How common is it to want to calculate node connectivity between all pairs?

If it is quite common, I'd lean towards including a simple function that does this for the user. One general complaint of Python libraries (i.e., in comparison to R), is that sometimes they tend to be too low level. The example you provided is >10 lines right? It involves the use of itertools, auxillary digraphs, and residuals. Users who only want node connectivity between pairs and don't care to learn the NetworkX implementation will thank you for being able to calculate it in one line.

@jtorrents
Copy link
Member Author

Good point @chebee7i. We could add again this function. However I think that computing node connectivity between all pairs is not very common, because it is a quite slow computation. For not so big problems, it will not be practical. In fact flow based connectivity algorithms are based in clever ways to avoid computing a minimum cut among all pairs of nodes.

However I agree that if an user needs to compute node connectivity among all pairs, they will have to dive in implementation details ... and that might no be so pleasant for them as it is for us ;). So I'll add it again. I'm not sure of which would be the best data structure for returning it. In the previous version it returned a 2d numpy array, but I'm thinking that a plain old dict might do. What do you think?

@chebee7i
Copy link
Member

I'm fine with either. The NumPy array is more efficient, but it is a dependency. So dict works.

1. Change function name from all_pairs_node_connectiviy_matrix to
all_pairs_node_connectivity.

2. The function now returns a dict instead of a numpy 2d array.

3. New parameter nbunch for computing node connectivity only among
pairs of nodes in the container nbunch.

4. Added old and new tests for all_pairs_node_connectiviy to
test_connectivity.py.
@jtorrents
Copy link
Member Author

I've added again the function all_pairs_connectivity_matrix with some modifications:

  1. Change function name from all_pairs_node_connectiviy_matrix to
    all_pairs_node_connectivity.
  2. The function now returns a dict instead of a numpy 2d array.
  3. New parameter nbunch for computing node connectivity only among
    pairs of nodes in nbunch.
  4. Added old and new tests for all_pairs_node_connectiviy to
    test_connectivity.py.

@jtorrents
Copy link
Member Author

Almost forgot to comment that I've added the function all_pairs_node_connectivity to the base NetworkX namespace. Not sure if this is necessary. We can also keep it to the connectivity package and require an explicit import.

@ysitu
Copy link
Contributor

ysitu commented May 15, 2014

Also need to put it in the Sphinx source.

@jtorrents
Copy link
Member Author

Added all_pairs_node_connectivity to the sphinx sources. Also added the functions for building auxiliary digraphs to the package documentation. And did a small fix in stoer_wagner docstrings: only the first line shows up as summary of the function in the generated documentation.

@ysitu
Copy link
Contributor

ysitu commented May 16, 2014

How about adding a test or two to check edge_connectivity against stoer_wagner?

@jtorrents
Copy link
Member Author

Good idea @ysitu! I've added a test that uses several platonic graphs to check edge connectivity against stoer_wagner.

@jtorrents
Copy link
Member Author

Any other comment on this?

ysitu added a commit that referenced this pull request May 20, 2014
@ysitu ysitu merged commit 4017b0c into networkx:master May 20, 2014
@jtorrents jtorrents deleted the refactor-connectivity branch February 5, 2016 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants