Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent dijkstra api #3923

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

nescio007
Copy link

The API for weighted shortest paths has two triples of functions of the form *_source_dijkstra, *_source_dijkstra_path, *_source_dijkstra_path_length, where the first returns both lengths and paths, the second only the path, and the third only the length.
While intuitively all three should function roughly the same, only the first version allows to specify a target parameter.

In short, this PR

  • Adds a target parameter to the *_source_dijkstra_path and *_source_dijkstra_path_length variants
  • Uses the new parameter to remove some duplicate code
  • Makes multi_source_dijkstra the effective "backend" of all versions, with a new capture_paths flag to prevent full path computation when they are not needed.

Previously only single_source_dijkstra had a target parameter, but not single_source_dijkstra_path or single_source_dijkstra_path_length
write variants of existing testcases to make use of the target parameter
@nescio007 nescio007 marked this pull request as ready for review April 16, 2020 21:02
@dschult
Copy link
Member

dschult commented May 29, 2020

This seems like a good idea. Unifying the API for those 3 functions is a good goal.
Is there a way to avoid the capture_paths flag by passing in a dict to fill with paths? (similar to _multisource?

@nescio007
Copy link
Author

I couldn't find an easy way:
multi_source_dijkstra is part of the public API, and thus should return both lengths and paths even when called just as multi_source_dijkstra(G, sources). A paths dict argument would thus need a good default value that would still return paths: An empty dict ({}) does not work (mutable), and None should be used to indicate that no paths should be captured. I guess one could use some fixed, non-dict value to indicate that a fresh paths-dict should be constructed, but I'm not sure that would yield a more elegant solution than the current capture_paths flag.

Additionally, a paths argument has somewhat weird semantics: as multi_source_dijkstra still has to return the captured paths (in order not to break the current API), there are now two places that contain the results: the dict given as the paths argument, and the value returned from multi_source_dijkstra.

I'm not 100% happy with the current capture_paths flag either, but it's the best I could come up with :-/

@dschult
Copy link
Member

dschult commented Jun 1, 2020

Here's an approach that would make that work smoothly (and other things):

  • instead of using multi_source_dijkstra as the function all others call, use the internal function _dijkstra_multisource.
  • The internal function should assume all inputs are reasonable and valid. This leaves the input checking to the outer-layer functions. Outer-layer functions that know "sources" is not empty don't have to check that. This reduces checking overall. Basically, put all checking in the outer functions. Yes, that might duplicate some code, but it saves code from being executed -- and it avoids an API with 7 functions that simply call a single function that could have been the API in and of itself... :}
  • The inner function can allow paths and pred inputs that get morphed. It knows those inputs are OK because our code (the outer-layer functions) created them. And it knows that those outer functions will ignore those input parameters if they aren't appropriate to be returned to user. In this sense, outer-layer functions construct all inputs and process all outputs. The inner function doesn't have to check inputs or worry about outputs.
  • preprocessing of the weight argument should also be done in the outer-layer functions. Again, that duplicates code in the file, but avoids calling unnecessary code. And it moves all input checking and construction to the outer-layer.
  • don't have outer-layer functions call other outer-layer functions. Calling many functions is slow in an interpreted language like Python and makes it hard for people to find what is done where in the code... They have to look up many functions to finally find the code that does the core of the algorithm. So, just have each function process inputs, construct inputs, call the single inner function that does everything (except check inputs), process output and return it to the user.

I think this will take care of the capture_paths argument in the API. And it'll make the rest of the code easier to parse for people trying to figure out how this code works.

Move `capture_paths` and argument checking to a "middle-layer" function `_dijkstra_multisource_checked`, before calling the "inner-layer" `_dijkstra_multisource` implementation.
This prevents adding `capture_paths` to the public API, but maintains a single place for argument checking/return value handling without code duplicates.
@nescio007
Copy link
Author

How about the following compromise, that moves capture_paths to a "middle-layer" function?
This leaves the public API as-is (no new arguments), but prevents code-duplication for argument checking.

To reply to some of your comments:

and it avoids an API with 7 functions that simply call a single function that could have been the API in and of itself... :}

Aren't we already in that situation?
dijkstra_path is the same as single_source_dijkstra_path (which is the same as multi_source_dijkstra_path with {source} as sources). Likewise, single_source_dijkstra and single_source_dijkstra_path are basically the same, only that the latter ignores one of the return values of the former.

don't have outer-layer functions call other outer-layer functions. Calling many functions is slow in an interpreted language like Python and makes it hard for people to find what is done where in the code... They have to look up many functions to finally find the code that does the core of the algorithm. So, just have each function process inputs, construct inputs, call the single inner function that does everything (except check inputs), process output and return it to the user.

Is the performance-penalty worth the code-duplication?
If dijkstra_path calls multi_source_dijkstra_path, the reader can immediately grasp that they are essentially the same.
If both were to call _dijkstra_multisource, but had their own pre-processing, the reader would first need to check that also this pre-processing is the same.

In terms of performance penalty, I did a quick benchmark using one of the small graphs from the test-suite:

import timeit

import networkx as nx

from networkx.algorithms.shortest_paths.weighted import _dijkstra_multisource_checked

def main():

    N = 100000

    XG = nx.DiGraph()
    XG.add_weighted_edges_from([('s', 'u', 10), ('s', 'x', 5),
                                     ('u', 'v', 1), ('u', 'x', 2),
                                     ('v', 'y', 1), ('x', 'u', 3),
                                     ('x', 'v', 5), ('x', 'y', 2),
                                     ('y', 's', 7), ('y', 'v', 6)])

    print("dijkstra_path:                ", end='\t')
    print(timeit.timeit(lambda: nx.dijkstra_path(XG, 's', 'v'), number=N))

    print("single_source_dijkstra_path:  ", end='\t')
    print(timeit.timeit(lambda: nx.single_source_dijkstra_path(XG, 's', 'v'), number=N))

    print("multi_source_dijkstra_path:   ", end='\t')
    print(timeit.timeit(lambda: nx.multi_source_dijkstra_path(XG, {'s'}, 'v'), number=N))

    print("_dijkstra_multisource_checked:", end='\t')
    print(timeit.timeit(lambda: _dijkstra_multisource_checked(XG, {'s'}, 'v'), number=N))

if __name__ == '__main__':
    main()

With this I get:

dijkstra_path:                	0.7512081320001016
single_source_dijkstra_path:  	0.7353465300002426
multi_source_dijkstra_path:   	0.7260776550010632
_dijkstra_multisource_checked:	0.7037647990000551

Which is an overhead of about 7% introduced by the dijkstra_path -> single_source_dijkstra_path -> multi_source_dijkstra_path -> _dijkstra_multisource_checked call-chain.

To be honest, I did not expect the function call overhead to be that high.
Then again, this is a very small graph, and the relative difference should be much smaller for graphs (where most of the time will be spent in the actual Dijkstra implementation).

Sorry for being such a nuisance, but I wanted to double-check before violating DRY.

@dschult
Copy link
Member

dschult commented Jul 5, 2020

In Python, function calls are relatively expensive. Compiled languages are very different that way.

And from my perspective, wrapping the API into knots (e.g. lots of layers of functions) to avoid DRY makes the code much less readable and harder to maintain which was the goal of DRY in the first place. And besides, input checking is usually simple code that should be easy to read and not a bother to repeat. We want to avoid repeating the hard work (algorithm logic).

What do you think? (I notice there are now conflicts with base due to a change in exception raising. If this is getting too much to deal with, let me know and I'll sort it out.) Thanks!!

@dschult dschult self-assigned this Jul 10, 2020
Base automatically changed from master to main March 4, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants