Consistent dijkstra api #3923

nescio007 · 2020-04-16T20:50:46Z

The API for weighted shortest paths has two triples of functions of the form *_source_dijkstra, *_source_dijkstra_path, *_source_dijkstra_path_length, where the first returns both lengths and paths, the second only the path, and the third only the length.
While intuitively all three should function roughly the same, only the first version allows to specify a target parameter.

In short, this PR

Adds a target parameter to the *_source_dijkstra_path and *_source_dijkstra_path_length variants
Uses the new parameter to remove some duplicate code
Makes multi_source_dijkstra the effective "backend" of all versions, with a new capture_paths flag to prevent full path computation when they are not needed.

Previously only single_source_dijkstra had a target parameter, but not single_source_dijkstra_path or single_source_dijkstra_path_length

write variants of existing testcases to make use of the target parameter

dschult · 2020-05-29T21:17:12Z

This seems like a good idea. Unifying the API for those 3 functions is a good goal.
Is there a way to avoid the capture_paths flag by passing in a dict to fill with paths? (similar to _multisource?

nescio007 · 2020-06-01T11:22:06Z

I couldn't find an easy way:
multi_source_dijkstra is part of the public API, and thus should return both lengths and paths even when called just as multi_source_dijkstra(G, sources). A paths dict argument would thus need a good default value that would still return paths: An empty dict ({}) does not work (mutable), and None should be used to indicate that no paths should be captured. I guess one could use some fixed, non-dict value to indicate that a fresh paths-dict should be constructed, but I'm not sure that would yield a more elegant solution than the current capture_paths flag.

Additionally, a paths argument has somewhat weird semantics: as multi_source_dijkstra still has to return the captured paths (in order not to break the current API), there are now two places that contain the results: the dict given as the paths argument, and the value returned from multi_source_dijkstra.

I'm not 100% happy with the current capture_paths flag either, but it's the best I could come up with :-/

dschult · 2020-06-01T14:09:47Z

Here's an approach that would make that work smoothly (and other things):

instead of using multi_source_dijkstra as the function all others call, use the internal function _dijkstra_multisource.
The internal function should assume all inputs are reasonable and valid. This leaves the input checking to the outer-layer functions. Outer-layer functions that know "sources" is not empty don't have to check that. This reduces checking overall. Basically, put all checking in the outer functions. Yes, that might duplicate some code, but it saves code from being executed -- and it avoids an API with 7 functions that simply call a single function that could have been the API in and of itself... :}
The inner function can allow paths and pred inputs that get morphed. It knows those inputs are OK because our code (the outer-layer functions) created them. And it knows that those outer functions will ignore those input parameters if they aren't appropriate to be returned to user. In this sense, outer-layer functions construct all inputs and process all outputs. The inner function doesn't have to check inputs or worry about outputs.
preprocessing of the weight argument should also be done in the outer-layer functions. Again, that duplicates code in the file, but avoids calling unnecessary code. And it moves all input checking and construction to the outer-layer.
don't have outer-layer functions call other outer-layer functions. Calling many functions is slow in an interpreted language like Python and makes it hard for people to find what is done where in the code... They have to look up many functions to finally find the code that does the core of the algorithm. So, just have each function process inputs, construct inputs, call the single inner function that does everything (except check inputs), process output and return it to the user.

I think this will take care of the capture_paths argument in the API. And it'll make the rest of the code easier to parse for people trying to figure out how this code works.

Move `capture_paths` and argument checking to a "middle-layer" function `_dijkstra_multisource_checked`, before calling the "inner-layer" `_dijkstra_multisource` implementation. This prevents adding `capture_paths` to the public API, but maintains a single place for argument checking/return value handling without code duplicates.

nescio007 · 2020-06-22T10:21:29Z

How about the following compromise, that moves capture_paths to a "middle-layer" function?
This leaves the public API as-is (no new arguments), but prevents code-duplication for argument checking.

To reply to some of your comments:

and it avoids an API with 7 functions that simply call a single function that could have been the API in and of itself... :}

Aren't we already in that situation?
dijkstra_path is the same as single_source_dijkstra_path (which is the same as multi_source_dijkstra_path with {source} as sources). Likewise, single_source_dijkstra and single_source_dijkstra_path are basically the same, only that the latter ignores one of the return values of the former.

don't have outer-layer functions call other outer-layer functions. Calling many functions is slow in an interpreted language like Python and makes it hard for people to find what is done where in the code... They have to look up many functions to finally find the code that does the core of the algorithm. So, just have each function process inputs, construct inputs, call the single inner function that does everything (except check inputs), process output and return it to the user.

Is the performance-penalty worth the code-duplication?
If dijkstra_path calls multi_source_dijkstra_path, the reader can immediately grasp that they are essentially the same.
If both were to call _dijkstra_multisource, but had their own pre-processing, the reader would first need to check that also this pre-processing is the same.

In terms of performance penalty, I did a quick benchmark using one of the small graphs from the test-suite:

import timeit

import networkx as nx

from networkx.algorithms.shortest_paths.weighted import _dijkstra_multisource_checked

def main():

    N = 100000

    XG = nx.DiGraph()
    XG.add_weighted_edges_from([('s', 'u', 10), ('s', 'x', 5),
                                     ('u', 'v', 1), ('u', 'x', 2),
                                     ('v', 'y', 1), ('x', 'u', 3),
                                     ('x', 'v', 5), ('x', 'y', 2),
                                     ('y', 's', 7), ('y', 'v', 6)])

    print("dijkstra_path:                ", end='\t')
    print(timeit.timeit(lambda: nx.dijkstra_path(XG, 's', 'v'), number=N))

    print("single_source_dijkstra_path:  ", end='\t')
    print(timeit.timeit(lambda: nx.single_source_dijkstra_path(XG, 's', 'v'), number=N))

    print("multi_source_dijkstra_path:   ", end='\t')
    print(timeit.timeit(lambda: nx.multi_source_dijkstra_path(XG, {'s'}, 'v'), number=N))

    print("_dijkstra_multisource_checked:", end='\t')
    print(timeit.timeit(lambda: _dijkstra_multisource_checked(XG, {'s'}, 'v'), number=N))

if __name__ == '__main__':
    main()

With this I get:

dijkstra_path:                	0.7512081320001016
single_source_dijkstra_path:  	0.7353465300002426
multi_source_dijkstra_path:   	0.7260776550010632
_dijkstra_multisource_checked:	0.7037647990000551

Which is an overhead of about 7% introduced by the dijkstra_path -> single_source_dijkstra_path -> multi_source_dijkstra_path -> _dijkstra_multisource_checked call-chain.

To be honest, I did not expect the function call overhead to be that high.
Then again, this is a very small graph, and the relative difference should be much smaller for graphs (where most of the time will be spent in the actual Dijkstra implementation).

Sorry for being such a nuisance, but I wanted to double-check before violating DRY.

dschult · 2020-07-05T20:22:54Z

In Python, function calls are relatively expensive. Compiled languages are very different that way.

And from my perspective, wrapping the API into knots (e.g. lots of layers of functions) to avoid DRY makes the code much less readable and harder to maintain which was the goal of DRY in the first place. And besides, input checking is usually simple code that should be easy to read and not a bother to repeat. We want to avoid repeating the hard work (algorithm logic).

What do you think? (I notice there are now conflicts with base due to a change in exception raising. If this is getting too much to deal with, let me know and I'll sort it out.) Thanks!!

nescio007 added 5 commits April 16, 2020 21:56

add target parameter to all dijkstra functions

e59355a

Previously only single_source_dijkstra had a target parameter, but not single_source_dijkstra_path or single_source_dijkstra_path_length

remove duplicate code

678ddc6

refactor multi_source_dijkstra_path_length

8bc3820

reformat PEP8

7f2b9e9

add testcases

fe1bc8d

write variants of existing testcases to make use of the target parameter

nescio007 marked this pull request as ready for review April 16, 2020 21:02

dschult self-assigned this Jul 10, 2020

Base automatically changed from master to main March 4, 2021 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent dijkstra api #3923

Consistent dijkstra api #3923

nescio007 commented Apr 16, 2020

dschult commented May 29, 2020

nescio007 commented Jun 1, 2020

dschult commented Jun 1, 2020

nescio007 commented Jun 22, 2020

dschult commented Jul 5, 2020

Consistent dijkstra api #3923

Are you sure you want to change the base?

Consistent dijkstra api #3923

Conversation

nescio007 commented Apr 16, 2020

dschult commented May 29, 2020

nescio007 commented Jun 1, 2020

dschult commented Jun 1, 2020

nescio007 commented Jun 22, 2020

dschult commented Jul 5, 2020