Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing traveling salesman test #5913

Closed
jamesjer opened this issue Aug 2, 2022 · 12 comments · Fixed by #7335
Closed

Failing traveling salesman test #5913

jamesjer opened this issue Aug 2, 2022 · 12 comments · Fixed by #7335

Comments

@jamesjer
Copy link

jamesjer commented Aug 2, 2022

Current Behavior

I help maintain a NetworkX package for the Fedora Linux distribution. We like to run the tests when we build, to catch problems that might affect users. One traveling salesman test, test_asadpour_tsp at line 707 of networkx/algorithms/approximation/tests/test_traveling_salesman.py in NetworkX 2.8.5, sometimes succeeds and sometimes fails. When it fails, it finds the tour [3, 2, 1, 0, 5, 4, 3], which has the same weight (402) as the expected tour [3, 2, 0, 1, 4, 5, 3]. I'm curious about this comment at the bottom of the test:

    # However, we are using a fixed random number generator so we know what the
    # expected tour is.

Does that mean that something is wrong with the random number generator? If so, can you give me a hint as to what I should look for?

Expected Behavior

The tests should pass.

Steps to Reproduce

The test failure is intermittent. We do the build, then run pytest.

Environment

Python version: 3.11.0~b5
NetworkX version: 2.8.5

@mjschwenne
Copy link
Contributor

This is odd indeed. While I did author the code in question, I will admit that the random number generation pattern used in networkx isn't something that I've investigated. I haven't been able to replicate the error in my environment running Ubuntu 22.04 (python 3.10.4) against the main branch of the repo and haven't heard any reports of that test failing in our pipeline, but since you mention that it's an intermittent problem that doesn't exactly say anything. (Ugh those types of problems are the worst to track down!)

I did check that the seed parameter passed to asadpour_atsp is used in all of the places where random numbers are required. It is worth noting that the since their are already two tours in the expected_tours, the comment above it seems... not exactly correct, but I'm not sure where the existing non-determinism is coming from.

Since the returned answer has the same weight as the correct answer, the quick and dirty option here is to add that tour to the list of expected tours. Perhaps either @dschult or @rossbar would have a better intuition on whether this is a problem that we need to track down or if the fast solution is considered good enough.

@dschult
Copy link
Member

dschult commented Aug 3, 2022

I think the random number generation in asadpour is fine. But calling traveling_salesman_problem doesn’t present a way to provide the seed to the asadpour function directly. One of the tests shows that it can be called with a seed by overriding the default method with a function that calls asadpour with a seed. But… the test that is causing troubles doesn’t do that. So no seed is set for that test despite the comment saying it is. Perhaps a previous version did call the seed.

I think the easy fix is to change the test to use asadpour with a seed. Longer term maybe we should provide a seed argument to traveling_salesman_problem. But I’m not sure. If we don’t do that we should provide an example in the doc_string of how to call it with a seed. Thoughts?

@dschult dschult added this to the networkx-2.8.6 milestone Aug 3, 2022
@jamesjer
Copy link
Author

jamesjer commented Aug 3, 2022

Changing the test to provide a seed sounds like the right way to go. Thank you both for the quick responses.

@mjschwenne
Copy link
Contributor

Changing it to use a seed would be the best solution, but I believe that it already is? Line 742 in the test file of the 2.8.5 release does wrap the asadpour_atsp function and provides a seed value of 19 for the failing test unless I'm seriously mistaken.

@dschult
Copy link
Member

dschult commented Aug 4, 2022

Whoops! Sorry @mjschwenne I stand corrected. The seed is set in the test function. Not sure what I was looking at. :}

I have extracted the test function and run it many times with the same results. I have also added a print statement to show the next random number generated after it finishes and I get the same number from (0,1) many many times within a for loop. I'm not sure how best to debug this. @jamesjer can you repeatedly run this code in the environment the test is run in?

import networkx as nx
import numpy as np
edge_list = [
        (0, 1, 100),
        (0, 2, 100),
        (0, 5, 1),
        (1, 2, 100),
        (1, 4, 1),
        (2, 3, 1),
        (3, 4, 100),
        (3, 5, 100),
        (4, 5, 100),
        (1, 0, 100),
        (2, 0, 100),
        (5, 0, 1),
        (2, 1, 100),
        (4, 1, 1),
        (3, 2, 1),
        (4, 3, 100),
        (5, 3, 100),
        (5, 4, 100),
    ]

G = nx.DiGraph()
G.add_weighted_edges_from(edge_list)

def fixed_asadpour(G, weight):
    return nx_app.asadpour_atsp(G, weight, seed=19)
    
for i in range(15):
    nx_app.traveling_salesman_problem(G, weight="weight",method=fixed_asadpour)

@rossbar rossbar removed this from the networkx-3.0 milestone Nov 15, 2022
@rossbar
Copy link
Contributor

rossbar commented Nov 15, 2022

Are you still seeing this issue @jamesjer ? Have you tried NX>2.8.5? I can't reproduce it locally either.

@thesamesam
Copy link

I think we might be hitting this in Gentoo on x86 (https://bugs.gentoo.org/921958).

If i run that script above, I get:

$ python /tmp/foo.py
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]
[3, 2, 1, 0, 5, 4, 3]

But for me, I seem to always get the failure (and the same output from the script).

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Feb 20, 2024
The test gets a valid solution but an unexpected on on x86. It's a somewhat
long-standing failure, so skip for now on x86.

Bug: networkx/networkx#5913
Bug: https://bugs.gentoo.org/921958
Signed-off-by: Sam James <sam@gentoo.org>
@mjschwenne
Copy link
Contributor

What version of networkx and python are you using?

@jamesjer
Copy link
Author

I haven't seen this happen for quite awhile, but it did on a recent build of networkx 3.2.1 with python 3.12.1. It does not happen on every build. Running the script above in the build environment produces nothing but [3, 2, 0, 1, 4, 5, 3], every time.

Is there any parallelism in the test suite?

@rossbar
Copy link
Contributor

rossbar commented Feb 22, 2024

I took another look at the implementation and the first thing that jumped out as a potential point of variability is the use of scipy.optimize.linprog in the held_karp_ascent function. By default, linprog uses the "highs" method, which chooses between one of two sub-methods, either highs-ipm or highs-ds - see the linprog("highs") method docstring for details.

The docstring mentions that when the default "highs" is specified, the submethod is chosen by scipy. I didn't chase down the selection logic, but perhaps it's possible different methods are being selected on different platforms. Another possibility is that the methods give different results on different platforms - they are binary extensions after all, so their building/packaging may have something to do with it.

If someone were going to dive into this further, that's where I'd start looking. In the end though, there may not be a whole lot we can do about it. If we're lucky, perhaps explicitly specifying one of highs-ds or highs-ipm removes the variability. The worst case scenario, such as it is, is that we cannot guarantee exact reproducibility for cases where there are multiple valid solutions. IMV this is not the end of the world. I'd be inclined to just document it and update the test to include all valid solutions to the test case!

@thesamesam
Copy link

What version of networkx and python are you using?

networkx-3.2.1 w/ python-3.11.8

@mjschwenne
Copy link
Contributor

Well, I'm still not sure what is causing this issue. I'm using the NixOS compiled version of both numpy and scipy and I was still unable to reproduce the bug by changing the scipy.linprog method. I tend to agree that Rossbar that this was probably the most likely point of origin for the bug, pretty much everything outside of that call is python only and should be fairly robust to platform specific changes.

Either way, since the new tour returned is equally valid, I've added the reported path to the test and fixed the method to "highs-ipm". That should fix the bug, but since I can't reproduce the problem I'd really appreciate if you could test the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants