Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncResultTest.test_wait_for_send fails in 1-core VM #380

Closed
bmwiedemann opened this issue Aug 27, 2019 · 3 comments
Closed

AsyncResultTest.test_wait_for_send fails in 1-core VM #380

bmwiedemann opened this issue Aug 27, 2019 · 3 comments

Comments

@bmwiedemann
Copy link

While working on reproducible builds for openSUSE, I found that
our python-ipyparallel-6.2.4 package failed to build on 1-core Linux VMs because
ipyparallel/tests/test_asyncresult.py:357 expects a timeout(0) to trigger an exception,
but that depends on the scheduling, which happens differently in 1-core VMs.

It seems, one can trigger this behaviour by running the tests under taskset 1

@minrk
Copy link
Member

minrk commented Jun 4, 2021

Hi! I’m going through and cleaning up old/stale issues on this repo. Sorry for not responding in a reasonable amount of time!

I'd recommend skipping the test in that case. Race conditions are hard to rigorously eliminate in asynchronous distributed code, and this example does exactly that. It could probably be mocked sufficiently to fake the behavior, but that wouldn't accurately test the relevant code anymore.

@minrk minrk closed this as completed Jun 4, 2021
@bmwiedemann
Copy link
Author

Aren't the tests there to be able to find and fix such race conditions, especially because it is so hard to do?

@minrk
Copy link
Member

minrk commented Jun 7, 2021

To be clear, this test failure is not due to a bug in the code. This is a test for low-level machinery coordinating information between libzmq's C++ io threads and Python. The race is in the test itself, not the code—The code is behaving correctly even in the failure case. The error is a failure to produce the intended test scenario, due to the configuration of the VM.

The case being tested:

  • we hand off a send to libzmq
  • on handoff, before the send completes, our 'sent' event is waiting
  • after the send completes, our 'sent' event is ready

The race is because libzmq immediately begins attempting to process the send in another GIL-less thread. Thread scheduling means this is nondeterministic, but in any realistic scenario, it takes a finite amount of time. I guess taskset 1 means it goes:

  1. handoff to libzmq
  2. wake libzmq thread
  3. libzmq thread processes send
  4. back to Python to wait, which is always (or at least more often) ready

The right thing to do is skip this test when run in an environment that cannot reproduce the test scenario. If there is an obvious way to detect this, I'd add the skip automatically.

If there were a mechanism to force a delay into libzmq's underlying send, that would help. But I'm not aware of such a mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants