Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: file descriptor cannot be a negative integer (-1) #63

Closed
danielmitterdorfer opened this issue Apr 17, 2020 · 2 comments
Closed

Comments

@danielmitterdorfer
Copy link

danielmitterdorfer commented Apr 17, 2020

Environment

  • Python version: 3.8.0
  • OS (uname -a): Darwin io 18.7.0 Darwin Kernel Version 18.7.0
  • Thespian version: 3.10.0
  • Transport: TCPTransport (on a single machine)

Error

When running an actor system on a single machine we get the following stack trace after a while:

2020-04-17 10:51:26.012604 p24308 ERR  Actor esrally.driver.driver.Worker @ ActorAddr-(T|:53459) transport run exception: Traceback (most recent call last):
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/actorManager.py", line 87, in run
    r = self.transport.run(self.handleMessages)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 74, in run
    rval = self._run_subtransport(incomingHandler, max_runtime)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
    rval = self._runWithExpiry(incomingHandler)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1143, in _runWithExpiry
    rrecv, rsend, rerr = select.select(wrecv, wsend,
ValueError: file descriptor cannot be a negative integer (-1)

In the error log we also see:

2020-04-17 10:51:26.016014 p24308 ERR  ValueError on select(#0: [], #3: [-1, 14, 13], #3: {13, -1, 14}, 0.166675)

I saw that the code in TCPTransport has handling for invalid file descriptors but it expects OSError to be raised.

I can reproduce this in our application when an actor sends large log messages so I wonder whether it is related. I also managed to capture a complete debug log output of Thespian (~ 13MB expanded; the error is towards the end of the log file).

The error is pretty well reproducible but the reproduction scenario is a bit involved (it requires running an application that is somewhat complex to setup) so I'm not sure whether it is helpful to describe it here. I am of course happy to replicate the steps here, provide further details and / or participate in testing but I wanted to get the issue out here first.

@kquick kquick closed this as completed in a99fbf9 Apr 20, 2020
@kquick
Copy link
Owner

kquick commented Apr 20, 2020

Thank you for the report and the debug log to help with looking at this. Having looked through the log, it looks like the remote Actor closed the socket at some point while there were still message queue to send to that remote queued, and the send got attempted. It's entirely possible for sockets to close, and the TCPTransport should accomodate those situations, but as you noted, it wasn't properly handling the ValueError scenario where the socket fd was invalid.

I've uploaded a fix (a99fbf9) that should properly handle the ValueError scenario. It's a little tough to reproduce this with local tests, but it sounds like it's not too difficult for you to reproduce, so I'm hoping you can confirm that this helps.

It's entirely possible that there's some other underlying issue that resulted in the socket disconnect; please check that the updated version is giving you the expected results and let me know if there's something else I can help with here.

[Note: apparently pushing a commit with the word "Fix" followed by a hash and a bug number causes github to automatically close an issue. TIL. I've re-opened the issue to wait for your confirmation and further testing.]

@kquick kquick reopened this Apr 20, 2020
@danielmitterdorfer
Copy link
Author

danielmitterdorfer commented Apr 21, 2020

I ran tests for a full day, both with with version 3.10.0 and your commit a99fbf9 and can confirm that I never spotted the issue again with a99fbf9. Thanks Kevin for implementing the fix so quickly!

danielmitterdorfer added a commit to danielmitterdorfer/rally that referenced this issue Jul 15, 2020
With this commit we upgrade from Thespian 3.10.0 to 3.10.1 which
includes a bugfix that improves stability when a remote actor has closed
a socket which still had pending messages.

Relates kquick/Thespian#63
danielmitterdorfer added a commit to elastic/rally that referenced this issue Jul 15, 2020
With this commit we upgrade from Thespian 3.10.0 to 3.10.1 which
includes a bugfix that improves stability when a remote actor has closed
a socket which still had pending messages.

Relates kquick/Thespian#63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants