ValueError: file descriptor cannot be a negative integer (-1) #63

danielmitterdorfer · 2020-04-17T12:14:56Z

Environment

Python version: 3.8.0
OS (uname -a): Darwin io 18.7.0 Darwin Kernel Version 18.7.0
Thespian version: 3.10.0
Transport: TCPTransport (on a single machine)

Error

When running an actor system on a single machine we get the following stack trace after a while:

2020-04-17 10:51:26.012604 p24308 ERR  Actor esrally.driver.driver.Worker @ ActorAddr-(T|:53459) transport run exception: Traceback (most recent call last):
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/actorManager.py", line 87, in run
    r = self.transport.run(self.handleMessages)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 74, in run
    rval = self._run_subtransport(incomingHandler, max_runtime)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
    rval = self._runWithExpiry(incomingHandler)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1143, in _runWithExpiry
    rrecv, rsend, rerr = select.select(wrecv, wsend,
ValueError: file descriptor cannot be a negative integer (-1)

In the error log we also see:

2020-04-17 10:51:26.016014 p24308 ERR  ValueError on select(#0: [], #3: [-1, 14, 13], #3: {13, -1, 14}, 0.166675)

I saw that the code in TCPTransport has handling for invalid file descriptors but it expects OSError to be raised.

I can reproduce this in our application when an actor sends large log messages so I wonder whether it is related. I also managed to capture a complete debug log output of Thespian (~ 13MB expanded; the error is towards the end of the log file).

The error is pretty well reproducible but the reproduction scenario is a bit involved (it requires running an application that is somewhat complex to setup) so I'm not sure whether it is helpful to describe it here. I am of course happy to replicate the steps here, provide further details and / or participate in testing but I wanted to get the issue out here first.

The text was updated successfully, but these errors were encountered:

kquick · 2020-04-20T06:19:02Z

Thank you for the report and the debug log to help with looking at this. Having looked through the log, it looks like the remote Actor closed the socket at some point while there were still message queue to send to that remote queued, and the send got attempted. It's entirely possible for sockets to close, and the TCPTransport should accomodate those situations, but as you noted, it wasn't properly handling the ValueError scenario where the socket fd was invalid.

I've uploaded a fix (a99fbf9) that should properly handle the ValueError scenario. It's a little tough to reproduce this with local tests, but it sounds like it's not too difficult for you to reproduce, so I'm hoping you can confirm that this helps.

It's entirely possible that there's some other underlying issue that resulted in the socket disconnect; please check that the updated version is giving you the expected results and let me know if there's something else I can help with here.

[Note: apparently pushing a commit with the word "Fix" followed by a hash and a bug number causes github to automatically close an issue. TIL. I've re-opened the issue to wait for your confirmation and further testing.]

danielmitterdorfer · 2020-04-21T08:20:23Z

I ran tests for a full day, both with with version 3.10.0 and your commit a99fbf9 and can confirm that I never spotted the issue again with a99fbf9. Thanks Kevin for implementing the fix so quickly!

With this commit we upgrade from Thespian 3.10.0 to 3.10.1 which includes a bugfix that improves stability when a remote actor has closed a socket which still had pending messages. Relates kquick/Thespian#63

kquick closed this as completed in a99fbf9 Apr 20, 2020

kquick reopened this Apr 20, 2020

danielmitterdorfer closed this as completed Apr 21, 2020

danielmitterdorfer mentioned this issue Jul 15, 2020

Upgrade to Thespian 3.10.1 elastic/rally#1025

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: file descriptor cannot be a negative integer (-1) #63

ValueError: file descriptor cannot be a negative integer (-1) #63

danielmitterdorfer commented Apr 17, 2020

kquick commented Apr 20, 2020

danielmitterdorfer commented Apr 21, 2020

ValueError: file descriptor cannot be a negative integer (-1) #63

ValueError: file descriptor cannot be a negative integer (-1) #63

Comments

danielmitterdorfer commented Apr 17, 2020

Environment

Error

kquick commented Apr 20, 2020

danielmitterdorfer commented Apr 21, 2020