[EDITED] Document cancel-after-final-timeout pattern #102

lucagiovagnoli · 2016-04-21T16:19:56Z

I encountered an issue with crochet not terminating the twisted deferred correctly when raising the TimeoutError exception.

Code for reference

def test_eventual_result_timeout(server_url):

    # server incredibly slow
    eventual_result = fido.fetch(server_url + '100sec')

    # make sure no timeout error is thrown here but only on result retrieval
    assert eventual_result.original_failure() is None

    with pytest.raises(crochet.TimeoutError):
        eventual_result.wait(timeout=1.0)

    assert eventual_result.original_failure() is None
    print("Sleeping a bit to make sure a potential exception in the reactor"

    "thread due to connectionLost is trapped in eventual_result")
    time.sleep(5)

    assert eventual_result.original_failure() is None

    print("Finished sleeping. in a moment we will lose references to"
    "eventual_result and hence the deferred will be closed uncorrectly causing"
    "the CRITICAL message to be logged right after this message....")


@crochet.run_in_reactor
def fetch_inner(url, method, headers, body, timeout, connect_timeout):

    bodyProducer, headers = _build_body_producer(body, headers)

    deferred = Agent(reactor, connect_timeout).request(
        method=method,
        uri=url,
        headers=listify_headers(headers),
        bodyProducer=bodyProducer)

    def response_callback(response):
        finished = Deferred()
        response.deliverBody(HTTPBodyFetcher(response, finished))
        return finished

    deferred.addCallback(response_callback)

    return deferred


def fetch(
    url, method='GET', headers=None, body='', timeout=DEFAULT_TIMEOUT,
    connect_timeout=DEFAULT_CONNECT_TIMEOUT,
    content_type=DEFAULT_CONTENT_TYPE, user_agent=DEFAULT_USER_AGENT,
):

    crochet.setup()
    return fetch_inner(url, method, headers, body, timeout, connect_timeout)

Traceback
Everything was going fine (TimeoutError is correctly raised) but the twisted reactor thread was failing due to the connection closing incorrectly. Traceback:

Sleeping a bit to make sure a potential exception in the reactor thread due to 
connectionLost is trapped in eventual_result

Finished sleeping. in a moment we will lose references to eventual_result and hence
 the deferred will be closed uncorrectly causing the CRITICAL message to
 be logged right after this message....

CRITICAL:twisted:Unhandled error in EventualResult
Traceback (most recent call last):
Failure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

All assertion were true, showing that no exception is trapped in the EventualResult. Exception happens exactly when the eventual_result is garbage collected.

The problem

From Twisted documentation:

If a Deferred is garbage-collected with an unhandled error (i.e. it would call the next errback if there was one), then Twisted will write the error’s traceback to the log file. This means that you can typically get away with not adding errbacks and still get errors logged. Be careful though; if you keep a reference to the Deferred around, preventing it from being garbage-collected, then you may never see the error (and your callbacks will mysteriously seem to have never been called).

In this case the ConnectionLost exception is caused by the garbage collector disposing of the deferred when we lose all references to the EventualResult (and hence the deferred). In fact we don't see anything logged until the eventual_result is alive.

I think that raising TimeoutError here without disposing of the deferred is causing the Agent connection to be closed incorrectly and the exception to be thrown in the reactor thread after raising the TimeoutError.

Verification
As a last step to verify the problem I added self.cancel() before raising the TimeoutError so that the deferred is cancelled properly.

 if not self._result_set.is_set():
     self.cancel()
     raise TimeoutError()

I tested this with our internal acceptance test suite.

Why not handling twisted.internet.error.ConnectionLost with a twisted errback?
The problem is a systematic erroneous disposal of the deferred object causing the connection to be always incorrectly closed the same way. Let's say I was to handle this on our side this way:

    def handle_timeout_errors(error):
        if error.check(twisted.web.client.ResponseNeverReceived):
            if error.value.reasons[0].check(
                twisted.internet.error.ConnectionLost
            ):
                # we get here if crochet aborts the connection uncorrectly
                # when the EventualResult timeout is due. Nothing to raise
                # because crochet will take care of raising a TimeoutError.
                return
        return error

    deferred.addErrback(handle_timeout_errors)

I would be mistakenly catching all potential connectionLost issues, even the ones due to other causes (for example the server aborting the connection).

Action points

I understand crochet would not want to dispose of deferred objects before raising the TimeoutError to allow multiple retries on result retrieval.

Q: Would crochet be interested in a way to dispose/destroy the deferred ? Something as this on stackoverflow. Otherwise I am conducting the same discussion to handle this on our side.

The text was updated successfully, but these errors were encountered:

itamarst · 2016-04-26T15:28:51Z

Hi, thanks for the detailed bug report! Just back from a vacation so catching up with things; I'll try to get you an answer soon.

lucagiovagnoli · 2016-04-27T08:36:35Z

Hi, thanks very much for your answer! Please let me know what you think, I'm really curious about getting a second opinion.

itamarst · 2016-04-28T23:23:24Z

So... the @run_in_reactor and wait() API are designed such that you can call wait() multiple times. That means that when wait() raises a TimeoutError it cannot know whether it should cancel the underlying Deferred or not, since it might be called again.

If the interaction pattern is "if timeout is hit we're done", perhaps you want to use @wait_for(timeout=x)? That will do the cancellation when timeout is hit. Or to put it another way, why is @wait_for insufficient for this use case?

itamarst · 2016-05-03T18:04:28Z

@lucagiovagnoli any thoughts on the above?

lucagiovagnoli · 2016-05-03T18:15:23Z

hi! thanks for your reply and sorry for my delayed answer (I am supposed to be back working on this next week).
I remember checking @wait_for out, but I dismissed it because it seemed like it was a blocking call? It's calling the wait() primitive inside (please tell me if I am wrong).

What I was looking for was something which:

does not block (so that we can do something else during the network calls)
destroys the deferred after waiting() once for the amount of time we input (so the connection gets closed cleanly).

itamarst · 2016-05-03T18:35:29Z

Oh, I see. So this is for fire-and-forget operations, just something you want to start in the background?

lucagiovagnoli · 2016-05-03T18:44:19Z

exactly. As in my example above, my deferred is returned from the Twisted Agent request, so I'd like to be able to fire the network call and retrieve results later

itamarst · 2016-05-03T18:49:57Z

Retrieving results later isn't really fire-and-forget though. So OK, it's not that use case.

The use case is then... "I've started a Twisted thing in background with @run_in_reactor, and I've hit timeout that's long enough that I want to cancel the underlying Deferred, I am not going to ever try again".

Can you do this:

try:
    eventual_result.wait(timeout=1.0)
except crochet.TimeoutError:
    eventual_result.cancel()
    raise

lucagiovagnoli · 2016-05-03T19:40:27Z

I think that could actually solve my problem! I missed the eventual_result could be canceled from outside while going crazy with twisted deferreds.

Thanks for taking the time and for the other pair of eyes. Will try that out but feel free to close this cause I feel confident that's the way to go

itamarst · 2016-05-03T19:47:27Z

I'm thinking maybe some documentation of this pattern would be useful, so will this open for that.

itamarst changed the title ~~CRITICAL error ConnectionLost when crochet does not cancel the deferred~~ [EDITED] Document cancel-after-final-timeout pattern May 3, 2016

itamarst added the documentation label May 3, 2016

lucagiovagnoli mentioned this issue May 4, 2016

Crochet does not cancel the deferred properly Yelp/fido#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EDITED] Document cancel-after-final-timeout pattern #102

[EDITED] Document cancel-after-final-timeout pattern #102

lucagiovagnoli commented Apr 21, 2016

itamarst commented Apr 26, 2016

lucagiovagnoli commented Apr 27, 2016

itamarst commented Apr 28, 2016

itamarst commented May 3, 2016

lucagiovagnoli commented May 3, 2016

itamarst commented May 3, 2016

lucagiovagnoli commented May 3, 2016

itamarst commented May 3, 2016

lucagiovagnoli commented May 3, 2016

itamarst commented May 3, 2016

[EDITED] Document cancel-after-final-timeout pattern #102

[EDITED] Document cancel-after-final-timeout pattern #102

Comments

lucagiovagnoli commented Apr 21, 2016

Code for reference

The problem

Action points

itamarst commented Apr 26, 2016

lucagiovagnoli commented Apr 27, 2016

itamarst commented Apr 28, 2016

itamarst commented May 3, 2016

lucagiovagnoli commented May 3, 2016

itamarst commented May 3, 2016

lucagiovagnoli commented May 3, 2016

itamarst commented May 3, 2016

lucagiovagnoli commented May 3, 2016

itamarst commented May 3, 2016