Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel creation/destruction may leave dangling thread. #14338

Closed
meawoppl opened this issue Feb 6, 2018 · 12 comments

Comments

Projects
None yet
3 participants
@meawoppl
Copy link

commented Feb 6, 2018

What version of gRPC and what language are you using?

  • Python==3.5.2
  • grpcio==1.8.4
  • grpcio-tools==1.8.4

What operating system (Linux, Windows, …) and version?

  • Linux Ubuntu 16.04
  • Windows 7

What runtime / compiler are you using (e.g. python version or version of gcc)

  • Python==3.5.2

What did you do?

Here is an example of a minimal function call we wrote

def wait_for_channel_ready(channel: grpc.Channel, timeout_ms: numbers.Number):
    """
    Wait for `channel` to enter the ready state.  If it does not enter this state
    withing `timeout_seconds`, raise a TimeoutError.

    :param channel: a grcp.Channel instance.
    :param timeout_ms: Time in milliseconds to wait for Channel to be ready.
    """
    fut = grpc.channel_ready_future(channel)
    try:
        fut.result(timeout=timeout_ms / 1000)
    except grpc.FutureTimeoutError as e:
        fut.cancel()
        raise TimeoutError(
            "Timeout waiting on {} for {} ms".format(channel, timeout_ms)) from e

And here is the corresponding test:

def test_waiter_times_out():
     channel = grpc.insecure_channel("localhost:51234")  # No one answers here
     with nose.tools.assert_raises(TimeoutError):
        grpc_helpers.wait_for_channel_ready(channel, 5)

    # This tries to trigger the __del__ handlers that grpc has decided
    # actually work in Python:  https://github.com/grpc/grpc/issues/12531
    del channel
    gc.collect()

What did you expect to see?

Test passes leaving no lingering non-demonic threads around.

What did you see instead?

Following the above the context in which it was run will have accrued a unterminated non-demonic thread. Code like the following (enforced by our CI) will fail:

import threading
for t in threading.enumerate():
   assert t.daemon, t

Our CI requires that tests do not leave Threads of this nature after the test has completed, and it will flags tests that leave unterminated threads around as having failed failed. It seems there is no (documented) way to terminate these lingering threads.

Anything else we should know about your project / environment?

I believe this is linked to #12531. The behavior is intermittent, and does not fail in all cases. Including sleep/timeout retry logic does not seem to help.

@meawoppl

This comment has been minimized.

Copy link
Author

commented Feb 6, 2018

Possibly also #6666

@nathanielmanistaatgoogle

This comment has been minimized.

Copy link
Member

commented Feb 14, 2018

I also believe that this is linked to issue 12531; let's reassess after that issue is closed.

@nathanielmanistaatgoogle

This comment has been minimized.

Copy link
Member

commented May 15, 2018

@meawoppl: if you are able to work from source, it may be time to reassess this issue now that issue 12531 is closed. Otherwise the issue 12531 fix will be included in the upcoming 1.12 release, so reassess this issue then?

@meawoppl

This comment has been minimized.

Copy link
Author

commented May 15, 2018

We are currently pinned to 1.11, when there is a 1.12 release it would be easy for me to test. If you ping back on this thread at that time, I would be happy to confirm/deny whether this got fixed.

@nathanielmanistaatgoogle

This comment has been minimized.

Copy link
Member

commented May 15, 2018

grpcio 1.12.0 was uploaded about an hour ago; please start working with it. In particular a grpc.Channel.close method is introduced; we'd be very keen to hear whether or not deliberately closing your channels resolves this issue.

@nathanielmanistaatgoogle

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

@meawoppl: poke poke? Does the problem continue to manifest? Should this issue remain open?

@meawoppl

This comment has been minimized.

Copy link
Author

commented Jun 13, 2018

I gave this a shot today and traded flaky thread cleanup for segfaults. The test suite appears to have rotted a bit, so more soon.

@meawoppl

This comment has been minimized.

Copy link
Author

commented Jun 22, 2018

Updates in #12531 should I repost them here too?

@srini100

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2018

No needed. These recent fixes on master and available in upcoming 1.13.0 should resolve your segfault on shutdown issue. Well, hopefully. Update us when you get a chance to try these out.

#15825
#15727

@srini100

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2018

Since 1.13.0 has been out for a while, closing this issue. Please reopen if the issue still exits.

@srini100 srini100 closed this Jul 26, 2018

@meawoppl

This comment has been minimized.

Copy link
Author

commented Jul 26, 2018

@nathanielmanistaatgoogle

This comment has been minimized.

Copy link
Member

commented Jul 28, 2018

@meawoppl: if you see the problem with the latest release and it's still the same problem described here, reopen this issue rather than opening a new issue.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.